Tuesday, September 7, 2010

Integrating Nutch's Language Identifier Into Your Own Java App

I've been doing some analysis of twitter data lately, and one of the features that I've needed is a quick method for determining the language of a tweet. Twitter's API does contain a language field for each tweet, but as far as I can tell the value of this field must be set by the user (maybe when they configure their account?) and does not reflect any intelligent recognition on the part of the Twitter infrastructure. Quite often the field specifies that the tweet is in English when it clearly is not.

Most of my analysis code is written in Java, so ideally I needed a library written in that language that I could use with my own code.  A quick investigation turned up a few options (and likely there are several others):
  • Google's Compact Language Detection library, which is used in Chrome and drives the translation service that automatically pops up when you view a page not in your default language. Unfortunately, this library appears to be written in C, and I couldn't find any resources on how compile and use it separately from Chrome.
  • NGramJ is an open source n-gram byte and character-based language detector based on a previous library implemented in Perl. This library seems to work, but unfortunately it uses the LGPL license.  As my code may be used within IBM, this library was out of the question.
  • cue.language is a stop-word based language detector written in Java by Jonathan Feinberg while at IBM Research and used in the Wordle word visualization site. I briefly considered this library, but ultimately discounted it as I suspect a stop-word-based language detector will not be as effective for recognizing the language of short textual tweets that conceivably might not include a stop word.
  • The Language Identifier library that is a plug-in to the Apache Nutch search engine project. This is the solution that I ultimately chose, based on a recommendation from this thread on Stack Overflow. This code is also made available under the Apache license, which is particular advantageous for my needs.
In that thread on Stack Overflow, the original poster mentions that it took him only 30 minutes to integrate the Language Identifier into his own project, however he doesn't describe how he did it or what's involved. It's actually quite straight-forward, but after spending the time to figure it out myself, I figured that I would make a quick post showing how to do it.

Here are the steps needed to integrate the Language Identifier into your own project:
  1. Download the Nutch release from http://nutch.apache.org/. Version 1.1 was current when I did this, and downloaded the file apache-nutch-1.1-bin.tar.gz.
  2. Unpack the Nutch distribution.
  3. Pull out four jar files, where is the root directory of the Nutch distribution that you just unpacked:
    • /nutch-1.1.jar
    • /lib/commons-logging-1.0.4.jar
    • /lib/hadoop-0.20.2-core.jar
    • /plugins/language-identifier/language-identifier.jar
  4. Add the above jar files to the build path and classpath of your existing Java project.
  5. Write some code to use the language identifier in your project.  For example:
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class MyClass {

    public static String identifyLanguage(String text) {

        Configuration conf = NutchConfiguration.create();
        LanguageIdentifier ld = new LanguageIdentifier(conf);
        return ld.identify(text);
    }
}
There are several other methods for identifying the language of a text, including from an InputStream. Check out the API documentation for more information on those methods.

A few notes:
  • This library comes pre-trained with n-gram dictionaries for 18 languages, all of which appear to be for European languages. It is possible to train the classifier on other languages, and I believe this is documented to some extent (I haven't tried to follow any of this documentation, so I won't link any potential resources yet). I know that I will need to further train the classifier on Asian languages, such as Japanese and Chinese, which may be the topic of a future blog post.
  • The Apache Tika project seems to have a similar, perhaps identical, Language Identifier library. I had some trouble downloading the source for this project, so I didn't get look into the details of that implementation in more detail.
  • It should be relatively trivial to remove the dependencies on the nutch-XX.jar and hadoop-XX-core.jar files. Both these jars are needed for their implementations/interfaces of the Configuration object, which only provides a minimum and maximum value for the number of n-grams used in classification (I've read 1 and 3 are reasonable values for each). After reading the code of the Language Identifier, it appears that it should be easily possible to remove the need for the Configuration object, and thus the need for those two dependencies.  The logging dependency could also likely be removed, but I did not look into this in detail.
Hope this is helpful!