Most of my analysis code is written in Java, so ideally I needed a library written in that language that I could use with my own code. A quick investigation turned up a few options (and likely there are several others):
- Google's Compact Language Detection library, which is used in Chrome and drives the translation service that automatically pops up when you view a page not in your default language. Unfortunately, this library appears to be written in C, and I couldn't find any resources on how compile and use it separately from Chrome.
- NGramJ is an open source n-gram byte and character-based language detector based on a previous library implemented in Perl. This library seems to work, but unfortunately it uses the LGPL license. As my code may be used within IBM, this library was out of the question.
- cue.language is a stop-word based language detector written in Java by Jonathan Feinberg while at IBM Research and used in the Wordle word visualization site. I briefly considered this library, but ultimately discounted it as I suspect a stop-word-based language detector will not be as effective for recognizing the language of short textual tweets that conceivably might not include a stop word.
- The Language Identifier library that is a plug-in to the Apache Nutch search engine project. This is the solution that I ultimately chose, based on a recommendation from this thread on Stack Overflow. This code is also made available under the Apache license, which is particular advantageous for my needs.
Here are the steps needed to integrate the Language Identifier into your own project:
- Download the Nutch release from http://nutch.apache.org/. Version 1.1 was current when I did this, and downloaded the file apache-nutch-1.1-bin.tar.gz.
- Unpack the Nutch distribution.
- Pull out four jar files, where
is the root directory of the Nutch distribution that you just unpacked: /nutch-1.1.jar /lib/commons-logging-1.0.4.jar /lib/hadoop-0.20.2-core.jar /plugins/language-identifier/language-identifier.jar
- Add the above jar files to the build path and classpath of your existing Java project.
- Write some code to use the language identifier in your project. For example:
import org.apache.hadoop.conf.Configuration;There are several other methods for identifying the language of a text, including from an InputStream. Check out the API documentation for more information on those methods.
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;
public class MyClass {
public static String identifyLanguage(String text) {
Configuration conf = NutchConfiguration.create();
LanguageIdentifier ld = new LanguageIdentifier(conf);
return ld.identify(text);
}
}
A few notes:
- This library comes pre-trained with n-gram dictionaries for 18 languages, all of which appear to be for European languages. It is possible to train the classifier on other languages, and I believe this is documented to some extent (I haven't tried to follow any of this documentation, so I won't link any potential resources yet). I know that I will need to further train the classifier on Asian languages, such as Japanese and Chinese, which may be the topic of a future blog post.
- The Apache Tika project seems to have a similar, perhaps identical, Language Identifier library. I had some trouble downloading the source for this project, so I didn't get look into the details of that implementation in more detail.
- It should be relatively trivial to remove the dependencies on the nutch-XX.jar and hadoop-XX-core.jar files. Both these jars are needed for their implementations/interfaces of the Configuration object, which only provides a minimum and maximum value for the number of n-grams used in classification (I've read 1 and 3 are reasonable values for each). After reading the code of the Language Identifier, it appears that it should be easily possible to remove the need for the Configuration object, and thus the need for those two dependencies. The logging dependency could also likely be removed, but I did not look into this in detail.