Tuesday, September 7, 2010

Integrating Nutch's Language Identifier Into Your Own Java App

I've been doing some analysis of twitter data lately, and one of the features that I've needed is a quick method for determining the language of a tweet. Twitter's API does contain a language field for each tweet, but as far as I can tell the value of this field must be set by the user (maybe when they configure their account?) and does not reflect any intelligent recognition on the part of the Twitter infrastructure. Quite often the field specifies that the tweet is in English when it clearly is not.

Most of my analysis code is written in Java, so ideally I needed a library written in that language that I could use with my own code.  A quick investigation turned up a few options (and likely there are several others):
  • Google's Compact Language Detection library, which is used in Chrome and drives the translation service that automatically pops up when you view a page not in your default language. Unfortunately, this library appears to be written in C, and I couldn't find any resources on how compile and use it separately from Chrome.
  • NGramJ is an open source n-gram byte and character-based language detector based on a previous library implemented in Perl. This library seems to work, but unfortunately it uses the LGPL license.  As my code may be used within IBM, this library was out of the question.
  • cue.language is a stop-word based language detector written in Java by Jonathan Feinberg while at IBM Research and used in the Wordle word visualization site. I briefly considered this library, but ultimately discounted it as I suspect a stop-word-based language detector will not be as effective for recognizing the language of short textual tweets that conceivably might not include a stop word.
  • The Language Identifier library that is a plug-in to the Apache Nutch search engine project. This is the solution that I ultimately chose, based on a recommendation from this thread on Stack Overflow. This code is also made available under the Apache license, which is particular advantageous for my needs.
In that thread on Stack Overflow, the original poster mentions that it took him only 30 minutes to integrate the Language Identifier into his own project, however he doesn't describe how he did it or what's involved. It's actually quite straight-forward, but after spending the time to figure it out myself, I figured that I would make a quick post showing how to do it.

Here are the steps needed to integrate the Language Identifier into your own project:
  1. Download the Nutch release from http://nutch.apache.org/. Version 1.1 was current when I did this, and downloaded the file apache-nutch-1.1-bin.tar.gz.
  2. Unpack the Nutch distribution.
  3. Pull out four jar files, where is the root directory of the Nutch distribution that you just unpacked:
    • /nutch-1.1.jar
    • /lib/commons-logging-1.0.4.jar
    • /lib/hadoop-0.20.2-core.jar
    • /plugins/language-identifier/language-identifier.jar
  4. Add the above jar files to the build path and classpath of your existing Java project.
  5. Write some code to use the language identifier in your project.  For example:
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class MyClass {

    public static String identifyLanguage(String text) {

        Configuration conf = NutchConfiguration.create();
        LanguageIdentifier ld = new LanguageIdentifier(conf);
        return ld.identify(text);
    }
}
There are several other methods for identifying the language of a text, including from an InputStream. Check out the API documentation for more information on those methods.

A few notes:
  • This library comes pre-trained with n-gram dictionaries for 18 languages, all of which appear to be for European languages. It is possible to train the classifier on other languages, and I believe this is documented to some extent (I haven't tried to follow any of this documentation, so I won't link any potential resources yet). I know that I will need to further train the classifier on Asian languages, such as Japanese and Chinese, which may be the topic of a future blog post.
  • The Apache Tika project seems to have a similar, perhaps identical, Language Identifier library. I had some trouble downloading the source for this project, so I didn't get look into the details of that implementation in more detail.
  • It should be relatively trivial to remove the dependencies on the nutch-XX.jar and hadoop-XX-core.jar files. Both these jars are needed for their implementations/interfaces of the Configuration object, which only provides a minimum and maximum value for the number of n-grams used in classification (I've read 1 and 3 are reasonable values for each). After reading the code of the Language Identifier, it appears that it should be easily possible to remove the need for the Configuration object, and thus the need for those two dependencies.  The logging dependency could also likely be removed, but I did not look into this in detail.
Hope this is helpful!

Friday, June 11, 2010

Using Dojo with XULRunner

I recently started porting one of my previous research projects, which was originally implemented as a browser extension to Firefox, into a XULRunner application. The advantage of moving to XULRunner is that I will hopefully be able to reuse a bunch of my Firefox extension code, but at the same time I'm hoping to give a lot of that code a revamp. One of the ways that I want to do that is to adopt a better object-oriented approach for my code (or just any object-oriented approach for that matter!), and after not a lot of convincing I decided to go with the object model provided by the Dojo Toolkit.

If you're at all familiar with Dojo, you'll know that it's primarily a toolkit for building browser-based web apps. So how do I go about using it in a Firefox extension or as part of a XULRunner app? It's actually pretty straight-forward, but there's not a lot of documentation. For that reason, I thought I'd throw together this quick blog post both for my own memory and maybe to help out other people who try to do the same thing in the future.

Step #1: Create a custom Dojo build

There are a bunch of resources for how to do this generally, but I'll describe the process quickly. What's most important to know are the few parameters that you'll need to put into your build profile to generate a build of Dojo that will work.

First, download the Dojo SDK. As of today, Dojo downloads are available at http://dojotoolkit.org/download/. Scroll to the bottom of the page to find the SDK, and download either the .tgz or .zip versions depending on which compression method works best for you.

Now we create the custom build. To do this, you need to create a build profile. Some sample build profiles are included in the SDK that you downloaded at <dojo src dir>/util/buildscripts/profiles.

I started by copying the Rhino profile in rhino.profile.js to another file, such as xulrunner.profile.js. In your new build profile, make the following changes:
  • Change hostenvType = "rhino" to hostenvType = "ff_ext".
  • Modify the prefixes section as needed to include the Dojo extensions that you need. I personally kept dojox but removed shrinksafe from the Rhino profile. You might want to add other prefixes, though most of the other libraries (such as dijit) have to do with creating widgets and that capability seems less useful in the XULRunner environment.
Once you've got your build profile finished, you need to create the custom build. The command line for this depends on what you named your build profile in the previous step. If you used the name that I specified above, then the command line would be:

<dojo src dir>/util/buildscripts/build.sh profile=xulrunner action=release

This will create a new release directory at the same level as your dojo src directory.

Step #2: Add the Dojo code to your XULRunner project

How you do this depends a bit on the structure of your XULRunner project. Assuming you're creating a normal app and using the standard conventions, you should have a content directory in your XULRunner project which contains a XUL file (let's call it main.xul) that defines the user interface for your main window.

First, copy the dojo release directory that you created in step #1 inside your content directory. I named this directory dojo, and in my configuration the dojo.js file was located at content/dojo/dojo/dojo.js. The remaining description assumes these locations, so make changes as necessary for your application.

To enable Dojo for your project, add the following lines near the top of the main.xul file:


<script>
    // Specify the name of the package (from chrome.manifest)
    var packageName = "package";
   
    // Determine the current locale so that we can pass it to Dojo
    // Code taken from:
    // https://developer.mozilla.org/En/How_to_enable_locale_switching_in_a_XULRunner_application
    var chromeRegService = Components.classes["@mozilla.org/chrome/chrome-registry;1"].getService();
    var xulChromeReg = chromeRegService.QueryInterface(Components.interfaces.nsIXULChromeRegistry);
    var selectedLocale = xulChromeReg.getSelectedLocale(packageName);           

    // Create the Dojo configuration structure
    var djConfig = {
        isDebug: true,
        locale: selectedLocale,
        baseUrl: 'chrome://highlightxr/content/dojo/dojo/'
    };
</script>
<script src="chrome://highlightxr/content/dojo/dojo/dojo.js" type="application/x-javascript"/>


Note that internationalization is important to me, so I added some extra code to get the current locale from XULRunner and pass that value to Dojo. That may not matter to you, in which case you can just manually set locale in djConfig to "en-US" or whatever the appropriate string may be. 

You'll also need to insert your own package name in the snippet above. This is the name that you specified in your chrome.manifest file.

I hope that helps!

If you need any help with Dojo, I suggest checking out their web site or looking at Dojo: The Definitive Guide or Mastering Dojo: JavaScript and Ajax Tools for Great Web Experiences.

Thursday, April 29, 2010

Mechanical Turk and the NCAA Tournament

Nearly a month and a half ago, the sports enthusiast community was consumed with the annual tradition of filling out their NCAA brackets and predicting the outcome of the tournament. I was watching a segment on Sportscenter about filling out your bracket, and they showed an experiment in which a group of people were brought into a room and each asked to fill out a bracket using just the seed numbers...no team names were shown. In other words, these people were randomly picking winners based only on seed. I don't remember the exact results, however the host of the segment was surprised at the accuracy and I seem to remember that the crowd did better than any of the other techniques for predicting the tournament that were shown in that same segment. (Unfortunately, I can't find any video of this segment to link to...)

The segment inspired me to try something similar, but I wanted my "crowd" to be a little more informed than the people participating in the ESPN segment. My idea was to show people a small set of facts about the two teams participating in a game, so that they could make a more informed decision about the winner. I chose not to show the team names as well, so that any bias for or against well-known teams (e.g., Duke) would not be a factor.

I also needed a crowd of people to answer these questions, and I chose to use Amazon's Mechanical Turk service to provide that crowd. For those of you who may not be familiar with Mechanical Turk, it is a service where people can post typically small simple tasks and have other people perform these tasks. The typical task also requires some sort of human judgement that can't easily be performed by a computer, such providing a label for an image, filling out a CAPTCHA, etc. A monetary value is also assigned to each task, though these values are often quite small (e.g., $0.01) depending on the difficulty of the task.

Here's how I set up my prediction system:

First, I collected 22 facts about each team participating in the tournament. This was actually the most time-consuming aspect of the entire process. The facts I collected were:
  • Bracket (Midwest, East, West, South)
  • Seed
  • Conference
  • Overall Record
  • Conference Record
  • RPI Rank
  • Strength of Schedule Rank
  • Conference Rank
  • Record Against the Top 25
  • AP Poll Ranking
  • Coaches Poll Ranking
  • Average Points/Game By Leading Scorer
  • Points Scored Per Game
  • Points Allowed Per Game
  • Home Record
  • Away Record
  • Most Recent Streak
  • Turnovers Per Game
  • Team Field Goal %
  • Team Free Throw %
  • Team 3 Point %
  • Total Points Scored
Second, I created a program to create tasks that would be posted to Mechanical Turk from the facts that I collected. For each game, the program created 10 tasks. Each task presented 10 random facts about the two teams in a game, and then asked the turker (what people who perform tasks on Mechanical Turk are called) to answer three questions:
  • A question about the presented facts of the form, "Which team has a better X?" The purpose of this question is to force the turker to read the facts and filter out any responses from turkers that did not read the facts. This question was inspired by Aniket Kittur et al.'s seminal paper on using Mechanical Turk for usability studies at CHI 2007 (pdf).
  • A prediction for a winner between the two teams.
  • A prediction of the final score between the two teams. This question also turned out to be filter for turkers that were not sufficiently knowledgeable about basketball. For example, responses with nonsensical or highly unlikely scores (usually far too low, such as 15-7) could be excluded.
Here's a screenshot of one of the tasks that I posted on Mechanical Turk.


Using this program, I collected answers from turkers starting with the first round of games in the tournament. Winners were determined based on which team the majority of turkers had picked, with each turker response weighted equally. Once I had winners for each game in a round, I generated and ran a new set of tasks for the subsequent round until the entire bracket was filled out.

Tasks were typically posted for 2-6 hours and I paid $0.02 to $0.05 per prediction depending on how quickly I needed predictions submitted (paying more will often lead to quicker responses, but there is evidence suggesting it can also lead to inferior predictions). I ended up paying more for the predictions in the later rounds because I wanted to complete the bracket before the tournament games began (so I could enter the bracket in ESPN's tournament challenge) and I was running out of time to make that deadline.

The final bracket that the turkers selected is shown below, along with the correctness of their final predictions.


You may also still be able to view the bracket on ESPN's web site.

The first thing to notice is that the turkers primarily picked winners based on seed, which I found a little disappointing.

Turkers did pick a few upsets however. The first round upsets are particularly interesting, where the turkers picked the following teams against seed:
  • 9th seed Northern Iowa over 8th seed UCLA
  • 11th seed Washington over 6th seed Marquette
  • 10th seed St. Mary's over 7th seed Richmond
Interestingly, all of these upsets actually happened. The upsets are also not just from the 8/9 games in which the two teams are generally evenly matched, but from games with larger differences in seeding. Of course, the turkers missed a number of other upsets that happened in the first round (they missed 7 picks overall).

In the second round, turkers picked only 1 upset, 5th seed Butler over 4th seed Vanderbilt. Butler did win the game, though they were playing a 13th seed Murray State team that had eliminated Vanderbilt in the first round. Again, while accurately picking one upset that did happen, the turkers also missed many other upsets, including Northern Iowa's surprising upset of Kansas. The turkers missed 7 picks overall in the second round, though interestingly only one of these was a carry-over from a missed pick in the first round (Georgetown). In other words, all but one of the teams that the turkers missed in the first round went on to lose in the second round.

In the third round, the turkers finally picked an upset that did not happen, though they also correctly picked one upset. The upset that actually occurred was 3rd seed Baylor over 2nd seed Villanova, though of course in the actual tournament Villanova was eliminated by 10th seed St. Mary's in the second round and Baylor beat St. Mary's in the third round. The upset that turkers picked but did not happen was 3rd seed New Mexico over 2nd seed West Virginia. West Virginia won that game, although they beat the 11th seed Washington, which had eliminated New Mexico in the previous round. As in the previous rounds, the turkers also missed several upsets. Turkers missed 4 picks overall in the third round, with two being carry-overs from misses in the 2nd round (Kansas and the missed upset New Mexico).

In the remaining rounds, the turkers went by seed and selected all #1 seeds to make it into the final four. Only Duke actually made the final four, and that was final correct pick that the turkers made. The turkers had Kentucky and Kansas in the championship game with Kansas winning it all, and obviously that did not happen in the real tournament.

While I was disappointed that the turkers picked primarily by seed, there are two features of their predictions that are interesting. First, it is impressive to me that nearly every upset winner that they chose actually won. This suggests that maybe this Mechanical Turk approach has some value for picking likely upsets based on factual data. Second, the turker picks suffered only five "carry-over misses" across their entire bracket, and three of those were due to the early exit of Kansas. This means that even when turkers were wrong about a team, that team would then likely lose in the following round. This may simply be an artifact of picking primarily by seed, but it is an advantageous feature in any bracket.

There are a bunch of interesting aspects of the turker data to look at that I haven't spent time on yet. In particular, there are a number of cases where the same turker provided multiple predictions for the same game (there is no way to prevent this using the current Mechanical Turk qualification system). None of the games I looked at were dominated by predictions from a single turker, so it does not seem that this bias significantly affected my results. Each task also used a different randomly chosen set of facts, so it is possible that a turker would not have even realized that they were providing a prediction for a game they had already predicted.

Overall, I paid about $35 in fees to turkers to collect these predictions. Clearly this was not a good investment overall, as the bracket did not win me any money. It was a fun experiment though, and it refreshed my knowledge of Mechanical Turk which I may apply in some additional experiments over the summer.

I'll definitely try this experiment again next year, perhaps with some slight modifications, to see if I get similar, and possibly better, results.

-jn

For empire avenue:
EAVB_JJDZUKIRWC