Friday, June 13, 2014

Letter to my State Senator re: AB's 612 and 2293

If you haven't heard, two bills are about to go up for vote in the California State Senate that have implications for ride-sharing services, such as Uber, Lyft, etc.  You can read the bills here:


Although I am generally in favor of ride-sharing services, there are some aspects, especially in AB 612, that are not necessarily bad for consumers.  Ultimately however, I think these bills will stifle innovation.  AB 2293 seems especially bad, because it defines drivers as being on the clock as long as they have the app "turned on," whatever that means.

Here is the letter that I wrote to my local state senator regarding these two bills.  If you live in California, please consider sending a similar letter to your local state senator.


Dear Senator Block-

I am resident of your district, living in downtown San Diego.  I am writing to encourage you to vote no on both AB 612 and AB 2293.

I strongly encourage you to vote no on AB 2293, which in my opinion creates an unreasonable standard for when commercial insurance should apply to ride-sharing services.  Just because an app is "on" does not mean that commercial activity is going on.  What if I am taking a personal trip to the store and am open to sharing my ride but no sharing actually happens?  Hopefully you agree that sharing rides is important from an ecological perspective, and such a law will limit future innovation and interfere with existing services, especially drivers that participate in multiple services simultaneously.

Another issue is what does "on" mean?  What if the application is being ran in the background but not being used or attended to, which is easy to do on modern smart phones.  I am certain drivers will be unnecessarily penalized by large insurance companies in situations where they should have been covered.  I certainly expect insurance companies to exploit the law as much as possible and in ways I expect lawmakers did not intend.

I also encourage you to vote no on AB 612.  I don't think it is appropriate that on-line ride sharing services are explicitly called out in this law.  Although I am sympathetic to the idea of requiring primary commercial insurance be offered by corporations like Uber and for background checks to be conducted of drivers, I think those should be requirements of all transportation services and that it is unnecessary to call out specific services.  

I am also worried about the impact on innovation of such a law.  This will obviously create an even larger barrier to entry for such services, which I am not sure is in the best interest of consumers.  Perhaps a different law that requires clear disclosure of what, if any, insurance, background checks, etc. have been done as part of the sharing service would have a similar effect, but allow consumers to make a choice as to what level of safety they are comfortable with.  To be honest, I think this would be a good idea anyway as I suspect few consumers have any idea what is required of taxi drivers vs. Uber vs. others.

Thank you for your time and consideration,

Jeffrey Nichols
Ph.D., IBM Research
San Diego Resident

Wednesday, May 15, 2013

IUI 2013 Reviewing Statistics & Results

Over the past year, I served as co-program chair for Intelligent User Interfaces 2013 with Pedro Szekely. We modified the review process somewhat from previous versions of the conference, and feedback from those attending IUI this year was that the program was strong and interesting.

Over the course of at least two blog posts, I want to write a little bit about the process that we used and to discuss the results of that process.  In this second post, I describe the results of the reviewing process and show various graphs and statistics.

In total, 192 papers were submitted to IUI 2013. We accepted 43 of these papers, which is a 22% acceptance rate and within the 20-25% range that we were hoping to achieve.  Submissions came from countries around the world, as can be seen in the following graph.


One question that we had was about the reviewer pool. We intended for the Senior Program Committee members handling each paper to find the most qualified reviewers for each paper rather than, for example, choosing from the small set of people that they knew well even if those people were not necessarily qualified. From the following graph, we can see that most reviewers contributed just one review, which would seem to indicate that SPC members did their job in finding uniquely qualified reviewers for many papers.


One question that some people ask with respect to this is about consistency. Specifically, how can decisions be made consistently when most reviewers review only one or two papers? The answer to this is in the work of the SPC members, who aggregate all of the reviews for each paper when writing their meta-review and also have visibility across multiple papers to help calibrate. The SPC members' knowledge of their reviewers and their expertise also comes into play when writing the meta-reviews and calibrating across different reviewers. The majority of decisions were also made during the two SPC meetings where many submissions were discussed, allowing for further calibration.

We also looked at the distributed across the reviewer ratings and their self-reported expertise on the papers they reviewed. These can be seen in the following graphs.


The breakdown for review scores makes some sense given the final acceptance rate. The majority of papers are rejected, and thus it is not surprising to see that low scores dominate the overall number.

We believe the expertise distribution suggests that this process at least partially helped achieve our goal of finding better reviewers for each submission. First, we're happy with only 20% of reviewers indicating an expertise of 2 or below, though we'd certainly like to see this be even smaller in the future. The large number of 3 ratings is encouraging, especially because in our experience many well qualified reviewers will hesitate to give themselves a top ranking of 4, perhaps because they are more aware of what they don't know.

Finally, a controversial decision in this year's process was to eliminate the short paper category (previously a 4 page maximum archival category) and to include explicit language in the call for papers and in the instructions to reviewers to rate the contribution of the paper in proportion to the length of the paper. This is the same practice that has been in place at SIGGRAPH for some time and has recently been adopted by the UIST and CSCW communities. An important question is, what was the impact on shorter papers? Did they have a harder time being accepted under this new policy?


This spreadsheet and graph show the submission results broken down by page length. From the data, we can clearly see that longer papers had a better chance of acceptance and that no papers of 4 pages or less were accepted by the conference.  It is also the case that very few 4 page or shorter papers were submitted however, so it is hard to draw a clear conclusion from this sample. It is well known that shorter papers have greater difficulty getting accepted even when there is an explicit short paper category. Informally, we've heard that acceptance rates are often in the 10-15% range. This means we might have expected one paper to be accepted from the 4 pages and under category, and the fact that we didn't have one this year might be ascribed to random chance.

Papers in the 5-7 page range did receive greater consideration, with an overall acceptance rate of 7.5%. While this is somewhat lower than what we've heard for conferences with an explicit short paper category, we are happy that some short papers were accepted to the conference.

We do wonder if our initial policy of disallowing conditional accepts and shepherding had an impact on short paper acceptance rates. Shorter papers are more difficult to write, and we know that in at least one case an interesting short paper was rejected because it had substantial writing flaws that the reviewers were not confident that the authors could address without shepherding during the camera-ready submission process. Had we had a clearer policy allowing conditional accepts and shepherding at the discretion of the SPC member, then perhaps the accept rates on shorter papers would have increased a small amount.

Going forward, I suspect that the short paper category will return in future years. The best argument I've heard so far is that members of the AI community may not submit some of their work if the category does not exist, as it often does in AI conferences, and making sure to cater to both the AI and HCI communities will be important if IUI is to grow and thrive.

IUI 2013 Reviewing Process

Over the past year, I served as co-program chair for Intelligent User Interfaces 2013 with Pedro Szekely. We modified the review process somewhat from previous versions of the conference, and feedback from those attending IUI this year was that the program was strong and interesting.

Over the course of at least two blog posts, I want to write a little bit about the process that we used and to discuss the results of that process.  In this first post, I describe our review process.

Our process consisted of the following steps, closely mimicking the process used by the UIST and CHI conferences.

  1. Chairs Recruit SPCs - We invited about 50 people to be on the SPC based on prior program committees, recent accepted papers in the IUI program, and suggestions from various members of the community.  Our goal was to assemble an SPC of a size such that each member would handle approximately 8-10 papers. Choosing the number of papers for each SPC member to handle is a trade-off between workload and decreasing variance in decisions (more submissions handled should lead to less variance), and we chose this range to balance these issues.
  2. SPCs Bid for Papers - After the deadline for abstracts, each SPC member was able to specify their conflicts and bid on the papers they most wanted to handle.
  3. Chairs Assign Papers to SPCs - We assigned papers using a greedy algorithm where all papers with non-competing SPC bids were assigned first, then papers with multiple bids were given assignments that best balanced workload, and then finally papers with no bids were assigned based on our estimate of the match between paper and SPC member.
  4. SPCs Find Reviewers - We did not pre-recruit a program committee to review papers, so SPC members were allowed and encouraged to find the most qualified reviewers to handle each submission. 3 reviewers were required for each paper. There are at least two reasons that we chose this approach: 1) allowing the reviewers to be selected from anyone in the world allows SPC members to bring in expertise from outside the community when needed, and also should ensure that each reviewer is highly qualified to review the paper.  This is especially true when compared to other processes where the set of reviewers is fixed a priori and SPC members must choose the most qualified from among the fixed set. 2) Because the SPC member recruited each reviewer, they should have knowledge of their perspective on the paper and their strengths and weaknesses as a reviewer. Thus the SPC member should be better able to interpret and trust the reviews they receive.
  5. Reviewers Submit Reviews - Reviewers had approximately 4 weeks to write reviews, depending on when their SPC member recruited them. We also allowed SPC members to grant additional latitude to their reviewers with the understanding that extensions to reviewers would cut into the time that SPC members would have for writing meta-reviews.
  6. SPCs Write Meta-Reviews - Meta-reviews synthesize the reviews written by the external reviewers and describe the points that the SPC member finds compelling both in favor and against the paper's acceptance. SPC members were given approximately 2 weeks to write their meta-reviews, though this period overlapped the US Thanksgiving holiday.
  7. Chairs Define Cutoff for Rebuttal - To reduce workload on the SPC members, we decided to reject some papers without rebuttal. These papers were those that received only below neutral or neutral scores. Papers that were allowed to continue to rebuttal received one or more scores of 4 or higher, two or more scores of 3, or an SPC score of 3. This eliminated approximately 50% of papers pre-rebuttal.  
  8. Authors Submit Rebuttal - For papers still alive, authors were invited to submit a rebuttal of 5000 characters that focused on correcting any mistakes or misconceptions made by the reviewers. Most, but not all, papers who were eligible submitted a rebuttal.
  9. Chairs, SPCs, Reviewers Discuss - Following the submission of rebuttals, reviewers were encouraged to return to the reviewing web site to read the rebuttal, update their reviews, and engage in discussion with the other reviewers, SPC member, and in a few cases ourselves.
  10. Pre-SPC Meeting Decisions - Following a week of discussion, we asked each SPC member to make a preliminary decision for each of their papers: accept, discuss, or reject. These decisions were used to determine thresholds for discussion at the PC meeting. A few papers with high scores and accept decisions by their SPC members were automatically accepted and not chosen to be discussed. Similarly, papers with low scores and reject decisions from their SPC member were automatically rejected and not discussed. Remaining papers, those with discuss decisions and/or within a particular score range, were assigned a secondary SPC member to provide an additional opinion and in many cases an additional review. 
  11. Two Chair/SPC Telecon Meetings - We conducted two remote meetings to discuss papers and reach final decisions for most, if not all, papers. Two meetings were conducted to reflect the timezone needs of our international SPC; one meeting was scheduled to accomodate the US west coast and asian countries, and the second meeting was scheduled to accomodate the US east coast and Europe (secondary SPC members were assigned with the constraint of ensuring that both SPCs for each paper would be in the same meeting). In most cases, the discussion led to an accept or reject decision. In a few cases, an additional SPC member or a Chair was assigned to provide an additional review and the decision was tabled for later in the week.
  12. Final Accepts - After additional reviews came in during the 4 days after the SPC meetings, final decisions were made for the remaining papers.
Overall, I thought the process worked very well. Hurricane Sandy and the US Thanksgiving holiday impacted us a little bit more than I would have liked, slowing down some reviewing and the writing of meta-reviews, which ultimately delayed the release of reviews for rebuttals by a day. 

Some things that worked well:
  • The remote SPC meetings seemed to work particularly well, and I think many of the decisions were improved by the discussions over the phone. I was very worried that the technology would get in the way of a productive meeting, but we were able to muddle through fairly well.
  • Pre-rebuttal rejections removed a large number of papers from consideration with little noticeable impact on the process. At least one author was happy to be able to revise and resubmit their paper elsewhere earlier than they might have been able to under the normal process.
  • SPC bidding provided a lot of information that was highly useful for making the SPC assignments. Thanks to the bidding information, SPC assignments took much less time than I initially anticipated.
  • The week gap between the abstract submission deadline and the full paper submission deadline allowed us to have all paper assignments ready to go by the time full papers were submitted. There was little lag between full paper submission and the beginning of the reviewer assignment process.
Were I to do this again, there are a couple things I would change:
  • Allowing the review deadline to be flexible and managed between the SPCs and their reviewers made the job of tracking and providing reminders more difficult than I would have liked. I would definitely go with a stricter deadline in the future.
  • We initially did not allow for conditional acceptances, although we ultimately allowed a few in the end for papers that clearly described good work but had writing flaws that were too substantial to rely on the authors alone to fix. In the future, the concept of conditional acceptance should be in place from the beginning, so that reviewers and SPC members can take this into account.
Coming up next, some statistics on the results of the reviewing process.



Tuesday, September 7, 2010

Integrating Nutch's Language Identifier Into Your Own Java App

I've been doing some analysis of twitter data lately, and one of the features that I've needed is a quick method for determining the language of a tweet. Twitter's API does contain a language field for each tweet, but as far as I can tell the value of this field must be set by the user (maybe when they configure their account?) and does not reflect any intelligent recognition on the part of the Twitter infrastructure. Quite often the field specifies that the tweet is in English when it clearly is not.

Most of my analysis code is written in Java, so ideally I needed a library written in that language that I could use with my own code.  A quick investigation turned up a few options (and likely there are several others):
  • Google's Compact Language Detection library, which is used in Chrome and drives the translation service that automatically pops up when you view a page not in your default language. Unfortunately, this library appears to be written in C, and I couldn't find any resources on how compile and use it separately from Chrome.
  • NGramJ is an open source n-gram byte and character-based language detector based on a previous library implemented in Perl. This library seems to work, but unfortunately it uses the LGPL license.  As my code may be used within IBM, this library was out of the question.
  • cue.language is a stop-word based language detector written in Java by Jonathan Feinberg while at IBM Research and used in the Wordle word visualization site. I briefly considered this library, but ultimately discounted it as I suspect a stop-word-based language detector will not be as effective for recognizing the language of short textual tweets that conceivably might not include a stop word.
  • The Language Identifier library that is a plug-in to the Apache Nutch search engine project. This is the solution that I ultimately chose, based on a recommendation from this thread on Stack Overflow. This code is also made available under the Apache license, which is particular advantageous for my needs.
In that thread on Stack Overflow, the original poster mentions that it took him only 30 minutes to integrate the Language Identifier into his own project, however he doesn't describe how he did it or what's involved. It's actually quite straight-forward, but after spending the time to figure it out myself, I figured that I would make a quick post showing how to do it.

Here are the steps needed to integrate the Language Identifier into your own project:
  1. Download the Nutch release from http://nutch.apache.org/. Version 1.1 was current when I did this, and downloaded the file apache-nutch-1.1-bin.tar.gz.
  2. Unpack the Nutch distribution.
  3. Pull out four jar files, where is the root directory of the Nutch distribution that you just unpacked:
    • /nutch-1.1.jar
    • /lib/commons-logging-1.0.4.jar
    • /lib/hadoop-0.20.2-core.jar
    • /plugins/language-identifier/language-identifier.jar
  4. Add the above jar files to the build path and classpath of your existing Java project.
  5. Write some code to use the language identifier in your project.  For example:
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class MyClass {

    public static String identifyLanguage(String text) {

        Configuration conf = NutchConfiguration.create();
        LanguageIdentifier ld = new LanguageIdentifier(conf);
        return ld.identify(text);
    }
}
There are several other methods for identifying the language of a text, including from an InputStream. Check out the API documentation for more information on those methods.

A few notes:
  • This library comes pre-trained with n-gram dictionaries for 18 languages, all of which appear to be for European languages. It is possible to train the classifier on other languages, and I believe this is documented to some extent (I haven't tried to follow any of this documentation, so I won't link any potential resources yet). I know that I will need to further train the classifier on Asian languages, such as Japanese and Chinese, which may be the topic of a future blog post.
  • The Apache Tika project seems to have a similar, perhaps identical, Language Identifier library. I had some trouble downloading the source for this project, so I didn't get look into the details of that implementation in more detail.
  • It should be relatively trivial to remove the dependencies on the nutch-XX.jar and hadoop-XX-core.jar files. Both these jars are needed for their implementations/interfaces of the Configuration object, which only provides a minimum and maximum value for the number of n-grams used in classification (I've read 1 and 3 are reasonable values for each). After reading the code of the Language Identifier, it appears that it should be easily possible to remove the need for the Configuration object, and thus the need for those two dependencies.  The logging dependency could also likely be removed, but I did not look into this in detail.
Hope this is helpful!

Friday, June 11, 2010

Using Dojo with XULRunner

I recently started porting one of my previous research projects, which was originally implemented as a browser extension to Firefox, into a XULRunner application. The advantage of moving to XULRunner is that I will hopefully be able to reuse a bunch of my Firefox extension code, but at the same time I'm hoping to give a lot of that code a revamp. One of the ways that I want to do that is to adopt a better object-oriented approach for my code (or just any object-oriented approach for that matter!), and after not a lot of convincing I decided to go with the object model provided by the Dojo Toolkit.

If you're at all familiar with Dojo, you'll know that it's primarily a toolkit for building browser-based web apps. So how do I go about using it in a Firefox extension or as part of a XULRunner app? It's actually pretty straight-forward, but there's not a lot of documentation. For that reason, I thought I'd throw together this quick blog post both for my own memory and maybe to help out other people who try to do the same thing in the future.

Step #1: Create a custom Dojo build

There are a bunch of resources for how to do this generally, but I'll describe the process quickly. What's most important to know are the few parameters that you'll need to put into your build profile to generate a build of Dojo that will work.

First, download the Dojo SDK. As of today, Dojo downloads are available at http://dojotoolkit.org/download/. Scroll to the bottom of the page to find the SDK, and download either the .tgz or .zip versions depending on which compression method works best for you.

Now we create the custom build. To do this, you need to create a build profile. Some sample build profiles are included in the SDK that you downloaded at <dojo src dir>/util/buildscripts/profiles.

I started by copying the Rhino profile in rhino.profile.js to another file, such as xulrunner.profile.js. In your new build profile, make the following changes:
  • Change hostenvType = "rhino" to hostenvType = "ff_ext".
  • Modify the prefixes section as needed to include the Dojo extensions that you need. I personally kept dojox but removed shrinksafe from the Rhino profile. You might want to add other prefixes, though most of the other libraries (such as dijit) have to do with creating widgets and that capability seems less useful in the XULRunner environment.
Once you've got your build profile finished, you need to create the custom build. The command line for this depends on what you named your build profile in the previous step. If you used the name that I specified above, then the command line would be:

<dojo src dir>/util/buildscripts/build.sh profile=xulrunner action=release

This will create a new release directory at the same level as your dojo src directory.

Step #2: Add the Dojo code to your XULRunner project

How you do this depends a bit on the structure of your XULRunner project. Assuming you're creating a normal app and using the standard conventions, you should have a content directory in your XULRunner project which contains a XUL file (let's call it main.xul) that defines the user interface for your main window.

First, copy the dojo release directory that you created in step #1 inside your content directory. I named this directory dojo, and in my configuration the dojo.js file was located at content/dojo/dojo/dojo.js. The remaining description assumes these locations, so make changes as necessary for your application.

To enable Dojo for your project, add the following lines near the top of the main.xul file:


<script>
    // Specify the name of the package (from chrome.manifest)
    var packageName = "package";
   
    // Determine the current locale so that we can pass it to Dojo
    // Code taken from:
    // https://developer.mozilla.org/En/How_to_enable_locale_switching_in_a_XULRunner_application
    var chromeRegService = Components.classes["@mozilla.org/chrome/chrome-registry;1"].getService();
    var xulChromeReg = chromeRegService.QueryInterface(Components.interfaces.nsIXULChromeRegistry);
    var selectedLocale = xulChromeReg.getSelectedLocale(packageName);           

    // Create the Dojo configuration structure
    var djConfig = {
        isDebug: true,
        locale: selectedLocale,
        baseUrl: 'chrome://highlightxr/content/dojo/dojo/'
    };
</script>
<script src="chrome://highlightxr/content/dojo/dojo/dojo.js" type="application/x-javascript"/>


Note that internationalization is important to me, so I added some extra code to get the current locale from XULRunner and pass that value to Dojo. That may not matter to you, in which case you can just manually set locale in djConfig to "en-US" or whatever the appropriate string may be. 

You'll also need to insert your own package name in the snippet above. This is the name that you specified in your chrome.manifest file.

I hope that helps!

If you need any help with Dojo, I suggest checking out their web site or looking at Dojo: The Definitive Guide or Mastering Dojo: JavaScript and Ajax Tools for Great Web Experiences.

Thursday, April 29, 2010

Mechanical Turk and the NCAA Tournament

Nearly a month and a half ago, the sports enthusiast community was consumed with the annual tradition of filling out their NCAA brackets and predicting the outcome of the tournament. I was watching a segment on Sportscenter about filling out your bracket, and they showed an experiment in which a group of people were brought into a room and each asked to fill out a bracket using just the seed numbers...no team names were shown. In other words, these people were randomly picking winners based only on seed. I don't remember the exact results, however the host of the segment was surprised at the accuracy and I seem to remember that the crowd did better than any of the other techniques for predicting the tournament that were shown in that same segment. (Unfortunately, I can't find any video of this segment to link to...)

The segment inspired me to try something similar, but I wanted my "crowd" to be a little more informed than the people participating in the ESPN segment. My idea was to show people a small set of facts about the two teams participating in a game, so that they could make a more informed decision about the winner. I chose not to show the team names as well, so that any bias for or against well-known teams (e.g., Duke) would not be a factor.

I also needed a crowd of people to answer these questions, and I chose to use Amazon's Mechanical Turk service to provide that crowd. For those of you who may not be familiar with Mechanical Turk, it is a service where people can post typically small simple tasks and have other people perform these tasks. The typical task also requires some sort of human judgement that can't easily be performed by a computer, such providing a label for an image, filling out a CAPTCHA, etc. A monetary value is also assigned to each task, though these values are often quite small (e.g., $0.01) depending on the difficulty of the task.

Here's how I set up my prediction system:

First, I collected 22 facts about each team participating in the tournament. This was actually the most time-consuming aspect of the entire process. The facts I collected were:
  • Bracket (Midwest, East, West, South)
  • Seed
  • Conference
  • Overall Record
  • Conference Record
  • RPI Rank
  • Strength of Schedule Rank
  • Conference Rank
  • Record Against the Top 25
  • AP Poll Ranking
  • Coaches Poll Ranking
  • Average Points/Game By Leading Scorer
  • Points Scored Per Game
  • Points Allowed Per Game
  • Home Record
  • Away Record
  • Most Recent Streak
  • Turnovers Per Game
  • Team Field Goal %
  • Team Free Throw %
  • Team 3 Point %
  • Total Points Scored
Second, I created a program to create tasks that would be posted to Mechanical Turk from the facts that I collected. For each game, the program created 10 tasks. Each task presented 10 random facts about the two teams in a game, and then asked the turker (what people who perform tasks on Mechanical Turk are called) to answer three questions:
  • A question about the presented facts of the form, "Which team has a better X?" The purpose of this question is to force the turker to read the facts and filter out any responses from turkers that did not read the facts. This question was inspired by Aniket Kittur et al.'s seminal paper on using Mechanical Turk for usability studies at CHI 2007 (pdf).
  • A prediction for a winner between the two teams.
  • A prediction of the final score between the two teams. This question also turned out to be filter for turkers that were not sufficiently knowledgeable about basketball. For example, responses with nonsensical or highly unlikely scores (usually far too low, such as 15-7) could be excluded.
Here's a screenshot of one of the tasks that I posted on Mechanical Turk.


Using this program, I collected answers from turkers starting with the first round of games in the tournament. Winners were determined based on which team the majority of turkers had picked, with each turker response weighted equally. Once I had winners for each game in a round, I generated and ran a new set of tasks for the subsequent round until the entire bracket was filled out.

Tasks were typically posted for 2-6 hours and I paid $0.02 to $0.05 per prediction depending on how quickly I needed predictions submitted (paying more will often lead to quicker responses, but there is evidence suggesting it can also lead to inferior predictions). I ended up paying more for the predictions in the later rounds because I wanted to complete the bracket before the tournament games began (so I could enter the bracket in ESPN's tournament challenge) and I was running out of time to make that deadline.

The final bracket that the turkers selected is shown below, along with the correctness of their final predictions.


You may also still be able to view the bracket on ESPN's web site.

The first thing to notice is that the turkers primarily picked winners based on seed, which I found a little disappointing.

Turkers did pick a few upsets however. The first round upsets are particularly interesting, where the turkers picked the following teams against seed:
  • 9th seed Northern Iowa over 8th seed UCLA
  • 11th seed Washington over 6th seed Marquette
  • 10th seed St. Mary's over 7th seed Richmond
Interestingly, all of these upsets actually happened. The upsets are also not just from the 8/9 games in which the two teams are generally evenly matched, but from games with larger differences in seeding. Of course, the turkers missed a number of other upsets that happened in the first round (they missed 7 picks overall).

In the second round, turkers picked only 1 upset, 5th seed Butler over 4th seed Vanderbilt. Butler did win the game, though they were playing a 13th seed Murray State team that had eliminated Vanderbilt in the first round. Again, while accurately picking one upset that did happen, the turkers also missed many other upsets, including Northern Iowa's surprising upset of Kansas. The turkers missed 7 picks overall in the second round, though interestingly only one of these was a carry-over from a missed pick in the first round (Georgetown). In other words, all but one of the teams that the turkers missed in the first round went on to lose in the second round.

In the third round, the turkers finally picked an upset that did not happen, though they also correctly picked one upset. The upset that actually occurred was 3rd seed Baylor over 2nd seed Villanova, though of course in the actual tournament Villanova was eliminated by 10th seed St. Mary's in the second round and Baylor beat St. Mary's in the third round. The upset that turkers picked but did not happen was 3rd seed New Mexico over 2nd seed West Virginia. West Virginia won that game, although they beat the 11th seed Washington, which had eliminated New Mexico in the previous round. As in the previous rounds, the turkers also missed several upsets. Turkers missed 4 picks overall in the third round, with two being carry-overs from misses in the 2nd round (Kansas and the missed upset New Mexico).

In the remaining rounds, the turkers went by seed and selected all #1 seeds to make it into the final four. Only Duke actually made the final four, and that was final correct pick that the turkers made. The turkers had Kentucky and Kansas in the championship game with Kansas winning it all, and obviously that did not happen in the real tournament.

While I was disappointed that the turkers picked primarily by seed, there are two features of their predictions that are interesting. First, it is impressive to me that nearly every upset winner that they chose actually won. This suggests that maybe this Mechanical Turk approach has some value for picking likely upsets based on factual data. Second, the turker picks suffered only five "carry-over misses" across their entire bracket, and three of those were due to the early exit of Kansas. This means that even when turkers were wrong about a team, that team would then likely lose in the following round. This may simply be an artifact of picking primarily by seed, but it is an advantageous feature in any bracket.

There are a bunch of interesting aspects of the turker data to look at that I haven't spent time on yet. In particular, there are a number of cases where the same turker provided multiple predictions for the same game (there is no way to prevent this using the current Mechanical Turk qualification system). None of the games I looked at were dominated by predictions from a single turker, so it does not seem that this bias significantly affected my results. Each task also used a different randomly chosen set of facts, so it is possible that a turker would not have even realized that they were providing a prediction for a game they had already predicted.

Overall, I paid about $35 in fees to turkers to collect these predictions. Clearly this was not a good investment overall, as the bracket did not win me any money. It was a fun experiment though, and it refreshed my knowledge of Mechanical Turk which I may apply in some additional experiments over the summer.

I'll definitely try this experiment again next year, perhaps with some slight modifications, to see if I get similar, and possibly better, results.

-jn

For empire avenue:
EAVB_JJDZUKIRWC