Monday, October 27, 2008

Labels are Noisy Features

Every label is a simplification of something more complicated.
Every label could be wrong with non-zero probability.

A principle I've just been thinking is that you just can't trust annotators. Even if interannotator agreement is high, the manual they were trained to use is a poor indicator of their understanding. Obvious examples are in the NetFlix challenge: the same person might use completely different features to rank two different movies as five-stars. One might have a great soundtrack and the other might have some actor or another. Alternatively, in parsing, one might want to distinguish between different kinds of NPs, since the distribution of nouns in a subject NP are different from the distribution of nouns in an object NP.

A non-trivial amount of the work I've seen here at EMNLP and just from reading in the past couple of months can be cast as dealing with impoverished labels. Broadly, I think the approaches fall into three categories:
  1. Deterministic Splitting. Two reasonable ways of doing this. First, along the lines of Klein and Manning, 2003, take your labels (e.g. NP) and add information from other nearby labels (e.g. S) to produce a new label (NP^S). Alternatively, you can instead "lexicalize" your labels by adding information about observed features (e.g. NP becomes NP-dog).
  2. Addition of Latent Variables. This is like the machine learning version of the above. Instead of deterministically renaming features, assume that there is a latent variable that controls what label the human assigns and acts as an intermediary between the label and the feature. In a sense, turn the label into a feature. For example, if Y is your label, X your features, and Z your latent variables, then add a new latent variable B:

    Y -> Z -> X

    becomes something like

    B -> Y
    B -> Z
    Z -> X

    There's of course a lot more to be discussed here. How big should |Z| be? Is it discrete? What's the interaction between other labels? Some papers that work out (some of) these details are Petrov and Klein, 2008, and McCallum, et al. 2006. One might argue that any latent variable problem is an example of this phenonemon, but it seems that in general you gain by reserving one latent variable "just" for the additional layer of indirection.

  3. Assume a mixture of latent variables. I'm mostly interested in this method for the multilabel setting. Here, you assume that a number of unseen components lead to the actual labels. The one that I find most interesting at the moment is inspired by ICA, which assumes that the observed labels are a noisy combination of the "real" labels that actually created the data. (Zhang, et al. 2005)

I don't think that this is especially profound, but it seems that too often people don't bother to try this simple extension.