:dlwh: 2008

Friday, November 14, 2008

Scala and Bash in the same file

...or really any other language with C-like comments:

// 2>/dev/null; echo '
println("Hello world!");
/*
# Yes this script does two things.
//' > /dev/null
echo "Hello World!"
# */

How does it work?

// 2>/dev/null; looks like a comment to Scala, but looks like you're attempting to execute the root directory to Bash (but the error message is sent to /dev/null). The 'echo' bit encapsulates all the Scala code, keeping it away from bash, and then Scala happily ignores the parts betwen the /* */.

My first polyglot program!

Monday, October 27, 2008

Labels are Noisy Features

Every label is a simplification of something more complicated.
Every label could be wrong with non-zero probability.

A principle I've just been thinking is that you just can't trust annotators. Even if interannotator agreement is high, the manual they were trained to use is a poor indicator of their understanding. Obvious examples are in the NetFlix challenge: the same person might use completely different features to rank two different movies as five-stars. One might have a great soundtrack and the other might have some actor or another. Alternatively, in parsing, one might want to distinguish between different kinds of NPs, since the distribution of nouns in a subject NP are different from the distribution of nouns in an object NP.

A non-trivial amount of the work I've seen here at EMNLP and just from reading in the past couple of months can be cast as dealing with impoverished labels. Broadly, I think the approaches fall into three categories:

Deterministic Splitting. Two reasonable ways of doing this. First, along the lines of Klein and Manning, 2003, take your labels (e.g. NP) and add information from other nearby labels (e.g. S) to produce a new label (NP^S). Alternatively, you can instead "lexicalize" your labels by adding information about observed features (e.g. NP becomes NP-dog).
Addition of Latent Variables. This is like the machine learning version of the above. Instead of deterministically renaming features, assume that there is a latent variable that controls what label the human assigns and acts as an intermediary between the label and the feature. In a sense, turn the label into a feature. For example, if Y is your label, X your features, and Z your latent variables, then add a new latent variable B:

Y -> Z -> X

becomes something like

B -> Y
B -> Z
Z -> X

There's of course a lot more to be discussed here. How big should |Z| be? Is it discrete? What's the interaction between other labels? Some papers that work out (some of) these details are Petrov and Klein, 2008, and McCallum, et al. 2006. One might argue that any latent variable problem is an example of this phenonemon, but it seems that in general you gain by reserving one latent variable "just" for the additional layer of indirection.
Assume a mixture of latent variables. I'm mostly interested in this method for the multilabel setting. Here, you assume that a number of unseen components lead to the actual labels. The one that I find most interesting at the moment is inspired by ICA, which assumes that the observed labels are a noisy combination of the "real" labels that actually created the data. (Zhang, et al. 2005)

I don't think that this is especially profound, but it seems that too often people don't bother to try this simple extension.

Monday, September 22, 2008

Look Ma No Jars!

Anyone who's used Hadoop knows you have to use jars to package your MapReduces. You don't get a chance to specify a CLASSPATH, and your environment variables aren't respected since Hadoop runs as a different user. This is, to be sure, a good idea, but it can be awfully annoying to figure out exactly what your dependencies are.

Luckily, jars have Manifests, which let you specify meta-information like Main-Class and version information. And... a Class-Path for other dependencies on jars and directories. And there's the in.

Let's suppose that you have a hadoop cluster all linked via NFS (or sshfs, or...), and you don't like jars, and you've carefully set up your $CLASSPATH to contain all the classes you'd ever care to use. Then try out this script:

#!/bin/env python
import os

user=os.environ["USER"]

out = open("~/.super-manifest.txt"%user,"w")

out.write("Class-Path: ")

# Creates a Jar from a CLASSPATH:
classpath=os.environ["CLASSPATH"]
for x in classpath.split(':'):
if x is not '':
 if not x.endswith(".jar"):
   x = x + "/" # dirs must end with a slash
 out.write(" %s \n"% x)

out.close();

os.system("mkdir -p ~/.superlibs/lib");
os.system("jar cmf ~/.super-manifest.txt ~/.superlibs/lib/supererJar.jar");
os.system("jar cf ~/.superlibs/superJar.jar -C ~/.superlibs/ lib/supererJar.jar");

Now, in your code, when you set up your job:

jobConf.setJar("/home/YOURNAME/.superlibs/superJar.jar");

And give it a spin. Hadoop, with no jars!

Thursday, September 04, 2008

Evolution, and Empiricism

I had a thought last night that I've been toying with all day. It seems obvious that certain kinds of knowledge are completely inaccessible to other animals: language with recursion, "deep" representation of concepts, etc. They don't have this, of course, because Nature hasn't selected for them.

We have (to some extent) both of these traits, but it seems naive to think that we aren't limited in the same way, so we can only assume that there are certain aspects of the universe humanity cannot understand.

This is not to say that I'm advocating the teaching of intelligent design, or the space unicorn, or whatever, and I'm certainly not advocating transhumanism. But it does put into doubt radical empiricism: that only those things we can observe or reason out exist. In a sense, if we can't ever know about some certain aspect of the world, then it's indistinguishable from chance, and should not concern us. So from that anthropocentric point of view, radical empiricism is justified.

But what if we're entitled to incomplete knowledge of something? Then any form of empiricism seems less justified.

Enough philosophy.

Wednesday, September 03, 2008

SMR, Hadoop, and Scala

I've more or less finished up the port of SMR to Hadoop. See the blogpost linked for information.

Let me know here (or there) what you'd like to see in future versions of SMR.

Monday, July 14, 2008

Monadic Sampling in Scala

(If you don't know what Monads are, see here. Short answer: they're a design pattern coupled with syntactic sugar for implementing "wrappers" around an arbitrary type T.)

I was initially quite resistant to monads, probably because I didn't really grok the syntax that Haskell was pushing. Seeing Scala's for-comprehensions, which are just like Haskell's do-notation, I've changed my mind, and in an effort to teach myself Monads, I've decided to write a simple library for doing sampling for Bayesian inference. What follows is inspired in part by this paper and this blog series.

For motivation, let's look at the specification of a generative model in standard notation:

θ ~ Dir(α)
z ~ Mult(θ)
w ~ Mult(β_z)

Our goal is to make a syntax that looks a lot like that for generating new data. We'll end up with something that looks like:

for( theta <- Dir(alpha);   
  z <- Mult(theta);   
  w <- Mult(beta_z))
    yield w;

which isn't so bad.

Let's start with a simple monad in the Scala tradition:

trait Rand[T] {                                                            
 def get() : T

 def draw() = get()
                                                                   
 def sample(n : Int) = List.tabulate(n,x => draw());                         
                                                                  
 def flatMap[E](f : T => Rand[E] ) : Rand[E]
                                                                   
 def map[E](f : T=>E) : Rand[E]
                                                                 
 def filter(p: T=>Boolean) = condition(p);                                

 def condition(p : T => Boolean) : Rand[T];
}

Thus far, nothing terribly special. Aside from normal monadic stuff and aliases, we have draw and get, which intuitively should give us a sample from the random distribution. sample is just an easy way to get many samples at once.

A default implementation for most of these methods is fairly straightforward:

trait Rand[T] {
 def get() : T // still undefined

 def draw() = get();

 def sample(n : Int) = List.tabulate(n,x => get);

 def flatMap[E](f : T => Rand[E]) =  {
   def x = f(get());
   new Rand[E] {
     def get = x.get;
   }
 }

 def map[E](f : T=>E) =  {
   def x = f(get());
   new Rand[E] {
     def get = x;
   }
 }

 def filter(p: T=>Boolean) = condition(p);

 // Not the most efficient implementation ever, but meh.                         
 def condition(p : T => Boolean) = new Rand[T] {
   def get() = {
     var x = get;                                                                 
     while(!p(x)) {
       x = get;                                                                   
     }
     x
   }                                                                              
 }                                                                                

}

The only difference from a normal "container" monad like Option is that we re-evaluate get() on each call to map and flatMap, to ensure that we're being random.

Now for some basic random number generators. These live in object Rand, but that's omitted here for something akin to clarity. First, we have the analogue of Haskell's return, for completeness.


 def always[T](t : T) = new Rand[T] {
   def get = t;
 }

Straightforward enough. Now for our building block:


 val uniform = new Rand[Double] {
   private val r = new Random;
   def get = r.nextDouble;
 }

Which lets us do things like:


scala> Rand.uniform.sample(10)
res6: List[Double] = List(0.8940037286915604, 0.34021110772450114,
0.2045633072974703, 0.44871569906073616, 0.47697121133477594,
0.8410830818576492, 0.6738322287017577, 0.16060602963773707,
0.602623326916021, 0.34327615862458416)

and:


scala> val twice = for( x <- Rand.uniform) yield x * 2; 
twice: java.lang.Object with scalanlp.stats.Rand[Double] =  
scalanlp.stats.Rand$$anon$3@5c772046  

scala> twice.sample(10)
res7: List[Double] = List(1.8602320334301579, 0.0872446976570771,
0.032309170483379335, 1.9753336995209254, 1.220452839716684,
 1.0214181828533413, 1.41457180527561, 1.6988361279393165,
 1.460110077486223, 0.6762038442765996)

So we already have some use of do-notation. And if that's not convincing for you, consider sampling two correlated Gaussians variables. First, we need a univariate Gaussian:


 val gaussian = new Rand[Double] {
   private val r = new java.util.Random;
   def get = r.nextGaussian;
 }

 // mu is the mean and s^2 is the variance
 def gaussian(m : Double, s : Double) = new Rand[Double] {
   def get = m + s * gaussian.get
 }

Then sampling two independent Gaussians is straightforward:


scala> val biGauss = for(x <- Rand.gaussian; y <- Rand.gaussian) yield (x,y)
biGauss: java.lang.Object with scalanlp.stats.Rand[(Double, Double)] = 
scalanlp.stats.Rand$$anon$2@caa6635  scala> biGauss.sample(3)

res9: List[(Double, Double)] = List((-0.9601505823179303,-0.1480670696609196), 
0.02594332256575975,0.02401831998712138),
 (1.4885591927916324,1.1998923591137476))

Now suppose we want to draw to correlated Gaussians. That is, Gaussians where knowing one tells you something about the other. Drawing on this article, suppose that z1 and z2 are independent Gaussians drawn i.i.d. from the standard normal distribution, and that we interested in sampling normals (x,y) with means mu1 and mu2 and standard deviations s1 and s2, with correlation r. Then, we can draw x and y by computing:


x = mu1 + s1 * z1;
y = mu2 + r * z1 + s2(1- r²)^1/2*z2

With that, we can easily write a generator in Scala:


scala> def corrGauss(mu1 : Double, mu2: Double, s1: Double, s2 : Double, r: Double) =                                                                      
     |   for( (z1,z2) <- biGauss;      
     |     x = mu1 + s1 * z1; 
     |     y = mu2 + r * z1 + s2 * Math.sqrt(1 - r * r) * z2)      
     |   yield (x,y);
corrGauss: (Double,Double,Double,Double,Double)java.lang.Object with scalanlp.stats.Rand[(Double, Double)]

And that's it! Next time, I'll write about Multinomials, Dirichlets, and generating more data.

Sunday, May 18, 2008

Dynamic Code Reloading in the Scala Interpreter

A week ago I went to the Scala Liftoff Unconference, which was a pretty fun time. At some point I'll write about a conversation I had with Greg Meredith about Bayes and category theory, but only when I understand it all better myself.

The Platinum Sponsor, ZeroTurnaround, gave everyone there a free copy of JavaRebel, which automatically reloads classes on the JVM whenever you change them. No need to restart your java instance. Pretty sweet.

This got me to thinking: one of the things I love about Scala is the interpreter. In particular, one of the greatest advantages of the interpreter is ":load", which just compiles and executes your scala file and let's you play with the results. Unfortunately, the interpreter doesn't work so well when your code base gets over a certain size. But, JavaRebel fixes all that: just recompile your class files in your editor, and JavaRebel will autoreload them! Magic!

Example:

foo.scala:


class Foo {
def bar() = 3;
}

Then, compile it, and open up a modified jrscala interpreter (more on this below):


##########################################################

ZeroTurnaround JavaRebel 1.1
(c) Copyright Webmedia, Ltd, 2007. All rights reserved.

This product is licensed to David Hall

##########################################################

Welcome to Scala version 2.7.1.final (Java HotSpot(TM) Client VM, Java 1.5.0_13).
Type in expressions to have them evaluated.
Type :help for more information.

scala> val f = new Foo
f: Foo = Foo@950feb

scala> f.bar
res0: Int = 3

Now, open back up foo.scala, and change it:

class Foo {
def bar() = 10;
}

Recompile, and go back to your interpreter:


scala> f.bar
f.bar
JavaRebel: Reloading class 'Foo'.
res1: Int = 10

Magic! Ok, so how do I do this? Well, I just made a copy of the scala script (called jrscala) and added two options to the very last line: namely


"-noverify -javaagent:/path/to/your/javarebel.jar":

So it looks like:


${JAVACMD:=java} $JAVA_OPTS -cp "$TOOL_CLASSPATH" -noverify \
-javaagent:/path/to/your/javarebel.jar -Dscala.home="$SCALA_HOME" \
-Denv.classpath="$CLASSPATH" -Denv.emacs="$EMACS"  scala.tools.nsc.MainGenericRunner  "$@"

And you're done. Obviously I should be more principled/less hacky about it, but it's pretty slick.

:dlwh