Monday, September 22, 2008

Look Ma No Jars!

Anyone who's used Hadoop knows you have to use jars to package your MapReduces. You don't get a chance to specify a CLASSPATH, and your environment variables aren't respected since Hadoop runs as a different user. This is, to be sure, a good idea, but it can be awfully annoying to figure out exactly what your dependencies are.

Luckily, jars have Manifests, which let you specify meta-information like Main-Class and version information. And... a Class-Path for other dependencies on jars and directories. And there's the in.

Let's suppose that you have a hadoop cluster all linked via NFS (or sshfs, or...), and you don't like jars, and you've carefully set up your $CLASSPATH to contain all the classes you'd ever care to use. Then try out this script:
#!/bin/env python
import os

user=os.environ["USER"]

out = open("~/.super-manifest.txt"%user,"w")

out.write("Class-Path: ")

# Creates a Jar from a CLASSPATH:
classpath=os.environ["CLASSPATH"]
for x in classpath.split(':'):
if x is not '':
if not x.endswith(".jar"):
x = x + "/" # dirs must end with a slash
out.write(" %s \n"% x)

out.close();

os.system("mkdir -p ~/.superlibs/lib");
os.system("jar cmf ~/.super-manifest.txt ~/.superlibs/lib/supererJar.jar");
os.system("jar cf ~/.superlibs/superJar.jar -C ~/.superlibs/ lib/supererJar.jar");
Now, in your code, when you set up your job:
jobConf.setJar("/home/YOURNAME/.superlibs/superJar.jar");
And give it a spin. Hadoop, with no jars!

Thursday, September 04, 2008

Evolution, and Empiricism

I had a thought last night that I've been toying with all day. It seems obvious that certain kinds of knowledge are completely inaccessible to other animals: language with recursion, "deep" representation of concepts, etc. They don't have this, of course, because Nature hasn't selected for them.

We have (to some extent) both of these traits, but it seems naive to think that we aren't limited in the same way, so we can only assume that there are certain aspects of the universe humanity cannot understand.

This is not to say that I'm advocating the teaching of intelligent design, or the space unicorn, or whatever, and I'm certainly not advocating transhumanism. But it does put into doubt radical empiricism: that only those things we can observe or reason out exist. In a sense, if we can't ever know about some certain aspect of the world, then it's indistinguishable from chance, and should not concern us. So from that anthropocentric point of view, radical empiricism is justified.

But what if we're entitled to incomplete knowledge of something? Then any form of empiricism seems less justified.

Enough philosophy.

Wednesday, September 03, 2008

SMR, Hadoop, and Scala

I've more or less finished up the port of SMR to Hadoop. See the blogpost linked for information.

Let me know here (or there) what you'd like to see in future versions of SMR.