Monday, September 22, 2008

Look Ma No Jars!

Anyone who's used Hadoop knows you have to use jars to package your MapReduces. You don't get a chance to specify a CLASSPATH, and your environment variables aren't respected since Hadoop runs as a different user. This is, to be sure, a good idea, but it can be awfully annoying to figure out exactly what your dependencies are.

Luckily, jars have Manifests, which let you specify meta-information like Main-Class and version information. And... a Class-Path for other dependencies on jars and directories. And there's the in.

Let's suppose that you have a hadoop cluster all linked via NFS (or sshfs, or...), and you don't like jars, and you've carefully set up your $CLASSPATH to contain all the classes you'd ever care to use. Then try out this script:
#!/bin/env python
import os

user=os.environ["USER"]

out = open("~/.super-manifest.txt"%user,"w")

out.write("Class-Path: ")

# Creates a Jar from a CLASSPATH:
classpath=os.environ["CLASSPATH"]
for x in classpath.split(':'):
if x is not '':
if not x.endswith(".jar"):
x = x + "/" # dirs must end with a slash
out.write(" %s \n"% x)

out.close();

os.system("mkdir -p ~/.superlibs/lib");
os.system("jar cmf ~/.super-manifest.txt ~/.superlibs/lib/supererJar.jar");
os.system("jar cf ~/.superlibs/superJar.jar -C ~/.superlibs/ lib/supererJar.jar");
Now, in your code, when you set up your job:
jobConf.setJar("/home/YOURNAME/.superlibs/superJar.jar");
And give it a spin. Hadoop, with no jars!

No comments: