Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Monday, September 22, 2008

Look Ma No Jars!

Anyone who's used Hadoop knows you have to use jars to package your MapReduces. You don't get a chance to specify a CLASSPATH, and your environment variables aren't respected since Hadoop runs as a different user. This is, to be sure, a good idea, but it can be awfully annoying to figure out exactly what your dependencies are.

Luckily, jars have Manifests, which let you specify meta-information like Main-Class and version information. And... a Class-Path for other dependencies on jars and directories. And there's the in.

Let's suppose that you have a hadoop cluster all linked via NFS (or sshfs, or...), and you don't like jars, and you've carefully set up your $CLASSPATH to contain all the classes you'd ever care to use. Then try out this script:
#!/bin/env python
import os

user=os.environ["USER"]

out = open("~/.super-manifest.txt"%user,"w")

out.write("Class-Path: ")

# Creates a Jar from a CLASSPATH:
classpath=os.environ["CLASSPATH"]
for x in classpath.split(':'):
if x is not '':
if not x.endswith(".jar"):
x = x + "/" # dirs must end with a slash
out.write(" %s \n"% x)

out.close();

os.system("mkdir -p ~/.superlibs/lib");
os.system("jar cmf ~/.super-manifest.txt ~/.superlibs/lib/supererJar.jar");
os.system("jar cf ~/.superlibs/superJar.jar -C ~/.superlibs/ lib/supererJar.jar");
Now, in your code, when you set up your job:
jobConf.setJar("/home/YOURNAME/.superlibs/superJar.jar");
And give it a spin. Hadoop, with no jars!