Wednesday, April 08, 2009

Bandwidth

Some back of the envelope style calculations:

You have 10 billion nerve cells in your brain, with an average of 10,000 synapses per nerve cell. Each of these fires about 100 times a second, within an order of magnitude. If each firing carries one bit of information, this puts the bandwidth of your brain to around 10 quadrillion bits every second or about a petabyte per second.

Compare that to the bandwidth of the internet, which in 2004 was a mere 4,200 petabytes per year, or just 141 gigabytes/second. That is, in 2004, the amount of data transferred on the entire internet is just 0.0135% the bandwidth going on in an average person.

If we're being extremely optimistic and assuming that the internet will double its bandwidth every year, then it will take roughly 13 years for the internet to reach the bandwidth of a single human brain, and another 35-ish years to reach the thinking power of the world's population.

These calculations aren't entirely fair (not to mention horribly inexact) for a lot of reasons. In particular, there is a lot of error in both directions: the internet isn't going to grow that fast, and the human brain has a lot of redundancy. Regardless, fun talking point.

Friday, February 20, 2009

Excerpt

... known that it is no coincidence that the sign for integration ∫ matches the contour of the Taijitu's interior, nor that the limits of integration were placed at the seeds of the other, the dual centers toward which both yin and yang are inextricably drawn. Leibniz, ever the sinophile, knew that motion and stasis were linked through the slope of time, that change could be captured in a single stroke...

Unfortunately, Leibniz ignored the most crucial point, that the whole is often far more than the limit of sums, and often far less...

Friday, November 14, 2008

Scala and Bash in the same file

...or really any other language with C-like comments:

// 2>/dev/null; echo '
println("Hello world!");
/*
# Yes this script does two things.
//' > /dev/null
echo "Hello World!"
# */

How does it work?

// 2>/dev/null; looks like a comment to Scala, but looks like you're attempting to execute the root directory to Bash (but the error message is sent to /dev/null). The 'echo' bit encapsulates all the Scala code, keeping it away from bash, and then Scala happily ignores the parts betwen the /* */.

My first polyglot program!

Monday, October 27, 2008

Labels are Noisy Features

Every label is a simplification of something more complicated.
Every label could be wrong with non-zero probability.

A principle I've just been thinking is that you just can't trust annotators. Even if interannotator agreement is high, the manual they were trained to use is a poor indicator of their understanding. Obvious examples are in the NetFlix challenge: the same person might use completely different features to rank two different movies as five-stars. One might have a great soundtrack and the other might have some actor or another. Alternatively, in parsing, one might want to distinguish between different kinds of NPs, since the distribution of nouns in a subject NP are different from the distribution of nouns in an object NP.

A non-trivial amount of the work I've seen here at EMNLP and just from reading in the past couple of months can be cast as dealing with impoverished labels. Broadly, I think the approaches fall into three categories:
  1. Deterministic Splitting. Two reasonable ways of doing this. First, along the lines of Klein and Manning, 2003, take your labels (e.g. NP) and add information from other nearby labels (e.g. S) to produce a new label (NP^S). Alternatively, you can instead "lexicalize" your labels by adding information about observed features (e.g. NP becomes NP-dog).
  2. Addition of Latent Variables. This is like the machine learning version of the above. Instead of deterministically renaming features, assume that there is a latent variable that controls what label the human assigns and acts as an intermediary between the label and the feature. In a sense, turn the label into a feature. For example, if Y is your label, X your features, and Z your latent variables, then add a new latent variable B:

    Y -> Z -> X

    becomes something like

    B -> Y
    B -> Z
    Z -> X

    There's of course a lot more to be discussed here. How big should |Z| be? Is it discrete? What's the interaction between other labels? Some papers that work out (some of) these details are Petrov and Klein, 2008, and McCallum, et al. 2006. One might argue that any latent variable problem is an example of this phenonemon, but it seems that in general you gain by reserving one latent variable "just" for the additional layer of indirection.

  3. Assume a mixture of latent variables. I'm mostly interested in this method for the multilabel setting. Here, you assume that a number of unseen components lead to the actual labels. The one that I find most interesting at the moment is inspired by ICA, which assumes that the observed labels are a noisy combination of the "real" labels that actually created the data. (Zhang, et al. 2005)

I don't think that this is especially profound, but it seems that too often people don't bother to try this simple extension.

Monday, September 22, 2008

Look Ma No Jars!

Anyone who's used Hadoop knows you have to use jars to package your MapReduces. You don't get a chance to specify a CLASSPATH, and your environment variables aren't respected since Hadoop runs as a different user. This is, to be sure, a good idea, but it can be awfully annoying to figure out exactly what your dependencies are.

Luckily, jars have Manifests, which let you specify meta-information like Main-Class and version information. And... a Class-Path for other dependencies on jars and directories. And there's the in.

Let's suppose that you have a hadoop cluster all linked via NFS (or sshfs, or...), and you don't like jars, and you've carefully set up your $CLASSPATH to contain all the classes you'd ever care to use. Then try out this script:
#!/bin/env python
import os

user=os.environ["USER"]

out = open("~/.super-manifest.txt"%user,"w")

out.write("Class-Path: ")

# Creates a Jar from a CLASSPATH:
classpath=os.environ["CLASSPATH"]
for x in classpath.split(':'):
if x is not '':
if not x.endswith(".jar"):
x = x + "/" # dirs must end with a slash
out.write(" %s \n"% x)

out.close();

os.system("mkdir -p ~/.superlibs/lib");
os.system("jar cmf ~/.super-manifest.txt ~/.superlibs/lib/supererJar.jar");
os.system("jar cf ~/.superlibs/superJar.jar -C ~/.superlibs/ lib/supererJar.jar");
Now, in your code, when you set up your job:
jobConf.setJar("/home/YOURNAME/.superlibs/superJar.jar");
And give it a spin. Hadoop, with no jars!

Thursday, September 04, 2008

Evolution, and Empiricism

I had a thought last night that I've been toying with all day. It seems obvious that certain kinds of knowledge are completely inaccessible to other animals: language with recursion, "deep" representation of concepts, etc. They don't have this, of course, because Nature hasn't selected for them.

We have (to some extent) both of these traits, but it seems naive to think that we aren't limited in the same way, so we can only assume that there are certain aspects of the universe humanity cannot understand.

This is not to say that I'm advocating the teaching of intelligent design, or the space unicorn, or whatever, and I'm certainly not advocating transhumanism. But it does put into doubt radical empiricism: that only those things we can observe or reason out exist. In a sense, if we can't ever know about some certain aspect of the world, then it's indistinguishable from chance, and should not concern us. So from that anthropocentric point of view, radical empiricism is justified.

But what if we're entitled to incomplete knowledge of something? Then any form of empiricism seems less justified.

Enough philosophy.

Wednesday, September 03, 2008

SMR, Hadoop, and Scala

I've more or less finished up the port of SMR to Hadoop. See the blogpost linked for information.

Let me know here (or there) what you'd like to see in future versions of SMR.