Friday, July 11, 2008

The MacGyver of data analysis

Jeff Hammerbacher, who runs Facebook's data infrastructure and insight team, is leaving the company. That's too bad for them, considering a hilarious quote from a talk he gave:

Basic statistics is more useful than advanced machine learning.

I can't tell you how many interviews I've had where someone has a really cool project on their resume. Support vector machines, topic analysis on CiteSeer, or whatever... But what it boils down to is someone took toy data set A and plugged it in to machine learning library B and took the output and was like, “sweet.”

People with "machine learning" on their resume fall from the sky these days, it seems to be a very sexy discipline. The problem is if I ask them explain a t-test, those same people can't tell me what that is.

If I had a MacGyver of data analysis and all he had was a t-test and regression, he would probably be able to do 99.9% of the analyses that we do that are actually useful.

Amen. That, with the importance of data visualization, is one of the best lessons I learned working with real data in the last year or two. Here's the entire video, in which he also talks about Hadoop, Scribe, Hive (a data warehousing and analysis platform they've built on Hadoop), and other fun things. The above bit is around 35:00.

(Direct link, hopefully, though the website is weird)

There's lots more to say on statistics vs. machine learning and all that. For another post...

Thursday, July 03, 2008

Link: Today's international organizations

Fascinating -- a review of the current international system, focusing on international organizations (that is, organizations of states). Who runs the world? | Wrestling for influence |

Tuesday, July 01, 2008

Bias correction sneak peek!

I really don't have time to write up an explanation for what this is so I'll just post the graph instead. Each box is a scatterplot of an AMT worker's responses versus a gold standard. Drawn are attempts to fit linear models to each worker. The idea is to correct for the biases of each worker. With a linear model y ~ ax+b, the correction is correction(y) = (y-b)/a. Arrows show such corrections. Hilariously bad "corrections" happen. *But*, there is also weighting: to get the "correct" answer (maximum likelihood) from several workers, you weight by a^2/stddev^2. Despite the sometimes odd corrections, the cross-validated results from this model correlate better with the gold than the raw averaging of workers. (Raw averaging is the maximum likelihood solution for a fixed noise model: a=1, b=0, and each worker's variance is equal).

Much better explanation is coming... will be a post I think.