Friday, July 11, 2008

The MacGyver of data analysis

Jeff Hammerbacher, who runs Facebook's data infrastructure and insight team, is leaving the company. That's too bad for them, considering a hilarious quote from a talk he gave:

Basic statistics is more useful than advanced machine learning.

I can't tell you how many interviews I've had where someone has a really cool project on their resume. Support vector machines, topic analysis on CiteSeer, or whatever... But what it boils down to is someone took toy data set A and plugged it in to machine learning library B and took the output and was like, “sweet.”

People with "machine learning" on their resume fall from the sky these days, it seems to be a very sexy discipline. The problem is if I ask them explain a t-test, those same people can't tell me what that is.

If I had a MacGyver of data analysis and all he had was a t-test and regression, he would probably be able to do 99.9% of the analyses that we do that are actually useful.

Amen. That, with the importance of data visualization, is one of the best lessons I learned working with real data in the last year or two. Here's the entire video, in which he also talks about Hadoop, Scribe, Hive (a data warehousing and analysis platform they've built on Hadoop), and other fun things. The above bit is around 35:00.

(Direct link, hopefully, though the website is weird)

There's lots more to say on statistics vs. machine learning and all that. For another post...


At 9:16 AM, Blogger William said...

I guess it depends if you're trying to analyze something or if you're trying to build something.

There are some times when frankly all you care about is getting a classifier that works. You don't care what ngrams are most indicative of spam if you're trying to just filter it from your mailbox.

(Of course you need some amount of understanding of what's going on if you're going to do a few cycles of error analysis and feature design. But you don't need much.)

Also I find it a little ironic that he proposes t-tests as a shibboleth when we have much better tools at our disposal with the advent of cheap computing.

At 11:03 AM, Blogger Brendan said...

I think there's two different problems here: you're looking at weird, complex data and you're just trying to figure out what's going on -- esp. in data-poor domains like UI design, where all you can do is get rough information that you're going to use qualitatively anyway -- vs. you're trying to make a high-accuracy predictor/classifier more in the machine learning style. But I think doing the first can help the second, by figuring out what can be useful features and the like. If you go around and plot your proposed features and find that they're noisily correlated with what you want to predict, that's good evidence it may be worth turning them into features for full-out classifier. I guess this doesn't work for data as fine-grained as n-grams; but it's interesting to look at the output of the n-gram-based classifier and compare across different document corpora, or whatever. Simple significance testing is helpful there.

Love the slides. I've used randomization methods a little, but I must admit, t-tests or mann-whitney tests are really, really easy to apply ad-hoc. I mean, if I'm on the R commandline and I have two arrays of numbers, t.test(a,b) and I'm done. I guess I could use whatever the function is in the bootstrap package, but it's like he said -- the t-test gives you simple information that's still useful.

At 1:03 PM, Anonymous Bob Carpenter said...

Talk about living in glass houses and throwing stones!

I can't tell you how many interviews I've had where someone has a really cool project on their resume. Clinical trial toxicity analysis for drugs, mean time to failure for components, or whatever... But what it boils down to is someone took a tiny data set A and plugged it into statistical package B and took the output and was like, "cool, man."

People with "statistics" on their resume fall from the sky these days, even if it's not a very sexy discipline. The problem is if I ask them if they can explain stochastic gradient descent for scalability or how to handle missing data, those same people can't tell me what it is.

If I had an Emma Peel of data analysis and all she had was an SVM implemented by SGD and cross-validation software, she would probably be able to do 99.9% of the classification problems that are actually useful.

The key here isn't your tool, it's the critical MacGyver of data analysis that makes the difference. Once you've got MacGyver, it doesn't matter what the tool is. That's the whole point of being MacGyver.

If one looks to the conferences in the field, the above caricature is patently ridiculous. Take the recent ICML. How many of those papers fit the model of "toy data set A versus machine learning library B"?

At 4:49 PM, Blogger Brendan said...

:) Though I think the true MacGyver of machine learning would use boosted decision trees instead of SVMs. (Fewer parameter tweaks, less of a need to preprocess data, handles noise better, etc). But in any case could ditch all those weird non-likelihood frequentist estimators and other fancy old stuff.

Machine learning is definitely considered cooler than mainline statistics, so it stands to reason it's going to be misused more often. I think especially once you get out of the community of actual ML researchers (ICML, NIPS) there's more dodgy application work out there.

At 3:33 PM, Blogger Mauricio Monsalve Moreno said...

Machine learning is not really about learning... it's just data exploration. And that's ok for most studies. But when it comes to summarize information, like when you establish a "natural law", you need that hardcore analytical part of you do the job.

I don't like t-tests... these are for ratios of normal things and works with *VERY* normal random variables. Regression, on the other hand, is very weak if the user doesn't do a good job. You know, regressions are for linear things! Come on, there are so many kinds of relations that the best thing you can do is to build a model, solve it theoretically, fit it somehow, and use a Chi-square or Cramer von Mises test to demonstrate its quality.

Nevertheless, regression is a very useful tool... and I don't think its better than machine learning; we know that SVM and neural nets are almost equivalent to linear regression :P but more general.



Post a Comment

<< Home