Wednesday, June 18, 2008

Turker classifiers and binary classification threshold calibration

I wrote a big Dolores Labs blog post a few days ago. Click here to read it. I am most proud of the pictures I made for it:



Tuesday, June 17, 2008

Pairwise comparisons for relevance evaluation

Not much on this blog lately, so I'll repost a comment I just wrote on whether to use pairwise vs. absolute judgments for relevance quality evaluation. (A fun one I know!)

From this post on the Dolores Labs blog.

The paper being talked about is Here or There: Preference Judgments for Relevance by Carterette et al.


I skimmed through the Carterette paper and it’s interesting. My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work. (Unless you do something horribly complicated with partial orders.) The absolute judgment task scales linearly, of course. Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise. Of course, if it’s true the pairwise judgment task is easier — as the paper claims — that might make my spending more efficient. But since it’s polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you’d always want to switch back to absolute judgments.

Absolute judgments are just so much easier to compute with — both for analysis and to use as machine learning training data. I really don’t want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data. (And I think real-valued scores will always become a necessity. Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons. It turns out that when you add enough rationality assumptions — e.g. the sort that are demanded of search engine ranking tasks anyways — then your fancy ordering can always be mapped back to real-valued utility function.)

I’d be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.

Thursday, June 05, 2008

Clinton-Obama support visualization

This interactive histogram is brilliant. The NYT data visualization folks never fail to impress.

margins.swf (application/x-shockwave-flash Object)