Turker classifiers and binary classification threshold calibration
I wrote a big Dolores Labs blog post a few days ago. Click here to read it. I am most proud of the pictures I made for it:
this blog is about artificial intelligence and social science: cognition, systems, choice, etc.
I wrote a big Dolores Labs blog post a few days ago. Click here to read it. I am most proud of the pictures I made for it:
Not much on this blog lately, so I'll repost a comment I just wrote on whether to use pairwise vs. absolute judgments for relevance quality evaluation. (A fun one I know!)
I skimmed through the Carterette paper and it’s interesting. My concern with pairwise setup is, in order to get comparability among query-result pairs, you need to get annotators to do an O(N^2) amount of work. (Unless you do something horribly complicated with partial orders.) The absolute judgment task scales linearly, of course. Given the AMT environment and a fixed budget, if I stay in the smaller-volume task, instead of spending a lot on a quadratic taskload, I can simply get a higher number of workers per result and boil out more noise. Of course, if it’s true the pairwise judgment task is easier — as the paper claims — that might make my spending more efficient. But since it’s polynomial, no matter the cost/benefit ratios, there has to be a tipping point where, for a given data set size, you’d always want to switch back to absolute judgments.
Absolute judgments are just so much easier to compute with — both for analysis and to use as machine learning training data. I really don’t want to have fancy utility inference or stopping rule schemes just to know the relative ranking of my data. (And I think real-valued scores will always become a necessity. Theoretical microeconomists have made boatloads of theorems about representing preferences by pairwise comparisons. It turns out that when you add enough rationality assumptions — e.g. the sort that are demanded of search engine ranking tasks anyways — then your fancy ordering can always be mapped back to real-valued utility function.)
I’d be most interested in a paper that compares real-valued scores derived from some sort of pairwise comparison task, versus absolute judgments, and is mindful of the cost tradeoffs in service of an actual goal, like ranking algorithm training.
This interactive histogram is brilliant. The NYT data visualization folks never fail to impress.