Saturday, January 19, 2008

Moral psychology on Amazon Mechanical Turk

There's a lot of exciting work in moral psychology right now. I've been telling various poor fools who listen to me to read something from Jonathan Haidt or Joshua Greene, but of course there's a sea of too many articles and books of varying quality and intended audience. But just last week Steven Pinker wrote a great NYT magazine article, "The Moral Instinct," which summarizes current research and tries to spell out a few implications. I recommend it highly, if just for presenting so many awesome examples. (Yes, this blog has poked fun at Pinker before. But in any case, he is a brilliant expository writer. The Language Instinct is still one of my favorite popular science books.)

For a while now I've been thinking that recruiting subjects online could lend itself to collecting some really interesting behavioral science data. A few months ago I tried doing this with Amazon Mechanical Turk, a horribly misnamed web service that actually lets you create web-based tasks and pay online workers do them. Its canonical commercial applications include tedious tasks like search quality evaluation or image labeling, where you really need human data to perform well. You put up, say, several thousand images you want classified as "porn" or "not-porn", say you'll pay workers $0.01 to label ten images, then sit back and watch the data roll in.

So AMT advertises itself as a data annotation or machine learning substitute system, but I think its main innovation is finding out that there are lots and lots of people with free time willing to do online work for very, very low amounts of money. You can run any task you want, including surveys, and people happily respond for mere pennies. (Far below minimum wage, I might add -- their motivation seems to be more like casual gaming or so.) To that end, I tried out running one of the standard moral psych survey questions to see what would happen -- the so-called "trolley problem":

A runaway trolley is hurtling down a track towards five people who have been tied down in its path. If nothing happens, they will be killed. Fortunately, you have a switch which would divert the trolley to a different track. Unfortunately, the other track has one person tied down to it. Should you flip the switch?

It's supposed to be a classic dilemma of consequentialist vs. deontological moral reasoning. Is it acceptable to sacrifice for the greater good? Is it permissible to take an action that will cause a preventable death? And so on. I think it's neat just because when I pose it to people, different folks really do disagree, give different answers, and are willing to argue about it. There are some interesting recent fMRI findings (due to Greene I think?) that people who refuse to flip the switch seem to be engaged in a more emotional response, whereas those who do seem to be using deliberative reasoning systems. (Some, like Greene and Pinker, seem to go further and argue this is a substantive normative reason to favor flipping the switch; whether you feel like getting sucked into that debate, though, there's clearly something interesting happening here.)

So I ran this on AMT; the particpants (they call themselves "turkers") had to answer yes or no. Turns out 77% say they'd flip the tracks.

I also ran two variant scenarios of the same logical dilemma, to sacrifice one person to save five:
A trolley is hurtling down a track towards five people. You are on a bridge under which it will pass, and you can stop it by dropping a heavy weight in front of it. As it happens, there is a very fat man next to you - your only way to stop the trolley is to push him over the bridge and onto the track, killing him to save five. Should you proceed?

A brilliant transplant surgeon has five patients, each in need of a different organ, each of whom will die without that organ. Unfortunately, there are no organs available to perform any of these five transplant operations. A healthy young traveler, just passing through the city the doctor works in, comes in for a routine checkup. In the course of doing the checkup, the doctor discovers that his organs are compatible with all five of his dying patients. Suppose further that if the young man were to disappear, no-one would suspect the doctor. Should the doctor sacrifice the man to save his other patients?

These two, of course, feel a lot harder to say "Yes" to, but if you were willing to say "Yes" to the original question, it is hard to justify why. The participants' repsonses followed what you would expect: fewer said "Yes" to these scenarios. Here are the Yes/No responses to each of the questions (100 responses for each):

Question Yes No
surgeon 2 98
fat man 30 70
switch, save 5 77 23
switch, save 10 82 18
switch, save 15 83 17
switch, save 20 83 17

Only two people thought it was acceptable to sacrifice for organs, and only half as many would push the fat man as would flip the switch. I also ran variants of the switch version with more and more people on the tracks; the Yes response creeps upwards but never reaches 100%. The differences among the first three questions are statistically significant (unpaired t-tests, all p<.001 (this seems like the wrong test, can anyone correct me?)).

What's amazing is how fast responses happen. I started getting responses just minutes after posting the question. I actually posted each of the six questions as a separate, standalone task; but many of the turkers who did one found the rest in the task pool and did them too. (So what was supposed to be a between-subjects design fell into something else, oops!) The whole thing cost $6 and was done in a matter of hours. It's very encouraging -- AMT allows you to very quickly iterate and try out different designs and such. It's a bit of a pain to use, though; Amazon has certainly done a poor job in exploiting its full potential. (They have a form builder which was good enough to quickly write up these tasks, but to do anything moderately sophisticated, even just getting your data back out, you have to write programs against their somewhat mediocre API; you have to know how to use an XML parser, etc. Hm.)

I also tried an explicitly within-subject version, where each participant answered the three basic versions. I was interested in consistency -- presumably very few people would sacrifice for organs but refuse to divert the trolley. For 141 participants, here are the frequencies of the different answer triples:

% with this response triple flip switch? push fat man? sacrifice traveler for organs?
42.6 YNN
29.8 YYN
20.6 NNN
5.0 YYY
0.7 YNY
0.7 NYY
0.7 NYN

I personally find the most common responses coherent with my own gut reactions -- from left to right, I feel less and less good about sacrificing in each case. Perhaps all people feel the same gut reactions, and use different ad hoc reasons to draw the line in different places?

I'm sorry that this post started with neat moral psychology then degenerated into methodology, but hey it's fun. I've seen only two instances of any sort of research paper being written using AMT, both by computer scientists; here's a nice blog post on an information retrieval experiment (it's a great blog, btw); and someone mentioned to me this one on data processing accuracy also. Anyone know of any? It's clearly an interesting approach.


At 3:06 PM, Anonymous J. Alden Page said...

Interesting post! I linked to it here:

At 7:43 PM, Blogger Ed H. Chi said...

We used AMT to do psychology experiments around summer of 2007, and the results are published in a ACM CHI conference article here:

Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008). (to appear). ACM Press, 2008. Florence, Italy.

At 11:58 PM, Blogger Brendan said...

Ed, thanks for the link to your paper! Everyone seems to be getting on the AMT wave :) We have a conference paper in review and if it's accepted (or if it's not, I suppose) I'll post about it here...

Do you have the HITs or their templates saved anywhere? I'm curious to see the difference between the two different ones you ran, since you said you found big differences in the quality of Turker responses between them. I read through your CHI 2007 paper ("conflict and cooperation") but couldn't figure out exactly what the task was..

From reading your blog, looks like you folks have already discovered Panos Ipeirotis's blog, and then perhaps ours ( If you've heard or do any more cool AMT work I'd be eager to know.


Post a Comment

<< Home