Social Science++: Simpson's paradox is so totally solved

My friend Lukas just wrote a great formulation Simpson's Paradox as a puzzle:

Against left-handed pitchers, Player A has a higher batting average than Player B. Player A does better against right-handed pitchers also.

Is it possible that B has a better average than A?

Here's a beautiful ASCII art visualization that says Yes.

Each star represents a number of at-bats where the player hit; pluses represent misses. If you put them in a horizontal line you can see the batting averages (proportions) pretty clearly. The bar lenghts carry across rows -- so a longer bar means more at-bats.


Against left-handed pitchers:
A hits |**++| A misses    --> 50% avg.
B hits |*+++| B misses    --> 25% avg.

Against right-handed pitchers:
A hits |**|              A misses    --> 100% avg.
B hits |**************+| B misses    -->  93% avg.  (for many more at-bats!)

But, batting against *ALL* pitchers:
A hits  |****++| A misses                 --> 66% avg.
B hits  |***************++++| B misses    --> 79% avg.

They both do well against right-handed pitchers, but B sees way more of them than A. All those right-handed pitchers B sees helps his score, but A doesn't get much of a payoff from his high average there. This effect overwhelms the within-group differences of A outperforming B.

I was tempted to write, "any random pitcher tends to be a right-handed, therefore those matter more." But that's not quite the right explanation -- rather, we're interested that among B's at-bats, it's usually against a right-handed pitcher, where A usually bats against a leftie.

A better visualization might step up to two dimensions, showing cross-cutting boxes for each group. I am personally of the opinion that cramming more dimensions of data into a visualization can often help understanding, but I don't have time to do it right now so I shouldn't natter on about it.

This was inspired by the pie chart version here [1], which was about group selection in evolutationary theory. Say altruism is socially efficient: a group with altruists does better than a group without altruists. But altruism is individually a bad bet: altruists do worse than free riders in their group. If you do well you have more children, so altruists always lose out to their fellow freeriders.

Surprisingly, the level of altruists across multiple groups can increase. If there's a group with a very high proportion of altruists, the altruists there benefit greatly from each other so have lots of offspring -- even though the few freeloaders in that group are doing better. But that altruistic group beats out the low-altruism groups, so altruists increase in the entire cross-group population. (This gain can only be temporary if groups are fixed: eventually the free-riders in the big group overwhelm the altruists.) So the effect can be characterized that the altruistic group beats out the selfish groups; this is dubbed "group selection". Group selection is working for altruism, but individual selection works against altruism; in the case of unbalanced groups, group selection is stronger.

Crazy people have written stuff about possible implications of this.

[1] Sober, Elliott, and David Sloan Wilson. 1998. Unto Others: The Evolution and Psychology of Unselfish Behavior.

4 Comments:

At 9:27 AM, L2K said...: Is there anything ascii art can't do?
At 9:48 AM, Mike said...: This comment has been removed by the author.
At 9:59 AM, Anonymous said...: i just got this book on the basics of python to manipulate data and the intro says it won't cover GUIs because they are totally lame, although you need to use them because "ASCII bar charts don't cut it in the Twenty-First Century"

thanks python book!
At 10:17 PM, Brendan O'Connor said...: >>> def p(w, t): print '|' + '*'*w + '-'*(t-w) + '|', "\t ==> %.2f avg" %(w*1.0/t)
...
>>> a_vs_l = (2,4)
>>> b_vs_l = (1,4)
>>> p(*a_vs_l); p(*b_vs_l)
|**--| ==> 0.50 avg
|*---| ==> 0.25 avg

>>> a_vs_r = (2,2)
>>> b_vs_r = (14,15)
>>> p(*a_vs_r); p(*b_vs_r)
|**| ==> 1.00 avg
|**************-| ==> 0.93 avg

>>> def elemadd(a,b): return [a[i]+b[i] for i in range(len(a))]
>>> p(*elemadd(a_vs_l,a_vs_r)); p(*elemadd(b_vs_l,b_vs_r))
|****--| ==> 0.67 avg
|***************----| ==> 0.79 avg

<< Home

Social Science ++

Wednesday, May 09, 2007

Simpson's paradox is so totally solved

4 Comments:

About Me

Previous Posts