invisible hit counter

Monday, May 19, 2008

conplot - a console plotter

This has to be the most quick-and-dirty data visualizer out there: I wrote an ascii art plotter script that takes a column of numbers on stdin and throws out a plot on your console. I've been using it for several months to quickly look at numbers on the commandline, especially from logs and such. (Back in school I would use gnuplot for this; R is good too. But sometimes you want to move really fast, esp if you have a few hideous perl -pe one-liners on your hands and mucking around with temp files will interrupt your flow.)

Link: github.com/brendano/conplot

"Demo":


$ cat time.log | conplot
14601
oooooooo
oooooo
ooooooooo
oooooooooooo
11269 oooooooo
oooo
ooo
oooo
oo
ooo
7271 ooooo
oooo
oooo
oooo
oooo
oooo
3272 oo
o
o
oo
ooo
ooo
-726 0 76826


I must say, it's way easier to throw up some code on GitHub than on to SourceForge, which is the only other open source code hosting service I've used. I guess Google Code is their biggest competitor in that respect; I haven't tried it.

Tuesday, May 13, 2008

The best natural language search commentary on the internet

With Powerset's launch, there's an awful lot of hot air and crappy blog posts about natural language search being written. Instead of contributing to that mess, I prefer to direct the reader to the best writing on the topic that I've seen: Fernando Pereira's posts on search.

Sunday, April 13, 2008

Are women discriminated against in graduate admissions? Simpson's paradox via R in three easy steps!

R has a fun built-in package, datasets: a whole bunch of easy-to-use, interesting tables of data. I found the famous UC Berkeley admissions data set, from a 1970's study of whether sex discrimination existed in graduate admissions. It's famous for illustrating a particular statistical paradox. Thanks to R's awesome mosaic plots interface, we can see this really easily.

UCBAdmissions is a three-dimensional table (like a matrix): Admit Status x Gender x Dept, with counts for each category as the matrix's values. R's default printing shows the basics just fine. Here's the data for just the first of six departments:
> UCBAdmissions
, , Dept = A

Gender
Admit Male Female
Admitted 512 89
Rejected 313 19

...

Overall, women have a lower admittance rate than men:
> apply(UCBAdmissions,c(1,2),sum)

Gender
Admit M F
Admitted 1198 557
Rejected 1493 1278

This is the phenomenon that prompted a lawsuit against Berkeley which prompted the study that collected this data.

R's plot function is overloaded to do a mosaic plot for this sort of categorical data. Very cool. With just
> plot(UCBAdmissions)

or, playing around after reading Quick-R's page on this:
> install.packages(”vcd”)
> library(vcd)
> mosaic(UCBAdmissions, condvars=c('Dept'))

We have a plot showing admittance and gender breakdowns per department:



In each department, women have similar admittance rates as men. This seems to be at odds with the fact that women have a lower admittance rate overall. This discrepancy is an example of Simpson's paradox.

This mosaic also shows the explanation: Selective departments have more female applicants. It's easy to see since the departments are ordered by selectiveness. Departments A and B let in many applicants, but they're mostly male. The reverse is true for the rest. This means that the overall female population takes big admittance hits in departments C through F, while lots of males get in via departments A and B.

I think these mosaic plots are impressive for visualizing categorical proportions for high dimensional data sets. Well, by “high” I think I mean, more than 2. I can't think of a better way to see several cross relationships in categorical data at once. And the only tuning I needed to do was play around a bit with the order of those three dimensions.

Sources:
  • R's UCBAdmissions help page. It comes with the standard download of R.
  • R's vcd::mosaic function. I recommend the pdf vigenette about it, which has many more pictures of cool mosaic plots.
  • I would post the original 1975 Science paper, but it's not freely available. I hate academic publishers.

Saturday, April 05, 2008

a regression slope is a weighted average of pairs' slopes!

Wow, this is pretty cool:




From an Andrew Gelman article on summaring a linear regression as a simple difference between upper and lower categories. I get the impression there are lots of weird misunderstood corners of linear models... (e.g. that "least squares regression" is a maximum likelihood estimator for a linear model with normal noise... I know so many people who didn't learn that from their stats whatever course, and therefore find it mystifying why squared error should be used... see this other post from Gelman.)

Wednesday, April 02, 2008

Datawocky: More data usually beats better algorithms

This is a great post. I think I've seen it from several sources already...

Datawocky: More data usually beats better algorithms

Saturday, March 29, 2008

Allende's cybernetic economy project

Wow -- teletype machines and cybernetics to run an economy!

Before ’73 Coup, Chile Tried to Find the Right Software for Socialism - New York Times

(note they mean this version of the word "cybernetics")

And here's a better Guardian article on it.

The control room:


I suspect this happens anyway today -- using computers to help decisionmakers run the economy -- but without the cool Star Trek chairs and display screens. Economists at central banks get data and use computers to analyze it, then eventually the data is used to inform decisions.

Though this vision involves more fine-grained data collection and automated analysis. (And other more Internet-y things like two-way communication between workers and management/government.). I suspect it would be way easier to do today, with better computational infrastructure (CPU, memory, data transmission, and software are better these days).

Sunday, March 23, 2008

Quick-R, the only decent R documentation on the internet

For R users or wannabes...

I really love R, but it has horrid documentation and a steep learning curve. Recently I was introduced to Quick-R, a really excellent documentation site. I think it's made the system dramatically more useful for me.

Thursday, March 20, 2008

Spending money on others makes you happy

Yes, Money Can Buy Happiness . . . - TierneyLab - Science - New York Times Blog

Tuesday, March 18, 2008

color name study i did

Where does “Blue” end and “Red” begin?

I'm writing some posts on blog.doloreslabs.com and this is the best one so far. Methodology-wise, along the lines of my earlier Amazon Mechanical Turk moral decisions survey...

Sunday, March 09, 2008

PHD Comics: Humanities vs. Social Sciences


PHD Comics: Humanities vs. Social Sciences