Protecting Yourself in the Age of Information: Simpson’s Paradox
A growing problem in today’s digital age is the propensity for false or misleading information to become mixed with the legitimate, thereby muddying the proverbial waters and making it difficult to navigate through the sea of information without getting some of the contaminated muck on yourself. Whether its Trump’s constant tirades about “fake news” or the latest online article about the New Cancer Drug [which] Kills 100% of Cancer Cells! It becomes difficult to separate the fact from the fiction, the advertisement from the news, the informative from the fluff. One of the most common ways people become confused about information is statistics, which can be used manipulatively to mislead readers.
Consider Simpson’s Paradox, named after statistician Edward Simpson, which illustrates how surface-level data can fool you by not revealing what lies beneath. The classic example of Simpson’s Paradox involves a case from 1973 when UC Berkeley was sued for gender discrimination against women based on admissions figures:
The data shows that men are significantly more likely to be accepted into UC Berkeley than women. Why is this misleading? Let’s look at the data for admission rates in the six largest departments at UC Berkeley:
Notice something funny? For 4/6 of the departments, women are actually more likely than men to be accepted. Then why do the totals show a higher proportion of men being admitted?
Direct your attention to the row for Department A and see the number of applicants for men and women. Even though more women are accepted to Department A at a rate 20% higher than men, the raw number of women accepted is far lower than men. The same pattern is true for department B.
Compare that to Department C, where far many women than are applying than men, but only 34% are being admitted. It turns out women tended to apply to highly competitive departments with low rates of admission, while men tended to gravitate toward less competitive departments with high rates of admission, explaining the surface-level data’s suggestive gender discrimination.
Other examples are in sports statistics, such as in hockey. Consider the following hypothetical goalie save% for two goalies, Swiss Cheese and Mr. Sieve, across two years:
|2017||2018||2017 and 2018|
|Swiss Cheese||456/487 (0.936)||2003/2301 (0.871)||2460/2788 (0.882)|
|Mr. Sieve||2116/2312 (0.915)||1544/1789 (0.863)||3660/4101 (0.892)|
Despite Swiss Cheese having the better save% in 2017 and 2018, his overall save% for the two years is lower than that of Mr. Sieve. If you look at the number of shots each goalie is facing in each year, you can begin to understand why.
So what can we take away from understanding Simpson’s Paradox? Simply to be careful about trusting every statistic you see at face value, because as with anything, there’s usually something hiding under the hood.