An attempt to take a crunchy subject and make it go down smoothly, so people will be less stupid about statistics.
Basic Books, 2019, 448 pages
In this "important and comprehensive" guide to statistical thinking (New Yorker), discover how data literacy is changing the world and gives you a better understanding of life’s biggest problems.
Statistics are everywhere, as integral to science as they are to business, and in the popular media hundreds of times a day. In this age of big data, a basic grasp of statistical literacy is more important than ever if we want to separate the fact from the fiction, the ostentatious embellishments from the raw evidence -- and even more so if we hope to participate in the future, rather than being simple bystanders.
In The Art of Statistics, world-renowned statistician David Spiegelhalter shows readers how to derive knowledge from raw data by focusing on the concepts and connections behind the math. Drawing on real world examples to introduce complex issues, he shows us how statistics can help us determine the luckiest passenger on the Titanic, whether a notorious serial killer could have been caught earlier, and if screening for ovarian cancer is beneficial. The Art of Statistics not only shows us how mathematicians have used statistical science to solve these problems -- it teaches us how we too can think like statisticians. We learn how to clarify our questions, assumptions, and expectations when approaching a problem, and -- perhaps even more importantly -- we learn how to responsibly interpret the answers we receive.
Combining the incomparable insight of an expert with the playful enthusiasm of an aficionado, The Art of Statistics is the definitive guide to stats that every modern person needs.
Do statins reduce heart attacks and strokes?
Do speed cameras reduce accidents?
Is prayer effective?
Why do old men have big ears?
Are more boys born than girls?
Does the Higgs boson exist?
Was Richard III buried in a Leicester parking lot?
The Art of Statistics is a nicely packaged introductory course in statistical reasoning, in which a Cambridge professor and president of the Royal Statistical Society tries to teach some subtle and important theories without making the reader do too much math.
So this is a book about statistics for the layman, and you can hear the author in every chapter pleading for people (politicians, journalists, scientists, and the general public) to be more informed because this shit matters. But as much as the author hand-holds the reader through his examples, you are going to have to look at some numbers, and even do a little math. But if you care enough to read this book, you should know enough math to get through it.
The first few chapters talk about elementary concepts, and why statistics matter. He starts each chapter with some intriguing, sometimes silly examples of questions you can answer with statistical reasoning.
One of his introductory examples is Harold Shipman, Britain's most prolific serial killer. He was family doctor who between 1975 and 1998 murdered hundreds of elderly patients before he was caught. Afterwards, investigators wanted to find out if he could have been detected earlier had anyone been paying attention to the death rate among his patients.
Answer: yes, and in fact he probably could have been caught in the first few years of his career, if the sort of forensic analysis of patient deaths that's done now had been performed then. But just looking at a chart that shows that Dr. Shipman's patients died at a higher rate than other GPs is obviously not enough - there are all kinds of confounders and other factors that need to be measured to express a degree of certainty that he's losing patients at a frequency that should really be considered alarming, and Spiegelhalter walks us through the numbers and the data visualizations to show us how it's done.
From there, he goes into many other measurements, from coin flips to number of sexual partners to predicting a child's height based on the heights of their parents. Very obvious ideas like "correlation is not causation" is covered in depth, of course, with some examples that aren't obvious at first glance. Regression models, probability theory, classification trees, bootstrapping, confidence intervals, p-values, Bayes Theorem, the Law of Large Numbers, the Central Limit Theorem - does that sound a little scary? Strap in and read up; if Spiegelhalter had his way this would be basic education at least for anyone who's graduated college, and the world would be a better place and journalists might not write stories with alarming headlines like "Threefold Variation in UK Bowel Cancer Death Rates" or "Going to university makes you more likely to die of a brain tumor." Also politicians might make decisions with some basic numeracy. Well, we can dream, right?
Two of my favorites:
The Prosecutor's Fallacy
The probability of innocence given the evidence is not the same as the probability of the evidence given innocence. I.e., "If the accused is innocent, there is only a 1 in a billion chance that their DNA would match the evidence at the crime scene" is wrongly interpreted as "Given the DNA evidence, there is only a 1 in a billion chance that the accused is innocent." Spiegelhalter likens this to "If you're the Pope, you're Catholic" being interpreted as meaning the same thing as "If you're Catholic, you're the Pope."
Simpson's Paradox
The direction of association between two variables can reverse when adjusted for a confounding factor. For example, rates of admission that show women being admitted at a lower rate than men - obvious sexism! - turn out to mean the opposite when factoring in the actual programs men and women applied for (more women apply to selective programs with a higher overall rate of rejection, but adjusting for the admission rate of each program, are overall more likely to be accepted than men! This plays out in many other scenarios.)
There's some discussion of communicating data, and data visualization, and of course there's every data science student's favorite problem, predicting which Titanic passengers should survive and which ones shouldn't.
Bayes Theorem (and the dispute between rival schools of statistical inference and Bayesians) gets its own chapter. If you think statistics is just hard math with provable right and wrong answers, well, it's more complicated.
Finally, Spiegelhalter talks about the so-called "replication crisis" (in which a large number of scientific papers have been found to have results that cannot be reproduced, leading many to suspect incompetence, fraud, and/or lazy research across many fields), and from there, a discussion of how bias affects statistics, and some proposed principles for ethical data science.
I have done a fair amount of machine learning and data science, so very few ideas in this book were new to me. But I found it very readable, with just enough math to require you to be comfortable with numbers, but not so much that I was straining my brain to remember how to calculate derivatives and integrals. And really, the world would be a better place if everyone knew this much, especially around election time.
My complete list of book reviews.