Measurement is a Minefield

We have officially entered the era of “big data”.  Hal Varian, chief economist at Google, has gone so far to describe statistics as “sexy”.

 That may be a stretch.  Still, the more easily we can gather data, and the more often it is used to make managerial and policy decisions, the more important statistics will become.  Mostly that’s good.

 I’m excited that Netflix always seems to know what films I will like.  I’m encouraged that urban police departments can anticipate where violence is likely to break out.  I dream of the day when better public school teachers will get paid more.

 But let me wave one giant yellow flag:  statistics will never be smarter or more honest than the people using them.  “Big data” will inevitably lead to big errors, deliberate or otherwise.

 How might number crunchers with huge quantities of raw information lead us astray? Let us count just a few of the ways.

The point of statistics is to simplify data; simplification invites deceit, or at least a point of view.  The telecommunications companies AT&T and Verizon have recently engaged in an advertising battle that exploits this kind of ambiguity about what is being described.

 One of the primary concerns of most cell phone users is the quality of the service in places where they are likely to make or receive phone calls.  Thus a logical point of comparison between the two firms is the size and quality of their networks.

 While consumers just want decent cell phone service in lots of places, both AT&T and Verizon have come up with different metrics for measuring the somewhat amorphous demand for “decent cell phone service in lots of places.”  Verizon launched an aggressive advertising campaign touting the geographic coverage of its network.  You may remember the maps of the United States that showed the large percentage of the country covered by the Verizon network compared with the relatively paltry coverage of the AT&T network.  The unit of analysis chosen by Verizon is geographic area covered—because the company has more of it.

 AT&T countered by launching a campaign that changed the unit of analysis.  Its billboards advertised that “AT&T covers 97 percent of Americans.”  Note the use of the word “Americans” rather than “America.”  AT&T focused on the fact that most people don’t live in rural Montana or the Arizona desert.  Since the population is not evenly distributed across the physical geography of the United States, the key to good cell service (the campaign argued implicitly) is having a network in place where callers actually live and work, not necessarily where they go camping.  (As someone who spends a fair bit of time in rural New Hampshire, my sympathies are with Verizon on this one.)

 Statistical arguments have much in common with bad marriages; the disputants often talk past one another.

We should not confuse precision with accuracy.  Precision reflects the exactitude with which we can express something.  Accuracy measures whether that figure is broadly consistent with the truth—hence the danger of confusing the two.

 If an answer is accurate, then more precision is always better. It’s nice if I can tell you that the nearest gas station is 1.265 miles down the road.  However, it’s a problem if I tell you that while pointing in the wrong direction.  No amount of precision can make up for inaccuracy.

 This point was reinforced to me when my wife gave me a new golf rangefinder for Christmas.  This laser device calculates the exact distance from my golf ball to the hole.  Instead of standing in the fairway and estimating that I was about 150 yards from the pin, my rangefinder could tell me that I was exactly 153.2.

 My golf game got steadily worse.  I fired balls over greens, into traps, and even into the parking lot behind the municipal police station (where all of the police officers park their personal cars).  After three months, I realized that my rangefinder was set to meters rather than yards.  Great precision, lousy accuracy.

 Wall Street learned the same basic lesson, albeit at greater global cost, during the financial crisis.  Prior to 2008, firms throughout the financial industry typically used a common barometer of risk, the Value at Risk model, or VaR.

 The elegance of VaR was that it used probability and other sophisticated modeling techniques to express the entire risk profile of a firm with a single number—a dollar figure no less.  The VaR was sometimes called the “4:15 report” because it was dropped on executives’ desks every day after the American markets had closed for the day.

 As my high school daughter might say, “The models sucked.”  A bad statistical model (e.g. one that assumes real estate prices will not fall) is like a faulty speedometer; it’s worse than none at all.  If you place too much faith in a broken speedometer, you will be oblivious to signs that you are traveling too fast.  In contrast, if there is no speedometer at all, you have no choice but to look around for clues about how fast you are really going.

Collecting bad data is cheaper and easier than ever!  Any method of gathering data that systematically excludes some segment of the relevant population is prone to bias.  Hence the expression “garbage in, garbage out.” 

 If you want to learn about American political attitudes, then you had better poll a group of people who are representative of America.  This should not be 500 people walking out of a movie theater in Manhattan, or 50,000 people who choose to respond to an on-line survey.

 Gathering data may be cheaper and easier than ever before, but gathering a representative sample of the population is actually getting harder.  In the old days, random telephone dialing across different area codes could generate a representative sample of American households.

 Not anymore.  Young people are less and less likely to have land lines; the area codes on their cell phones no longer tell us anything about where they live.

 Rich people have caller ID and don’t answer calls from strangers.  Busy people won’t make time to take surveys; lonely, unemployed people will.  (The latter could easily lead to a huge, sloppy telephone poll suggesting that 60 percent of Americans are lonely and unemployed.)

 The dangerous thing about polling, or any kind of statistical sampling, is that large, badly-drawn samples will make the results look artificially impressive.  A presidential poll of 20,000 Americans will produce results with a tiny margin of error, say plus or minus one percent.

 And if all 20,000 of those Americans live in Washington, DC, or subscribe to the National Review, the poll is rubbish.  You would be better off with a sample of 100, if it were truly representative of America.

 Smart managers will use data to evaluate performance; smart workers will figure out how to manipulate those data.  The State of New York introduced “scorecards” that evaluate the mortality rates for the patients of cardiologists performing coronary angioplasty, a common treatment for heart disease.  This seems like a perfectly reasonable and helpful use of data.  It would have been a great policy if it hadn’t killed people.

 Cardiologists obviously care about their “scorecard.”  However, the easiest way for a surgeon to improve his mortality rate is not by killing fewer people; presumably most doctors are already trying very hard to keep their patients alive.  The easiest way for a doctor to improve his mortality rate is by refusing to operate on the sickest patients.

 According to a survey conducted by the School of Medicine and Dentistry at the University of Rochester, 79 percent of the doctors said that some of their personal medical decisions had been influenced by the knowledge that mortality data are collected and made public.

 Even when employees cannot manipulate the data, you had better make darn sure that you are measuring and rewarding what really matters.  Test scores are a good example.  Will better teachers generate better test scores for their students?  Maybe not.

 The Air Force Academy, like the other military academies, randomly assigns its cadets to different sections of standardized core courses, such as introductory calculus.  We can assume that over time all professors get students with a similar range of aptitudes (unlike most universities, where students of different abilities can select into or out of different courses).

 The Air Force Academy also uses the same readings and exams in every section of a particular course.  This standardization helps us answer an important question:  Which professors are most effective?

 The answer:  The professors with less experience and fewer degrees from fancy universities.  These professors have students who typically do better on the standardized exams in the introductory courses.  They also get better student evaluations.

 Clearly these young, motivated instructors are more committed to their teaching than the old crusty professors with PhDs from places like Harvard.

 But hold on.  A deeper look at the same Air Force Academy data provided another relevant finding about student performance over a longer horizon.  Researchers found that in math and science the students with more experienced (and more highly credentialed) professors do better in their mandatory follow-on courses than students who had less experienced professors in the introductory courses.

 One logical explanation is that less experienced instructors are more likely to “teach to the test” in the introductory courses.  This produces impressive exam scores (in the short run) and happy students when it comes to filling out the instructor evaluation.

 Meanwhile, the old, crusty professors focus less on the exam and more on the concepts that matter most in future courses and in life after the Academy—which is what we really care about.

Correlation does not mean causation.  Will more guns make us more or less safe?  That is the question of the day.  I don’t have an answer.  I don’t think anyone else has a conclusive answer either.

 The difficulty of unraveling the complex relationship between guns and crime is a variation on a challenge that plagues most social science research:  places with lots of guns are different than places without lots of guns in ways that have nothing to do with the guns themselves.

 Maybe it’s the guns that explain differential crime rates; maybe it is the differences in the kinds of places that do or do not have lots of gun owners.

 Intellectually honest researchers use very clever methodologies to try to sort out these kinds of confounding factors.  Even then, the results are usually presented with caution.  Shoddy researchers, and those who propagate their work, don’t necessarily bother with such trifles.

 A simple statistical association between two variables rarely tells us much.  When you read that people who eat three servings of kale a day have lower rates of colon cancer, remind yourself of one crucial thing:  people who eat three servings of kale a day are not like the rest of us.

Statistical analysis suggests that much statistical analysis will turn out to be wrong.  John Ioannidis, a Greek doctor and epidemiologist, examined 49 studies published in three prominent medical journals.  Each study had been cited in the medical literature at least a thousand times.

 Roughly one third of the research was subsequently refuted by later work.  Dr. Ioannidis estimates that roughly half of all scientific papers will turn out to be wrong.  His research was published in the Journal of the American Medical Association, one of the journals in which the articles he studied had appeared.

 This creates a certain mind-bending irony:  If Dr. Ioannidis’s research is correct, then there is a good chance that his research is wrong.

 We are indeed in the midst of a data revolution.  But let me finish with some word association:  fire, knives, automobiles, hair removal cream.  Each one of these things serves an important purpose.  Each one makes our lives better.  And each one can cause some serious problems when abused.

 Add statistics to that list.