Data Modelling

Just one number

So often, just one number is not only not enough, it’s positively misleading. We often see statistics quoted that, say, the average number of children per family is 1.8. First off, what sort of average? Mean, median or mode?  It makes a difference. But really, the problem is that a mean (or median or mode) gives us only very limited information. It doesn’t tell us what the data looks like overall: we get no idea of the shape of the distribution, or the range the data covers, or indeed anything other than this single point.

Many traditional actuarial calculations are the same. The net present value of a series of payments tells us nothing about the period of time over which the payments are due, or how variable their amount is — information which is very important in a wide range of circumstances.

Tim Harford has just written a good piece about how the same is true of government statistics, too. He points out that not only is gdp not good for all purposes (a statement that just about everybody agrees with), but that there are lots of other statistics that are good for some purposes but not others. There is no such thing as a single number that measures everything.

And why should there be? Life, the world and everything is variable and complex. There’s no reason to suppose that just one measurement will be able to sum it all up. We can think of the mean (or any other summary statistic) as a very simple model of the data. So simple that it’s abstracted nearly all the complexity away. The model, like any other model, may be useful for some purposes, but it’s never going to be the only possible, or only useful, model


Statistically speaking…

Numbers are often perceived as a sign of respectability. Press releases often include them — it seems so much more believable to say 75.4% of people do such-and-such than to say many or even most people. Quote a specific percentage and people tend to believe it.

The trouble is, the numbers we see in the press are often misleading or just plain wrong. Some recent sources of error include:

  • Journalists writing the story have not fully understood the press release, or the writers of the press release didn’t understand the original results. A common area of confusion is the significance of quoted results, and what that really means. There’s a really good Understanding Uncertainty blog on this. In summary:

Take Paul the Octopus, who correctly predicted 8 football results in a row, which is unlikely (probability 1/256), due to chance. Is it reasonable to say that these results are unlikely to be due to chance (in other words that Paul is psychic)? Of course not, and nobody said this at the time, even after this 2.5 sigma event. So why do they say it about the Higgs Boson?

  • The numbers being compared aren’t like for like. There’s a good Understanding Uncertainty blog on this one, too (it’s an excellent website!). The recent news that Brits are more obese than other Europeans is a case in point: first, the figures for most countries are for people aged 18 and over, but for the Brits (who are in fact, in this case, just the English) are for people aged 16 and over; and second, the data for most countries is based on asking people what they weigh and how tall they are, but the English data is based on actual measurements. And guess what? People don’t always tell the absolute truth when asked how heavy they are.
  • People, and possibly especially journalists, are really unwilling to believe that phenomena are due to chance rather than to causality. I’ve written about this before. For instance, all those stories in the press about such-and-such a local authority being a black spot for whatever health risk is top of the list on that day: often due simply to random variation. In brief, a smaller population is quite likely to have results relatively far from the mean. It’s very easy to over interpret results.

People aren’t always very good at understanding percentages, either, and in particular the difference between percentages and percentage points. And people are really bad at understanding probabilities and risks:

The trouble is, many of us struggle with understanding risk. I realised how tenuous my grasp of risk was when I noticed that 1 in 20 sounded a bigger risk to me, than 5 percent (yes, they’re exactly the same). Representing risk so that people can get a true understanding of it is an art as well as a science.

Which is why giving children lessons in gambling may not be a stupid idea.

There are many people out there doing their best to introduce some sanity into the world. The Understanding Uncertainty website is consistently interesting and well written (have I mentioned that before?), Ben Goldacre has lots of useful stuff, the Guardian’s datablog is just starting a series on statistics (the first article explains samples and how bias can skew results), and Straight Statistics is also well worth a look.