Data Modelling

Just one number

So often, just one number is not only not enough, it’s positively misleading. We often see statistics quoted that, say, the average number of children per family is 1.8. First off, what sort of average? Mean, median or mode?  It makes a difference. But really, the problem is that a mean (or median or mode) gives us only very limited information. It doesn’t tell us what the data looks like overall: we get no idea of the shape of the distribution, or the range the data covers, or indeed anything other than this single point.

Many traditional actuarial calculations are the same. The net present value of a series of payments tells us nothing about the period of time over which the payments are due, or how variable their amount is — information which is very important in a wide range of circumstances.

Tim Harford has just written a good piece about how the same is true of government statistics, too. He points out that not only is gdp not good for all purposes (a statement that just about everybody agrees with), but that there are lots of other statistics that are good for some purposes but not others. There is no such thing as a single number that measures everything.

And why should there be? Life, the world and everything is variable and complex. There’s no reason to suppose that just one measurement will be able to sum it all up. We can think of the mean (or any other summary statistic) as a very simple model of the data. So simple that it’s abstracted nearly all the complexity away. The model, like any other model, may be useful for some purposes, but it’s never going to be the only possible, or only useful, model


Interesting links

I found these interesting:

  1. Kaprekar’s constant — not everything has to be useful to be appealing and fun.
  2. Apparently the Roman Empire was more equal than the USA, while in Britain income inequality rose faster between 1975 and 2008 than in any other OECD member country.
  3. How to get your keys back if you drop them down a drain.
  4. Talking about big numbers
  5. The UK opens up NHS data, and the EU announces an ‘open by default‘ position for public sector information.

Statistically speaking…

Numbers are often perceived as a sign of respectability. Press releases often include them — it seems so much more believable to say 75.4% of people do such-and-such than to say many or even most people. Quote a specific percentage and people tend to believe it.

The trouble is, the numbers we see in the press are often misleading or just plain wrong. Some recent sources of error include:

  • Journalists writing the story have not fully understood the press release, or the writers of the press release didn’t understand the original results. A common area of confusion is the significance of quoted results, and what that really means. There’s a really good Understanding Uncertainty blog on this. In summary:

Take Paul the Octopus, who correctly predicted 8 football results in a row, which is unlikely (probability 1/256), due to chance. Is it reasonable to say that these results are unlikely to be due to chance (in other words that Paul is psychic)? Of course not, and nobody said this at the time, even after this 2.5 sigma event. So why do they say it about the Higgs Boson?

  • The numbers being compared aren’t like for like. There’s a good Understanding Uncertainty blog on this one, too (it’s an excellent website!). The recent news that Brits are more obese than other Europeans is a case in point: first, the figures for most countries are for people aged 18 and over, but for the Brits (who are in fact, in this case, just the English) are for people aged 16 and over; and second, the data for most countries is based on asking people what they weigh and how tall they are, but the English data is based on actual measurements. And guess what? People don’t always tell the absolute truth when asked how heavy they are.
  • People, and possibly especially journalists, are really unwilling to believe that phenomena are due to chance rather than to causality. I’ve written about this before. For instance, all those stories in the press about such-and-such a local authority being a black spot for whatever health risk is top of the list on that day: often due simply to random variation. In brief, a smaller population is quite likely to have results relatively far from the mean. It’s very easy to over interpret results.

People aren’t always very good at understanding percentages, either, and in particular the difference between percentages and percentage points. And people are really bad at understanding probabilities and risks:

The trouble is, many of us struggle with understanding risk. I realised how tenuous my grasp of risk was when I noticed that 1 in 20 sounded a bigger risk to me, than 5 percent (yes, they’re exactly the same). Representing risk so that people can get a true understanding of it is an art as well as a science.

Which is why giving children lessons in gambling may not be a stupid idea.

There are many people out there doing their best to introduce some sanity into the world. The Understanding Uncertainty website is consistently interesting and well written (have I mentioned that before?), Ben Goldacre has lots of useful stuff, the Guardian’s datablog is just starting a series on statistics (the first article explains samples and how bias can skew results), and Straight Statistics is also well worth a look.


Interesting links

Some things that have recently struck me in one way or another:

  1. Literary references to actuaries aren’t that common
  2. Some interesting graphical representations of relative sizes from xkcd: money (recent) and radiation (older). And from elsewhere: how big is a PhD?
  3. Old news is the latest thing
  4. US/UK culture gap: “Like most US universities, [UC Davis] maintains its own police force, employing (as of 2009) 101 people (including administrators), far more than the largest academic departments. The officer wielding the spray is on record as earning $110,000 in 2010, more than all but the better paid full professors.” More
  5. Social differences on public transport: the tube’s posh.
Interesting Uncategorized

Interesting links

Some things I’ve found interesting:

  1. Have you seen those Google ads on the tube? The example they give of a strong password isn’t so strong after all. It’s always worth checking the statistics.
  2. The important field – as usual with xkcd, make sure you read the alt-text
  3. Language is not writing, and some myths that arise from the mis-identification
  4. Sometimes, animations are the best way of showing data. This is a great one on global warming.
  5. Don’t believe everything you read on Wikipedia – and remember the alt-text!
Old site

More laptop woes

Laptops can contain confidential information, and are inherently less secure than large machines: it is easier to take physical possession of them.

Nationwide building society recently had one stolen that contained customer information; and 3 laptops containing police payroll information were stolen from LogicaCMG, the UK IT services firm.

You have to wonder whether it was absolutely necessary for this information to be on the laptops in the first place. It appears that it may not have been, as Nationwide are saying that the employee who had the laptop stolen may not have been complying with the firm’s security policy. Of course, it’s one thing to have a policy and another for it to be complied with.