Data Modelling

Just one number

So often, just one number is not only not enough, it’s positively misleading. We often see statistics quoted that, say, the average number of children per family is 1.8. First off, what sort of average? Mean, median or mode?  It makes a difference. But really, the problem is that a mean (or median or mode) gives us only very limited information. It doesn’t tell us what the data looks like overall: we get no idea of the shape of the distribution, or the range the data covers, or indeed anything other than this single point.

Many traditional actuarial calculations are the same. The net present value of a series of payments tells us nothing about the period of time over which the payments are due, or how variable their amount is — information which is very important in a wide range of circumstances.

Tim Harford has just written a good piece about how the same is true of government statistics, too. He points out that not only is gdp not good for all purposes (a statement that just about everybody agrees with), but that there are lots of other statistics that are good for some purposes but not others. There is no such thing as a single number that measures everything.

And why should there be? Life, the world and everything is variable and complex. There’s no reason to suppose that just one measurement will be able to sum it all up. We can think of the mean (or any other summary statistic) as a very simple model of the data. So simple that it’s abstracted nearly all the complexity away. The model, like any other model, may be useful for some purposes, but it’s never going to be the only possible, or only useful, model


There’s a yotta data out there…

One result of the unrelenting increase in computing power is that the amount of data is now huge. Earlier this year, it was estimated that 295 exabytes of data was being stored around the world in 2007. An exabyte is a billion gigabytes. That’s a lot of data, though admittedly it’s not a yottabyte yet (a million exabytes).

Not only is data storage technology improving — for example, you can buy a 1 TB (terabyte: 1000 gigabytes) hard drive for well under £100 — but more computing power means that it’s now possible to analyse these huge amounts of data.

It’s changing the way companies do business, too:

As Ron Kohavi at Microsoft memorably put it, objective, fine-grained data are replacing HiPPOs (Highest Paid Person’s Opinions) as the basis for decision-making at more and more companies.

In the past, it may simply have been too hard to collect masses of data and then analyse it. That’s simply not true now. It’ll be interesting to see if these data-driven decisions turn out to be better than the traditions seat of the pants ones. My bet is that, on the whole, they will.

Data Society

Where the money is

I love the Guardian’s datablog. It consistently presents large quantities of data in interesting interactive ways. Yesterday it took data from the annual survey of hours and earnings, and presented it in three different ways:

  • Choose a salary, and see how earnings for different jobs compare
  • Choose a job, and see what the earnings are
  • Choose a job group

To my mind, one of the most interesting aspects, and one which the presentation highlights, is the gender gap and how it varies between jobs. It seems to be greatest for the highest paid jobs, on the whole.

Actuaries are lumped in with management consultants, economists and statisticians, not necessarily a totally homogeneous grouping, and have a gender difference of 18%. The difference for corporate managers and senior officials is 39%.

Today, it’s got a geographic analysis based on the same data. Guess what? London and the south east come out top.

One of the best things about the datablog is that, as well as coming up with good ways of presenting the data, it also provides access to the raw data so you can do your own thing, or check that the conclusions are in fact warranted. Great stuff.


Data – it’s where the sporting action is

Game Theory, the Economist sports blog (a fairly loose description) has had a series of articles recently on how technology is affecting sport. Telemetry (including GPS tracking) is being used in Formula 1, sailing, rugby and football, and looks likely to spread to others. Technology has been a huge influence in tennis, but it looks as if some of the recent increases in ball speed and spin may be down to old fashioned causes: the players improving their technique. With the help of hi-tech training methods, of course.

Some sports are embracing technology as a way of assisting referees and umpires and, presumably, supporting fairness and compliance with the rules; others resist its introduction, worrying that it will undermine referees’ authority (or, on a cynical view, that it will detect non-compliance with the rules). But participants in all sports are using technology to improve their training, strategy and tactics. And the technology they are using is centred on data: collecting it and analysing it.

We’re not just talking about professional athletes in the top teams, either. Many of the ordinary runners I know (myself included) use GPS and heart rate monitors in training. It appeals to the inner geek, apart from anything else.

It’ll be interesting to see how this tendency progresses. My prediction, for what it’s worth, is that top class training will become more and more data intensive, and that all sports will, eventually, be dragged kicking and screaming into the data age. As more and more money depends on the outcomes of sporting events, those involved are going to want the results to depend on the athletes, rather than the officials.


Actuarial Data

The new modelling

Data is the new modelling. That is, it’s where all the sexy stuff is going to be over the next few years. Over the last few years, in the insurance industry at least, modelling has been where its at. Driven largely by Solvency II, a huge amount of effort has gone into building and, now, validating, hugely complex financial models.

But now, in the insurance industry as well as others, data is coming to the fore. After all, what is a model without data? And, as we all know, Garbage In, Garbage Out is one of the fundamental tenets of computing. The FSA has pointed out that data is a key area for the successful introduction of Solvency II and has produced a scoping tool that will help them assess a firm’s data management processes.

And it’s not only Solvency II. At GIRO last week there was an interesting debate over whether telematics will be at the heart of personal motor insurance in ten years’ time. The thing about telematics is that it produces large quantities of data. With the Test Achats case meaning that gender won’t be able to be used as a rating factor, insurers are going to be looking for other ways of coming up with premiums, and other factors they can take into account. The thing about gender, of course, is that it doesn’t take much data. It’s just a single bit in the database. Other rating factors may have more predictive power, but it’s harder to get at them.

We’re seeing this everywhere, though. As computers continue to get more powerful, and data storage gets ever cheaper (how big is the disk drive on your laptop? — even my phone has 16GB), doing things the rough and ready way with only limited data has fewer and fewer advantages. Big data is becoming mainstream: look at Google, for instance. And why did HP buy Autonomy?

You mark my words, a change is gonna come.

Data risk management

Fiddling the figures: Benford reveals all

Well, some of it, anyway. There’s been quite a lot of coverage in on the web recently about Benford’s law and the Greek debt crisis.

As I’m sure you remember, Benford’s law says that in lists of numbers from many real life sources of data, the leading digit isn’t uniformly distributed. In fact, around 30% of leading digits are 1, while fewer than 5% are 9. The phenomenon has been known for some time, and is often used to detect possible fraud – if people are cooking the books, they don’t usually get the distributions right.

It’s been in the news because it turns out that the macroeconomic data reported by Greece shows the greatest deviation from Benford’s law among all euro states (hat tip Marginal Revolution).

There was also a possible result that the numbers in published accounts in the financial industry deviated more from Benford’s law now than they used to. But it now appears that the analysis may be faulty.

How else can Benford’s law be used? What about testing the results of stochastic modelling, for example? If the phenomena we are trying to model are ones for which Benford’s law works, then the results of the model should comply too.


What Google’s about

What is Google about? Is it about search? or advertising? Actually, it’s probably about data.

I’ve been reminded of this by a couple of recent articles I’ve read. The Economist’s Babbage blog has a good piece on the Google Internet bus – a free, mobile cybercafe that operates in India.

“It has covered over 43,000km and passed through 120 towns in 11 states since it hit the road on February 3rd, 2009. Google estimates that 1.6m people have been offered their first online experience as a result.”

And also as a result, Google has a huge amount of data. I was reminded of Google’s appetite for data by a recent review of several books about Google in the London Review of Books. In 2007 Google started up a directory inquiry service in the USA:

“You dialled 1-800-4664-411 and spoke your question to the robot operator, which parsed it and spoke you back the top eight results, while offering to connect your call. It was free, nifty and widely used, especially because – unprecedentedly for a company that had never spent much on marketing – Google chose to promote it on billboards across California and New York State.”

People wondered why Google was doing this – it definitely wasn’t a money making exercise, and indeed only lasted for three years. What Google was doing was collecting a huge amount of phoneme data that it could use in voice recognition technology. It is almost certainly doing the same with its bus in India.

The availability of large amounts of data has changed various aspects of technology in many ways. Off the top of my head, it’s influenced voice recognition, natural language understanding and translation (which are rather different from each other) and of course has had a huge effect on marketing in general. Then there’s the forecasting of epidemics, and various initiatives to make data freely available to all. There is also now much more demand for large amounts of data storage, both physical (disk drives etc) and in software (database technology). It’s a chicken and egg situation with data storage, of course.