'Big Data' is useful but caution is still needed when drawing conclusions
OVER 30 YEARS AGO, in the early days of the personal computer, I worked on the commercialisation of a machine intelligence algorithm from the University of Edinburgh. I demonstrated this product to Bill Gates at the headquarters of Microsoft in Seattle and this being 1983, Microsoft was in a single three story building.
Our program took lots of attributes - the more the merrier - of measured behaviour, created a rule that fitted all the factors, and achieved a formula that could then be used to make future decisions.
It was an intriguing product and Gates seemed impressed. “You'll probably sell 1,500 of these” he said. He was right, the program, Expert-Ease sold about 1,500 copies.
What we were missing, in 1983, was the internet, and it's access to unlimited data. If you have lots of data, you can potentially make conclusions based on that data that are very likely to be correct. And nowadays, there actually are masses of data - IBM estimates that every day 2.5 quintillion bytes of data are created - that's 1018.
Welcome to the world of ‘big data’ - the current buzzword of the information industry, and fast becoming the most accepted way of reaching complex decisions based on evidence.
‘Big data’ has driven many initiatives, from the pioneering Tesco clubcard which allowed the company to take decisions about what items to stock in what stores, to IBM’s $1bn Watson computer, which used facts retrieved from the internet to give successive correct answers and win the TV game show Jeopardy.
The book Moneyball by Michael Lewis, later turned into a Brad Pitt movie, showed rigorously analysing lots of raw performance statistics allowed the Oakland Athletics baseball team to make dramatically better hiring and promotion decisions than their competitors, who were using traditional expert analysis based on human judgement.
These days, pretty much every corporation is using ‘big data’ to make decisions - from trading billions of stocks per day on stock markets, to buying online advertising based on keywords and induction. It’s turned out to be the only way to cope with the volume and speed of decisions that is required.
Steven Levitt and Stephen Dubner caught this mood early with their book, Freakonomics, in which they famously correlated trends and argued crime rates fell in the USA as a direct result of the Roe vs Wade decision legalising abortion. Their argument was the babies who were never wanted, and who as a result might have turned into criminals, were never born after abortion became available, and so couldn't commit the crimes.
Some argue that this conclusion is too glib, and it might have been other factors. For example, it might have been the removal of lead from petrol, that resulted in personality changes that led to lower levels of criminal behaviour. In any case it certainly doesn't explain similar falling crime rates in other developed economies over the same period.
Some analysts have created other correlations that are even more unbelievable. The Spurious Correlations website quotes examples that strongly links divorce rates to margarine sales, or oil imports from Norway correlating with drivers killed by trains.
My Expert-Ease program had a basic flaw. Once you had introduced lots of factors the rules that were created were often impossible to understand; where you can’t check and test the logic you have to doubt the efficacy of the system created.
Even today, we need to be very careful drawing conclusions from ‘big data’.