The Case for Data

Tags:  

There is no shortage of data, computing power or storage. Eric Schmidt, former CEO of Google says it best when he states “there was five exabytes of information created between the dawn of civilisation through to 2003, but that much information is now created every two days.

There has been a huge increase in the amount of data produced by society.  A huge number of life events are being recorded.  Computing power and storage have also increase.  As a result, there is no shortage of data, storage or computing power.  T

There has been a huge increase in the amount of data produced by society. A huge number of life events are being recorded. Computing power and storage have also increase. As a result, there is no shortage of data, storage or computing power.

Data Opportunities

Together with  this large increase in data has been a concomitant increase in computing power. Data storage costs have also dropped.  Its not really clear whether these technical developments are a cause or an effect of the large increase in data volumes.  What is clear though is that the costs of processing a large amount of data have become cheaper.

Added to this is open source analytic software such as R, Python and Hadoop also help lower the financial hurdle — the only limiting item between data and conclusions is the ability of the analyst.

Big Data. Is it Really Necessary?

There is a considerable amount of hype of the potential of big data and some people claim that the more data the better. Basic statistical inference theory acknowledges that as your sample size increases, the confidence error decreases and this leads to more accurate predictions. Taking a different perspective, machine leaning theory also recognises that generally the cross-validation error decreases as the training set increases.  But by how much does prediction error decease?

As the number of observations increases, it is likely that predictions get better (i.e. MSE decreases ) but this occurs at a decreasing rate.  Therefore, it be possible to get a small increase in predictive ability but the costs of obtaining this is likely to be relatively large.

As the number of observations increases, it is likely that predictions get better (i.e. MSE decreases ) but this occurs at a decreasing rate. Therefore, it be possible to get a small increase in predictive ability but the costs of obtaining this is likely to be relatively large.

Learning curves enable the analyst to make predictions regarding how prediction error is likely to decrease as the training set is changed. To do this, the analyst progressively increases the size of the training set while recording the cross-validation error. The resultant curve can be plotted and predictions can be made regarding what type of error to expect if the training set became larger. Typically, prediction error decreases but at a decreasing rate  — so more accurate predictions can be obtained but it may cost a considerable amount.

Insights for Free

Once the data is collected and organised, there is almost zero marginal cost to analyse the data. This may mean that the economic gains due to analytics due not need to be enormous to pass a costs benefits test.

If these small gains can be aggregated, they may collectively result in a large gain and / or a competitive advantage. Small gains over time may also be compounded such that, once again, there is a large collective gain.

There is also a large scope for applying machine learning algorithms. Uses include: assortment optimisation, in-store behaviour analysis, creating micro-segments, personalised medicine, disease patterns,  remote patient monitoring,  clinical decision support, cross-selling, fraud detection,  geo-targeted advertising,  insurance pricing based on micro-segments, optimisation based on controlled experiments and amongst others, crime investigation.

The Case for Data Visualisation

When analysing a large dataset, it can be difficult to communicate the characteristics of the dataset into a single equation, model or set summary statistics.  As an alternative, as humans have a high visual bandwidth, visualisations can convey more information than what is possible using more mathematical techniques.

Anscombe's quartet is a series of datasets with the same summary statistics.  All graphs above have the same variance, correlation, mean and linear regression lines.  But as can be seen, the characteristics of each dataset are very different.

Anscombe’s quartet is a series of datasets with the same summary statistics. All graphs above have the same variance, correlation, mean and linear regression lines. But as can be seen, the characteristics of each dataset are very different.

For example, Anscombe’s Quartet, as as shown above, is series of four datasets that have exactly the same variance, mean, correlation and simple regression line.  But besides these characteristics, the datasets as can be seen below very different.

Hypothesis First Approach to Insights – Inductive Reasoning

Deductive reasoning is a top down approach.  It starts with a theory and then formulates hypotheses based on the theory. These hypotheses can be tested with controlled experiments and based on the resulting observations, these can either be rejected or accepted using statistical techniques.

Inductive reasoning is a more bottom up approach. Inductive reasoning starts with the observations and then attempts to detect patterns. Some patterns may occur with such regularity that a theory may be attempted in order to provide an explanation for the pattern.

Given the definition above, it follows that data mining is a form of inductive reasoning. Some critics of this type of reasoning will highlight some laughable instances of spurious correlations — an example might be the correlation of mozzarella cheese versus engineering doctorates awarded.

However, there are some powerful historical examples that demonstrate the power of inductive reasoning.  One is the discovery of the link between smoking and cancer discovered —  not by doctors using deductive techniques — but by actuaries in the early 1950’s.