« Previous | Next »

TechTank Response to Problems with Big Data from a New York Times Editorial

People stand in front of a big data analytics logo at the booth of IBM during preparations for the CeBIT trade fair in Hanover, March 9, 2014. The world's biggest computer and software fair will be open to the public from March 10 to 14.

 

In a recent New York Times editorial Gary Marcus and Ernest Davis argue, “we need to be levelheaded about what big data can – and can’t – do.”  They offer important suggestions, but ultimately misattribute generic problems with statistical analysis to big data.  It is important not to over interpret data analytics, but the nine arguments they offer against big data should not dissuade data scientists or policymakers from moving ahead with what is a transformational approach to providing systematic feedback.

One. Can’t establish causality

Big data can establish causality if the necessary data is available and certain assumptions are true.  More accurately big data users rarely have access to data that would allow for an inference about causality.  Determining causality is a high bar to cross.  For many fields particularly in the social sciences where doing experiments is extremely complicated or unethical, finding correlations is far preferable to doing nothing or comparing averages.

Two. Works well as a supplement but can’t replace traditional scientific inquiry

This is undoubtedly true.  We are unaware of anyone who believes we should drop scientific inquiry and they offer no examples.  Big data is a valuable supplement especially when double blind placebo trials are not possible or are inordinately expensive.

Three. Big data is easily gamed

Individuals have adapted in the past to cheat big data analytics.  People will attempt to game any system when the stakes are high.  Cracking the Google search algorithm, for example, would be worth millions to those who could profit from a higher page rank.  But few students would try to game big data in a classroom because there are no comparable financial consequences to big data analysis.  The lesson here is not to reject big data but to use if carefully when people have an incentive to manipulate it.

Four. Results are sometimes less robust than they originally seem

Their fourth point describes overfitting.  This occurs when a model discovers a trend in the data that reflects random variation rather than a real correlation.  Overconfidence in interpreting statistical analysis is almost always a bad idea.  If done without forethought some big data can exacerbate problems of overfitting but this is an issue of a craftsman blaming their tool.

Five. Echo-chamber effect

Echo-chamber effects only occur if you design your big data model poorly.  Imagine students taking a math quiz on a computer.  If a student answers the question incorrectly the program prompts the user with a hint or even the correct answer.  The student then takes another quiz the next day with similar questions.  A big data system should treat these two observations very differently.  There is a much better chance the student gets the question right on day two because of the prompt from day one.  If a big data system treats these data points the same then it will damage the quality of the analysis.  However, if the model accounts for the difference then it would improve the rigor of the prediction.

Six. Type One Error

Their sixth point is a problem that we all face when answering a question.  In statistics lingo this is called type one error.  It’s the chance that you get a false positive.  Imagine you take 100 tests to see if you have a disease and five of those tests report that you are positive but in reality you are healthy.  In those five cases you would incorrectly trust the test.  For every judgment whether relying on a blood test, weather report, or big data analytic there is always a chance of a false positive.

Seven. Prone to giving scientific solution to subjective questions

This is entirely an issue with people using their data improperly.  Anyone can use or misuse big data.

Eight. Big data works well with common things

Here they point out that having numerous observations improves the quality of analysis.  Again this is true; but it is equally true for any evaluation effort.  It is extremely difficult to assess new experiences where we have no reference point.

Nine. The hype

Finally they argue that the hype is out of control.  Big data has tremendous transformational capacity.  They are correct that big data is less important than the invention of antibiotics.  But, big data could provide timely feedback to a pharmacist to prevent a deadly drug interaction.

When I was seven I wanted to know how many baseball cards I had in my collection.  So I stacked them into piles that were the same height (they each had the same number of cards) and started to count them one by one.  My Mom saw me doing this and tried to explain a faster way called multiplication.  Seven year old me did not buy it.  It seemed like make believe.  In the end the analysis from Marcus and Davis fails because they mischaracterize big data.  It’s  not magic but a group of statistical techniques used to identify relationships in large sets of information.  Big data is no panacea but it can find patterns that we could never find with conventional approaches.  When used properly it can lend insight into how our world works.

blog comments powered by Disqus