Big Data: Don’t Let the Billions of Data Points Blind You to the Problem of Too Few Sources to Check the Results

Many policy makers are interested in the potential of big data to provide answers to questions that appear otherwise out-of-reach. For example, inspecting search results for drug names and side effects can provide valuable information that is exceedingly difficult to gather. Millions of people take drugs in untested combinations.  Testing each pairing of drugs is not feasible since it would require an absurdly high number of clinical trials. People who suffer from complications, however, will often take to Google to see if, say, a blood-pressure medicine taken with an antibiotic might lead to a symptom.  A spike in paired searches likely indicates a question worth pondering.

However, the low-hanging fruit like drug interactions hide major shortcomings. For example, similarly-heralded Google Flu trends which many championed as a major breakthrough.  At one point Google Flu could predict outbreaks two weeks ahead of the CDC based on people’s searches for flu related keywords.  But, recently efforts revealed Google Flu to have lost its predictive power.

What happened to Google Flu?

One likely explanation is Google Flu trends itself: perhaps more people “googled” flu related items after Google Flu Trends was announced. Perhaps it was a totally different factor:  tweaks to Google’s algorithms, changes in user behavior, some unusual symptoms, or any of these combinations. In truth, we don’t know.  This enigma is at the heart of one of the weaknesses of big data used for human social behavior: there may be billions of data points but the sources are actually very few.  Specialized search engines create major problems of self-selection as they are used by particular groups of people.  For example, “Duckduckgo” is the search engine of choice for privacy conscious users—a group certain to have different characteristics than the population as a whole. Thus, it’s difficult to use big data for rigorous research and ground the big data results for search engines: test them “out of sample”.  It’s almost like having only one group of patients to do a clinical trial on and never really having the opportunity to repeat an experiment.

The Model Organism Problem

In a new paper, I explore these structural issues about social media big data including what I term the “model organism” problem:  when big data sources are too few, and when structural biases of these too few sources cannot be adequately explored. For example, many big data papers rely on Twitter because the data is relatively accessible.  However, Twitter data has structural biases due to the characteristics of the platform which limit message size to 140 characters and encourage the rapid turnover of a conversational medium compared to say, blogs.

 To make matters worse, we can rarely compare and contrast user behavior across platforms since the data is proprietary and in large silos owned by a few corporations: Facebook, Google, Twitter, etc. But, in real life, people don’t exist or post on just media which renders results relying on a single platform even shakier. Further, these few corporations tweak their algorithms and introduce many changes that make it hard to know why and how something may or may not work.

In the case of Google Trends, we were lucky enough to have CDC to provide an “out-of-sample” check on the big data results. Unfortunately, there are too many instances where it is not possible to calibrate or control our results. Some of the big data findings may be golden, and some may be bunk. The problem we face is figuring out which is which, and that requires an extensive effort to pay attention to big data methods as well as big data results.