John M. Abowd and Ian M. Schmutte find that methods to protect and secure “big data” may be corrupting that data, compromising the integrity of research and analysis.
As the government and private companies increase the amount of data made available for public use (e.g. Census data, employment surveys, medical data), efforts to protect privacy and confidentiality (through statistical disclosure limitation or SDL) can often cause misleading and compromising effects on economic research and analysis, particularly in cases where data properties are unclear for the end-user.
Data swapping is a particularly insidious method of SDL and is frequently used by important data aggregators like the Census Bureau, the National Center for Health Statistics and others, which interferes with the results of empirical analysis in ways that few economists and other social scientists are aware of.
To encourage more transparency, the authors call for both government statistical agencies as well as the private sector (Amazon, Google, Microsoft, Netfix, Yahoo!, etc.) to release more information about parameters used in SDL methods, and insist that journals and editors publishing such research require documentation of the author’s entire methodological process.