How to tackle the data collection behind China’s AI ambitions

FILE PHOTO: SenseTime surveillance software, which identifies details about people and vehicles, runs during a demonstration at the company's office in Beijing, China, October 11, 2017. REUTERS/Thomas Peter/File Photo

The United States and China are increasingly engaged in a competition over who will dominate the strategic technologies of tomorrow. No technology is as important in that competition as artificial intelligence: Both the United States and China view global leadership in AI as a vital national interest, with China pledging to be the world leader by 2030. As a result, both Beijing and Washington have encouraged massive investment in AI research and development.

Yet the competition over AI is not just about funding. In addition to investments in talent and computing power, high-performance AI also requires data—and lots of it. The competition for AI leadership cannot be won without procuring and compiling large-scale datasets. Although we have some insight into Chinese A.I. funding generally—see, for example, a recent report from the Center for Security and Emerging Technology on the People’s Liberation Army’s AI investments—we know far less about China’s strategy for data collection and acquisition. Given China’s interest in integrating cutting-edge AI into its intelligence and military enterprise, that oversight represents a profound vulnerability for U.S. national security. Policymakers in the White House and Congress should thus focus on restricting the largely unregulated data market not only to protect Americans’ privacy but also to deny China a strategic asset in developing their AI programs.

China’s data-hungry AI projects

Attempts to discover how China’s security agencies are leveraging data for AI development are foiled by, among other things, a lack of international transparency around data flows as well as China’s own regulatory efforts. Domestically, China passed a major cybersecurity law in 2017 that dramatically increased data protection and data localization requirements for firms operating there. Internationally, China launched the Global Initiative on Data Security in September 2020, an effort designed in part to convince Belt and Road countries to adopt its data security practices and standards. The efforts lend credence to the importance of “data security” while nonetheless providing greater authorities and capabilities for Chinese officials and agencies to access individual-level data at home and abroad. 

China’s regulatory and policy efforts on data security have helped to accelerate its AI development, even as much of the data it uses remains opaque. Chinese authorities view automated mass surveillance systems as a tool to maintain the Communist Party’s hold on power. These systems are built on large stores of data—some of it acquired illicitly from U.S. companies and systems. By virtue of being home to nearly 20% of the global population, China has an advantage in its ability to gather a wide variety of data through multiple avenues. Combined with its Belt and Road Initiative, the Chinese government is laying what the UK foreign intelligence chief recently described as “data traps”—expansive efforts to collect critical data and undermine national sovereignty.

China’s most well-documented use of automated systems for social control is its genocidal campaign against the Uighur minority in Xinjiang. Systems there rely on up to 60 data points to determine if someone is in need of “reeducation,” as PBS Frontline reported in 2020. In order to build this system, Chinese developers and officials first had to define Uighur identity in a way that is comprehensible to a computer, requiring the collection of huge amounts of data to build the necessary algorithms. These data points include communication data, video surveillance, DNA samples collected at checkpoints, and whether someone has grown a beard or quit smoking. With this data, the Communist Party has built a surveillance machine and tool of social control that uses AI to identify individuals allegedly susceptible to radicalization and can even follow Uighurs around the world.

China’s campaign against its Uighur minority gives an indication of how it may use AI surveillance technologies in other systems. For example, consider how a similar system might be built to identify American servicemembers or U.S. government officials. Such a tool may involve the readily available data on the personal information of servicemembers (provided, perhaps, by the 2015 breach of the Office of Personnel Management), pattern of life data (perhaps from digital location data), and ubiquitous facial recognition technology to train an algorithm to identify military service members. Content uploaded to the Chinese-owned viral video app Tiktok, for example, could be used to develop a model of service members who are expressing discontent, are susceptible to influence operations, or could be recruited by Chinese spies (perhaps with the help of purloined travel data). The vulnerability lies not in any individual piece of data but the ability to aggregate it and draw inferences from it that can be weaponized against the United States and its allies.

However, our understanding of how China is attempting to use AI in national-security applications is clouded by the broad data collection underway by Chinese entities. The Chinese state has long relied on private sector firms to build technological capacity, and the 2017 cybersecurity law and subsequent regulatory measures have made it easier for government officials and agencies to request and access private sector data. This has led to fears that companies like TikTok might be used to supply data to intelligent computing initiatives. TikTok recently began to collect biometric identifiers from U.S. users, and while the parent company ByteDance denies that it supplies the Chinese state with data, it is data stores like this that have been used to build other Chinese AI systems. Other cases of nefarious data collection—like a neonatal test used by millions of women around the world supplying gene data to a Chinese company with links to the Chinese military or the acquisition of American data from various ad tech and other commercial sources—compound fears that our understanding of Chinese data-collection and its uses may be paltry at best.

Potential policy responses

As policymakers in Washington assess the future of the U.S. relationship with China, it will be crucial to focus on technological competition and AI technologies, especially the data used to train the algorithms that power them. China’s surveillance policies mean that Beijing can construct powerful datasets for intelligence and military applications, provided they can get useful training data for the algorithms they’ve already developed. Developing a better understanding of the scope and quality of these datasets should be a top policy priority for U.S. officials and regulators.

First, policymakers need to push for greater algorithmic transparency and establish robust privacy protections that limit the data that can be collected, aggregated, and inferred about people residing in the United States. Regulators need to fully understand the consequences of the use or misuse of algorithms to benefit commercial interests over the rights of individuals and how commercial interests can influence national-security issues. From elected officials to government acquisition specialists, policymakers in the United States also need to educate themselves about AI—and the data that powers AI applications—if they are ever going to effectively oversee its deployment. There’s little excuse for policymakers to claim they cannot understand how the current data ecosystem works or blithely accept industry’s claims of self-regulation. Algorithmic failures can have significant negative impacts on the lives of those who are subjected to their decisions, and while China is using their algorithms to suppress dissent, American companies must lead in AI transparency. Regulators must set standards that require transparency, auditability, and replicability for all data used to develop algorithms.

Second, the fact that China can buy U.S. advertising technology and data but the U.S. cannot buy China’s is a potential economic lever to use to push for greater transparency. Policymakers should consider eliminating China’s access to U.S. data and treat U.S. data as a national-security asset, through export controls and other mechanisms. This would require the data collection ecosystem to be better defined and regulated. The vast majority of apps on Google and Apple’s app stores are developed and run by foreign companies; in many cases, data collected on American users through these apps are likely leaving the United States. We should be deeply concerned with what China and other foreign actors are doing with U.S. persons’ data, both now and in the future. Placing greater controls on the collection, aggregation, and access to U.S. data available to China is a good first step to eliminating an obvious national-security vulnerability.

Third, U.S. government agencies should have a stronger mandate—and, where necessary, new authorities—to oversee the data industry, especially for data used in AI. For example, the Federal Trade Commission should have greater ability to audit and penalize technology firms for collecting and selling U.S. citizens’ data. Likewise, the Securities and Exchange Commission should have greater ability to scrutinize how foreign entities plan to use the data and algorithms acquired as part of mergers and acquisitions of U.S. firms, and whether foreign ownership stakes of American firms can facilitate data access. Similarly, the Committee on Foreign Investment should more closely scrutinize any sales or mergers of U.S. companies with those affiliated with the CCP, which would help to prevent China and other malign actors from using shell companies to purchase data they might be otherwise restricted from accessing. The Department of Defense should also develop robust technological tools, policy recommendations, and best practices to protect service members’ private devices from persistent surveillance.

Fourth, policymakers should seek to ensure that China’s agenda-setting efforts at international standards bodies related to data use, AI governance, and other standards related to internet communications technologies do not trap countries into using only products developed and produced in China. China and Russia’s recognition of the strategic importance of AI governance should serve as a warning for policymakers that these countries view AI as a developing strategic capability. The United States and its allies should ensure that international standards bodies are not captured by authoritarian regimes that can be used to undermine democratic societies.

Finally, to counter China’s opaque approach to AI development and data collection, the United States should fund academic research that develops creative ideas to assess China’s advances. For example, a group of researchers led by the sociologist Charles Gomez have built an ecological model of academic research across different academic disciplines based on citations, which might be used to determine the areas in which China is rising to prominence. While this doesn’t specifically address data challenges, it provides an interesting model for developing mechanisms for further research and an example of how creative investigative approaches can offer insights on closed political systems.  

AI-enabled applications in the United States should not only be more auditable and transparent, but the data they are trained on should be viewed and regulated as national-security assets, protected from foreign collection. We should not expect China to advertise data-related vulnerabilities in our AI ecosystem for us. Instead, the United States should aggressively investigate, audit, and interrogate the data underlying modern machine learning and become the standard bearer for the responsible development and deployment of AI.

Jessica Dawson, Ph.D is the information warfare division chief at the Army Cyber Institute.  
Tarah Wheeler is a contributing editor to TechStream, a Cyber Project Fellow at the Belfer Center for Science and International Affairs at Harvard University‘s Kennedy School of Government, and an International Security Fellow at New America.

The opinions here reflect the authors and in no way represent the official position of the Department of Defense, United States Army, United States government, or Harvard University.