A growing body of work has identified a digital language divide: the disparity between languages in terms of digital content availability, accessibility, and technological support. This disparity between the world’s dominant languages (e.g., English, Spanish, Chinese, and French) and low-resourced languages, such as Malagasy and Navajo, is due to a variety of factors that create harm when such languages lack sufficient resources or technological support to thrive in the digital realm.
These factors include the lack of substantial investment in language technology for under-resourced languages, limited representation in digital and educational platforms, and insufficient inclusion in AI training datasets, resulting in biased or unrepresentative algorithms. As a result, speakers of these low-resourced languages are often excluded from the benefits of digital technologies and experience bias or discrimination. These problems exacerbate the global digital divide and highlight the urgent need for more equitable language representation in the digital age.
This “digital language divide” is closely tied to the exploitation of data annotators in the Global South, who play a crucial role in developing, refining, and labeling data for AI technologies. Our recent article underscores how these workers face significant challenges due to their marginalized status, as their contributions are often undervalued, and their diverse cultural insights are often overlooked. Many of these annotators work on projects involving dominant global languages, such as English, rather than their native languages. This forces them to use resources such as Google Translate to get through their tasks. Moreover, despite their essential contributions, these annotators often face poor working conditions and low pay, reflecting broader economic inequalities between the Global North and South.
Workers in the Global South contribute 90% of the labor that powers the world economy, yet they receive only 21% of global income, with wages that are 87% to 95% lower than those in the North for work of equal skill. This economic exploitation underscores a broader pattern of inequality, where the contributions of workers from less developed regions primarily benefit and reflect the values of the more privileged societies and languages.
Addressing the digital language divide thus requires not only technological advancements, but also a commitment to equitable resource distribution and the empowerment of local innovation. These actions will ensure that all languages across the world, and their speakers, can benefit from advancements in digital technology.
How AI is widening multilingual machine translation
Given the challenges of the digital language divide, multilingual machine translation technologies have the potential to both mitigate and exacerbate these issues. As of December 2024, Google Translate supports 249 languages, and in 2022, the company announced an initiative to support 1,000 of the most spoken languages. In April 2024, one AI company reported that OpenAI’s ChatGPT supported over 90 languages. Anthropic claims that its Claude large language model performs best in English, in addition to Portuguese, French, and German, but “knows more than a dozen languages” and can translate to “varying degrees of success.”
While these efforts are laudable, they exclude languages that have less digital representation. Moreover, prominent multilingual datasets used to train natural language processing (NLP) models have also demonstrated systematic issues. A manual audit of 205 language-specific corpora revealed that many are mislabeled, use nonstandard or ambiguous language codes, or lack usable text, and that less than 50% of their sentences are of acceptable quality.
The disparity in multilingual access is clear. However, even larger companies with more multilingual representation have time and again demonstrated lackluster performance on low-resourced languages in comparison to African-centric models. For instance, one study found that Meta exaggerated the capabilities of their No Language Left Behind model, in which the company reported that the system delivered high-quality translations between 200 languages, including 55 African languages. However, researchers found that the company’s discussions of the model mentioned neither the low dataset quality for various African languages nor the fact that model performance lags in benchmarks of colloquial, everyday speech.
This recurring issue suggests that, while it is feasible to develop more effective translation models, larger companies may prioritize scaling and improving their technologies across numerous high-resourced languages at the expense of depth and quality in low-resourced languages. In contrast, smaller, localized initiatives often exhibit a greater understanding and integration of the cultural and linguistic contexts necessary for high-quality translations. This localized knowledge and targeted approach enable these smaller projects to outperform the broader, less nuanced efforts of bigger corporations.
The importance of localization in AI tools
Because of the excitement surrounding commercial large language models, there has been a decrease in investment in smaller organizations focusing on communities, as investors are mistakenly led to believe that multilingual translation offerings from Big Tech companies have made smaller companies redundant. However, these perceptions overlook the crucial need for localized initiatives, which are sensitive to the unique cultural and social contexts of specific communities. For example, Google Translate’s inclusion of the Romani language, while seemingly beneficial, has potential risks. It could expose vulnerable communities to disinformation, manipulation, marginalization, and abuse without their consent. This concern is highlighted by instances in Hungary and Romania in which law enforcement misused Romani language translations to target and oppress Romani youth.
This situation illustrates the critical need for localized, community-focused translation initiatives that involve community consent and active participation in how their languages are used in translation models. Such approaches ensure that translations do not only offer a broad-brush approach and broaden linguistic access, but also respect and protect cultural heritage while safeguarding against misuse.
The path forward
Efforts to close the digital language divide in a responsible manner must go beyond merely adding more languages to datasets. They must also address the power dynamics and biases that shape how these languages are represented and used. This involves recognizing data workers as collaborators and sources of knowledge, rather than just cheap labor. Participatory approaches in scaling NLP research for low-resourced languages have demonstrated that it is feasible to engage local experts without formal technical training into the machine translation development process in a sustainable and responsible manner. By involving these workers more directly in the AI development process, developers can create systems that are more culturally sensitive and better aligned with the needs and goals of the communities they serve.
Companies should also work with local academic institutions and community-based organizations as sources of linguistic expertise to support efforts that expand language offerings of machine translation models. These institutions may have expert scholars who could be consulted regarding data collection and annotation, while also verifying data representation or identifying further sources. Moreover, such experts could serve as gatekeepers for establishing relations of trust and informed consent between community members and AI researchers. In recognizing that many of the world’s languages are oral and serve primarily local functions, emerging work in relational NLP advocates for data collection methods that are community-focused and provide local experts an opportunity to reshape data collection methods to align with their values and agendas.
A significant step toward bridging this divide is exemplified by initiatives like Project Elevate Black Voices (EBV), a collaboration between Google and Howard University to create a high-quality African American English (AAE) linguistic dataset. The project highlights the importance of community involvement and responsible data collection, ensuring that the voices of marginalized communities are represented authentically in AI systems. By allowing Howard University to own and manage the dataset, the project aims to build trust and ensure that the data benefit the Black community, reflecting a model that could be adapted by other underrepresented languages and cultures.
Another opportunity to better engage data workers and local community organizations in the AI ecosystem is to support the development of localized AI development initiatives. This involves providing funding and resources to smaller-scale projects that focus on specific linguistic and cultural contexts. Grassroots organizations—like Ghana NLP, Lesan.ai, and Masakhane—are leading the way in developing AI tools tailored to the linguistic and cultural contexts of their communities. Supporting and collaborating with these types of organizations is crucial for fostering trustworthy AI in Africa, as they provide platforms for entrepreneurs and mentorship opportunities for students.
The success of these localized initiatives underscores the importance of community involvement and the potential for more inclusive and effective AI development when local knowledge and expertise are leveraged as opposed to exploited. By building local expertise, we can ensure that communities have the knowledge and skills needed to contribute to and benefit from AI development as stakeholders, rather than shallow consumers or workers.
Efforts to increase AI transparency could involve incentivizing the creation and sharing of open-access language datasets that are focused on low-resourced languages and establishing regulatory frameworks that ensure that open-source datasets and resources created by local organizations and communities are not exploited by Big Tech corporations. While openness is crucial for AI development, it can also threaten the agency and ownership of the communities involved. For instance, open licensing regimes, like Creative Commons, may lead to unintended commercialization of data without benefiting the original contributors. Therefore, a nuanced approach to openness is essential, one that considers the diverse needs of African AI researchers, local communities, and commercial entities. This approach may require reforms to legal frameworks like copyright and privacy laws to ensure that they prioritize public good and balance transparency.
Closing the digital language divide involves equitable resources
Addressing the digital language divide requires a multifaceted approach that goes beyond technological advancements and instead focuses on equitable resource distribution, empowerment of local innovation, and the recognition of the vital contributions of data workers from the Global South. The limitations of multilingual translation technologies underscore the necessity for localized, community-focused initiatives that are sensitive to the unique cultural and social contexts of specific communities. Supporting grassroots organizations and fostering collaborative efforts can lead to AI systems that are more inclusive, culturally aware, and beneficial to all linguistic groups. Building local expertise and ensuring fair labor practices are essential steps toward a more equitable and just digital future, where all languages and their speakers can fully participate in and benefit from technological advancements. This path forward not only mitigates existing disparities, but also promotes the development of AI that truly serves the diverse needs of global communities.
-
Acknowledgements and disclosures
Google and Meta are general, unrestricted donors to the Brookings Institution. The findings, interpretations, and conclusions posted in this piece are solely those of the authors and are not influenced by any donation.
The Brookings Institution is committed to quality, independence, and impact.
We are supported by a diverse array of funders. In line with our values and policies, each Brookings publication represents the sole views of its author(s).
Commentary
Closing the gap: A call for more inclusive language technologies
December 12, 2024