Can small language models revitalize Indigenous languages?

Artificial intelligence (AI) small language models (SLMs) are designed to perform specific language tasks using far fewer resources than larger counterparts that have dominated the field so far. As the release of DeepSeek-R1 displayed, techniques often used in SLMs such as distillation and reinforcement learning can result in models that can be trained faster, consume less energy, and run more efficiently on devices with limited computational power and low-bandwidth connectivity. Typically encompassing a range from a few million to a few billion parameters, SLMs are optimized for targeted tasks and fine-tuned on smaller datasets.

Small models, big impact

SLMs can respond to, analyze, and translate a variety of languages that—unlike English and Mandarin—are not widely used or recorded in digital formats. By reducing computational and data requirements, SLMs offer a practical solution for communities and peoples where high-end infrastructure is unavailable. (For example, while larger models like GPT‑3 contain nearly 175 billion parameters, comparatively smaller models typically rely on less than 20 billion to 30 billion.)

Many SLMs are distilled from large language models (LLMs), where the LLM generates the data used to train the smaller model. Most notably, DeepSeek achieved greater model efficiency by distilling other models from open-source LLMs such as Meta’s LLaMA and Alibaba’s Qwen systems. OpenAI also alleges there were layers of distillation from their models. Distillation is increasingly common as a way of fine-tuning smaller models: Microsoft distilled OpenAI’s GPT-4 to develop its Phi SLMs, Google distilled Gemini 1.5 Pro into the smaller 1.5 Flash model, and Meta also plans to use distillation in their models. Other SLMs are fine-tuned from smaller, curated datasets, with both approaches creating models that are more targeted in scope and purpose.

SLMs’ real-time processing capabilities and efficient deployment on affordable hardware make them particularly suitable for low- and middle-income countries (LMICs) where bandwidth and computational resources are more constrained. SLMs can be trained on much smaller, language-specific datasets, facilitating the development of tailored language tools that support preservation and revitalization efforts through applications like spellcheckers, word predictors, machine translation systems, and digital documentation platforms.

Furthermore, since SLMs are typically designed for specific tasks, designing models with chain-of-thought distillation makes it easier to verify internal processing steps and increase explainability and interpretability. Starting with a narrower dataset and/or model compression techniques can lead to models with more accurate output.

Indigenous language model applications

Indigenous languages play a critical role in preserving cultural identity and transmitting unique worldviews, traditions, and knowledge, but at least 40% of the world’s 6,700 languages are currently endangered. The United Nations declared 2022-2032 as the International Decade of Indigenous Languages to draw attention to this threat, in hopes of supporting the revitalization of these languages and preservation of access to linguistic resources.

Building on the advantages of SLMs, several initiatives have successfully adapted these models specifically for Indigenous languages. Such Indigenous language models (ILMs) represent a subset of SLMs that are designed, trained, and fine-tuned with input from the communities they serve.

Case studies and applications

Meta released No Language Left Behind (NLLB-200), a 54 billion–parameter open-source machine translation model that supports 200 languages as part of Meta’s universal speech translator project. The model includes support for languages with limited translation resources. While the model’s breadth of languages included is novel, NLLB-200 can struggle to capture the intricacies of local context for low-resource languages and often relies on machine-translated sentence pairs across the internet due to the scarcity of digitized monolingual data.

Lelapa AI’s InkubaLM-0.4B is an SLM with applications for low-resource African languages. Trained on 1.9 billion tokens across languages including isiZulu, Yoruba, Swahili, and isiXhosa, InkubaLM-0.4B (with 400 million parameters) builds on Meta’s LLaMA 2 architecture, providing a smaller model than the original LLaMA 2 pretrained model with 7 billion parameters.

IBM Research Brazil and the University of São Paulo have collaborated on projects aimed at preserving Brazilian Indigenous languages such as Guarani Mbya and Nheengatu. These initiatives emphasize co-creation with Indigenous communities and address concerns about cultural exposure and language ownership. Initial efforts included electronic dictionaries, word prediction, and basic translation tools. Notably, when a prototype writing assistant for Guarani Mbya raised concerns about exposing their language and culture online, project leaders paused further development pending community consensus.

Researchers have fine-tuned pre-trained models for Nheengatu using linguistic educational sources and translations of the Bible, with plans to incorporate community-guided spellcheck tools. Since the translations relying on data from the Bible, primarily translated by colonial priests, often sounded archaic and could reflect cultural abuse and violence, they were classified as potentially “toxic” data that would not be used in any deployed system without explicit Indigenous community agreement.

Researchers at Bina Nusantara University in Jakarta have applied Meta’s XLSR model to develop speech recognition systems for the Orang Rimba, an Indigenous people in Indonesia. The project obtains local Orang Rimba speech data to fine-tune the XLS-R model into a speech recognition model, seeking to address the scarcity of high-quality data while still adapting technology to local linguistic features.

ILM development builds on longstanding research and open-source models. Developers continue to refine and distill models and build on existing datasets. An early example is AfriBERTa, a multilingual language model released in 2021, trained on 11 Indigenous African languages and less than 1 GB of text. Four years later, new studies found that with targeted fine-tuning, smaller models adapted from AfriBERTa can outperform larger counterparts such as RoBERTa (which has 125 million parameters and was trained on English) and Meta’s XLM-R (550 million parameters and trained on over 100 languages) on tasks like sentiment analysis.

In northern Europe, research groups like Divvun and Giellatekno received funding from the Norwegian government to develop digital support for the Indigenous Sámi languages, including spellcheckers, predictive text systems, and machine translation. Native speakers developed the project, and, since deployment, has proven to be a crucial tool for people in Sápmi, Greenland, and the Faroe Islands.

Active research communities, such as Masakhane, an organization aiming to strengthen natural language processing research across African languages, and Deep Learning Indaba, an annual conference aiming to strengthen African expertise in machine learning and AI, are advancing ILM development in Africa through research competitions, events, and collaborative research support through online spaces.

Despite these advancements, limited resources and funding remain critical hurdles for the sustainable development of language technologies. The scarcity of textual and spoken data limits model training, while linguistic complexities, such as the polysynthetic structure of Mi’kmaq, an Indigenous language spoken across Canada’s Atlantic provinces, where a single word can convey the meaning of an entire English sentence, pose additional hurdles.

Collaborating with Indigenous communities

Ensuring ethical development of ILMs is crucial to protecting Indigenous knowledge from exploitation, misrepresentation, and unwanted exposure. Involving community input and participation through the data collection, model training, and subsequent deployment of a model will help ensure that model outputs respect traditional knowledge and align technological outcomes with community values.

A notable example of ethical data stewardship comes from Te Hiku Media, a small nonprofit radio station in New Zealand. When approached by Lionbridge—a Massachusetts-based translation company—to transcribe hundreds of hours of annotated Māori audio, Te Hiku Media declined a $45-per-hour offer. The organization maintained that true data sovereignty means that only the Māori people rather than for-profit organizations should benefit financially from their language. Instead, Te Hiku created data license agreements to ensure that projects created with Māori data will benefit the Māori people as a whole.

Recently, Meta launched a Language Technology Partner Program, building off the work on the NLLB-200 project. This announcement invites partners to contribute speech recordings and written text to develop open-source free machine translation models in exchange for access to technical workshops led by the research teams. While working directly with these communities is an important first step (e.g., Meta also announced a partnership with the Nunavut territorial government in Canada on Inuktitut and Inuinnaqtun languages), it is unclear whether communities will be able to access the collected data or easily replicate these models. While Meta has committed to making the models available for free, the long-term accessibility of these tools remains uncertain where the data is aggregated within a for-profit company.

Restricting a language from digitalization entirely, such as with the Guarani Mbya model discussed above, prevents beneficial access to AI-enabled translation and editing tools as well as more diverse training for ILMs on differing language characteristics. However, some Indigenous groups have valid concerns about sharing their linguistic data with large tech companies, especially if there is little transparency about the final product or no clear way to benefit from its commercialization. This can create an extractive relationship.

One approach to addressing this issue is fostering close, transparent relationships with Indigenous leaders to incorporate community input. Another is for Indigenous communities to release existing data under Creative Commons or similar licenses that ensure credit and restrict commercial exploitation. A recent example, the Esethu Framework, provides a community-centric data license that grants research access and allows African-owned entities to use datasets commercially. However, non-African entities must pay licensing fees, aiming to create long-term jobs and ensure fair compensation for often underpaid data collection workers. By setting more specific terms, such licenses can also encourage greater community participation in data collection, supporting researchers.

Additionally, developers can use long-term impact assessments to continuously evaluate and communicate best use-cases for ILMs and identify any adverse outcomes. Conversely, without community collaboration, developers risk creating ILMs with training sets that lack specificity and accuracy, potentially leading to outputs that distort the language’s meaning. They also risk overfitting cultural nuances into their existing SLMs.

What is on the horizon?

DeepSeek has recently garnered attention to the development and advantages of such models. While DeepSeek-V3 is a larger model with 671 billion parameters, the company has released smaller models in parallel with parameter counts ranging from 6 billion to 27 billion. By releasing decentralized, largely open-weight AI tools, DeepSeek may reduce both reliance on more centralized models of large tech companies and costs to adopt foundational models in LMICs. At the same time, DeepSeek did not disclose publicly much of the model’s data sourcing, so it is unclear whether model developers consulted Indigenous communities or included Indigenous language material when compiling the training data. Additionally, DeepSeek’s purported access to high-end chips and funding for infrastructure is still inaccessible to researchers and developers in low-resource contexts.

Advancing SLMs in Indigenous contexts requires further research to evaluate their effectiveness across diverse linguistic and cultural settings. Additionally, studies demonstrating model use in environments with limited access to cloud computing or desktop devices will help develop more efficient models for prompting and better support communities that rely on mobile devices.

While the idea of a “universal translator” remains a distant goal for world travelers, SLMs have already demonstrated superior performance in some local contexts. Their fine-tuning on domain-specific data allows them to capture local dialects, idiomatic expressions, and cultural nuances more accurately than generalized large models. In turn, this not only supports language preservation but also enhances digital inclusion in underserved regions.

The evolution of ILMs teaches several key lessons for the broader deployment of SLMs. First, fine-tuning small, context-specific datasets can yield models that are not only resource-efficient but also more accurate in their target domains. Second, community-driven development is crucial; projects that integrate local input, respect cultural nuances, and prioritize data sovereignty tend to achieve more effective and sustainable outcomes. Finally, by embracing ethical AI practices and fostering collaboration between technology developers and Indigenous communities, SLMs can serve as a blueprint for responsible and impactful innovation in diverse linguistic and cultural contexts.