CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and Common Corpus trained by Pleias in partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative.
CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage. Since CommonLingua is trained exclusively on open data under free license, we release the extent original dataset with detailed licensing contribution. As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous baseline.
| Organization Type: | For-profit business / social enterprise / B Corp |
|---|---|
| Status: | Active |
| Related Links: | |
| Founded: | 2026 |
| Last Modified: | 5/1/2026 |
| Added on: | 4/30/2026 |