Project - CommonLingua

CommonLingua

https://huggingface.co/PleIAs/CommonLingua
France 🇫🇷

CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and Common Corpus trained by Pleias in partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative.

CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage. Since CommonLingua is trained exclusively on open data under free license, we release the extent original dataset with detailed licensing contribution. As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous baseline.

Organization Type:	For-profit business / social enterprise / B Corp
Status:	Active
Related Links:	Launch post Pleias and GSMA Launch CommonLingua
Founded:	2026
Last Modified:	5/1/2026
Added on:	4/30/2026

Civic Tech Field Guide

CommonLingua

Project Categories

Get email updates

Helpful Links

Reach Out

CommonLingua

Project Categories

Get email updates

Helpful Links

Reach Out

A project of

With support from

Founding Organization

Technology partner