Ethical data as a foundation

Pleias was founded in Paris in 2024 by Pierre-Carl Langlais, Anastasia Stasenko and Ivan Yamshchikov, and operates out of Station F. The lab's core bet is that the data layer — not raw scale — determines model quality. Its Common Corpus is the largest openly licensed multilingual dataset for LLM pretraining (~2 trillion tokens), accepted at ICLR, and reused by organizations including Anthropic, IBM, StepFun and Elastic.

Small models, synthetic data

Pleias pretrains compact reasoning models entirely on open and synthetic data, including the Pleias-RAG family of small retrieval-augmented reasoners that quote their sources. Its SYNTH pipeline generates fully autonomous synthetic pretraining datasets, while products like Synth (agent training data), Stratum (document structuring) and Common Corpus power enterprise deployments. A 600M-parameter model trained for Paris transit operator RATP reportedly outperformed closed models 200x its size on domain tasks.

European open AI

Collaborating with NVIDIA, Mozilla and the Wikimedia Foundation, Pleias deploys on-premise models in regulated sectors — transport, banking, telecom, energy and healthcare — making it a distinctive European voice for provably clean, open AI training.