Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Language Resources and Evaluation Conference (LREC), 2026
Oral presentation

PDF   Code

Authors
Affiliations

Phuong-Hang Le

Saclay AI

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Valentin Pelloin

INA (Institut National de l’Audiovisuel)

Arnault Chatelain

CREST (École Polytechnique, ENSAE, CNRS)

Maryem Bouziane

Avignon Université, LIA

Mohammed Ghennai

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Qianwen Guan

LLF (Université Paris Cité and CNRS)

Kirill Milintsevich

INA (Institut National de l’Audiovisuel)

Salima Mdhaffar

Avignon Université, LIA

Aidan Mannion

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Nils Defauw

Univ. Grenoble Alpes, EFELIA-MIAI, IUT2 Grenoble, LIG

Shuyue Gu

LLF (Université Paris Cité and CNRS)

Alexandre Audibert

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Marco Dinarelli

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Yannick Estève

Avignon Université, LIA

Lorraine Goeuriot

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Steffen Lalande

INA (Institut National de l’Audiovisuel)

Nicolas Hervé

INA (Institut National de l’Audiovisuel)

Maximin Coavoux

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

François Portet

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Étienne Ollion

CREST (École Polytechnique, ENSAE, CNRS)

Marie Candito

LLF (Université Paris Cité and CNRS)

Maxime Peyrard

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Solange Rossato

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Benjamin Lecouteux

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Aurélie Nardy

Univ. Grenoble Alpes, Lidilem

Gilles Sérasset

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Vincent Segonne

Université Bretagne Sud, CNRS, IRISA

Solène Evain

IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3

Diandra Fabre

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Didier Schwab

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Published

2026

Abstract
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

Overview of the Pantagruel architecture. The network starts with a modality-specific pre-net to extract feature vectors from the input text/speech sequence. These features are input to a teacher encoder, while randomly chosen visible tokens (in blue) are input to a student encoder. A lightweight decoder predicts the teacher’s latent representations from the student’s outputs. For text input, an additional masked language modeling~(MLM) loss is used. The teacher’s parameters are updated as an exponential moving average (EMA) of the student’s. After training, only the embedding layer and the student encoder are used for fine-tuning on downstream tasks.

Citation

@inproceedings{le2026pantagruel,
  title   = {Pantagruel: Unified Self-Supervised Encoders for French Text and Speech},
  author  = {Phuong-Hang Le and
            Valentin Pelloin and
            Arnault Chatelain and
            Maryem Bouziane and
            Mohammed Ghennai and
            Qianwen Guan and
            Kirill Milintsevich and
            Salima Mdhaffar and
            Aidan Mannion and
            Nils Defauw and
            Shuyue Gu and
            Alexandre Audibert and
            Marco Dinarelli and
            Yannick Est{\`e}ve and
            Lorraine Goeuriot and
            Steffen Lalande and
            Nicolas Herv{\'e} and
            Maximin Coavoux and
            Fran{\c c}ois Portet and
            {\'E}tienne Ollion and
            Marie Candito and
            Maxime Peyrard and
            Solange Rossato and
            Benjamin Lecouteux and
            Aur{\'e}lie Nardy and
            Gilles S{\'e}rasset and
            Vincent Segonne and
            Sol{\`e}ne Evain and
            Diandra Fabre and
            Didier Schwab},
  booktitle    = {Proceedings of the 15th Language Resources and Evaluation Conference (LREC)},
  publisher    = {European Language Resources Association},
  year         = {2026}
}