FlauBERT: Unsupervised Language Model Pre-training for French

The Language Resources and Evaluation Conference (LREC), 2020

PDF Code Slides Video Publisher

Authors

Affiliations

Hang Le

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Loïc Vial

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Jibril Frej

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Vincent Segonne

Université Paris Diderot

Maximin Coavoux

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Benjamin Lecouteux

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Alexandre Allauzen

E.S.P.C.I, CNRS LAMSADE, PSL Research University

Benoît Crabbé

Université Paris Diderot

Laurent Besacier

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Didier Schwab

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG

Published

2019

Abstract

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

The architecture of Flaubert. This is the encoder of the Transformer (Vaswani et al., 2017), under the pre-norm configuration. In an updated implementation of the Transformer (Vaswani et al., 2018), layer normalization is applied before each sub-layer (attention or FFN module) by default, rather than after each residual block as in the original implementation (Vaswani et al., 2017). These configurations are called pre-norm and post-norm, respectively. It was observed by Vaswani et al. (2018), and again confirmed by later works e.g. (Wang et al., 2019b; Xu et al., 2019; Nguyen and Salazar, 2019), that pre-norm helps stabilize training. For training FlauBERT_LARGE, we employed the pre-norm configuration and stochastic depths for the Transformer Encoder.

Citation

@inproceedings{le2020flaubert,
  author       = {Hang Le and
                  Lo{\"{\i}}c Vial and
                  Jibril Frej and
                  Vincent Segonne and
                  Maximin Coavoux and
                  Benjamin Lecouteux and
                  Alexandre Allauzen and
                  Beno{\^{\i}}t Crabb{\'{e}} and
                  Laurent Besacier and
                  Didier Schwab},
  title        = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle    = {Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC)},
  pages        = {2479--2490},
  publisher    = {European Language Resources Association},
  year         = {2020}
}