Home > TERATEC FORUM > Workshops > Workshop 8

Teratec 2022 Forum
Wednesday, June 15 - Technical Workshop

Workshop 08 - 2:00 to 5:30 pm

High performance AI in the industry
Chaired by Cristel Saudemont, France Director, Supercomputing & AI , Higher Education and Research and Frédéric Parienté, senior manager in the solutions architecture and engineering group, Nvidia

BigScience: Collaboratively training a large multilingual language model
By Lucile Saulnier, Machine Learning Engineer and Thomas Wang, Machine Learning Engineer, Hugging Face

Over the last few years, pre-trained self-supervised language models have shown their usefulness for many applications in many domains. These models aim at discovering general representations from a large amount of text without the need for human annotation - a time-consuming and costly task. The representations produced by these models are extremely relevant as it usually reduces significantly - if not completely - the volume of annotated data and training time required for downstream applications. It is therefore not difficult to conceive the impact that these models can have on society. Unfortunately, only a few organizations across the world have the resources to train such models - especially as one way to significantly improve the results of these models is to increase the size of the model exponentially, the volume of training data, and thus the computational resources required to train them. As a result, the scientific community rely on what these resource-rich groups are willing to publish to understand how they are built, how they work, and how they can be further improved.

BigScience is a one-year research project whose main ambition is to train a 176 billion parameters language model in the order of magnitude of GPT-3 (OpenAI’s proprietary solution), in a transparent, public and collaborative manner. In order to do so, 1000+ researchers, coming from both academia and industry, have gathered and contributed to 30 working groups in order to take decisions every step of the way: the creation of multilingual datasets, the design of the model, the engineering challenges, the formulation of a new license for the model, legal considerations of personal identifiable information within training datasets, the development of evaluation tools and finally reflections on downstream applications in different domains, such as bio-medical.

These efforts result in:

  • the creation of a dataset with over 46 languages as a community effort.
  • the training of a 176 billion parameters language model thanks to Jean Zay supercomputer using a total of 384 GPUs (A100) for 4 months.
  • the open-sourcing of tools used during those efforts.
  • the publication of multiple research papers.
Biography  : Lucile Saulnier est ingénieure Machine Learning chez Hugging Face. Elle développe et soutient l'utilisation d'outils open source. Elle est aussi activement impliquée dans des projets de recherche dans le domaine du Deep Learning tel que BigScience - un projet collaboratif d'un an visant à produire un gros modèle de langue multilingue et un très grand ensemble de données textuelles multilingues sur le supercalculateur Jean Zay.
Biography  : Thomas Wang est ingénieur Machine Learning chez Hugging Face. Il rejoint le projet BigScience - projet collaboratif cherchant à entraîner un grand modèle linguistique sur le super calculateur Jean Zay - en particulier dans les problématiques de modélisation ainsi que d'acquisition de données. Thomas Wang est diplômé de l'École polytechnique, ainsi que du master MVA en 2019.

Register now and get your badge here

© Ter@tec - All rights reserved - Lawful mention