Pre-trained language model is an important infrastructure capability which can support many different use cases, such as classification, generation, etc. Many current monolingual models focus on English while there are customers speaking different languages who have the need of these pre-trained models for various use cases. In addition, multilingual models may not have the ideal performance for some downstream tasks in certain languages and thus one may also want to pre-train another monolingual model for some specific language to improve the performance.
In this blog, we provide a general guideline of pre-training and fine-tuning language models using Hugging Face and Gaudi HPU developed byHabana. For illustration, we use pre-training language models for question generation (answer-agnostic) in Korean as running example.
Figure 1. Overview of pre-training and fine-tuning with Hugging Face.
Figure 1 shows the overview of pre-training and fine-tuning with Hugging Face. Specifically, one can follow the steps summarized below.
Choose the model. Hugging Face Transformers provides tons of state-of-the-art models across different modalities and backend (we focus on language models and PyTorch for now). Roughly speaking, language models can be grouped into two main classes based on the downstream use cases. (Check this list for supported models on Hugging Face.)
Representation models are more suitable for classification tasks such as named-entity recognition. For example,
Prepare the pre-train corpus. Hugging Face Datasets provides useful toolkits to prepare and share data for different use cases (again we focus on NLP for now). Check this tutorial to get started. There are also many public resources that could be considered as potential corpus (some of them are also available from Hugging Face, check this page). For example,
Wiki Dump: A complete copy of all Wikimedia wikis.
CC-100: Constructed using the urls and paragraph indices from CC-Net repository by processing January-December 2018 Commoncrawl snapshots.
Train the tokenizer. Once the model is chosen and pre-train corpus is prepared, one may also want to train the tokenizer (associated with the model) on the pre-train corpus from scratch. Hugging Face Tokenizers provides the pipeline to train different types of tokenizers. Follow this example to get started. Some commonly used tokenizers include
Pre-train the model. Hugging Face Transformers also provides convenient wrapper for training deep neural networks. In particular,
DataCollator: There are many pre-defined DataCollator that can meet the requirements of different models and pre-train task (objective). One can also build customized DataCollator upon the existing ones if needed.
TrainingArguments/Trainer: With the convenient wrapper for training loop, one can simply specify hyperparameters (learning rate, batch size, etc) in TrainingArguments and pass them, along with the chosen model, pre-train corpus and trained tokenizer, to Trainer for training. One can also build customized Trainer upon the existing ones if needed.
Fine-tune the model. Depending on the use case, one can now fine-tune the pre-trained model for different downstream tasks.
Prepare data: similarly as before, HuggingFace.Datasets can be used to prepare and share data.
Train: similarly as before, HuggingFace.Transformers (DataCollator, Trainer, etc) can be used to train the model.
Evaluate: Hugging Face Evaluate includes lots of commonly used metrics for different domains (again we focus on NLP for now). Check this tour to get started and this page for the list of supported metrics.
For our running example, the specification is summarized as follows (one can also use our script as the simple template and replace with different model/data/etcto get started).
Choose the model. As our use case is question generation (answer-agnostic) in Korean, we consider the ProphetNet/XLM-ProphetNet model (check the next sub-section for an overview) and the goal is to provide ProphetNet-KoBase/Large model checkpoints that could be fine-tuned for question generation in Korean.
Prepare the pre-train corpus (scriptfor preparing corpus). In addition to Wiki Dumps and CC-100 mentioned before, we also consider the following sources for our pre-train corpus (the base pre-train corpus is around 16GB and the large pre-train corpus is around 75GB):
Petition: Data collected from the Blue House National Petition (2017.08 ~ 2019.03).
Train the tokenizer (scriptfor training the tokenizer). We train the (base/large) SentencePiece tokenizer (associated with XLM-ProphetNet) with vocabulary size of 32K on the (base/large) pre-train corpus.
Pre-train the model (scriptfor preparing pre-train data and scriptfor pre-training). We define our customized DataCollator and Seq2SeqTrainer to adopt the future n-gram prediction objective (a new sequence-to-sequence pre-train task proposed by this paper). We pre-train base model (~125M parameters) on 16GB base corpus and large model (~400M parameters) on 75GB large corpus.
Fine-tune the model (scriptfor preparing fine-tune data and scriptfor fine-tuning). As our downstream task is question generation (answer-agnostic), we consider KLUE-MRC and KorQuAD v1.0 as potential datasets for fine-tuning. We use BLEU scores as evaluation metrics.
Overview of ProphetNet
ProphetNet is a Transformer-based Seq2Seq model with n-stream self-attention mechanism and future n-gram prediction objective (with continuous masked span). See Figure 2 for an overview.
Figure 2. Overview of ProphetNet. Figure adopted from the original paper.
Compared with the traditional Seq2Seq models that only optimize one-step ahead prediction, ProphetNet also learns n-steps ahead prediction. This objectivetends to provide extra guidance such that the model will plan for future tokens and alleviate the overfitting on strong local correlations. For more details on the model, check the original paper (section 2).
For our running example, we set the masked length to be 10 for every 64 tokens where 10% of the masked tokens are unchanged, 10% of the masked tokens are replaced with randomly picked tokens, and 80% of the masked tokens are replaced with the [MASK] token.We set n=2 for n-gram prediction. For more details on the pre-train setting and choice of n, check the original paper (section 3.1 and section 3.6).
Figure 3. Pre-train loss over steps for base model.
GaudiHPU (Habana Processor Unit) is the deep learning processor developed byHabana. In particular,
Gaudi1 “is the first-generation processor implemented in 16nm providing leading training price/performance”.
Gaudi2 “is the second-generation processor implemented in 7nm delivering performance leadership and efficiency”.
According to this post, the performance of Gaudi1 is between NVIDIA V100 and A100 GPU (for some use cases), the performance of Gaudi2 is better than NVIDIA A100 GPU (for some use cases). The migration from GPU to HPU is relatively smooth and consists of two parts:
Environment. It is recommended to use the docker image from Habana Vault for now as it is configured for HPU with latest software such as SynapseAI. It can be a little bit different to pass HPUs to the container. One can follow this example:
## Pull the image
docker pull \
## Run container
docker run -it --rm \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
## Check HPU status (similar to nvidia-smi)
Script. Optimum-Habanaprovides the interface between Hugging Face and HPU. To migrate the script that is based on Hugging Face and GPU, one can follow this example. For distributed training, check this example. Then one can just run the modified script in the container.
Note that we only test some fine-tunning examples on Gaudi1 for now and Gaudi2 will be ready for testing soon. For more information, check Habana's developer site.
Also note that this part (training deep learning models on Gaudi) is at early investigation stage.
All the scripts and documentation are available at this repo.