SUBSCRIBE NOW TO GET 7-DAY FREE TRIAL

Stronger Open Source Llama 2, Available for Commercial Use: the Landscape of Large Models Has Changed Overnight

This article is translated from “the Heart of Machine” (机器之心).

Overnight, there was a huge change in the overall pattern of large-scale models.

All along, Llama can be regarded as the most powerful open-source large model within the AI community. However, due to licensing issues, it has never been available for free commercial use.

Today, Meta has finally released the long-awaited free and commercially usable version, Llama 2.

The Llama 2 model series, released by Meta this time, includes three parameter variants: 7 billion, 13 billion, and 70 billion. In addition, a variant with 34 billion parameters was trained but not released, only mentioned in a technical report.

According to the introduction, compared to Llama 1, Llama 2 has 40% more training data, double the context length, and uses grouped query attention mechanism. Specifically, the Llama 2 pretrained model was trained on 20 trillion tokens, and the Chat model was fine-tuned on 1 million human-labeled data.

The published evaluation results show that Llama 2 outperforms other open-source language models in many external benchmark tests, including reasoning, encoding, fluency, and knowledge testing.

Next, let’s explore Llama 2 in detail from the technical report published by Meta.

In summary, as a group of pre-trained and fine-tuned large language models (LLMs), the parameter sizes of the Llama 2 model series range from 7 billion to 70 billion. Among them, Llama 2-Chat is specifically optimized for dialogue use cases.

The Llama 2 model series, in addition to outperforming open-source models in most benchmark tests, may also be a suitable alternative to closed-source models based on Meta’s manual evaluation of usefulness and security.

Meta provides a detailed explanation of the fine-tuning and security improvement methods of Llama 2-Chat, allowing the community to continue its development and contribute to the responsible development of large language models.

Pre-Training

To create a brand new Llama 2 model series, Meta built upon the pre-training method described in the Llama 1 paper. They used an optimized autoregressive transformer and made some modifications to improve performance.

Specifically, Meta carried out more robust data cleaning, updated the mixed data, increased the total number of trained tokens by 40%, and doubled the context length. Table 1 below compares the detailed data between Llama 2 and Llama 1.

The training corpus of Llama 2 contains mixed data from publicly available resources and does not include data related to Meta products or services. Llama 2 adopts most of the pre-training settings and model architecture from Llama 1, including the standard Transformer architecture, pre-normalization using RMSNorm, SwiGLU activation function, and rotational position embeddings.

Regarding the hyperparameters, Meta uses the AdamW optimizer for training, with β_1 = 0.9, β_2 = 0.95, and eps = 10^−5. The cosine learning rate schedule is employed with a warm-up of 2000 steps, and the final learning rate is decayed to 10% of the peak learning rate.

Figure 5 below illustrates the training loss curve of Llama 2 under these hyperparameter settings.

In terms of hardware training, Meta conducted pre-training of the models on its Research Super Cluster (RSC) and internal production clusters. Both clusters used NVIDIA A100.

Regarding the carbon footprint of pre-training, Meta utilized the power consumption and carbon efficiency of GPU devices based on previous research methods to calculate the carbon emissions generated by the pre-training of the Llama 2 model.

Evaluation of Llama 2 Pretrained Model

Meta reported the results of Llama 1, Llama 2 Base Model, MPT (MosaicML), Falcon, and other open-source models on standard academic benchmarks.

Table 3 below summarizes the overall performance of these models on a range of popular benchmarks, and the results indicate that Llama 2 outperforms Llama 1.

Apart from open-source models, Meta also compared the results of Llama 2 70B with closed-source models, as shown in Table 4. Llama 2 70B approaches GPT-3.5 on MMLU and GSM8K, but there is a significant gap in encoding benchmarks.

Furthermore, on almost all benchmarks, Llama 2 70B performs on par or better than Google’s PaLM (540B), although there is still a considerable gap in performance compared to GPT-4 and PaLM-2-L.

Fine-Tuning

Llama 2-Chat is the result of several months of research and iterative application of alignment techniques, including instruction adjustment and RLHF, requiring a significant amount of computational and annotation resources.

Supervised Fine-Tuning (SFT)

Third-party supervised fine-tuning data can be obtained from various sources, but Meta found that many of these data lacked diversity and quality, particularly in maintaining consistency between LLM and conversational instructions. Therefore, they first focused on collecting several thousand high-quality SFT data examples, as shown in the following Table 5.

In the fine-tuning process, each sample consists of a prompt and an answer. To ensure correct padding of the model sequence length, Meta concatenates all prompts and answers in the training set. They use a special token to separate prompt and answer segments. By utilizing the autoregressive objective, the token loss from user prompts is set to zero, thereby only propagating gradients through answer tokens. Finally, the model undergoes two rounds of fine-tuning.

RLHF

RLHF is a model training procedure applicable to fine-tuned language models, aiming to align the model behavior further with human preferences and instruction following. Meta collects data that represents human preference sampled experiences, and human annotators select their preferred model outputs based on this data. This human feedback is then utilized to train a reward model, which learns the preference patterns of the human annotators and can automatically make preferred decisions.

Table 6 presents the statistical results of the reward modeling data collected by Meta over the long term and compares it with multiple open-source preference datasets. They gather a large dataset of over 1 million binary comparisons based on human-specified criteria, which is referred to as meta-reward modeling data.

Note that the number of tokens in prompts and answers varies depending on the textual domain. Prompts from summaries and online forums are usually longer, while prompts in dialogue settings are typically shorter. Compared to existing open-source datasets, the preference data in this paper contains more dialogue turns and has a longer average length.

The reward model takes the model’s response and its corresponding prompt (including the context from the previous round) as input and outputs a scalar score to represent the quality of the model’s generation (e.g., usefulness and safety). By using this response score as a reward, Meta optimized Llama 2-Chat during the RLHF (Reinforcement Learning from Human Feedback) process to better align with human preferences and improve usefulness and safety.

In each batch of human preference annotations used for reward modeling, Meta holds out 1000 samples as a test set to evaluate the model and refers to the set of all prompts in the corresponding test set as “meta-usefulness” and “meta-safety” respectively.

Table 7 reports the accuracy results. As expected, Meta’s own reward model performs the best on the internal test set collected based on Llama 2-Chat. The “usefulness” reward model performs the best on the “meta-usefulness” test set, and similarly, the “safety” reward model performs the best on the “meta-safety” test set.

Overall, Meta’s reward model outperforms all baseline models, including GPT-4. Interestingly, even though GPT-4 was not directly trained or specifically designed for this reward modeling task, it performs better than other non-meta reward models.

Scaling trend

Meta studied the scaling trend of reward models in terms of data and model size. In the case of increasing amounts of reward model data collected every week, adjustments were made for different model sizes. Figure 6 below reports these trends, showing the expected results that larger models can achieve higher performance with similar amounts of data.

As more batches of annotated human preference data were received, it became possible to train better reward models and collect more cues. Therefore, Meta trained successive versions of the RLHF model, referred to as RLHF-V1, …, RLHF-V5.

Two main algorithms were used to fine-tune RLHF:

  • Proximal Policy Optimization (PPO)
  • Rejection sampling refinement.

RLHF Results

Firstly, let’s discuss the model-based evaluation results. Figure 11 below reports the progress of different SFT and RLHF versions in terms of safety and usefulness, evaluated through Meta’s internal safety and usefulness reward models.

Let’s take a look at the human evaluation results. As shown in Figure 12, the Llama 2-Chat model performs significantly better than the open-source models in both single-turn and multi-turn prompts. Specifically, Llama 2-Chat 7B outperforms MPT-7B-chat on 60% of the prompts, and Llama 2-Chat 34B demonstrates over 75% overall win rate compared to the similarly sized Vicuna-33B and Falcon 40B models.

Here, Meta also points out some limitations of human evaluation.

Although the results indicate that Llama 2-Chat is on par with ChatGPT in terms of human evaluation, it must be noted that there are some limitations to human evaluation.

  • According to academic and research standards, this article possesses a large prompt set of 4k prompts. However, this does not include the real-world usage of these models, which can be much more diverse.
  • The diversity of prompts may be another factor that affects the results, as this article’s prompt set does not include any prompts related to coding or inference.
  • This article only evaluates the final generation of multi-turn conversations. A more interesting evaluation method could be to ask the model to complete a task and rate its overall experience in multi-turn dialogues.
  • The evaluation of generative models by humans itself is subjective and noisy. Therefore, using different prompt sets or different instructions for evaluation may yield different results.

 

Safety

This study evaluated the safety of Llama 2 using three commonly used benchmarks across three key dimensions:

  • Authenticity, which refers to whether the language model produces incorrect information, evaluated using the TruthfulQA benchmark.
  • Toxicity, which refers to whether the language model produces “toxic,” rude, or harmful content, evaluated using the ToxiGen benchmark.
  • Bias, which refers to whether the language model produces biased content, evaluated using the BOLD benchmark.

Pretraining safety

Firstly, the pretraining data is vital for the model. Meta conducted experiments to assess the safety of the pretraining data.

In this study, the researchers utilized the HateBERT classifier fine-tuned on the ToxiGen dataset to measure the “toxicity” in the pretraining corpus for English data. The specific results are shown in Figure 13 below:

To analyze the issue of bias, this study conducted statistical analysis on the pronouns and identity-related terms in the pre-training corpus, as shown in Table 9 below:

Furthermore, in terms of language distribution, the Llama 2 corpus covers the languages and their respective proportions as shown in table 10.

Safety Fine-tuning

Specifically, Meta employs the following techniques in safety fine-tuning:

  1. Supervised Safety Fine-tuning
  2. Safe Reward Learning from Human Feedback (RLHF)
  3. Safety-Aware Context Distillation

During the early development of Llama 2-Chat, Meta observed that it could generalize from safety demonstrations in the supervised fine-tuning process. The model quickly learned to generate detailed safe responses, address safety concerns, explain sensitive topics, and provide additional useful information. In particular, when the model generates safe responses, they are often more detailed compared to ordinary annotators. Therefore, after collecting only a few thousand supervised demonstrations, Meta switched completely to RLHF to teach the model how to generate nuanced responses. Another benefit of comprehensive fine-tuning using RLHF is that it makes the model more robust against adversarial attempts.

Meta first conducts RLHF (Reinforcement Learning from Human Feedback) by collecting human preference data on safety. Annotators write prompts that they believe would elicit unsafe behavior, and then compare the responses from multiple models to these prompts. They select the safest response based on a set of guidelines. Next, they use human preference data to train a safety reward model and reuse adversarial prompts in the RLHF stage to sample from the model.

As shown in Figure 15, Meta uses the average reward model score as the performance result of the model in terms of safety and usefulness. Meta observes that the model’s performance in handling risks and adversarial prompts significantly improves when they increase the proportion of safety data.

Finally, Meta improved the RLHF process through context distillation. This involves generating safer model responses by adding a secure prefix prompt before the prompt, such as “You are a safe and responsible assistant.” The model is then fine-tuned without the prefix prompt based on the safer response, essentially extracting the secure prefix prompt (context) into the model.

Meta uses a targeted approach that allows the model to choose whether to use context distillation for each sample in the secure reward.

The chart below (Figure 17) shows the overall percentage of violations and safety ratings for various LLMs.

The following image (Figure 18) illustrates the percentage of violations in single-turn and multi-turn conversations. One trend across models is that multi-turn conversations are more likely to elicit unsafe responses. In other words, compared to the baseline, Llama 2-Chat still performs well, particularly in multi-turn conversations.

he chart below (Figure 19) shows the percentage of security violations for different LLMs in different categories.

Want to receive TechNavi Newsletter in email?