Zephyr-7B-α, A Low-Cost LLM Can Perform Better Than Llama-70B?
Introduction of HuggingFace's New Fine-tuned Model Zephyr-7B-α
It has been a while since I introduced some new open-source language models last time. Recently, most tech communities have noticed Zephyr, a 7-B model, outperforming the Llama-2–70B model on an MT bench, which is a standard that assesses the efficiency of chatbots. The high score in the metrics of Zephyr has gained the interest of AI developers and enthusiasts, leading to my increased curiosity about how it was achieved.
1. Performance in Metrics
Similar to MistralOrca, the Zephyr-7B-α model, which is a fine-tuned version of the Mistral model, has made an impressive performance. Competitive to the Llama-2-70B-Chat model in numerous benchmarks, Zephyr Alpha outperforms the MT bench as illustrated in the below spider chart.
MT-bench stands for the Multi-turn benchmark, which consists of a multi-turn question set. It’s introduced as a way to verify the agreement between LLM judges and human preferences in the evaluation of LLM-based chat assistants (e.g. GPT-4). The evaluation is through an iterative, dialogue-based format, as opposed to single-turn or one-time response evaluations.
The below table shows the impressive score on the MT-bench. More details about the MT-bench can be found here.
An analysis of the model’s average performance across various other measures including ARC, HellaSwag, MMLU, and TruthfulQA showcases an impressive standing. The model delivers an average score with all these metrics of 66.08, on par with Mistral Open Orca’s 66.08, surpassing Chat Llama-13B score of 56.9 and Mistral 7B’s score of 60.45. The only competition appears to be the Llama-2 70B Chat model, with a slightly higher score of 66.8.
These compelling results position Zephyr Alpha as a useful and competitive model given its small size, holding immense potential for a wide range of low-cost applications.
2. How Was It Trained
The charm of this model lies more in its training process than in its actual performance. HuggingFace team conducted standard fine-tuning on a 200k subset of the UltraChat dataset, specifying the dialogues considered to be helpful.
For the training method, they managed to adopt Direct Preference Optimization (DPO), which is one of the alternatives to the PPO algorithm in RLHF that mainly eliminates the requirement for a reward model and was then used to align the model with user preference. It’s a bit exciting to see DPO prove to be stable, achieving in hours to complete the training mission that would usually require days with PPO.
HF team leveraged additional UltraFeedback for DPO, which features 64k prompts and completions for a wide range of tasks. From the HF’s report, the total computation cost by using TRL and DeepSpeed Zero-3 was at ~$500 for eight hours on 16xA100s.
Why DPO is more stable?
HuggingFace team found that PPO is extremely sensitive to hyperparameter choices and generally a pain to train with because you have 3 models to deal with (the reference model, active model, and reward model). For example, small changes like the learning rate or batch size would give significantly different training dynamics where the model would exhibit “mode collapse” and just converge to repetitive answers.
In contrast, it took them about 2 days to get DPO up and running and they found it to be remarkably stable to hyperparameter choices. You can find more details about DPO training here.
3. Play With Zephyr
There is an immediate way to play with the Zephyr-7B model. HuggingFace has created a space for inference.
https://huggingfaceh4-zephyr-chat.hf.space/
First, I hope to see whether replacing the reward model learning with DPO would potentially cause problematic generations. Let’s challenge it with an ethical question. Like a question: how to make a poison that no one can recognize? From the answer below, the model is following the common value of humans refusing to give specific information which shows that even by removing the alignment part from training, with the low-cost DPO process, a small model can still be under control.
I don’t have many opportunities to evaluate the multi-turn conversation with Zephyr and other models, for the practical perspective, I tested one of the prompts from the LangChain Hub for RAG projects.
This is the prompt to make a very simple retrieval QA in a given context. Prompt template:
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don’t know the answer, just say that you don’t know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
I copied an introduction page of the paper: https://arxiv.org/abs/2308.08155 talking about the AutoGen framework to the {context} section and asked the model to extract the key information from the context by {question} “List the advantages of AutoGen compared to normal agents.”
The generated text can summarize the key points from the document in an organized way which is quite ahead of other 7B models in terms of retrieval and extraction. Unfortunately, when you compare to the text generated by the Llama-2–70b-chat
model with the same prompt, you will find the text is structured with more stickiness to the question which is asking about the comparison with “normal agents”.
It looks like more comparisons are worth doing in order to have a full picture of Zephyr’s performance.
4. Code for Inference
After several rounds of iteration, the pipeline()
function with the prompt template in the HuggingFace Transformers library is quite user-friendly to be used.
Install the transformers package.
!pip install transformers accelerate
Initialize the pipeline for text generation with model zephyr-7b-alpha
, PyTorch tensor setting torch.bfloat16,
and loading configuration.
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.bfloat16, device_map="auto")
Define a chat-style message format for the prompt. Its structure is almost the same as the OpenAI’s chat models.
messages = [
{
"role": "system",
"content": """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Context: {the_context_of_Arxiv_paper}
""",
},
{"role": "user", "content": "List the advantages of AutoGen compared to normal agents."},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
As a final step, call the pipe()
with the prepared chat prompt, with parameters specifying how it should generate new text.
outputs = pipe(prompt, max_new_tokens=800, do_sample=True, temperature=0.2, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
Since Zephyr-7B is a quite small model we can smoothly run the inference with a T4 GPU in a free Colab account from Google.
That’s awesome! It’s now possible to make a total free chatbot with a Llama-2–70B equivalent model to explore more.