Running AutoGen+Gemma (or Any Model), Here is the Ultimate Solution
A Quick Guide for Using Any Custom Models in AutoGen by CustomModelClient Method
In the previous article, I demonstrated the integration of Google’s small language model, Gemma, with the AutoGen framework, by deploying the Ollama server as an inference tool to run Gemma locally on Kaggle’s free-tier GPU environment. This approach allowed us to leverage Gemma’s impressive capabilities while remaining in control over the generative applications without relying on paid APIs or exposing private data.
However, by using Ollama, you have to rely on the model’s availability from Ollama’s platform, although their model library is quite rich and up-to-date. There will be a challenge if you would like to use a fine-tuned version or your own trained model.
In this article, I will dive into AutoGen’s internal methods for inference with custom models, and I will continue to use Gemma-7b for a quick demonstration, which will allow you to port any model/API/hardcoded response you want to your LLM agents. By leveraging this built-in functionality, developers can open up a world of possibilities for creating highly customized and optimized multi-agent applications without relying on any external service.
1. Introduction of CustomModel Client
To meet the massive requirement for using customized open-source LLMs, AutoGen released a new class called CustomModelClient
that inherits from ModelClient
to support the implementation of external model inference.
1. Class definition
A typical implementation is required to define the CustomModelClient
call with rewriting 5 functions.
class CustomModelClient:
def __init__(self, config, **kwargs):
print(f"CustomModelClient config: {config}")
def create(self, params):
num_of_responses = params.get("n", 1)
# can create my own data response class
# here using SimpleNamespace for simplicity
# as long as it adheres to the ModelClientResponseProtocol
response = SimpleNamespace()
response.choices = []
response.model = "model_name" # should match the OAI_CONFIG_LIST registration
for _ in range(num_of_responses):
text = "this is a dummy text response"
choice = SimpleNamespace()
choice.message = SimpleNamespace()
choice.message.content = text
choice.message.function_call = None
response.choices.append(choice)
return response
def message_retrieval(self, response):
choices = response.choices
return [choice.message.content for choice in choices]
def cost(self, response) -> float:
response.cost = 0
return 0
@staticmethod
def get_usage(response):
return {}
For a minimum requirement, you need to implement the init()
, create(),
and the rest of the three functions. In which:
The
__init__()
should include the code for model loading or deploying, or parameters setting if you don’t have to load the model locally.The
create()
should return the response from model generation, and the response should follow the format defined byModelClientResponseProtocol
inclient.py
. I will use examples to explain this in the later section.The
message_retrieval()
should return a list of generated messages.The
cost()
andget_usage()
are responsible for providing token usage data.
2. Define the LLM config list
To feed the custom model with the initial parameters for local deployment, we should add additional fields to specify how to run the model.
{
"model": "model_name",
"model_client_cls": "CustomModelClient",
"device": "cuda",
"n": 1,
"params": {
"max_length": 1000,
}
}
3. Register the model
Moving forward, we register the model to a certain agent for generative capability. Make sure that this registration needs to be called before the AutoGen conversation starts.
my_agent.register_model_client(model_client_cls=CustomModelClient)
For a typical AutoGen application, after these custom model steps are completed, we just need to create the agents and tasks like we normally do in the AutoGen framework.
2. AutoGen + Local Gemma
With that in mind, let’s see practically how to port the local Gemma-7b model to AutoGen by using HuggingFace’s Transformers library. Please make sure you have at least 20GB GPU memory in your local environment to run this application or make use of the Kaggle Notebook where you will have at most 2xT4 GPU with 30GB VRAM for free.
Following the guide of CustomModelClient
implementation, we should add the code for model downloading, setup, and inference into relevant functions.
a. Install packages
Firstly, install the necessary packages most of which are responsible for loading models and inference via Transformers.
!pip install --quiet --upgrade pyautogen~=0.2.0b4 torch git+https://github.com/huggingface/transformers sentencepiece
!pip install --quiet --upgrade accelerate bitsandbytes
b. Implement the model
Define the config list first. In this demo, we will use Gemma-7b-it
model with 4-bit quantization to fit into the Kaggle notebook, and I will explain the reason why such a 7B model will consume so much computational resource in this architecture later in the code explanation, and I will give you optimization guide.
import os
os.environ['OAI_CONFIG_LIST'] ='[{"model": "google/gemma-7b-it","model_client_cls": "CustomModelClient","device": "cuda","n": 1,"params": {"max_length": 1000}}]'
Since we are going to load the model from HuggingFace, you should make sure the “model
” field is the exact path/name in the model card page on HuggingFace. model_client_cls
is the class name that you will define later. You can define the device
as cuda
for GPU environment or cpu
for CPU or TPU environment.
Now comes the most critical step — defining the CustomModelClient
. In the __init__()
process, we load the pre-trained model by using AutoModelForCausalLM.from_pretrained()
, with the quantization as 4-bit by set load_in_4bit=True
, and load the tokenizer by AutoTokenizer.from_pretrained()
.
# custom client with custom model loader
class CustomModelClient:
def __init__(self, config, **kwargs):
print(f"CustomModelClient config: {config}")
self.device = config.get("device", "cpu")
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
self.model = AutoModelForCausalLM.from_pretrained(config["model"],
quantization_config=quantization_config,#.to(self.device)
low_cpu_mem_usage=True,
device_map='auto',
torch_dtype=torch.float16)
#self.model = AutoModelForCausalLM.from_pretrained(config["model"], device_map='auto', torch_dtype=torch.bfloat16).to(self.device)
self.model_name = config["model"]
self.tokenizer = AutoTokenizer.from_pretrained(config["model"], use_fast=False)
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
# params are set by the user and consumed by the user since they are providing a custom model
# so anything can be done here
gen_config_params = config.get("params", {})
self.max_length = gen_config_params.get("max_length", 256)
print(f"Loaded model {config['model']} to {self.device}")
Then, define the create()
function. For a proper generation process, there are two main steps:
Use
tokenizer.apply_chat_template()
to tokenize the input text that is constructed with roles/messages by AutoGen framework.Use
model.generate()
andtokenizer.decode()
to iterate the output texts and wrap them as a list into the response structure.
def create(self, params):
if params.get("stream", False) and "messages" in params:
raise NotImplementedError("Local models do not support streaming.")
else:
num_of_responses = params.get("n", 1)
response = SimpleNamespace()
chat_template = replace_system_role_with_user(params["messages"])
inputs = self.tokenizer.apply_chat_template(
chat_template, return_tensors="pt", add_generation_prompt=True
).to(self.device)
inputs_length = inputs.shape[-1]
# add inputs_length to max_length
max_length = self.max_length + inputs_length
generation_config = GenerationConfig(
max_length=max_length,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)
response.choices = []
response.model = self.model_name
for _ in range(num_of_responses):
outputs = self.model.generate(inputs, generation_config=generation_config)
# Decode only the newly generated text, excluding the prompt
text = self.tokenizer.decode(outputs[0, inputs_length:])
choice = SimpleNamespace()
choice.message = SimpleNamespace()
choice.message.content = text
choice.message.function_call = None
response.choices.append(choice)
return response
Here you should note that for the AutoGen structured input message params[‘messages’]
, you cannot directly use it for chat_template
to apply to the tokenizer because the tokenizer for Gemma model will print error messages indicating the prompt format does not follow the “user/assistant/user/assistant…” but unfortunately AutoGen cannot assure that. Therefore we have to define a helper function replace_system_role_with_user()
to modify the original input messages to a) replace the “system” role with “user”, b) insert the assistant role a dummy message “ok.” if any user message generated by AutoGen does not have an assistant message followed.
Keep reading with a 7-day free trial
Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.