Lab For AI

Lab For AI

Share this post

Lab For AI
Lab For AI
Running AutoGen+Gemma (or Any Model), Here is the Ultimate Solution

Running AutoGen+Gemma (or Any Model), Here is the Ultimate Solution

A Quick Guide for Using Any Custom Models in AutoGen by CustomModelClient Method

Yeyu Huang's avatar
Yeyu Huang
Mar 14, 2024
∙ Paid

Share this post

Lab For AI
Lab For AI
Running AutoGen+Gemma (or Any Model), Here is the Ultimate Solution
Share
Image by author

In the previous article, I demonstrated the integration of Google’s small language model, Gemma, with the AutoGen framework, by deploying the Ollama server as an inference tool to run Gemma locally on Kaggle’s free-tier GPU environment. This approach allowed us to leverage Gemma’s impressive capabilities while remaining in control over the generative applications without relying on paid APIs or exposing private data.

However, by using Ollama, you have to rely on the model’s availability from Ollama’s platform, although their model library is quite rich and up-to-date. There will be a challenge if you would like to use a fine-tuned version or your own trained model. 

In this article, I will dive into AutoGen’s internal methods for inference with custom models, and I will continue to use Gemma-7b for a quick demonstration, which will allow you to port any model/API/hardcoded response you want to your LLM agents. By leveraging this built-in functionality, developers can open up a world of possibilities for creating highly customized and optimized multi-agent applications without relying on any external service.

Lab For AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

1. Introduction of CustomModel Client

To meet the massive requirement for using customized open-source LLMs, AutoGen released a new class called CustomModelClient that inherits from ModelClient to support the implementation of external model inference.

1. Class definition

A typical implementation is required to define the CustomModelClient call with rewriting 5 functions.

class CustomModelClient:
    def __init__(self, config, **kwargs):
        print(f"CustomModelClient config: {config}")

    def create(self, params):
        num_of_responses = params.get("n", 1)

        # can create my own data response class
        # here using SimpleNamespace for simplicity
        # as long as it adheres to the ModelClientResponseProtocol

        response = SimpleNamespace()
        response.choices = []
        response.model = "model_name" # should match the OAI_CONFIG_LIST registration

        for _ in range(num_of_responses):
            text = "this is a dummy text response"
            choice = SimpleNamespace()
            choice.message = SimpleNamespace()
            choice.message.content = text
            choice.message.function_call = None
            response.choices.append(choice)
        return response

    def message_retrieval(self, response):
        choices = response.choices
        return [choice.message.content for choice in choices]

    def cost(self, response) -> float:
        response.cost = 0
        return 0

    @staticmethod
    def get_usage(response):
        return {}

For a minimum requirement, you need to implement the init(), create(), and the rest of the three functions. In which:

  • The __init__() should include the code for model loading or deploying, or parameters setting if you don’t have to load the model locally.

  • The create() should return the response from model generation, and the response should follow the format defined by ModelClientResponseProtocol in client.py. I will use examples to explain this in the later section.

  • The message_retrieval() should return a list of generated messages.

  • The cost() and get_usage() are responsible for providing token usage data.

2. Define the LLM config list

To feed the custom model with the initial parameters for local deployment, we should add additional fields to specify how to run the model. 

{
    "model": "model_name",
    "model_client_cls": "CustomModelClient",
    "device": "cuda",
    "n": 1,
    "params": {
        "max_length": 1000,
    }
}

3. Register the model

Moving forward, we register the model to a certain agent for generative capability. Make sure that this registration needs to be called before the AutoGen conversation starts.

my_agent.register_model_client(model_client_cls=CustomModelClient)

For a typical AutoGen application, after these custom model steps are completed, we just need to create the agents and tasks like we normally do in the AutoGen framework.

2. AutoGen + Local Gemma

With that in mind, let’s see practically how to port the local Gemma-7b model to AutoGen by using HuggingFace’s Transformers library. Please make sure you have at least 20GB GPU memory in your local environment to run this application or make use of the Kaggle Notebook where you will have at most 2xT4 GPU with 30GB VRAM for free.

Following the guide of CustomModelClient implementation, we should add the code for model downloading, setup, and inference into relevant functions.

a. Install packages

Firstly, install the necessary packages most of which are responsible for loading models and inference via Transformers.

!pip install --quiet --upgrade pyautogen~=0.2.0b4 torch git+https://github.com/huggingface/transformers sentencepiece
!pip install --quiet --upgrade accelerate bitsandbytes

b. Implement the model

Define the config list first. In this demo, we will use Gemma-7b-it model with 4-bit quantization to fit into the Kaggle notebook, and I will explain the reason why such a 7B model will consume so much computational resource in this architecture later in the code explanation, and I will give you optimization guide.

import os
os.environ['OAI_CONFIG_LIST'] ='[{"model": "google/gemma-7b-it","model_client_cls": "CustomModelClient","device": "cuda","n": 1,"params": {"max_length": 1000}}]'

Since we are going to load the model from HuggingFace, you should make sure the “model” field is the exact path/name in the model card page on HuggingFace. model_client_cls is the class name that you will define later. You can define the device as cuda for GPU environment or cpu for CPU or TPU environment.

Now comes the most critical step — defining the CustomModelClient. In the __init__() process, we load the pre-trained model by using AutoModelForCausalLM.from_pretrained(), with the quantization as 4-bit by set load_in_4bit=True, and load the tokenizer by AutoTokenizer.from_pretrained(). 

# custom client with custom model loader
class CustomModelClient:
    def __init__(self, config, **kwargs):
        print(f"CustomModelClient config: {config}")
        self.device = config.get("device", "cpu")
        quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

        self.model = AutoModelForCausalLM.from_pretrained(config["model"],
                                                          quantization_config=quantization_config,#.to(self.device)
                                                          low_cpu_mem_usage=True,
                                                          device_map='auto',
                                                          torch_dtype=torch.float16)
        #self.model = AutoModelForCausalLM.from_pretrained(config["model"], device_map='auto', torch_dtype=torch.bfloat16).to(self.device)
        self.model_name = config["model"]
        self.tokenizer = AutoTokenizer.from_pretrained(config["model"], use_fast=False)
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

        # params are set by the user and consumed by the user since they are providing a custom model
        # so anything can be done here
        gen_config_params = config.get("params", {})
        self.max_length = gen_config_params.get("max_length", 256)

        print(f"Loaded model {config['model']} to {self.device}")

Then, define the create() function. For a proper generation process, there are two main steps:

  • Use tokenizer.apply_chat_template() to tokenize the input text that is constructed with roles/messages by AutoGen framework.

  • Use model.generate() and tokenizer.decode() to iterate the output texts and wrap them as a list into the response structure.

 def create(self, params):
        if params.get("stream", False) and "messages" in params:
            raise NotImplementedError("Local models do not support streaming.")
        else:
            num_of_responses = params.get("n", 1)

            response = SimpleNamespace()

            chat_template = replace_system_role_with_user(params["messages"])
            inputs = self.tokenizer.apply_chat_template(
                chat_template, return_tensors="pt", add_generation_prompt=True
            ).to(self.device)
            inputs_length = inputs.shape[-1]

            # add inputs_length to max_length
            max_length = self.max_length + inputs_length
            generation_config = GenerationConfig(
                max_length=max_length,
                eos_token_id=self.tokenizer.eos_token_id,
                pad_token_id=self.tokenizer.eos_token_id,
            )

            response.choices = []
            response.model = self.model_name

            for _ in range(num_of_responses):
                outputs = self.model.generate(inputs, generation_config=generation_config)
                # Decode only the newly generated text, excluding the prompt
                text = self.tokenizer.decode(outputs[0, inputs_length:])
                choice = SimpleNamespace()
                choice.message = SimpleNamespace()
                choice.message.content = text
                choice.message.function_call = None
                response.choices.append(choice)

            return response

Here you should note that for the AutoGen structured input message params[‘messages’], you cannot directly use it for chat_template to apply to the tokenizer because the tokenizer for Gemma model will print error messages indicating the prompt format does not follow the “user/assistant/user/assistant…” but unfortunately AutoGen cannot assure that. Therefore we have to define a helper function replace_system_role_with_user() to modify the original input messages to a) replace the “system” role with “user”, b) insert the assistant role a dummy message “ok.” if any user message generated by AutoGen does not have an assistant message followed.

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Yeyu Huang
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share