Adding One Function, Your AutoGen App Can Use Open-Source LLMs Locally
A Quick Guide for Building AutoGen with Open-Source LLMs
As a tutorial writer recently committed to demonstrating the creative LLM application framework AutoGen, I have previously guided you through several steps of integrating AutoGen with a functional Web UI, helping you explore the multifaceted capabilities of the multi-agent conversational chatbot. Building on buzz feedback from tech communities, I am going to present a new chapter in this journey. The upcoming tutorial is designed to empower developers and hobbyists alike with a straightforward yet useful enhancement for AutoGen: integrating open-source large language models for local deployment. This addition is a strategic pivot that has the potential to significantly reduce the cost of utilizing AutoGen which quite a lot of people complain about token consumption. It’s also easy to expect data privacy will be ensured by using open-source models.
There are various similar methodologies for deploying open-source models like Ollama, Fastchat…In this tutorial, I will introduce a very simple way with only one function adding to your AutoGen project without touching any existing program by which you will convert your GPT-based agents to open-source models on HuggingFace.
Without further ado, let’s dive.
AutoGen Framework
If you’re unfamiliar with my prior tutorials or the AutoGen framework, let me provide you with a quick description of what is AutoGen.
AutoGen by Microsoft is an LLM development framework for creating automated conversations between multiple AI agents to complete complex tasks autonomously. It was praised for customizable LLM-powered agents and seamless integration with human input.
AutoGen distinguishes itself with two primary benefits. Firstly, it offers users a wide range of valuable conversational templates that cater to essential and widespread communication situations, including group, collaborative, and 1-on-1 chats. Furthermore, AutoGen supplies an extensive collection of programmable templates within its code repository, which users can effortlessly adapt and integrate into their own applications.
The second advantage is that AutoGen is integrated with other prompting techniques like OpenAI Assistant, RAG, and function call…to enhance the LLM-based agents’ capabilities with additional knowledge sources.
For those assistant agents with LLM configurations, AutoGen provides only OpenAI models in the application demo of official notebooks.
Fortunately, at the more technical level, AutoGen’s text generation relies on OpenAI’s APIs. This suggests that by using a flexible interface designed to mimic OpenAI’s APIs, we could actually run text generation using open-source models locally. Building on this premise, the vLLM steps into the spotlight which offers a compatible interface that seamlessly re-directs OpenAI API to the inference server running open-source models.
vLLM
vLLM is an accessible and efficient library designed to provide fast and affordable large language model inference services. It boasts serving throughput and efficiently manages memory resources, specifically for attention operations, with its PagedAttention feature. By continuously batching requests and utilizing CUDA/HIP graph technology, vLLM ensures rapid model execution.
Additionally, vLLM is user-friendly and versatile: it easily integrates with well-known HuggingFace models and supports high-capacity serving with several decoding methods, such as parallel sampling and beam search. vLLM’s user experience is enriched by its capability to stream outputs and provide an OpenAI-compatible API server, ensuring compatibility with various environments. It offers broad hardware support, including NVIDIA and AMD GPUs.
You will find its supported model in this list, and although the mainstream model architectures have already been included, they still accept users’ submissions for adding more models.
Code Walkthrough
This demo focuses on a language model intended for local deployment, necessitating the provision of a GPU for computing power. Therefore, for a straightforward demonstration, I will showcase it using Google Colab.
MathChat using GPT-4 model
Firstly, let’s create a simple AutoGen app with two agents to resolve math problems by using the GPT-4 model as normal usage. It’s quite easy to implement in the AutoGen framework.
a. Install packages
!pip install --quiet pyautogen openai
b. Define an LLM configuration
LLM config is an essential component that enables AutoGen’s agent with relevant LLM capabilities. Here we only use one GPT-4 model.
import autogen
llm_config={
"timeout": 600,
"cache_seed": 55, # change the seed for different trials
"config_list": autogen.config_list_from_json(
"OAI_CONFIG_LIST",
filter_dict={"model": ["gpt-4-1106-preview"]},
),
"temperature": 0,
}
For the OAI_CONFIG_LIST
, you can define it as an environment variable. At this moment we only need to add gpt-4–1106-preview
model.
import os
os.environ['OAI_CONFIG_LIST'] ='[{"model": "gpt-4-1106-preview","api_key": "sk-Your_OPENAI_Key"}]'
c. Construct Agents
Then, let’s construct two normal agents: Assistant Agent which runs GPT-4 at the backend, and Math User Proxy Agent which acts as the user proxy to provide professional math questions and execute code.
from autogen.agentchat.contrib.math_user_proxy_agent import MathUserProxyAgent
# create an AssistantAgent instance named "assistant"
assistant = autogen.AssistantAgent(
name="assistant",
llm_config=llm_config,
is_termination_msg=lambda x: True if "TERMINATE" in x.get("content") else False,
)
# create a UserProxyAgent instance named "user_proxy"
mathproxyagent = MathUserProxyAgent(
name="mathproxyagent",
human_input_mode="NEVER",
is_termination_msg=lambda x: True if "TERMINATE" in x.get("content") else False,
code_execution_config={
"work_dir": "work_dir",
"use_docker": False,
},
max_consecutive_auto_reply=5,
)
d. Answer Generation
Ok, now we can prompt the conversation to let them generate an answer.
task1 = """
Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. """
mathproxyagent.initiate_chat(assistant, problem=task1)
You can find the output in my Colab Notebook here the Assistant Agent generates a runnable Python code that can calculate the inequality and the Math Proxy Agent directly executes it in its environment and prints the answer.
Mathchat using Open Source LLMs
Here comes the open-source models. Let’s see how easy it is to add it to your AutoGen projects. Make sure you have at least 16GB of GPU memory in your runtime environment, and you can also subscribe to a Colab Pro to have a V100/A100 GPU environment.
Keep reading with a 7-day free trial
Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.