How to Use Ollama to Run Google's Gemma Locally in AutoGen
A Quick Tutorial on Creating Gemma Agents in AutoGen Locally
In the previous article, we introduced Gemma, Google’s latest open-source language model that brought us impressive capabilities with inevitable limitations at a lower cost. We saw how we explored its power in orchestrating multiple agents through the AutoGen framework using a remote API from OpenRouter’s inference service. Now, we’re turning our attention to deploying Gemma locally on computational hardware or servers, giving us full control without relying on paid services, and exposing our private data either.
There are a couple of efficient local inference methods in the open-source LLM community, for example, llama.cpp for low-level deployment, ollama for a decent wrapped server, and HuggingFace’s Transformers for rich model resources.
Developers who develop AutoGen applications now are familiar with its conversable patterns and tools under GPT models, however, when starting a Go-to-Market project under this framework, the 24/7 service reliability, long-term cost, and maintainability are all the risks if the language model inference is run by third-party remotely. Therefore, more AutoGen program owners are seeking open-source language models with local deployment, and with an additional consideration of a balance between model size (hardware cost) and its performance.
AutoGen + Gemma
As a newly released open-source language model from Google, Gemma-7b has shown its evaluation scores on various benchmarks that defeat Llama-2 7b and 13b, and even slightly overcomes Mistral-7B which is standing on the top of 7B models practically.
To further prove its usage performance in multi-agent systems, by using third-party Gemma API, I have tested a simple conversation pattern with coding and debugging tasks, but the Gemma was not as good as what the benchmark scores look. Surprisingly, in the next test, the writing and especially orchestrating (reasoning actually) capabilities were really beyond my expectations. With very simple system prompts that I haven’t refined much, the group chat orchestrator did a good job of selecting proper speakers. You may find details from my last tutorial —
With that in mind, let’s move the AutoGen application totally to the local environment with Gemma models.
In this article, we will try Ollama as an inference tool to run Gemma in Jupyter notebooks on Kaggle with a free-tier account, which could simulate a local double T4 GPU environment for free.
AutoGen x Ollama x Gemma
Let’s see how to integrate Gemma into AutoGen by using the Ollama toolset.
Ollama is a platform that allows you to run large language models on your own devices. Its inference technique makes models extremely efficient in speed and resource demand.
The main idea of using Ollama is to run a local Ollama server and then call OpenAI-compatible API to the server for chat completion tasks.
Without further ado, let’s walk through the code.
Keep reading with a 7-day free trial
Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.