Lab For AI

Lab For AI

Share this post

Lab For AI
Lab For AI
How to Use Ollama to Run Google's Gemma Locally in AutoGen
Copy link
Facebook
Email
Notes
More

How to Use Ollama to Run Google's Gemma Locally in AutoGen

A Quick Tutorial on Creating Gemma Agents in AutoGen Locally

Yeyu Huang's avatar
Yeyu Huang
Mar 05, 2024
∙ Paid

Share this post

Lab For AI
Lab For AI
How to Use Ollama to Run Google's Gemma Locally in AutoGen
Copy link
Facebook
Email
Notes
More
Share
image by author

In the previous article, we introduced Gemma, Google’s latest open-source language model that brought us impressive capabilities with inevitable limitations at a lower cost. We saw how we explored its power in orchestrating multiple agents through the AutoGen framework using a remote API from OpenRouter’s inference service. Now, we’re turning our attention to deploying Gemma locally on computational hardware or servers, giving us full control without relying on paid services, and exposing our private data either.

There are a couple of efficient local inference methods in the open-source LLM community, for example, llama.cpp for low-level deployment, ollama for a decent wrapped server, and HuggingFace’s Transformers for rich model resources.

Lab For AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Developers who develop AutoGen applications now are familiar with its conversable patterns and tools under GPT models, however, when starting a Go-to-Market project under this framework, the 24/7 service reliability, long-term cost, and maintainability are all the risks if the language model inference is run by third-party remotely. Therefore, more AutoGen program owners are seeking open-source language models with local deployment, and with an additional consideration of a balance between model size (hardware cost) and its performance.

AutoGen + Gemma

As a newly released open-source language model from Google, Gemma-7b has shown its evaluation scores on various benchmarks that defeat Llama-2 7b and 13b, and even slightly overcomes Mistral-7B which is standing on the top of 7B models practically.

To further prove its usage performance in multi-agent systems, by using third-party Gemma API, I have tested a simple conversation pattern with coding and debugging tasks, but the Gemma was not as good as what the benchmark scores look. Surprisingly, in the next test, the writing and especially orchestrating (reasoning actually) capabilities were really beyond my expectations. With very simple system prompts that I haven’t refined much, the group chat orchestrator did a good job of selecting proper speakers. You may find details from my last tutorial — 

Is Gemma Capable of Building Multi-agent Applications in AutoGen?

Is Gemma Capable of Building Multi-agent Applications in AutoGen?

Yeyu Huang
·
February 28, 2024
Read full story

With that in mind, let’s move the AutoGen application totally to the local environment with Gemma models.

In this article, we will try Ollama as an inference tool to run Gemma in Jupyter notebooks on Kaggle with a free-tier account, which could simulate a local double T4 GPU environment for free. 

AutoGen x Ollama x Gemma

Let’s see how to integrate Gemma into AutoGen by using the Ollama toolset.

Ollama is a platform that allows you to run large language models on your own devices. Its inference technique makes models extremely efficient in speed and resource demand.

The main idea of using Ollama is to run a local Ollama server and then call OpenAI-compatible API to the server for chat completion tasks.

Without further ado, let’s walk through the code.

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Yeyu Huang
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More