Build a Real-time Voice & Video Chat App with Function Calling by Gemini 2.0 Multimodal Live API

Gemini Development Tutorial V2

Dec 29, 2024

∙ Paid

Last time, we took a deep dive into Google’s Gemini 2.0 and got really excited about how it can handle pretty much everything — text, images, sound, video — all at once and in real-time. It’s pretty amazing how natural and human-like it feels! We didn’t just stop at playing around in Google AI Studio, though — we built our own chat app based on Google’s demo to show off what this tech can do.

In that app, we focused on developing a chat interface that can accept human voice and video input. Then, we asynchronously generate voice or text responses using the high-performance and low-latency Gemini 2.0 flash experimental model. This was all achieved by using Google’s experimental Multimodal Live API, which is accessible through a WebSocket connection. We implemented an intermediate server that handles WebSocket connection with the API, which then forwards the message from and to a front-end client. We had our main code walkthrough on server-side implementation in Python, which is key to activating Gemini’s multimodal live API. Then, we also covered the client-side implementation in HTML and JS for real-time voice and video interaction.

It’s impressive because of its quick response time with interruption tolerance and the high quality of voice image reasoning and context generation. However, they represent only the starting point of what’s possible with Gemini 2.0. Today, we’re going to dive deeper into a specific extension to make these interactions even more powerful: Function Calling.

Function calling is a critical feature that allows AI assistants like our Gemini powered chat app to interact with external systems and APIs. Instead of just generating text or audio, the model can understand a user’s request, generate the instructions like parameters required to use a function, and then execute this function. It effectively evolves the Gemini-powered app to become more than just a multimodal chatbot. It transforms it into a powerful agent capable of performing real-world tasks, and most importantly, it allows the agent to be more human-like and interactive. The good news is that the experimental multimodal live API of Gemini 2.0 supports the function generation that follows the data scheme of its common generative API.

https://ai.google.dev/api/multimodal-live

New Project

Today, we will make some interesting improvements to our previous chat app by adding function calls to let it execute internally and preserve the existing voice and video interactions.

For example, imagine a scenario in the movie “Iron Man” or “Avengers” where you ask the AI assistant Jarvis some commands like “Dim the lights to 30% and set them to warm mode for movie night” or “Turn on all lights to full brightness with daylight temperature, I need to focus”. Function calling can actually identify the function to call to control the light’s API and make the change happen in real time. It just perfectly matches what we have seen in the movie. This bridges the gap between understanding the intent of the request and carrying out the actual actions in the real world.

Let’s start to build this app.

In this app, we will define a dummy function to control the light and display the light settings the model decides. The program flow will be like this:

When a user prompt requires the model to call a function, it sends a tool call message to the server via the WebSocket. Your server then executes the required function, constructs the function call response with the result, and sends the result back to the Gemini API via the same web socket; then, the model continues to generate the final response. Similar to the text and audio response, this process works with streaming chunks, where data exchange happens asynchronously and in real time.

Code Walkthrough — Server

Now, let’s take a look at the code.

As usual, we will mostly focus on the server-side implementation in Python.

Install the dependencies first if you haven’t done so.

Remember to install the google-genai for this experimental multimodal live API.

pip install google-genai==0.3.0

Import and Config

In the code, import all the necessary models and define the model name and API key, which can be found in the Google AI Studio.

import asyncio
import json
import os
import websockets
from google import genai
import base64

# Load API key from environment
os.environ['GOOGLE_API_KEY'] = ''
MODEL = "gemini-2.0-flash-exp"  # use your model ID

client = genai.Client(
    http_options={
        'api_version': 'v1alpha',
    }
)

Tool Declaration

Then, we need to declare the tool function set_light_values that model will call. This function does nothing but return the light parameters(brightness and color temperature) from model’s request, which are used to verify the call result and the model’s response performance.

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.