How to Build a Real-time Screen Sharing Assistant with Gemini 2.0 Multimodal Live API

Gemini Development Tutorial V3

Jan 03, 2025

∙ Paid

In the last tutorials of the Gemini 2.0 series, we established the core functionality of a self-hosted real-time voice and video chatbot and then added the function calling feature to it to allow it to call external tools and APIs. These are all practical applications with fast response, human-like interaction and enhanced reasoning enabled by the Gemini 2.0 Multimodal Live API.

Build a Real-time Voice & Video Chat App with Function Calling by Gemini 2.0 Multimodal Live API

Yeyu Huang

December 29, 2024

Read full story

Use Gemini 2.0 to Build a Realtime Chat App with Multimodal Live API

Yeyu Huang

December 23, 2024

Read full story

In this tutorial, we’ll be focusing on another practical application of the model you may have already tried in the Google AI Studio and surprised by its performance and user experience. That’s right, we’ll be building a real-time screen sharing assistant that can work with you via voice interaction, and delve deeper into both frontend and backend architecture design and code implementation.

The Google AI Studio offers a great starting point for experimenting with Gemini 2.0’s multimodal capabilities. In the ‘Stream Realtime’ feature, a “Share your screen” block allows for simultaneous text, audio, and screen interaction. However, for true customizability, we must build our own application using the underlying API.

Let’s now get started!

Architecture

First, let’s see the overall architecture of the application.

Our architecture, as before, involves a two-way WebSocket connection: one between the client and the server and another between the server and the Gemini API. The server acts as an intermediary, forwarding messages and managing the real-time streaming. More specifically, the server's code is almost the same as the previous video of the basic multimodal chatbot we developed. So, if you have already read it, you can skip this quick recap and jump to the client development section.

Code Walkthrough — Server

The server, implemented in Python, is responsible for two main tasks: handling the client WebSocket connections and managing the Gemini API connection.

You need to install and import the WebSockets and google-genai libraries. Set the API key for the model gemini-2.0-flash_exp, and create a Gemini client using the API version v1alpha.

## pip install --upgrade google-genai==0.3.0##
import asyncio
import json
import os
import websockets
from google import genai
import base64

# Load API key from environment
os.environ['GOOGLE_API_KEY'] = ''
MODEL = "gemini-2.0-flash-exp"  # use your model ID

client = genai.Client(
  http_options={
    'api_version': 'v1alpha',
  }
)

At the bottom of the code, we define a websockets.serve function to establish a server on a specified port. Each WebSocket connection from the client triggers the handler gemini_session_handler.

async def main() -> None:
    async with websockets.serve(gemini_session_handler, "localhost", 9083):
        print("Running websocket server localhost:9083...")
        await asyncio.Future()  # Keep the server running indefinitely


if __name__ == "__main__":
    asyncio.run(main())

Inside the gemini_session_handler, we make use of the client.aio.live.connect() function to establish a connection with the Gemini API with the config data, including the response_modalities coming from the client’s first message and the system_instruction that we set to instruct the model to act as a screen sharing assistant.

After that, the handler will focus on the message forwarding actions:

The send_to_gemini function captures messages from the client, extracts audio and image data, and sends it to the Gemini API.
The receive_from_gemini function listens to responses from the Gemini API and unpacks text or audio data to be sent to the client.

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.