How to Build a Real-time Screen Sharing Assistant with Gemini 2.0 Multimodal Live API
Gemini Development Tutorial V3
In the last tutorials of the Gemini 2.0 series, we established the core functionality of a self-hosted real-time voice and video chatbot and then added the function calling feature to it to allow it to call external tools and APIs. These are all practical applications with fast response, human-like interaction and enhanced reasoning enabled by the Gemini 2.0 Multimodal Live API.
In this tutorial, we’ll be focusing on another practical application of the model you may have already tried in the Google AI Studio and surprised by its performance and user experience. That’s right, we’ll be building a real-time screen sharing assistant that can work with you via voice interaction, and delve deeper into both frontend and backend architecture design and code implementation.
The Google AI Studio offers a great starting point for experimenting with Gemini 2.0’s multimodal capabilities. In the ‘Stream Realtime’ feature, a “Share your screen” block allows for simultaneous text, audio, and screen interaction. However, for true customizability, we must build our own application using the underlying API.
Let’s now get started!
Architecture
First, let’s see the overall architecture of the application.
Our architecture, as before, involves a two-way WebSocket connection: one between the client and the server and another between the server and the Gemini API. The server acts as an intermediary, forwarding messages and managing the real-time streaming. More specifically, the server's code is almost the same as the previous video of the basic multimodal chatbot we developed. So, if you have already read it, you can skip this quick recap and jump to the client development section.
Code Walkthrough — Server
The server, implemented in Python, is responsible for two main tasks: handling the client WebSocket connections and managing the Gemini API connection.
You need to install and import the WebSockets
and google-genai
libraries. Set the API key for the model gemini-2.0-flash_exp
, and create a Gemini client using the API version v1alpha
.
## pip install --upgrade google-genai==0.3.0##
import asyncio
import json
import os
import websockets
from google import genai
import base64
# Load API key from environment
os.environ['GOOGLE_API_KEY'] = ''
MODEL = "gemini-2.0-flash-exp" # use your model ID
client = genai.Client(
http_options={
'api_version': 'v1alpha',
}
)
At the bottom of the code, we define a websockets.serve
function to establish a server on a specified port. Each WebSocket connection from the client triggers the handler gemini_session_handler
.
async def main() -> None:
async with websockets.serve(gemini_session_handler, "localhost", 9083):
print("Running websocket server localhost:9083...")
await asyncio.Future() # Keep the server running indefinitely
if __name__ == "__main__":
asyncio.run(main())
Inside the gemini_session_handler
, we make use of the client.aio.live.connect()
function to establish a connection with the Gemini API with the config data, including the response_modalities
coming from the client’s first message and the system_instruction
that we set to instruct the model to act as a screen sharing assistant.
After that, the handler will focus on the message forwarding actions:
The
send_to_gemini
function captures messages from the client, extracts audio and image data, and sends it to the Gemini API.The
receive_from_gemini
function listens to responses from the Gemini API and unpacks text or audio data to be sent to the client.
Keep reading with a 7-day free trial
Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.