From Gemini API to Local: Building a Fully Open-Source Realtime Multimodal Assistant

Development Stack - Whisper + Qwen + Kokoro

Mar 09, 2025

∙ Paid

The real-time multimodal tutorial continues. This time, instead of focusing on a cloud-based API like Google’s Gemini 2.0 Multimodal Live API, I explored something similar but with completely different technical stacks: a fully open-source, locally deployed multimodal real-time assistant by leveraging Whisper for ASR (Automatic Speech Detection) + Qwen 2.5 for Multimodal Generation + Kokoro for TTS (Text-to-Speech). This project marks a big step away from depending on external API services, moving instead toward a more customizable, private, and possibly more affordable solution.

Let’s see the demo video first:

In my previous tutorials, I showed what you can do with the Gemini Multimodal Live API by building interactive chat applications that handle real-time audio and images — including a Next.js serverless chat app, a screen-sharing app, and mobile versions. I also pointed out some current limitations when using Google Gemini, since it’s still in an experimental stage. For example, the real-time functionality is only working on gemini-2.0-flash-preview model, the session can only last 15 minutes for voice-only interaction, and only 3 sessions can be concurrently active for one API Key…

This new demo project addresses those limitations directly. While it might not (yet!) match the raw performance of a top-class, integrated model like Gemini, it offers a compelling set of advantages:

No Rate Limits, No Fees: As mentioned above, experimental APIs often come with strict rate limits on tokens and sessions. A local solution eliminates this constraint entirely. Except for the infrastructure cost (GPU machine or cloud), there are no usage fees, which makes it ideal for experimentation, prototyping, and even deployment in scenarios, especially where cost is a major factor.
Customization and Extensibility: The open-source nature of this project opens up many possibilities. Want to add Japanese support? Swap out the image recognition for a different model? Those are all easy to do because the modular design allows for fine-grained control and adaptation to specific needs.
Performance Tuning: Each stage of the multimodal pipeline — speech recognition, language processing, and text-to-speech — can be independently optimized. You can choose models that prioritize speed, accuracy, or a balance between the two. This level of control is simply not available with a fixed API.
Data Privacy: For applications dealing with sensitive information, keeping data local is essential. This system ensures that audio, video, and text data never leave your machine/server, offering a significant advantage in terms of privacy and security.

Please note that this isn’t about replacing cloud APIs entirely. They have their place. This is about exploring an alternative solution for those who have a demand for greater control, flexibility, and privacy in multimodal real-time LLM.

System Architecture: A Three-Stage Pipeline

The system operates as a three-stage pipeline, each stage handled by a dedicated open-source model:

Speech-to-Text (Whisper): Incoming audio is processed by the Whisper model (specifically, openai/whisper-large-v3-turbo in this implementation) to transcribe spoken words into text.
Multimodal Processing and Generation (Qwen2.5-VL): The transcribed text, along with a captured image, is fed into the Qwen2.5-VL model (Qwen/Qwen2.5-VL-7B-Instruct). This model, quantized to 4-bit for efficiency, generates a text-based response.
Text-to-Speech (Kokoro): Finally, the generated text is converted back into speech using the Kokoro TTS engine.

The entire process is orchestrated via a WebSocket connection, allowing for real-time interaction. Here’s a block diagram illustrating the flow:

Key Components and Workflow:

Client (Browser): Captures audio and video streams using standard web APIs. Sends audio data (PCM) and images (JPEG) encoded as base64 via WebSocket.
WebSocket Server (Python): Handles the WebSocket connection, manages the audio segmentation, and coordinates the three processing stages.
Audio Segment Detector: A key component that analyzes the incoming audio stream for speech activity. It uses energy thresholds and silence detection to identify meaningful segments of speech, preventing the system from processing continuous noise.
Whisper Transcriber: Converts the detected speech segments into text.
Qwen Multimodal Processor: Processes the transcribed text and the most recent image to generate a relevant response.
Kokoro TTS Processor: Synthesizes speech from the generated text.
Audio Playback (Client): Receives the synthesized audio (base64 encoded) via WebSocket and plays it back.

WebSocket API

The objective for designing this WebSocket API protocol is to be fully compatible with Gemini Multimodal API so that our existing client apps, like camera chat or screen sharing, can be reused.

Let me quickly recap the definition.

The server communicates with clients via a WebSocket connection established at ws://0.0.0.0:9073 (or the configured address). The API uses JSON messages for both requests and responses.

1. Initial Connection and Configuration:

Upon connecting, the client should send an initial message. While the server currently doesn’t process this initial message’s content, sending an empty JSON object ({}) is recommended for future compatibility.

2. Sending Audio and Images (Client to Server):

The client sends audio and image data within a JSON object. The key structure is as follows:

{
  "realtime_input": {
    "media_chunks": [
      {
        "mime_type": "audio/pcm",
        "data": "base64_encoded_audio_data"
      },
      {
        "mime_type": "image/jpeg",
        "data": "base64_encoded_image_data"
      }
    ]
  }
}

The audio data shall contain the base64-encoded audio data (PCM, 16-bit, typically at a 16kHz sample rate.

or backward compatibility, the server also accepts images sent directly in an “image” field:

{
    "image": "base64_encoded_image_data"
}

3. Receiving Audio (Server to Client):

The server sends synthesized audio back to the client in the following JSON format:

{
  "audio": "base64_encoded_audio_data"
}

audio: Contains the base64-encoded audio data (PCM, 16-bit, typically at a 24kHz sample rate, but this depends on the Kokoro configuration). The client is responsible for decoding this data and playing it back.

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.