Lab For AI

Lab For AI

Share this post

Lab For AI
Lab For AI
Clone the Gemini Multimodal Realtime App Locally with Gemma 3, Whisper, Kokoro

Clone the Gemini Multimodal Realtime App Locally with Gemma 3, Whisper, Kokoro

An Updated Tutorial for Open-source Solution of Multimodal Realtime Service

Yeyu Huang's avatar
Yeyu Huang
Mar 24, 2025
∙ Paid
1

Share this post

Lab For AI
Lab For AI
Clone the Gemini Multimodal Realtime App Locally with Gemma 3, Whisper, Kokoro
Share

Remember that real-time multimodal project we built? The one where we ditched the cloud API limitations of Google’s Gemini and created our own local, real-time assistant using Whisper, Qwen 2.5, and Kokoro? Well, it’s time for a major upgrade. We’re going to make it feel much closer to the responsiveness of Gemini — or maybe even better! This is about achieving a truly natural and intuitive interaction speed, enabling interruptions, and significantly reducing GPU demands.

Let’s recap what we did before, then I’ll show you how we supercharged this project.

Recap: The Open-Source Multimodal Live Solution

In the previous tutorial, we built a system capable of processing real-time audio and images, and creating spoken responses. The key was a three-stage pipeline:

  1. Whisper (small model): For transcribing incoming audio.

  2. Qwen 2.5 VL 7B Instruct model: For multimodal processing — combining the transcribed text with an image to generate a text response.

  3. Kokoro: For text-to-speech, converting the text response back into audio and sending it to the user’s front end.

This approach gave us several huge advantages:

  • No API Limits: No rate limits, no token counting, no session timeouts. Freedom!

  • Customization: We could swap out models, tweak parameters, and add support for new languages — total control.

  • Privacy: All the data stays on our own machine, crucial for sensitive applications.

  • Cost-Effectiveness: Apart from the initial hardware investment (or cloud GPU rental), there were no ongoing usage fees.

It’s important to note: that this wasn’t about replacing cloud APIs. They have their strengths. This was about providing an alternative — a powerful option for those who need more control, flexibility, and privacy.

We used a WebSocket API designed to be compatible with the Gemini Multimodal Live API, so existing projects (especially front-end code) could be easily reused.

As you can see (and hear!), while it worked, it wasn’t perfect. Here were the key areas we wanted to address:

  1. Response Speed: There was a noticeable delay between speaking and hearing the AI’s response. It didn’t feel like a real-time interaction.

  2. Interruptibility: If you started speaking while the AI was talking, it wouldn’t stop. You had to wait for the entire response, which isn’t how natural conversations work.

  3. GPU Requirements: (You can’t see this in the demo, but it’s important!) The previous version needed a GPU with at least 17GB of VRAM, a barrier for some users.

The Updated Backend Service —  Addressing the Challenges

Let’s look at the updated system diagram:

Block Diagram

The diagram clearly shows the flow:

  • Audio and image data travel from the client (browser) to the server via a WebSocket.

  • The AudioSegmentDetector identifies speech segments and triggers the WhisperTranscriber.

  • The transcribed text and image go to the GemmaMultimodalProcessor.

  • Crucially, we see the parallel processing of “Initial Text” and “Remaining Text” by the KokoroTTSProcessor

  • The “Cancel Tasks” node, triggered by new speech, demonstrates interruptibility, with control signals (dashed arrows) to stop ongoing generation and TTS.

We addressed the three challenges head-on:

  • Slow Response Speed: We implemented streaming output and parallel TTS. The Gemma 3 model generates text token-by-token. We immediately synthesize speech from the initial portion of the response using synthesize_initial_speech(). This happens in parallel with the ongoing text generation. The user hears a response almost instantly. The remaining text is synthesized using synthesize_remaining_speech() for higher quality.

  • Lack of Interruptibility: The AudioSegmentDetector now continuously monitors for new speech, even during TTS. If detected, cancel_current_tasks() is called, immediately canceling any ongoing generation and TTS tasks. The system then processes the new speech input.

  • High GPU Usage: We switched to the Gemma-3-4B model with 8-bit quantization. This dramatically reduced VRAM usage from 17GB to around 8GB! We also added conversation history to improve response quality.

Image from Google
  • Unnatural response: Added logic to filter out pure punctuation, single-word utterances, and common filler sounds to prevent the AI from responding to unintentional noises.

Let’s see the new demo in action:

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Yeyu Huang
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share