SpeechKit

kombify SpeechKit is the speech-to-text (STT) and text-to-speech (TTS) framework that powers voice interaction in kombify AI. It runs as a standalone service with a Go backend and React frontend, connecting to your Companions and the Voice Agent pipeline.

SpeechKit lives in its own repository: kombify-SpeechKit. It is a separate component from the main kombify-AI service.

Two modes of operation

SpeechKit provides two distinct voice interaction modes:

Assist Mode
Voice Agent Mode

Standard STT/TTS pipeline for kombify AI Companions.Flow: Microphone input → STT transcription → Companion processes request → TTS speech output

Push-to-talk or voice activity detection (VAD)
Audio visualization with waveforms
Works with any configured Companion
Provider hot-switching (change STT provider without restart)

Supported STT providers

SpeechKit supports six speech-to-text providers. You can switch between them at runtime without restarting the service.

Provider	Type	Latency	Notes
Azure Speech	Cloud	Low	Microsoft Cognitive Services
Google Cloud STT	Cloud	Low	Google Cloud Speech-to-Text
Deepgram	Cloud	Very low	Optimized for real-time streaming
OpenAI Whisper API	Cloud	Medium	OpenAI hosted Whisper
Groq Whisper	Cloud	Low	Groq-accelerated Whisper inference
Faster Whisper	Local	Medium	Runs entirely on your machine, no cloud needed

For a fully local setup with no cloud dependencies, use Faster Whisper for STT and Qwen3-TTS for TTS.

Text-to-speech

SpeechKit uses Qwen3-TTS via the Kokoro pipeline for text-to-speech. This runs locally on your machine — no cloud API required.

Architecture

SpeechKit runs as a self-contained service with two components:

┌──────────────────────────────────┐
│         React Frontend           │
│     (Vite 6 / React 19)         │
│  Audio capture, visualization,   │
│  provider selection UI           │
└──────────────┬───────────────────┘
               │ WebSocket / HTTP
┌──────────────▼───────────────────┐
│          Go Backend              │
│        (HTTP on :8787)           │
│  STT routing, TTS pipeline,      │
│  WebRTC (Voice Agent Mode)       │
└──────────────┬───────────────────┘
               │
    ┌──────────▼──────────┐
    │   STT Providers     │
    │   TTS (Qwen3/Kokoro)│
    │   Gemini Live API   │
    └─────────────────────┘

Tech stack:

Go 1.25 (backend HTTP server, WebSocket handling via gorilla/websocket)
React 19 + Vite 6 (frontend UI)
WebRTC for Gemini Live API connectivity

Configuration

API keys (BYOK)

SpeechKit uses the BYOK (Bring Your Own Keys) model. You provide API keys for whichever cloud STT providers you want to use.

Open the SpeechKit UI

Navigate to the SpeechKit frontend in your browser.

Select an STT provider

Choose your preferred provider from the dropdown. You can switch providers at any time without restarting.

Enter your API key

Provide the API key for the selected cloud provider. For Faster Whisper (local), no key is needed.

Provider selection

You can hot-switch between STT providers during a session. SpeechKit routes audio to the currently selected provider without requiring a restart.

Quick start

Assist Mode

Start SpeechKit

Launch the SpeechKit service. The Go backend starts on port 8787 and serves the React frontend.

Configure a provider

Select an STT provider and enter your API key (or choose Faster Whisper for local processing).

Start speaking

Use push-to-talk or enable VAD, then speak your request. The transcription is sent to your Companion, and the response is spoken back via Qwen3-TTS.

Voice Agent Mode

Configure Gemini API key

Voice Agent Mode requires a Google Gemini API key for the Gemini Live API.

Switch to Voice Agent Mode

Select Voice Agent Mode in the SpeechKit UI.

Start a conversation

Begin speaking. The Gemini Live API provides real-time bidirectional audio — you can interrupt and redirect the conversation naturally.

Platform support

Platform	Status
Windows	Production-ready
Linux	Planned

SpeechKit is currently Windows-first. Linux support is on the roadmap.

Voice interaction

User-facing voice features in kombify AI

BYOK setup

Configure your own API keys for AI providers

Overview

How-To

Explanations

Reference

Two modes of operation

Supported STT providers

Text-to-speech

Architecture

Configuration

API keys (BYOK)

Provider selection

Quick start

Assist Mode

Voice Agent Mode

Platform support

Further reading

Voice interaction

BYOK setup

Overview

How-To

Explanations

Reference

Documentation Index

​Two modes of operation

​Supported STT providers

​Text-to-speech

​Architecture

​Configuration

​API keys (BYOK)

​Provider selection

​Quick start

​Assist Mode

​Voice Agent Mode

​Platform support

​Further reading

Voice interaction

BYOK setup

Two modes of operation

Supported STT providers

Text-to-speech

Architecture

Configuration

API keys (BYOK)

Provider selection

Quick start

Assist Mode

Voice Agent Mode

Platform support

Further reading