Zero-shot voice cloning with Kokoro fallback — because conversations hit different when your AI sounds like someone you know ❤️
A practical guide to bridging OmniVoice and Kokoro for zero-shot voice cloning with local fallback, Telegram-ready OGG output, and zero downtime.
The Problem
You’re chatting with your AI assistant — it’s fast, smart, helpful. But every reply comes back in the same robotic voice as every other AI on the planet. There’s no warmth. No personality. It doesn’t sound like your assistant.
With Hermes Agent, you can change that. This bridge gives your AI a voice clone — a warm, familiar voice that makes every conversation feel personal. Whether it’s your partner’s voice, your own, or a custom design, the AI speaks like someone you know. It transforms dry status updates and technical replies into something that actually feels like a conversation.
The technical recipe:
- Primary: Voice cloning via OmniVoice — 600+ languages, zero-shot cloning from 3–10 seconds of audio, GPU-accelerated
- Fallback: Kokoro — fast, lightweight local TTS on CPU (no GPU required), so your AI never goes silent
- Output: OGG Opus — the format Telegram requires for native voice bubbles
The bridge script ties it all together: it tries OmniVoice first, drops to Kokoro if the server is down, and pipes everything through FFmpeg for the correct output format.
Architecture
The Script
Download omnivoice-tts-bridge.py —
sanitized, ready to configure.
Features
- ✅ Configurable via environment variables — no hardcoded IPs or paths
- ✅ Falls back gracefully if OmniVoice is unreachable
- ✅ Accepts
{input_path}/{output_path}args — compatible with Hermes Agent command providers - ✅ Auto-detects output format from file extension (
.ogg/.mp3/.wav) - ✅ Uploads reference audio on first run if not already on the server
- ✅ 30-poll retry loop with 3-second intervals (90 seconds total timeout)
Quick Start
# 1. Install dependencies
pip install requests
# 2. Set up your environment
export OMNIVOICE_HOST="192.168.1.50"
export OMNIVOICE_PORT="8001"
export REF_AUDIO_LOCAL="/path/to/your/voice_sample.wav"
# 3. Run it
python3 omnivoice-tts-bridge.py /tmp/input.txt /tmp/output.ogg
Environment Variables
| Variable | Default | Description |
|---|---|---|
OMNIVOICE_HOST | 192.168.1.10 | OmniVoice server host |
OMNIVOICE_PORT | 8001 | OmniVoice server port |
KOKORO_URL | http://localhost:8880/v1/audio/speech | Kokoro API endpoint |
REF_AUDIO_REMOTE | "" | Pre-uploaded path on OmniVoice server |
REF_AUDIO_LOCAL | ./ref_audio.wav | Local reference audio for upload |
TTS_VOICE | af_sky | Kokoro fallback voice name |
TTS_SPEED | 1.0 | Speech speed multiplier |
Setting Up OmniVoice
OmniVoice runs best on a machine with a GPU (NVIDIA CUDA recommended, Intel Arc XPU also supported).
# Install
pip install omnivoise
# Start the web demo
omnivoice-demo --ip 0.0.0.0 --port 8001
Setting Up Kokoro (Fallback)
Kokoro runs on CPU — lightweight, fast, always-available.
# Clone and install
git clone https://github.com/remsky/Kokoro-FastAPI
cd Kokoro-FastAPI
pip install -r requirements.txt
# Start the server
python3 server.py --port 8880
Verify it works:
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello world","voice":"af_sky","response_format":"wav"}' \
-o test.wav && file test.wav
Integrating with Hermes Agent
Add the bridge as a custom command TTS provider in your Hermes config:
hermes config set tts.provider omnivoice
hermes config set tts.providers.omnivoice.type command
hermes config set tts.providers.omnivoice.command \
'python3 /path/to/omnivoice-tts-bridge.py {input_path} {output_path}'
hermes config set tts.providers.omnivoice.output_format ogg
hermes config set tts.providers.omnivoice.voice_compatible true
text_to_speech call routes through the bridge —
OmniVoice with auto-fallback to Kokoro.
Why This Pattern Works
Three layers of resilience:
- Network check — TCP connect to OmniVoice host:port before even calling the API
- HTTP health check — Verify the server responds with 200 OK
- API polling — 30 attempts with 3-second intervals, fallback on any error
This means your voice pipeline never fully dies. If your GPU server is down for maintenance, you still get speech — just from the fallback model. The caller never sees a failure.
The Full Script
#!/usr/bin/env python3
"""
OmniVoice TTS Bridge — Voice Clone with Kokoro Fallback
Reads text from {input_path}, tries OmniVoice (voice cloning),
falls back to local Kokoro if the server is unreachable.
Environment variables (all optional):
OMNIVOICE_HOST — OmniVoice server host (default: 192.168.1.10)
OMNIVOICE_PORT — OmniVoice server port (default: 8001)
KOKORO_URL — Kokoro API endpoint (default: http://localhost:8880/v1/audio/speech)
REF_AUDIO_REMOTE — Path to reference audio on OmniVoice server (default: auto-upload)
REF_AUDIO_LOCAL — Local path to reference audio for upload (default: ./ref_audio.wav)
TTS_VOICE — Kokoro fallback voice name (default: af_sky)
TTS_SPEED — Speech speed (default: 1.0)
Usage:
python3 omnivoice-tts-bridge.py <input_path> <output_path>
"""
import os, sys, json, time, subprocess, socket
import requests
# ── Configuration ────────────────────────────────────────────────
OMNIVOICE_HOST = os.getenv("OMNIVOICE_HOST", "192.168.1.10")
OMNIVOICE_PORT = int(os.getenv("OMNIVOICE_PORT", "8001"))
OMNIVOICE_URL = f"http://{OMNIVOICE_HOST}:{OMNIVOICE_PORT}"
KOKORO_URL = os.getenv("KOKORO_URL",
"http://localhost:8880/v1/audio/speech")
KOKORO_VOICE = os.getenv("TTS_VOICE", "af_sky")
TTS_SPEED = float(os.getenv("TTS_SPEED", "1.0"))
REF_REMOTE = os.getenv("REF_AUDIO_REMOTE", "")
REF_LOCAL = os.getenv("REF_AUDIO_LOCAL",
os.path.join(os.path.dirname(__file__), "ref_audio.wav"))
# ── Input ─────────────────────────────────────────────────────────
input_path = sys.argv[1] if len(sys.argv) > 1 else "/dev/stdin"
output_path = sys.argv[2] if len(sys.argv) > 2 else "/tmp/tts_output.ogg"
with open(input_path) as f:
text = f.read().strip()
if not text:
print("Empty input")
sys.exit(1)
# ── Helpers ───────────────────────────────────────────────────────
def to_ogg(wav_path, ogg_path):
subprocess.run(["ffmpeg", "-y", "-i", wav_path,
"-c:a", "libopus", "-b:a", "64k", "-vbr", "on",
"-f", "ogg", ogg_path], capture_output=True, timeout=30)
def to_mp3(wav_path, mp3_path):
subprocess.run(["ffmpeg", "-y", "-i", wav_path,
"-codec:a", "libmp3lame", "-b:a", "128k", mp3_path],
capture_output=True, timeout=30)
def fallback_kokoro():
try:
r = requests.post(KOKORO_URL, json={
"input": text, "voice": KOKORO_VOICE,
"speed": TTS_SPEED, "response_format": "wav",
"model": "kokoro", "stream": False
}, timeout=30)
p = subprocess.Popen(["ffmpeg", "-y", "-i", "pipe:0",
"-c:a", "libopus", "-b:a", "64k", "-vbr", "on",
"-f", "ogg", output_path],
stdin=subprocess.PIPE, stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)
p.communicate(r.content)
print(f"FALLBACK kokoro: {output_path}")
sys.exit(0)
except Exception as e:
print(f"Fallback also failed: {e}")
sys.exit(1)
def check_host(host, port):
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(2)
ok = s.connect_ex((host, port)) == 0
s.close()
return ok
except Exception:
return False
# ── Step 1: Check OmniVoice server ─────────────────────────────
if not check_host(OMNIVOICE_HOST, OMNIVOICE_PORT):
print(f"OmniVoice unreachable, falling back to Kokoro")
fallback_kokoro()
try:
r = requests.get(f"{OMNIVOICE_URL}/", timeout=3)
if r.status_code != 200:
fallback_kokoro()
except Exception:
fallback_kokoro()
# ── Step 2: Reference audio ────────────────────────────────────
ref_path = REF_REMOTE
if not ref_path:
if not os.path.exists(REF_LOCAL):
fallback_kokoro()
with open(REF_LOCAL, "rb") as f:
r = requests.post(f"{OMNIVOICE_URL}/gradio_api/upload",
files={"files": (os.path.basename(REF_LOCAL), f)}, timeout=30)
ref_path = r.json()[0]
# ── Step 3: Estimate duration ──────────────────────────────────
word_count = len(text.split())
duration = max(10, min(120, int(word_count * 0.35)))
# ── Step 4: Call OmniVoice clone ───────────────────────────────
payload = {
"data": [
text, "Auto",
{"path": ref_path, "meta": {"_type": "gradio.FileData"}},
"", "", 40, 2.0, True, 0.9, duration, True, True
]
}
try:
resp = requests.post(f"{OMNIVOICE_URL}/gradio_api/call/_clone_fn",
json=payload, timeout=15)
event_id = resp.json().get("event_id")
if not event_id:
fallback_kokoro()
except Exception as e:
fallback_kokoro()
# ── Step 5: Poll for result ────────────────────────────────────
result_url = f"{OMNIVOICE_URL}/gradio_api/call/_clone_fn/{event_id}"
for attempt in range(30):
try:
time.sleep(3)
r = requests.get(result_url, stream=True, timeout=15)
lines = [l.decode() for l in r.iter_lines() if l]
for line in lines:
if not line.startswith("data:"):
continue
result_data = json.loads(line[5:])
audio_info = result_data[0]
if not audio_info:
continue
dl = requests.get(audio_info["url"], timeout=30)
tmp_wav = output_path + ".wav"
with open(tmp_wav, "wb") as f:
f.write(dl.content)
fmt = output_path.split(".")[-1]
if fmt == "ogg":
to_ogg(tmp_wav, output_path)
elif fmt == "mp3":
to_mp3(tmp_wav, output_path)
else:
os.rename(tmp_wav, output_path)
if os.path.exists(tmp_wav):
os.remove(tmp_wav)
print(f"OK omnivoice: {output_path}")
sys.exit(0)
if "error" in content and attempt > 2:
fallback_kokoro()
except Exception:
continue
# ── Timeout → fallback ─────────────────────────────────────────
print("OmniVoice timeout, falling back to Kokoro")
fallback_kokoro()
License
MIT — use it, fork it, share it. No attribution required.