Hermes Agent TTS Bridge: Give Your AI a Voice Clone

Zero-shot voice cloning with Kokoro fallback — because conversations hit different when your AI sounds like someone you know ❤️

A practical guide to bridging OmniVoice and Kokoro for zero-shot voice cloning with local fallback, Telegram-ready OGG output, and zero downtime.

The Problem

You’re chatting with your AI assistant — it’s fast, smart, helpful. But every reply comes back in the same robotic voice as every other AI on the planet. There’s no warmth. No personality. It doesn’t sound like your assistant.

With Hermes Agent, you can change that. This bridge gives your AI a voice clone — a warm, familiar voice that makes every conversation feel personal. Whether it’s your partner’s voice, your own, or a custom design, the AI speaks like someone you know. It transforms dry status updates and technical replies into something that actually feels like a conversation.

The technical recipe:

Primary: Voice cloning via OmniVoice — 600+ languages, zero-shot cloning from 3–10 seconds of audio, GPU-accelerated
Fallback: Kokoro — fast, lightweight local TTS on CPU (no GPU required), so your AI never goes silent
Output: OGG Opus — the format Telegram requires for native voice bubbles

The bridge script ties it all together: it tries OmniVoice first, drops to Kokoro if the server is down, and pipes everything through FFmpeg for the correct output format.

Architecture

┌──────────────────────────────────────┐ │ Text Input │ │ (from AI agent / CLI / webhook) │ └──────────────┬───────────────────────┘ │ ▼ ┌─────────────────────────────┐ │ Check OmniVoice ready? │ │ (TCP connect + HTTP 200) │──── No ────▶ Kokoro Fallback └─────────────┬───────────────┘ │ │ Yes │ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────────┐ │ OmniVoice API │ │ Kokoro API │ │ POST /_clone_fn │ │ POST /v1/audio/speech │ │ poll SSE for result │ │ (af_sky / af_bella) │ └────────┬────────────┘ └───────────┬─────────────┘ │ │ ▼ ▼ WAV output WAV output │ │ └──────────────────┬────────────────────┘ ▼ ┌─────────────────────────────────────┐ │ FFmpeg Conversion │ │ WAV → OGG Opus (64k, Telegram) │ │ or WAV → MP3 (128k, other uses) │ └────────────────┬────────────────────┘ ▼ ┌─────────────────────────────────────┐ │ Final .ogg file delivered │ │ to Telegram as voice bubble │ └─────────────────────────────────────┘

The Script

Download omnivoice-tts-bridge.py — sanitized, ready to configure.

Features

✅ Configurable via environment variables — no hardcoded IPs or paths
✅ Falls back gracefully if OmniVoice is unreachable
✅ Accepts {input_path} / {output_path} args — compatible with Hermes Agent command providers
✅ Auto-detects output format from file extension (.ogg / .mp3 / .wav)
✅ Uploads reference audio on first run if not already on the server
✅ 30-poll retry loop with 3-second intervals (90 seconds total timeout)

Quick Start

# 1. Install dependencies
pip install requests

# 2. Set up your environment
export OMNIVOICE_HOST="192.168.1.50"
export OMNIVOICE_PORT="8001"
export REF_AUDIO_LOCAL="/path/to/your/voice_sample.wav"

# 3. Run it
python3 omnivoice-tts-bridge.py /tmp/input.txt /tmp/output.ogg

Environment Variables

Variable	Default	Description
`OMNIVOICE_HOST`	`192.168.1.10`	OmniVoice server host
`OMNIVOICE_PORT`	`8001`	OmniVoice server port
`KOKORO_URL`	`http://localhost:8880/v1/audio/speech`	Kokoro API endpoint
`REF_AUDIO_REMOTE`	`""`	Pre-uploaded path on OmniVoice server
`REF_AUDIO_LOCAL`	`./ref_audio.wav`	Local reference audio for upload
`TTS_VOICE`	`af_sky`	Kokoro fallback voice name
`TTS_SPEED`	`1.0`	Speech speed multiplier

Setting Up OmniVoice

OmniVoice runs best on a machine with a GPU (NVIDIA CUDA recommended, Intel Arc XPU also supported).

# Install
pip install omnivoise

# Start the web demo
omnivoice-demo --ip 0.0.0.0 --port 8001

Reference audio: Record 3–10 seconds of clean speech, save as WAV. The bridge script uploads it automatically on first run.

Setting Up Kokoro (Fallback)

Kokoro runs on CPU — lightweight, fast, always-available.

# Clone and install
git clone https://github.com/remsky/Kokoro-FastAPI
cd Kokoro-FastAPI
pip install -r requirements.txt

# Start the server
python3 server.py --port 8880

Verify it works:

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello world","voice":"af_sky","response_format":"wav"}' \
  -o test.wav && file test.wav

Kokoro returns WAV by default — the bridge script handles the FFmpeg conversion to OGG.

Integrating with Hermes Agent

Add the bridge as a custom command TTS provider in your Hermes config:

hermes config set tts.provider omnivoice
hermes config set tts.providers.omnivoice.type command
hermes config set tts.providers.omnivoice.command \
  'python3 /path/to/omnivoice-tts-bridge.py {input_path} {output_path}'
hermes config set tts.providers.omnivoice.output_format ogg
hermes config set tts.providers.omnivoice.voice_compatible true

Now every text_to_speech call routes through the bridge — OmniVoice with auto-fallback to Kokoro.

Why This Pattern Works

Three layers of resilience:

Network check — TCP connect to OmniVoice host:port before even calling the API
HTTP health check — Verify the server responds with 200 OK
API polling — 30 attempts with 3-second intervals, fallback on any error

This means your voice pipeline never fully dies. If your GPU server is down for maintenance, you still get speech — just from the fallback model. The caller never sees a failure.

The Full Script

#!/usr/bin/env python3
"""
OmniVoice TTS Bridge — Voice Clone with Kokoro Fallback

Reads text from {input_path}, tries OmniVoice (voice cloning),
falls back to local Kokoro if the server is unreachable.

Environment variables (all optional):
  OMNIVOICE_HOST   — OmniVoice server host (default: 192.168.1.10)
  OMNIVOICE_PORT   — OmniVoice server port (default: 8001)
  KOKORO_URL       — Kokoro API endpoint  (default: http://localhost:8880/v1/audio/speech)
  REF_AUDIO_REMOTE — Path to reference audio on OmniVoice server (default: auto-upload)
  REF_AUDIO_LOCAL  — Local path to reference audio for upload (default: ./ref_audio.wav)
  TTS_VOICE        — Kokoro fallback voice name (default: af_sky)
  TTS_SPEED        — Speech speed (default: 1.0)

Usage:
  python3 omnivoice-tts-bridge.py <input_path> <output_path>
"""

import os, sys, json, time, subprocess, socket
import requests

# ── Configuration ────────────────────────────────────────────────
OMNIVOICE_HOST = os.getenv("OMNIVOICE_HOST", "192.168.1.10")
OMNIVOICE_PORT = int(os.getenv("OMNIVOICE_PORT", "8001"))
OMNIVOICE_URL  = f"http://{OMNIVOICE_HOST}:{OMNIVOICE_PORT}"

KOKORO_URL  = os.getenv("KOKORO_URL",
    "http://localhost:8880/v1/audio/speech")
KOKORO_VOICE = os.getenv("TTS_VOICE", "af_sky")
TTS_SPEED    = float(os.getenv("TTS_SPEED", "1.0"))

REF_REMOTE = os.getenv("REF_AUDIO_REMOTE", "")
REF_LOCAL  = os.getenv("REF_AUDIO_LOCAL",
    os.path.join(os.path.dirname(__file__), "ref_audio.wav"))

# ── Input ─────────────────────────────────────────────────────────
input_path  = sys.argv[1] if len(sys.argv) > 1 else "/dev/stdin"
output_path = sys.argv[2] if len(sys.argv) > 2 else "/tmp/tts_output.ogg"

with open(input_path) as f:
    text = f.read().strip()

if not text:
    print("Empty input")
    sys.exit(1)

# ── Helpers ───────────────────────────────────────────────────────
def to_ogg(wav_path, ogg_path):
    subprocess.run(["ffmpeg", "-y", "-i", wav_path,
        "-c:a", "libopus", "-b:a", "64k", "-vbr", "on",
        "-f", "ogg", ogg_path], capture_output=True, timeout=30)

def to_mp3(wav_path, mp3_path):
    subprocess.run(["ffmpeg", "-y", "-i", wav_path,
        "-codec:a", "libmp3lame", "-b:a", "128k", mp3_path],
        capture_output=True, timeout=30)

def fallback_kokoro():
    try:
        r = requests.post(KOKORO_URL, json={
            "input": text, "voice": KOKORO_VOICE,
            "speed": TTS_SPEED, "response_format": "wav",
            "model": "kokoro", "stream": False
        }, timeout=30)
        p = subprocess.Popen(["ffmpeg", "-y", "-i", "pipe:0",
            "-c:a", "libopus", "-b:a", "64k", "-vbr", "on",
            "-f", "ogg", output_path],
            stdin=subprocess.PIPE, stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL)
        p.communicate(r.content)
        print(f"FALLBACK kokoro: {output_path}")
        sys.exit(0)
    except Exception as e:
        print(f"Fallback also failed: {e}")
        sys.exit(1)

def check_host(host, port):
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.settimeout(2)
        ok = s.connect_ex((host, port)) == 0
        s.close()
        return ok
    except Exception:
        return False

# ── Step 1: Check OmniVoice server ─────────────────────────────
if not check_host(OMNIVOICE_HOST, OMNIVOICE_PORT):
    print(f"OmniVoice unreachable, falling back to Kokoro")
    fallback_kokoro()

try:
    r = requests.get(f"{OMNIVOICE_URL}/", timeout=3)
    if r.status_code != 200:
        fallback_kokoro()
except Exception:
    fallback_kokoro()

# ── Step 2: Reference audio ────────────────────────────────────
ref_path = REF_REMOTE
if not ref_path:
    if not os.path.exists(REF_LOCAL):
        fallback_kokoro()
    with open(REF_LOCAL, "rb") as f:
        r = requests.post(f"{OMNIVOICE_URL}/gradio_api/upload",
            files={"files": (os.path.basename(REF_LOCAL), f)}, timeout=30)
        ref_path = r.json()[0]

# ── Step 3: Estimate duration ──────────────────────────────────
word_count = len(text.split())
duration = max(10, min(120, int(word_count * 0.35)))

# ── Step 4: Call OmniVoice clone ───────────────────────────────
payload = {
    "data": [
        text, "Auto",
        {"path": ref_path, "meta": {"_type": "gradio.FileData"}},
        "", "", 40, 2.0, True, 0.9, duration, True, True
    ]
}

try:
    resp = requests.post(f"{OMNIVOICE_URL}/gradio_api/call/_clone_fn",
        json=payload, timeout=15)
    event_id = resp.json().get("event_id")
    if not event_id:
        fallback_kokoro()
except Exception as e:
    fallback_kokoro()

# ── Step 5: Poll for result ────────────────────────────────────
result_url = f"{OMNIVOICE_URL}/gradio_api/call/_clone_fn/{event_id}"
for attempt in range(30):
    try:
        time.sleep(3)
        r = requests.get(result_url, stream=True, timeout=15)
        lines = [l.decode() for l in r.iter_lines() if l]
        for line in lines:
            if not line.startswith("data:"):
                continue
            result_data = json.loads(line[5:])
            audio_info = result_data[0]
            if not audio_info:
                continue
            dl = requests.get(audio_info["url"], timeout=30)
            tmp_wav = output_path + ".wav"
            with open(tmp_wav, "wb") as f:
                f.write(dl.content)
            fmt = output_path.split(".")[-1]
            if fmt == "ogg":
                to_ogg(tmp_wav, output_path)
            elif fmt == "mp3":
                to_mp3(tmp_wav, output_path)
            else:
                os.rename(tmp_wav, output_path)
            if os.path.exists(tmp_wav):
                os.remove(tmp_wav)
            print(f"OK omnivoice: {output_path}")
            sys.exit(0)
        if "error" in content and attempt > 2:
            fallback_kokoro()
    except Exception:
        continue

# ── Timeout → fallback ─────────────────────────────────────────
print("OmniVoice timeout, falling back to Kokoro")
fallback_kokoro()

License

MIT — use it, fork it, share it. No attribution required.