When NetworkChuck Inspires a Mad Scientist: How I Built “Svetlana”, a SIP-Based AI Voice Assistant

WordPress Metadata & Tags:
3CX Home Assistant Hermes Agent Local AI VoIP Python SIP Trunk Homelab

If you are plugged into the homelab and tech community on YouTube, you probably saw NetworkChuck’s video: “I built a phone system because no one answers me.” In that video, Chuck demonstrates how much fun it is to bridge old-school analog telephony with modern network infrastructure using a 3CX PBX and an Analog Telephone Adapter (ATA) to voice-control Home Assistant.

Like any self-respecting tech enthusiast, watching that video triggered a massive wave of “I need to build this, but crazier.” Enter Svetlana.

Svetlana isn’t just a basic bridge to a cloud-based AI. She is a fully custom, self-hosted, SIP-based AI voice assistant integrated directly into a local 3CX phone system. When you call the designated extension, Svetlana answers, processes your speech locally, communicates with a custom LLM backend, and speaks back to you. Oh, and because a good assistant keeps you informed, she proactively initiates an outbound call to the user’s extension every morning at 10:00 AM sharp to deliver a complete homelab and house status report.

The Architecture Overview

To bring Svetlana to life, I had to link multiple open-source AI projects together and pipe them cleanly through a VoIP stack. The infrastructure maps out like this:

You (User Extension)
        │
        │  SIP / RTP
        ▼
  3CX PBX (PBX IP)
        │
        │  SIP INVITE → Bot Extension
        ▼
  Svetlana Bot (Bot IP:5060)
        │
        ├─► faster-whisper  (STT - Local Speech-to-Text)
        ├─► Hermes Agent    (AI Brain / LLM with Home Assistant access)
        └─► Kokoro TTS      (High-fidelity Voice Synthesis)

The underlying pipeline is fully self-hosted within a local subnet:

  • 3CX PBX handles the call routing.
  • faster-whisper handles lightning-fast local speech-to-text (STT).
  • Hermes Agent serves as the central AI orchestration brain, holding deep native integrations into Home Assistant.
  • Kokoro TTS provides human-like voice responses (using the af_heart voice profile).
Deep Dive: The Hermes Agent Integration

The core intelligence of Svetlana relies heavily on the Hermes Agent. Next to processing dialogue text, the Hermes Agent acts as a local, action-oriented orchestrator. It functions via a local HTTP API endpoint and possesses full, stateful access to Home Assistant component registries. This allows the LLM to dynamically understand intent, query critical sensor states (like server temperatures or live power metrics), and trigger targeted automation scripts directly during the active phone call.

Step 1: Setting up the 3CX SIP Trunk

Following NetworkChuck’s blueprint, I started by configuring a Generic SIP Trunk (IP-based) in 3CX, naming it Svetlana BOT. I pointed the registrar straight to my Linux bot environment on port 5060.

Since it’s an internal, local subnet deployment, no authentication was required for the trunk. I set the default route so that any call to the bot’s extension routed straight to Svetlana.

Everything passed the 3CX Trunk Checker with flying colors. But when I picked up my physical extension to make the first test call, things immediately fell apart.

The Diagnostics: Defeating Voicemail Loop & Registration Gremlins

In the spirit of a true homelab journey, things rarely work on the first try. I encountered two massive roadblocks that required some serious low-level network debugging.

Issue #1: The Phantom ATA Battle

I have a Grandstream FXS adapter to connect analog hardware in the house. Unknowingly, Line 1 of the ATA was co-registering on the same extension I assigned to the bot, while Line 2 was mapped to my main desk extension. This meant the Grandstream hardware was actively competing with my custom Python script for incoming SIP packets on the bot’s extension.

  • The Fix: I remapped the ATA Line 1 to a completely separate, newly created extension, clearing the path for Svetlana to control her extension unimpeded.

Issue #2: The “Offline” Extension Trap

The core Python backend of the bot (svetlana.py) was utilizing a library to listen passively for incoming SIP INVITE packets. However, because it never initiated a SIP REGISTER request to the 3CX server, 3CX assumed the extension was offline and automatically forwarded every single call directly to voicemail.

I tried using an open-source client called baresip to handle the registration background task, but it introduced a port conflict: baresip listened on 5061, while my script ran on 5060. 3CX would route the call to the registered port (5061), causing my main script to miss it completely.

The Elegant Solution: Native Digest Authentication in Python

To fix the registration issues, I engineered a native SIP REGISTER loop directly into my Python architecture.

The trick was bypassing the Address already in use error. Since the voice utility locks down UDP port 5060 to listen for inbound calls, I bound a separate registration socket to a completely different port (e.g., 5062). However, inside the outbound SIP payload, the Via and Contact headers are manually written to advertise port 5060. 3CX receives the registration from the alternate port, but knows to send incoming phone calls to port 5060.

Furthermore, 3CX uses a strict security posture, responding to registration attempts with a 407 Proxy Authentication Required status code. I wrote a custom MD5 cryptographic digest auth loop to handle this handshake seamlessly:

Svetlana Bot (Reg Port)  ── REGISTER (No Auth) ──►  3CX PBX
Svetlana Bot (Reg Port)  ◄── 407 Proxy Auth Required ──  3CX PBX (Contains Nonce)

*Compute Local MD5 Digest Auth*
HA1 = MD5(Username : Realm : Password)
HA2 = MD5("REGISTER" : URI)
Response = MD5(HA1 : Nonce : HA2)

Svetlana Bot (Reg Port)  ── REGISTER + Proxy-Authorization ──►  3CX PBX
Svetlana Bot (Reg Port)  ◄── 200 OK (Successfully Registered!) ──  3CX PBX

This loop refreshes every 55 seconds to ensure Svetlana stays perfectly active and online in the eyes of the PBX.

The Inbound Magic: A Local AI Conversational Flow

Once registered, the live call handling works like a beautifully orchestrated symphony of open-source AI:

  1. The Inbound Connection: You dial the bot extension. 3CX hits the Python script via a SIP INVITE.
  2. Audio Stream: The call is answered, and a custom call protocol takes control of the raw RTP audio stream.
  3. The Greeting: Svetlana speaks through the line: “Hey love. Talk to me.”
  4. Voice Activity Detection: The incoming audio buffers locally. The script tracks the energy threshold. When audio energy surpasses the threshold, it marks the user as speaking. After exactly 0.8 seconds of silence, it stops recording.
  5. Speech-To-Text: The raw buffer feeds into faster-whisper for near-instant transcription.
  6. The Brain: The text prompt is sent via an HTTP API to the Hermes Agent. The agent rapidly evaluates the query, interfaces with the local Home Assistant API if home infrastructure data is needed, and formulates the appropriate smart response block.
  7. Text-To-Speech: The agent’s text response streams into Kokoro TTS.
  8. Audio Processing: ffmpeg takes Kokoro’s crisp 48kHz stereo PCM output and down-samples it to the telephony standard.
  9. The Feedback Loop: The audio plays back over the RTP stream as Opus audio.

Taking It Further: Proactive Outbound Morning Reports

While calling an AI on an old phone is cool, having the AI call you is next-level. I implemented a background systemd cron scheduler that triggers every morning at 10:00 AM.

Svetlana initiates a raw UDP SIP INVITE targeting my main desk extension. Behind the scenes, the script dynamically compiles a specialized context payload for the Hermes Agent:

“Give me a brief morning report of the house and homelab. Include: indoor temperature, energy usage, servers online/offline, and anything notable. Address me by name. Keep it under 5 sentences.”

Because the Hermes Agent queries Home Assistant natively via JSON-RPC, it instantly generates a perfectly updated summary of the current ecosystem metrics. The response text is fed to Kokoro TTS, converted to PCMU (G.711 µ-law) at 8kHz via ffmpeg, packaged into manual 12-byte RTP headers, and read aloud to me through the physical phone line.

When I pick up my desk phone at 10:00 AM, Svetlana gives me a concise, automated breakdown of my entire digital life.

Future Blueprint

The project is currently running stably as a native systemd service on my local server, but a mad scientist’s homelab is never truly completed. My upcoming milestones for Svetlana include:

  • True Exterior Routing: Assigning a dedicated public external VoIP number space directly to the environment so I can dial into my home AI from anywhere in the world.
  • Two-Way Outbound Conversations: Expanding the morning call script from a one-way broadcast into a fully interactive, multi-turn dialogue driven dynamically by the Hermes Agent.
  • Emergency Alert Intercepts: Programming Home Assistant to trigger automated emergency outbound calls via Svetlana if critical sensors trip (e.g., smoke detection, water leakage, or door security breaches).

NetworkChuck proved that telecommunication engineering can be an incredible playground. By shifting the computing stack entirely to local AI engines, you can turn a vintage telephone framework into a futuristic, context-aware command center.

Would you let an AI call your extension every morning? Let me know your thoughts in the comments below!