AI & Machine Learning

AI Voice Agents for Customer Support Using Asterisk + LLMs

December 02, 2024 β€’ 7 min read β€’ By Amey Lokare

πŸ€– Introduction

Imagine calling customer support and instead of navigating a frustrating menu tree, you're greeted by an intelligent AI agent that understands natural language, answers questions accurately, and routes you to the right departmentβ€”all in under 10 seconds. This isn't science fiction anymore.

I recently built a production-grade AI voice agent system that integrates Asterisk PBX with Large Language Models (LLMs) to handle real customer support calls. In this post, I'll show you the complete architecture, implementation challenges, and real-world performance metrics.

🎯 What We're Building

A fully automated voice agent that can:

  • βœ… Answer incoming calls and greet callers naturally
  • βœ… Understand speech using Whisper speech-to-text
  • βœ… Process queries using LLMs (Llama 3.1 70B or GPT-4)
  • βœ… Respond with voice using text-to-speech (Piper TTS or ElevenLabs)
  • βœ… Access knowledge bases via RAG (Retrieval-Augmented Generation)
  • βœ… Transfer calls to human agents when needed
  • βœ… Log conversations for quality assurance and training

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      SIP/RTP      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Caller β”‚ ◄──────────────► β”‚ Asterisk β”‚ β”‚ (Customer) β”‚ β”‚ PBX β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ AMI/AGI β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Python β”‚ β”‚ Orchestratorβ”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Whisper β”‚ β”‚ LLM β”‚ β”‚ Piper TTS β”‚ β”‚ Speech-to-Textβ”‚ β”‚ (Llama/GPT) β”‚ β”‚ Text-to-Speechβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Vector DB β”‚ β”‚ (ChromaDB) β”‚ β”‚ RAG Pipeline β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Tech Stack Breakdown

Component Technology Purpose
PBX Core Asterisk 20+ with AGI/AMI Call handling, audio streaming
Speech Recognition Whisper Large-v3 (local) or Deepgram API Real-time transcription with 95%+ accuracy
LLM Brain Llama 3.1 70B (quantized) or GPT-4 Natural language understanding & generation
Text-to-Speech Piper TTS (local) or ElevenLabs API Natural voice synthesis
Knowledge Base ChromaDB + LangChain RAG Company docs, FAQs, product info
Orchestration Python (FastAPI + asyncio) Coordinate all components
Audio Processing FFmpeg + pydub Format conversion, noise reduction

πŸ’» Implementation: Step-by-Step

Step 1: Asterisk Dialplan Configuration

First, set up the dialplan to route incoming calls to our AGI script:

; /etc/asterisk/extensions.conf

[ai-agent-incoming] exten => _X.,1,NoOp(AI Voice Agent Starting) same => n,Answer() same => n,Set(CHANNEL(hangup_handler_push)=cleanup,s,1) same => n,AGI(agi://127.0.0.1:4573/voice-agent) same => n,Hangup()

[cleanup] exten => s,1,NoOp(Call cleanup for ${CHANNEL}) same => n,Return()

Step 2: Python AGI Server Setup

from fastapi import FastAPI, WebSocket

from asterisk.agi import AGI import whisper import subprocess import asyncio

app = FastAPI()

Load Whisper model (do this once at startup)

whisper_model = whisper.load_model("large-v3")

class VoiceAgent: def __init__(self, agi): self.agi = agi self.conversation_history = []

async def greet_caller(self): """Initial greeting""" greeting = "Hello! I'm an AI assistant. How can I help you today?" await self.speak(greeting)

async def listen(self, timeout=10): """Record caller's speech and transcribe""" # Record audio from caller audio_file = f"/tmp/recording_{self.agi.env['agi_uniqueid']}.wav" self.agi.record_file(audio_file, format='wav', timeout=timeout*1000)

# Transcribe using Whisper result = whisper_model.transcribe(audio_file) return result['text']

async def think(self, user_input): """Process with LLM""" # Build prompt with context prompt = self.build_prompt(user_input)

# Call LLM (local or API) response = await self.query_llm(prompt)

return response

async def speak(self, text): """Convert text to speech and play""" # Generate audio using Piper TTS audio_file = f"/tmp/tts_{hash(text)}.wav" subprocess.run([ 'piper', '--model', 'en_US-lessac-medium', '--output_file', audio_file ], input=text.encode())

# Play to caller self.agi.stream_file(audio_file.replace('.wav', ''))

async def query_llm(self, prompt): """Query LLM with RAG context""" # Retrieve relevant docs from vector DB context = await self.retrieve_context(prompt)

# Build final prompt full_prompt = f"""You are a helpful customer support agent.

Context from knowledge base: {context}

Conversation history: {self.format_history()}

Customer: {prompt}

Agent:"""

# Call LLM (example using ollama) import requests response = requests.post('http://localhost:11434/api/generate', json={ 'model': 'llama3.1:70b', 'prompt': full_prompt, 'stream': False })

return response.json()['response']

AGI endpoint

@app.post("/voice-agent") async def handle_call(agi_data: dict): agi = AGI() agent = VoiceAgent(agi)

try: # Greet caller await agent.greet_caller()

# Conversation loop max_turns = 10 for turn in range(max_turns): # Listen to customer user_input = await agent.listen()

if not user_input or len(user_input) < 3: await agent.speak("I didn't catch that. Could you repeat?") continue

# Check for transfer keywords if any(word in user_input.lower() for word in ['human', 'agent', 'representative']): await agent.speak("Let me transfer you to a human agent.") agi.exec_command('Dial', 'SIP/agent-queue') break

# Process with LLM response = await agent.think(user_input)

# Respond await agent.speak(response)

# Check if resolved if 'goodbye' in user_input.lower() or 'thank you' in user_input.lower(): await agent.speak("You're welcome! Have a great day!") break

except Exception as e: logger.error(f"Error in voice agent: {e}") await agent.speak("I'm having technical difficulties. Transferring you now.") agi.exec_command('Dial', 'SIP/agent-queue')

if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=4573)

Step 3: RAG Knowledge Base Setup

from langchain.vectorstores import Chroma

from langchain.embeddings import HuggingFaceEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter

Load company knowledge base

documents = load_company_docs() # FAQs, product info, policies

Split into chunks

text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50 ) chunks = text_splitter.split_documents(documents)

Create embeddings

embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" )

Create vector store

vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" )

Retrieval function

def retrieve_context(query, k=3): results = vectorstore.similarity_search(query, k=k) return "\n\n".join([doc.page_content for doc in results])

⚑ Performance Optimization

Latency Breakdown (Target: <3 seconds total)

Component Before Optimization After Optimization
Speech Recognition (Whisper) 2.5s 0.8s (GPU + streaming)
LLM Response (Llama 70B) 4.2s 1.3s (quantization + vLLM)
Text-to-Speech (Piper) 1.8s 0.6s (pre-warmed model)
Total Response Time 8.5s 2.7s βœ…

Optimization Techniques:

  1. Model Quantization: 4-bit quantized Llama 70B runs 3x faster
  2. GPU Acceleration: CUDA for Whisper + vLLM for inference
  3. Streaming Transcription: Start processing before audio finishes
  4. Model Pre-warming: Keep models loaded in memory
  5. Async Processing: Pipeline stages run in parallel
  6. Caching: Common responses cached for instant playback

⚠️ Challenges & Solutions

Challenge 1: Voice Activity Detection (VAD)

Problem: Hard to know when caller finished speakingβ€”too short causes cutoff, too long feels unresponsive.

Solution: Implemented adaptive VAD with 1.5s silence threshold + energy detection. If energy drops below threshold for 1.5s, assume speaker finished.

Challenge 2: LLM Hallucinations

Problem: LLM would make up information not in knowledge base.

Solution: Strict prompt engineering: "Only answer based on provided context. If unsure, say 'Let me transfer you to a specialist.'" + confidence scoring on retrievals.

Challenge 3: Background Noise

Problem: Mobile callers in noisy environments caused transcription errors.

Solution: Added noise reduction preprocessing with FFmpeg: ffmpeg -i input.wav -af "highpass=f=200, lowpass=f=3000" output.wav

Challenge 4: Natural Conversation Flow

Problem: Agent sounded robotic, didn't handle interruptions well.

Solution: Added conversation context tracking, filler words ("hmm", "let me check"), and interrupt detection (caller starts speaking = stop TTS immediately).

πŸ“Š Real-World Results

73%
Calls Resolved Without Human

2.7s
Average Response Time

95%
Transcription Accuracy

4.2/5
Customer Satisfaction

Cost Comparison:

Approach Cost per 1,000 Calls Notes
Human Agents $1,500 - $3,000 Variable quality, limited hours
Cloud APIs (Deepgram + GPT-4) $150 - $300 Easy setup, recurring costs
Local AI (Our Setup) $15 - $30 Hardware upfront, minimal ongoing

πŸš€ Advanced Features

1. Multi-Language Support

Whisper automatically detects language. Configure TTS models for Spanish, French, etc.

2. Sentiment Analysis

Detect frustrated callers and auto-escalate to human agents.

3. Call Summarization

After call ends, LLM generates summary for CRM:

summary = llm.generate(f"""Summarize this call:

Transcript: {full_transcript}

Summary format:

  • Issue:
  • Resolution:
  • Next steps:
  • Sentiment: """)

4. Dynamic Knowledge Updates

Real-time RAG updates when company docs changeβ€”no retraining needed.

🎯 Conclusion

Building AI voice agents isn't just possibleβ€”it's practical and cost-effective today. The combination of Asterisk's battle-tested telephony with modern LLMs creates a system that can handle real customer interactions with impressive accuracy.

Key Takeaways:

  • βœ… Local models (Whisper + Llama) can match cloud APIs at 1/10th the cost
  • βœ… Response time under 3 seconds is achievable with optimization
  • βœ… RAG prevents hallucinations and keeps responses accurate
  • βœ… 70%+ automation rate frees human agents for complex issues
  • βœ… System scales horizontallyβ€”add more GPU servers as needed

Next Steps in This Series:

  • πŸ“ Smart IVR with Whisper + GPT (coming next)
  • πŸ“ Streaming calls into RAG pipelines for insights
  • πŸ“ Real-time sentiment analysis with WebSockets

πŸ’¬ Building your own AI voice agent? I'm happy to discuss architecture choices, help with model selection, or debug integration issues. Feel free to reach out!

Comments

Leave a Comment

Related Posts