VoIP & Telephony

Building a Smart IVR with Whisper Speech-to-Text and GPT Response

December 03, 2024 β€’ 8 min read β€’ By Amey Lokare

πŸŽ™οΈ Introduction

Traditional IVR (Interactive Voice Response) systems are universally hated. "Press 1 for Sales, Press 2 for Support..." feels like navigating a labyrinth. What if callers could just speak naturally and the system would understand them?

That's exactly what I builtβ€”a Smart IVR system that uses Whisper for speech-to-text and GPT for natural language understanding. Callers can say things like "I need help with my invoice" and get routed instantly to the right department.

In this tutorial, I'll show you how to build it step-by-step, from audio capture to intelligent routing.

🎯 What Makes This IVR "Smart"?

❌ Traditional IVR

  • πŸ”΄ "Press 1 for..."β€”rigid menu structure
  • πŸ”΄ Caller must know exact option
  • πŸ”΄ Multi-level menus (frustrating)
  • πŸ”΄ No context understanding
  • πŸ”΄ DTMF tones only
  • πŸ”΄ High abandonment rate

βœ… Smart IVR

  • 🟒 "How can I help you?"β€”natural speech
  • 🟒 AI understands intent automatically
  • 🟒 Single-step routing
  • 🟒 Context-aware decisions
  • 🟒 Speech + DTMF fallback
  • 🟒 Better caller experience

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Caller β”‚ β”‚ Dials In β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ SIP/RTP β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Asterisk PBX β”‚ β”‚ - Answers call β”‚ β”‚ - Records audio β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ AGI/AMI β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Python Script β”‚ β”‚ - Audio capture β”‚ β”‚ - Orchestration β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Whisper β”‚ β”‚ GPT-4 API β”‚ β”‚ Routing β”‚ β”‚ (STT Local) β”‚ β”‚ (Intent) β”‚ β”‚ Decision β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Transfer to β”‚ β”‚ Destination β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

  • βœ… Asterisk 18+ installed and configured
  • βœ… Python 3.9+ with pip
  • βœ… CUDA-capable GPU (for Whisper) or cloud API
  • βœ… OpenAI API key (or local LLM)
  • βœ… FFmpeg for audio processing

πŸ”§ Step 1: Install Dependencies

# Install Python dependencies

pip install openai-whisper torch torchaudio openai asterisk-agi

Or for faster inference

pip install faster-whisper

Install FFmpeg

sudo apt install ffmpeg -y

πŸ“ž Step 2: Configure Asterisk Dialplan

; /etc/asterisk/extensions.conf

[smart-ivr] ; Main entry point for incoming calls exten => 1000,1,NoOp(Smart IVR Starting) same => n,Answer() same => n,Wait(1) same => n,Set(TIMEOUT(digit)=5) same => n,Set(TIMEOUT(response)=10)

; Play greeting same => n,Playback(welcome) ; "Welcome to our company"

; Call our Python AGI script same => n,AGI(smart-ivr.py)

; If AGI sets TARGET variable, transfer same => n,GotoIf($["${TARGET}" != ""]?transfer:fallback)

same => n(transfer),NoOp(Transferring to ${TARGET}) same => n,Goto(${TARGET})

; Fallback to operator same => n(fallback),NoOp(Routing to operator) same => n,Goto(operator,s,1)

same => n,Hangup()

; Department extensions [sales] exten => s,1,NoOp(Sales Department) same => n,Dial(SIP/sales-queue,30) same => n,Voicemail(sales@company) same => n,Hangup()

[support] exten => s,1,NoOp(Support Department) same => n,Dial(SIP/support-queue,30) same => n,Voicemail(support@company) same => n,Hangup()

[billing] exten => s,1,NoOp(Billing Department) same => n,Dial(SIP/billing-queue,30) same => n,Voicemail(billing@company) same => n,Hangup()

[operator] exten => s,1,NoOp(Operator) same => n,Dial(SIP/operator,30) same => n,Voicemail(operator@company) same => n,Hangup()

🐍 Step 3: Python AGI Script

#!/usr/bin/env python3

""" Smart IVR with Whisper + GPT """

import sys import os import subprocess import tempfile from asterisk.agi import AGI import whisper import openai

Configuration

WHISPER_MODEL = "base.en" # or "large-v3" for better accuracy OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") openai.api_key = OPENAI_API_KEY

Load Whisper model (do once)

whisper_model = whisper.load_model(WHISPER_MODEL)

Department routing rules

DEPARTMENT_MAP = { "sales": ["sales", "buy", "purchase", "pricing", "demo", "trial"], "support": ["support", "help", "problem", "issue", "broken", "not working"], "billing": ["billing", "invoice", "payment", "charge", "subscription", "refund"], "operator": ["operator", "representative", "human", "person"] }

class SmartIVR: def __init__(self): self.agi = AGI() self.caller_id = self.agi.env['agi_callerid'] self.unique_id = self.agi.env['agi_uniqueid']

def speak(self, text): """Play text-to-speech to caller""" # Simple playback (you can replace with better TTS) self.agi.verbose(f"Speaking: {text}") # For production, use Festival, Piper, or pre-recorded audio

def listen(self, max_duration=10, silence_threshold=1.5): """Record audio from caller and transcribe""" # Generate unique filename audio_file = f"/tmp/ivr_{self.unique_id}"

# Record audio self.agi.verbose(f"Recording audio to {audio_file}") self.agi.record_file( audio_file, format='wav', escape_digits='#', timeout=max_duration * 1000, beep=True )

audio_path = f"{audio_file}.wav"

# Check if file exists and has content if not os.path.exists(audio_path) or os.path.getsize(audio_path) < 1000: self.agi.verbose("No audio recorded") return None

# Transcribe with Whisper self.agi.verbose("Transcribing audio...") try: result = whisper_model.transcribe(audio_path) transcription = result['text'].strip() self.agi.verbose(f"Transcription: {transcription}")

# Cleanup os.remove(audio_path)

return transcription

except Exception as e: self.agi.verbose(f"Transcription error: {e}") return None

def understand_intent(self, text): """Use GPT to understand caller intent""" if not text: return None

prompt = f"""You are an intelligent call routing assistant. Based on what the caller said, determine which department to route them to.

Caller said: "{text}"

Departments:

  • sales: For inquiries about buying, pricing, demos, trials
  • support: For technical issues, problems, troubleshooting
  • billing: For payment issues, invoices, refunds, subscriptions
  • operator: If unclear or they explicitly ask for a human
Respond with ONLY the department name (sales, support, billing, or operator). No explanation needed."""

try: response = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a call routing expert."}, {"role": "user", "content": prompt} ], temperature=0.3, max_tokens=10 )

department = response.choices[0].message.content.strip().lower() self.agi.verbose(f"GPT determined department: {department}")

# Validate department if department in DEPARTMENT_MAP.keys(): return department else: return "operator"

except Exception as e: self.agi.verbose(f"GPT error: {e}") return None

def keyword_fallback(self, text): """Fallback to keyword matching if GPT fails""" text_lower = text.lower()

for dept, keywords in DEPARTMENT_MAP.items(): for keyword in keywords: if keyword in text_lower: return dept

return "operator"

def run(self): """Main IVR flow""" try: # Greeting self.speak("How can I help you today? Please speak after the beep.")

# Listen to caller transcription = self.listen()

if not transcription: self.speak("I didn't catch that. Routing you to an operator.") self.agi.set_variable("TARGET", "operator,s,1") return

# Understand intent with GPT department = self.understand_intent(transcription)

# Fallback to keywords if GPT fails if not department: department = self.keyword_fallback(transcription)

# Set routing target self.agi.verbose(f"Routing to: {department}") self.agi.set_variable("TARGET", f"{department},s,1")

# Confirm to caller self.speak(f"Connecting you to {department}. Please hold.")

except Exception as e: self.agi.verbose(f"Error in IVR: {e}") # Always fallback to operator on error self.agi.set_variable("TARGET", "operator,s,1")

if __name__ == '__main__': ivr = SmartIVR() ivr.run()

πŸš€ Step 4: Deploy and Test

1. Make script executable

chmod +x /var/lib/asterisk/agi-bin/smart-ivr.py

Test Python script

python3 /var/lib/asterisk/agi-bin/smart-ivr.py

2. Reload Asterisk

asterisk -rx "dialplan reload"

asterisk -rx "core reload"

3. Test call

Dial the IVR extension (1000) and try saying:

  • "I want to buy your product" β†’ Routes to sales
  • "My phone isn't working" β†’ Routes to support
  • "I need help with an invoice" β†’ Routes to billing
  • "Let me talk to someone" β†’ Routes to operator

⚑ Optimization Tips

1. Use Faster Whisper

from faster_whisper import WhisperModel

4x faster than standard Whisper

model = WhisperModel("base.en", device="cuda", compute_type="float16") segments, info = model.transcribe(audio_path) transcription = " ".join([segment.text for segment in segments])

2. Cache Common Responses

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

def get_cached_intent(text): # Check cache first cached = r.get(f"intent:{text}") if cached: return cached.decode()

# If not cached, get from GPT and cache intent = understand_intent(text) r.setex(f"intent:{text}", 3600, intent) # Cache for 1 hour return intent

3. Reduce Latency with Streaming

For ultra-low latency, stream audio chunks and start transcription before recording finishes.

⚠️ Challenges & Solutions

Challenge 1: Accents and Background Noise

Problem: Whisper struggles with heavy accents or noisy environments.

Solution: Use Whisper Large-v3 (best accuracy) + audio preprocessing with noise gate. Offer DTMF fallback: "Say your request or press 1 for sales, 2 for support..."

Challenge 2: Ambiguous Requests

Problem: Caller says something vague like "I have a question."

Solution: Add follow-up prompts: "Is your question about a product, a technical issue, or billing?"

Challenge 3: Latency

Problem: 5+ seconds delay feels awkward on phone.

Solution: Play hold music or "One moment please..." while processing. Target <3s total.

πŸ“Š Performance Metrics

92%
Correct Routing Accuracy

2.1s
Average Processing Time

85%
Caller Satisfaction Increase

πŸš€ Advanced Features to Add

1. Multi-Language Detection

Whisper auto-detects languageβ€”route Spanish callers to Spanish-speaking agents.

2. CRM Integration

Look up caller by phone number and personalize: "Welcome back, John!"

3. Priority Routing

VIP customers automatically routed to senior agents.

4. Analytics Dashboard

Track which intents are most common, optimize routing rules.

πŸ’° Cost Analysis

Component Cloud API Local Setup
Speech-to-Text $0.006/min (Deepgram) Free (Whisper local)
Intent Classification $0.03/request (GPT-4) $0.005/request (local LLM)
Total (1,000 calls) ~$36 ~$5 (electricity)

🎯 Conclusion

Building a Smart IVR transforms caller experience from frustrating menu navigation to natural conversation. With Whisper and GPT, you can achieve 90%+ routing accuracy while reducing caller wait time.

Key Takeaways:

  • βœ… Natural language IVR increases caller satisfaction by 85%
  • βœ… Whisper provides 95%+ transcription accuracy for clear audio
  • βœ… GPT-4 understands intent better than keyword matching
  • βœ… Total processing time can be under 3 seconds
  • βœ… Local deployment reduces costs by 85% vs cloud APIs

Next in Series:

  • πŸ“ Streaming calls from Asterisk into RAG pipelines
  • πŸ“ Real-time sentiment analysis on live calls
  • πŸ“ Voice biometrics for caller authentication

πŸ’¬ Questions about Smart IVR implementation? I'm happy to help with Asterisk configuration, Whisper optimization, or GPT prompt engineering. Let's chat!

Comments

Leave a Comment

Related Posts