Real-Time Speech-to-Text with Whisper and WebRTC: Building Voice Interfaces
Real-Time Speech-to-Text with Whisper and WebRTC: Building Voice Interfaces
Building real-time voice interfaces requires low-latency speech recognition and seamless audio streaming. I've integrated OpenAI Whisper with WebRTC to create production-ready voice transcription systems that work in browsers without plugins.
π― Why Whisper + WebRTC?
OpenAI Whisper provides state-of-the-art speech recognition with:
- High accuracy across languages
- Robust to background noise
- Handles accents and dialects well
- Direct browser audio capture (no plugins)
- Low-latency streaming
- Secure peer-to-peer connections
π Architecture Overview
``` Browser (WebRTC) β Audio Stream β Backend (Python) β Whisper API β Transcription β WebSocket β Frontend ```
Components
1. Frontend: WebRTC captures microphone audio 2. Backend: Receives audio chunks via WebSocket 3. Whisper Processing: Converts audio to text 4. Real-time Updates: Streams transcriptions back to client
π» Implementation
1. Frontend: WebRTC Audio Capture
```javascript // Capture microphone audio const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm;codecs=opus' });
const chunks = []; mediaRecorder.ondataavailable = (event) => { if (event.data.size > 0) { chunks.push(event.data); // Send to backend via WebSocket websocket.send(event.data); } };
mediaRecorder.start(1000); // Send chunks every second ```
2. Backend: WebSocket Server (Python)
```python from fastapi import FastAPI, WebSocket import whisper import io import base64
app = FastAPI() model = whisper.load_model("base") # or "small", "medium", "large"
@app.websocket("/ws/transcribe") async def websocket_endpoint(websocket: WebSocket): await websocket.accept()
try: while True: # Receive audio chunk audio_data = await websocket.receive_bytes()
# Convert to format Whisper expects audio_file = io.BytesIO(audio_data) audio_file.name = "audio.webm"
# Transcribe result = model.transcribe(audio_file, language="en") text = result["text"].strip()
# Send transcription back if text: await websocket.send_json({ "text": text, "confidence": result.get("no_speech_prob", 0) }) except Exception as e: print(f"Error: {e}") ```
3. Laravel Integration
For Laravel projects, I use a hybrid approach:
```php // routes/web.php Route::post('/api/transcribe', [TranscriptionController::class, 'transcribe']);
// app/Http/Controllers/TranscriptionController.php public function transcribe(Request $request) { $audioFile = $request->file('audio');
// Send to Python Whisper service $response = Http::attach('audio', $audioFile->getContent(), 'audio.webm') ->post('http://whisper-service:8000/transcribe');
return response()->json([ 'text' => $response->json('text'), 'language' => $response->json('language'), ]); } ```
β‘ Performance Optimization
1. Chunking Strategy
Send audio in small chunks (1-2 seconds) for lower latency:
```javascript // Send chunks every 1 second mediaRecorder.start(1000); ```
2. Model Selection
Choose the right Whisper model based on needs:
- tiny: Fastest, lower accuracy (39M params)
- base: Good balance (74M params) β Recommended
- small: Better accuracy (244M params)
- medium: High accuracy (769M params)
- large: Best accuracy, slowest (1550M params)
3. Caching
Cache common phrases and commands:
```python from functools import lru_cache
@lru_cache(maxsize=1000) def transcribe_cached(audio_hash): # Only transcribe if not in cache return model.transcribe(audio_file) ```
4. GPU Acceleration
Use GPU for faster processing:
```python import torch
Check if CUDA available
device = "cuda" if torch.cuda.is_available() else "cpu" model = whisper.load_model("base", device=device) ```
π€ Use Cases I've Built
1. Real-Time Call Transcription
Transcribe VoIP calls in real-time for:
- Customer support logs
- Compliance recording
- Live captions
2. Voice-Controlled Dashboards
Voice commands for:
- System controls
- Data queries
- Navigation
3. Accessibility Features
- Live captions for video calls
- Voice-to-text for forms
- Hands-free navigation
π§ Advanced Features
VAD (Voice Activity Detection)
Only process audio when speech is detected:
```python import webrtcvad
vad = webrtcvad.Vad(2) # Aggressiveness 0-3
def is_speech(audio_chunk): return vad.is_speech(audio_chunk, sample_rate=16000) ```
Language Detection
Auto-detect language before transcription:
```python result = model.transcribe(audio_file, language=None) # Auto-detect detected_lang = result["language"] ```
Punctuation & Formatting
Post-process transcriptions for better readability:
```python import re
def format_transcription(text): # Add punctuation text = re.sub(r'\s+([.!?])', r'\1', text) # Capitalize sentences text = '. '.join(s.capitalize() for s in text.split('. ')) return text ```
π Production Considerations
1. Error Handling
```python try: result = model.transcribe(audio_file) except Exception as e: # Fallback to simpler model or return error logger.error(f"Transcription failed: {e}") return {"text": "", "error": str(e)} ```
2. Rate Limiting
Prevent abuse:
```python from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)
@app.post("/transcribe") @limiter.limit("10/minute") async def transcribe(request: Request): # ... ```
3. Monitoring
Track:
- Transcription latency
- Accuracy metrics
- Error rates
- Resource usage
π‘ Real-World Example
I built a real-time transcription system for VoIP calls:
1. Audio Capture: WebRTC captures call audio 2. Streaming: Audio chunks sent via WebSocket every 1 second 3. Processing: Whisper transcribes in real-time 4. Display: Live transcript appears in dashboard 5. Storage: Transcripts saved to database for compliance
Result: Sub-2-second latency, 95%+ accuracy, handles multiple languages.
π Key Takeaways
- WebRTC enables browser-based audio capture
- Whisper provides excellent accuracy out of the box
- Chunking reduces latency significantly
- GPU acceleration speeds up processing
- Caching reduces redundant processing
Conclusion
Whisper + WebRTC is a powerful combination for building voice interfaces. The key is balancing latency, accuracy, and resource usage based on your specific needs. Start with the base model and optimize from there.