Real-Time Speech-to-Text with Whisper and WebRTC

Real-Time Speech-to-Text with Whisper and WebRTC: Building Voice Interfaces

Building real-time voice interfaces requires low-latency speech recognition and seamless audio streaming. I've integrated OpenAI Whisper with WebRTC to create production-ready voice transcription systems that work in browsers without plugins.

🎯 Why Whisper + WebRTC?

OpenAI Whisper provides state-of-the-art speech recognition with:

High accuracy across languages
Robust to background noise
Handles accents and dialects well

WebRTC enables:

Direct browser audio capture (no plugins)
Low-latency streaming
Secure peer-to-peer connections

Together, they create a powerful foundation for voice-controlled applications.

🏗 Architecture Overview

``` Browser (WebRTC) → Audio Stream → Backend (Python) → Whisper API → Transcription → WebSocket → Frontend ```

Components

1. Frontend: WebRTC captures microphone audio 2. Backend: Receives audio chunks via WebSocket 3. Whisper Processing: Converts audio to text 4. Real-time Updates: Streams transcriptions back to client

💻 Implementation

1. Frontend: WebRTC Audio Capture

```javascript // Capture microphone audio const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm;codecs=opus' });

const chunks = []; mediaRecorder.ondataavailable = (event) => { if (event.data.size > 0) { chunks.push(event.data); // Send to backend via WebSocket websocket.send(event.data); } };

mediaRecorder.start(1000); // Send chunks every second ```

2. Backend: WebSocket Server (Python)

```python from fastapi import FastAPI, WebSocket import whisper import io import base64

app = FastAPI() model = whisper.load_model("base") # or "small", "medium", "large"

@app.websocket("/ws/transcribe") async def websocket_endpoint(websocket: WebSocket): await websocket.accept()

try: while True: # Receive audio chunk audio_data = await websocket.receive_bytes()

# Convert to format Whisper expects audio_file = io.BytesIO(audio_data) audio_file.name = "audio.webm"

# Transcribe result = model.transcribe(audio_file, language="en") text = result["text"].strip()

# Send transcription back if text: await websocket.send_json({ "text": text, "confidence": result.get("no_speech_prob", 0) }) except Exception as e: print(f"Error: {e}") ```

3. Laravel Integration

For Laravel projects, I use a hybrid approach:

```php // routes/web.php Route::post('/api/transcribe', [TranscriptionController::class, 'transcribe']);

// app/Http/Controllers/TranscriptionController.php public function transcribe(Request $request) { $audioFile = $request->file('audio');

// Send to Python Whisper service $response = Http::attach('audio', $audioFile->getContent(), 'audio.webm') ->post('http://whisper-service:8000/transcribe');

return response()->json([ 'text' => $response->json('text'), 'language' => $response->json('language'), ]); } ```

⚡ Performance Optimization

1. Chunking Strategy

Send audio in small chunks (1-2 seconds) for lower latency:

```javascript // Send chunks every 1 second mediaRecorder.start(1000); ```

2. Model Selection

Choose the right Whisper model based on needs:

tiny: Fastest, lower accuracy (39M params)
base: Good balance (74M params) ← Recommended
small: Better accuracy (244M params)
medium: High accuracy (769M params)
large: Best accuracy, slowest (1550M params)

3. Caching

Cache common phrases and commands:

```python from functools import lru_cache

@lru_cache(maxsize=1000) def transcribe_cached(audio_hash): # Only transcribe if not in cache return model.transcribe(audio_file) ```

4. GPU Acceleration

Use GPU for faster processing:

```python import torch

Check if CUDA available

device = "cuda" if torch.cuda.is_available() else "cpu" model = whisper.load_model("base", device=device) ```

🎤 Use Cases I've Built

1. Real-Time Call Transcription

Transcribe VoIP calls in real-time for:

Customer support logs
Compliance recording
Live captions

2. Voice-Controlled Dashboards

Voice commands for:

System controls
Data queries
Navigation

3. Accessibility Features

Live captions for video calls
Voice-to-text for forms
Hands-free navigation

🔧 Advanced Features

VAD (Voice Activity Detection)

Only process audio when speech is detected:

```python import webrtcvad

vad = webrtcvad.Vad(2) # Aggressiveness 0-3

def is_speech(audio_chunk): return vad.is_speech(audio_chunk, sample_rate=16000) ```

Language Detection

Auto-detect language before transcription:

```python result = model.transcribe(audio_file, language=None) # Auto-detect detected_lang = result["language"] ```

Punctuation & Formatting

Post-process transcriptions for better readability:

```python import re

def format_transcription(text): # Add punctuation text = re.sub(r'\s+([.!?])', r'\1', text) # Capitalize sentences text = '. '.join(s.capitalize() for s in text.split('. ')) return text ```

🚀 Production Considerations

1. Error Handling

```python try: result = model.transcribe(audio_file) except Exception as e: # Fallback to simpler model or return error logger.error(f"Transcription failed: {e}") return {"text": "", "error": str(e)} ```

2. Rate Limiting

Prevent abuse:

```python from slowapi import Limiter

limiter = Limiter(key_func=get_remote_address)

@app.post("/transcribe") @limiter.limit("10/minute") async def transcribe(request: Request): # ... ```

3. Monitoring

Track:

Transcription latency
Accuracy metrics
Error rates
Resource usage

💡 Real-World Example

I built a real-time transcription system for VoIP calls:

1. Audio Capture: WebRTC captures call audio 2. Streaming: Audio chunks sent via WebSocket every 1 second 3. Processing: Whisper transcribes in real-time 4. Display: Live transcript appears in dashboard 5. Storage: Transcripts saved to database for compliance

Result: Sub-2-second latency, 95%+ accuracy, handles multiple languages.

🎓 Key Takeaways

WebRTC enables browser-based audio capture
Whisper provides excellent accuracy out of the box
Chunking reduces latency significantly
GPU acceleration speeds up processing
Caching reduces redundant processing

Conclusion

Whisper + WebRTC is a powerful combination for building voice interfaces. The key is balancing latency, accuracy, and resource usage based on your specific needs. Start with the base model and optimize from there.

Real-Time Speech-to-Text with Whisper and WebRTC: Building Voice Interfaces