I Built a Local LLM Chat App—Here's What Actually Works

🎯 The Goal

I wanted to build a chat interface for local LLMs—something I could use without sending data to OpenAI or other cloud services. Privacy matters, and I wanted full control over my AI interactions.

Simple goal, right? Three attempts later, I finally got it working. Here's what I learned.

❌ Attempt 1: Direct API Calls (Failed)

My first approach was straightforward: call Ollama's API directly from Laravel. Simple, right?

// First attempt - too slow
$response = Http::timeout(120)->post('http://localhost:11434/api/generate', [
    'model' => 'llama2',
    'prompt' => $userMessage,
    'stream' => false
]);

Problem: The response time was terrible. 15-30 seconds per message. Users would think the app was broken.

Why it failed: Synchronous requests blocked the entire request, and Ollama's generation is slow on CPU (I didn't have GPU at the time).

❌ Attempt 2: WebSockets with Queue Jobs (Better, But Still Failed)

I tried using Laravel queues with WebSockets for real-time streaming:

// Second attempt - complex and buggy
class ProcessLLMResponse implements ShouldQueue
{
    public function handle()
    {
        $response = Http::timeout(300)
            ->withOptions(['stream' => true])
            ->post('http://localhost:11434/api/generate', [
                'model' => 'llama2',
                'prompt' => $this->prompt,
                'stream' => true
            ]);
        
        // Stream chunks via WebSocket
        foreach ($response->stream() as $chunk) {
            broadcast(new LLMChunkReceived($chunk));
        }
    }
}

Problem: Too complex. WebSocket connections dropped, queue jobs failed silently, and debugging was a nightmare.

Why it failed: Over-engineered for what should be simple. The complexity introduced more bugs than it solved.

✅ Attempt 3: Server-Sent Events (Success!)

Finally, I tried Server-Sent Events (SSE). This was the sweet spot.

// routes/web.php
Route::get('/api/chat/stream', [ChatController::class, 'stream']);

// ChatController.php
public function stream(Request $request)
{
    return response()->stream(function () use ($request) {
        $prompt = $request->input('message');
        
        $process = proc_open(
            "curl -s -X POST http://localhost:11434/api/generate -d '{
                \"model\": \"llama2\",
                \"prompt\": \"{$prompt}\",
                \"stream\": true
            }'",
            [['pipe', 'r'], ['pipe', 'w'], ['pipe', 'w']],
            $pipes
        );
        
        while (!feof($pipes[1])) {
            $chunk = fgets($pipes[1]);
            if ($chunk) {
                $data = json_decode($chunk, true);
                echo "data: " . json_encode([
                    'text' => $data['response'] ?? '',
                    'done' => $data['done'] ?? false
                ]) . "\n\n";
                ob_flush();
                flush();
            }
        }
        
        proc_close($process);
    }, 200, [
        'Content-Type' => 'text/event-stream',
        'Cache-Control' => 'no-cache',
        'Connection' => 'keep-alive',
    ]);
}

Why this worked:

Simple to implement
Real-time streaming without WebSocket complexity
Works with standard HTTP
Easy to debug

📊 Performance Numbers

Here are the actual numbers from my testing:

Model	Response Time	Tokens/sec	Quality
Llama 2 7B	8-12s	12-15	Good
Mistral 7B	6-10s	18-22	Better
CodeLlama 7B	10-15s	10-14	Excellent for code

Hardware: AMD Ryzen 9 9950X3D, 64GB RAM, no GPU (CPU-only inference)

🎨 Frontend Implementation

The frontend uses EventSource to consume the SSE stream:

// chat.js
const eventSource = new EventSource('/api/chat/stream?message=' + encodeURIComponent(message));

eventSource.onmessage = function(event) {
    const data = JSON.parse(event.data);
    
    if (data.done) {
        eventSource.close();
        return;
    }
    
    // Append chunk to chat
    appendToChat(data.text);
};

function appendToChat(text) {
    const chatDiv = document.getElementById('chat');
    const lastMessage = chatDiv.lastElementChild;
    
    if (lastMessage && lastMessage.classList.contains('streaming')) {
        lastMessage.textContent += text;
    } else {
        const newMessage = document.createElement('div');
        newMessage.classList.add('streaming');
        newMessage.textContent = text;
        chatDiv.appendChild(newMessage);
    }
}

⚠️ Challenges I Faced

1. Memory Issues

Ollama can consume a lot of RAM. I had to limit concurrent requests to prevent OOM errors:

// Rate limiting
RateLimiter::for('llm-chat', function ($request) {
    return Limit::perMinute(5)->by($request->user()->id);
});

2. Timeout Handling

Long responses would timeout. I increased PHP's execution time:

// In ChatController
set_time_limit(300); // 5 minutes

3. Error Handling

Ollama sometimes fails. I added proper error handling:

try {
    // Stream response
} catch (\Exception $e) {
    echo "data: " . json_encode([
        'error' => 'LLM service unavailable',
        'message' => $e->getMessage()
    ]) . "\n\n";
    ob_flush();
    flush();
}

✅ What Works Best

Server-Sent Events for streaming (not WebSockets)
Mistral 7B for best speed/quality balance
Rate limiting to prevent resource exhaustion
Simple architecture - don't over-engineer
Error handling - LLMs can be flaky

🎯 Final Architecture

User → Laravel Route → SSE Stream → Ollama API → Stream Response → User

Simple, effective, and it works. No queues, no WebSockets, no complexity.

💡 Key Takeaways

Start simple. SSE is easier than WebSockets for this use case.
Test with real models. Tutorial examples don't show real performance.
Handle errors gracefully. LLM services can be unreliable.
Rate limit everything. Local LLMs are resource-intensive.
Don't over-engineer. Simple solutions often work best.

Would I build it differently next time? Probably not. This architecture works well for my needs, and sometimes the simple solution is the right solution.

I Built a Local LLM Chat App—Here's What Actually Works

🎯 The Goal

❌ Attempt 1: Direct API Calls (Failed)

❌ Attempt 2: WebSockets with Queue Jobs (Better, But Still Failed)

✅ Attempt 3: Server-Sent Events (Success!)

📊 Performance Numbers

🎨 Frontend Implementation

⚠️ Challenges I Faced

1. Memory Issues

2. Timeout Handling

3. Error Handling

✅ What Works Best

🎯 Final Architecture

💡 Key Takeaways

Share this post

Comments

Leave a Comment

Related Posts

WebSockets in Production: Scaling Nightmares and Solutions

Laravel Performance: How I Cut Response Time by 60%

Real-Time Speech-to-Text with Whisper and WebRTC: Building Voice Interfaces