AI & Machine Learning

I Built a Local LLM Chat App—Here's What Actually Works

December 18, 2024 4 min read By Amey Lokare

🎯 The Goal

I wanted to build a chat interface for local LLMs—something I could use without sending data to OpenAI or other cloud services. Privacy matters, and I wanted full control over my AI interactions.

Simple goal, right? Three attempts later, I finally got it working. Here's what I learned.

❌ Attempt 1: Direct API Calls (Failed)

My first approach was straightforward: call Ollama's API directly from Laravel. Simple, right?

// First attempt - too slow
$response = Http::timeout(120)->post('http://localhost:11434/api/generate', [
    'model' => 'llama2',
    'prompt' => $userMessage,
    'stream' => false
]);

Problem: The response time was terrible. 15-30 seconds per message. Users would think the app was broken.

Why it failed: Synchronous requests blocked the entire request, and Ollama's generation is slow on CPU (I didn't have GPU at the time).

❌ Attempt 2: WebSockets with Queue Jobs (Better, But Still Failed)

I tried using Laravel queues with WebSockets for real-time streaming:

// Second attempt - complex and buggy
class ProcessLLMResponse implements ShouldQueue
{
    public function handle()
    {
        $response = Http::timeout(300)
            ->withOptions(['stream' => true])
            ->post('http://localhost:11434/api/generate', [
                'model' => 'llama2',
                'prompt' => $this->prompt,
                'stream' => true
            ]);
        
        // Stream chunks via WebSocket
        foreach ($response->stream() as $chunk) {
            broadcast(new LLMChunkReceived($chunk));
        }
    }
}

Problem: Too complex. WebSocket connections dropped, queue jobs failed silently, and debugging was a nightmare.

Why it failed: Over-engineered for what should be simple. The complexity introduced more bugs than it solved.

✅ Attempt 3: Server-Sent Events (Success!)

Finally, I tried Server-Sent Events (SSE). This was the sweet spot.

// routes/web.php
Route::get('/api/chat/stream', [ChatController::class, 'stream']);

// ChatController.php
public function stream(Request $request)
{
    return response()->stream(function () use ($request) {
        $prompt = $request->input('message');
        
        $process = proc_open(
            "curl -s -X POST http://localhost:11434/api/generate -d '{
                \"model\": \"llama2\",
                \"prompt\": \"{$prompt}\",
                \"stream\": true
            }'",
            [['pipe', 'r'], ['pipe', 'w'], ['pipe', 'w']],
            $pipes
        );
        
        while (!feof($pipes[1])) {
            $chunk = fgets($pipes[1]);
            if ($chunk) {
                $data = json_decode($chunk, true);
                echo "data: " . json_encode([
                    'text' => $data['response'] ?? '',
                    'done' => $data['done'] ?? false
                ]) . "\n\n";
                ob_flush();
                flush();
            }
        }
        
        proc_close($process);
    }, 200, [
        'Content-Type' => 'text/event-stream',
        'Cache-Control' => 'no-cache',
        'Connection' => 'keep-alive',
    ]);
}

Why this worked:

  • Simple to implement
  • Real-time streaming without WebSocket complexity
  • Works with standard HTTP
  • Easy to debug

📊 Performance Numbers

Here are the actual numbers from my testing:

Model Response Time Tokens/sec Quality
Llama 2 7B 8-12s 12-15 Good
Mistral 7B 6-10s 18-22 Better
CodeLlama 7B 10-15s 10-14 Excellent for code

Hardware: AMD Ryzen 9 9950X3D, 64GB RAM, no GPU (CPU-only inference)

🎨 Frontend Implementation

The frontend uses EventSource to consume the SSE stream:

// chat.js
const eventSource = new EventSource('/api/chat/stream?message=' + encodeURIComponent(message));

eventSource.onmessage = function(event) {
    const data = JSON.parse(event.data);
    
    if (data.done) {
        eventSource.close();
        return;
    }
    
    // Append chunk to chat
    appendToChat(data.text);
};

function appendToChat(text) {
    const chatDiv = document.getElementById('chat');
    const lastMessage = chatDiv.lastElementChild;
    
    if (lastMessage && lastMessage.classList.contains('streaming')) {
        lastMessage.textContent += text;
    } else {
        const newMessage = document.createElement('div');
        newMessage.classList.add('streaming');
        newMessage.textContent = text;
        chatDiv.appendChild(newMessage);
    }
}

⚠️ Challenges I Faced

1. Memory Issues

Ollama can consume a lot of RAM. I had to limit concurrent requests to prevent OOM errors:

// Rate limiting
RateLimiter::for('llm-chat', function ($request) {
    return Limit::perMinute(5)->by($request->user()->id);
});

2. Timeout Handling

Long responses would timeout. I increased PHP's execution time:

// In ChatController
set_time_limit(300); // 5 minutes

3. Error Handling

Ollama sometimes fails. I added proper error handling:

try {
    // Stream response
} catch (\Exception $e) {
    echo "data: " . json_encode([
        'error' => 'LLM service unavailable',
        'message' => $e->getMessage()
    ]) . "\n\n";
    ob_flush();
    flush();
}

✅ What Works Best

  1. Server-Sent Events for streaming (not WebSockets)
  2. Mistral 7B for best speed/quality balance
  3. Rate limiting to prevent resource exhaustion
  4. Simple architecture - don't over-engineer
  5. Error handling - LLMs can be flaky

🎯 Final Architecture

User → Laravel Route → SSE Stream → Ollama API → Stream Response → User

Simple, effective, and it works. No queues, no WebSockets, no complexity.

💡 Key Takeaways

  • Start simple. SSE is easier than WebSockets for this use case.
  • Test with real models. Tutorial examples don't show real performance.
  • Handle errors gracefully. LLM services can be unreliable.
  • Rate limit everything. Local LLMs are resource-intensive.
  • Don't over-engineer. Simple solutions often work best.

Would I build it differently next time? Probably not. This architecture works well for my needs, and sometimes the simple solution is the right solution.

Comments

Leave a Comment

Related Posts