I Built a Local LLM Chat App—Here's What Actually Works
🎯 The Goal
I wanted to build a chat interface for local LLMs—something I could use without sending data to OpenAI or other cloud services. Privacy matters, and I wanted full control over my AI interactions.
Simple goal, right? Three attempts later, I finally got it working. Here's what I learned.
❌ Attempt 1: Direct API Calls (Failed)
My first approach was straightforward: call Ollama's API directly from Laravel. Simple, right?
// First attempt - too slow
$response = Http::timeout(120)->post('http://localhost:11434/api/generate', [
'model' => 'llama2',
'prompt' => $userMessage,
'stream' => false
]);
Problem: The response time was terrible. 15-30 seconds per message. Users would think the app was broken.
Why it failed: Synchronous requests blocked the entire request, and Ollama's generation is slow on CPU (I didn't have GPU at the time).
❌ Attempt 2: WebSockets with Queue Jobs (Better, But Still Failed)
I tried using Laravel queues with WebSockets for real-time streaming:
// Second attempt - complex and buggy
class ProcessLLMResponse implements ShouldQueue
{
public function handle()
{
$response = Http::timeout(300)
->withOptions(['stream' => true])
->post('http://localhost:11434/api/generate', [
'model' => 'llama2',
'prompt' => $this->prompt,
'stream' => true
]);
// Stream chunks via WebSocket
foreach ($response->stream() as $chunk) {
broadcast(new LLMChunkReceived($chunk));
}
}
}
Problem: Too complex. WebSocket connections dropped, queue jobs failed silently, and debugging was a nightmare.
Why it failed: Over-engineered for what should be simple. The complexity introduced more bugs than it solved.
✅ Attempt 3: Server-Sent Events (Success!)
Finally, I tried Server-Sent Events (SSE). This was the sweet spot.
// routes/web.php
Route::get('/api/chat/stream', [ChatController::class, 'stream']);
// ChatController.php
public function stream(Request $request)
{
return response()->stream(function () use ($request) {
$prompt = $request->input('message');
$process = proc_open(
"curl -s -X POST http://localhost:11434/api/generate -d '{
\"model\": \"llama2\",
\"prompt\": \"{$prompt}\",
\"stream\": true
}'",
[['pipe', 'r'], ['pipe', 'w'], ['pipe', 'w']],
$pipes
);
while (!feof($pipes[1])) {
$chunk = fgets($pipes[1]);
if ($chunk) {
$data = json_decode($chunk, true);
echo "data: " . json_encode([
'text' => $data['response'] ?? '',
'done' => $data['done'] ?? false
]) . "\n\n";
ob_flush();
flush();
}
}
proc_close($process);
}, 200, [
'Content-Type' => 'text/event-stream',
'Cache-Control' => 'no-cache',
'Connection' => 'keep-alive',
]);
}
Why this worked:
- Simple to implement
- Real-time streaming without WebSocket complexity
- Works with standard HTTP
- Easy to debug
📊 Performance Numbers
Here are the actual numbers from my testing:
| Model | Response Time | Tokens/sec | Quality |
|---|---|---|---|
| Llama 2 7B | 8-12s | 12-15 | Good |
| Mistral 7B | 6-10s | 18-22 | Better |
| CodeLlama 7B | 10-15s | 10-14 | Excellent for code |
Hardware: AMD Ryzen 9 9950X3D, 64GB RAM, no GPU (CPU-only inference)
🎨 Frontend Implementation
The frontend uses EventSource to consume the SSE stream:
// chat.js
const eventSource = new EventSource('/api/chat/stream?message=' + encodeURIComponent(message));
eventSource.onmessage = function(event) {
const data = JSON.parse(event.data);
if (data.done) {
eventSource.close();
return;
}
// Append chunk to chat
appendToChat(data.text);
};
function appendToChat(text) {
const chatDiv = document.getElementById('chat');
const lastMessage = chatDiv.lastElementChild;
if (lastMessage && lastMessage.classList.contains('streaming')) {
lastMessage.textContent += text;
} else {
const newMessage = document.createElement('div');
newMessage.classList.add('streaming');
newMessage.textContent = text;
chatDiv.appendChild(newMessage);
}
}
⚠️ Challenges I Faced
1. Memory Issues
Ollama can consume a lot of RAM. I had to limit concurrent requests to prevent OOM errors:
// Rate limiting
RateLimiter::for('llm-chat', function ($request) {
return Limit::perMinute(5)->by($request->user()->id);
});
2. Timeout Handling
Long responses would timeout. I increased PHP's execution time:
// In ChatController
set_time_limit(300); // 5 minutes
3. Error Handling
Ollama sometimes fails. I added proper error handling:
try {
// Stream response
} catch (\Exception $e) {
echo "data: " . json_encode([
'error' => 'LLM service unavailable',
'message' => $e->getMessage()
]) . "\n\n";
ob_flush();
flush();
}
✅ What Works Best
- Server-Sent Events for streaming (not WebSockets)
- Mistral 7B for best speed/quality balance
- Rate limiting to prevent resource exhaustion
- Simple architecture - don't over-engineer
- Error handling - LLMs can be flaky
🎯 Final Architecture
User → Laravel Route → SSE Stream → Ollama API → Stream Response → User
Simple, effective, and it works. No queues, no WebSockets, no complexity.
💡 Key Takeaways
- Start simple. SSE is easier than WebSockets for this use case.
- Test with real models. Tutorial examples don't show real performance.
- Handle errors gracefully. LLM services can be unreliable.
- Rate limit everything. Local LLMs are resource-intensive.
- Don't over-engineer. Simple solutions often work best.
Would I build it differently next time? Probably not. This architecture works well for my needs, and sometimes the simple solution is the right solution.