Running Multiple LLMs: My GPU Memory Management Nightmare
🎯 The Goal
I wanted to run multiple LLMs simultaneously: one for code, one for writing, one for analysis. Simple, right?
Wrong. GPU memory management became a nightmare.
The problem: Each LLM needs GPU memory. My RTX 4090 has 24GB. That should be enough, right? Wrong.
💥 What Went Wrong
Problem 1: Memory Fragmentation
Loading and unloading models created memory fragmentation. Eventually, even though I had "enough" memory, allocations failed.
Problem 2: Model Switching Overhead
Switching between models was slow. Loading a model took 30-60 seconds. Not practical for real-time use.
Problem 3: OOM Errors
Out-of-memory errors were frequent. Models would crash mid-inference.
✅ Solutions That Worked
Solution 1: Model Quantization
Using quantized models (4-bit, 8-bit) reduced memory usage significantly:
# 4-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=quantization_config
)
Impact: Reduced memory usage by 75%. Could run 3-4 models instead of 1.
Solution 2: Model Caching
Keep models in memory, don't unload them:
# Cache models in memory
models = {
'code': load_model('code-llm'),
'writing': load_model('writing-llm'),
'analysis': load_model('analysis-llm'),
}
# Use cached models
response = models['code'].generate(prompt)
Impact: No loading overhead. Instant switching between models.
Solution 3: Memory Pooling
Pre-allocate GPU memory pools to avoid fragmentation:
import torch
# Pre-allocate memory
torch.cuda.empty_cache()
torch.cuda.set_per_process_memory_fraction(0.9)
📊 Results
| Approach | Models Running | Memory Used | Stability |
|---|---|---|---|
| Before (Full Precision) | 1 | 22GB | Unstable |
| After (Quantized + Caching) | 3-4 | 18-20GB | Stable |
💡 Key Takeaways
- Quantization is essential for running multiple models
- Model caching eliminates switching overhead
- Memory pooling prevents fragmentation
- 24GB VRAM can run 3-4 quantized models
- Plan your memory usage carefully
Running multiple LLMs is possible, but it requires careful memory management. Quantization and caching are your friends.