You don’t start with a data center. You start with a problem.
My problem was simple: I wanted AI that just works. Cloud-first for speed, local when I need privacy. No GPU management on my daily driver.
What I built looks weird from the outside. A Raspberry Pi doing brain work. A gaming mini-PC acting as the heavy lifter. A Surface Go and a Pixel 8 as edge nodes. None of it matches the “proper” architecture diagrams.
But that’s the pattern: start with constraints, build what works, add sophistication later.
How the Pieces Connect
The Pi orchestrates. The cloud does the thinking. The EVO handles privacy-sensitive work. The GL.iNet Flint 2 routes everything — and provides WireGuard VPN for remote access from the Surface and Pixel when I’m not home.
The Pi 5 Runs the Show
Eight gigabytes. Plenty when you’re not running models.
The Pi is pure infrastructure — no inference, just orchestration:
- OpenClaw — the orchestrator, always listening
- SearXNG — local search, no API limits
- fmem — memory system, semantic search
- Hermes — personal AI assistant (Docker container, for family use)
- Browser Node — disposable Chromium container for web automation (Docker)
No Ollama here. The Pi routes requests to Ollama Cloud by default, hands off to EVO when I need local processing. This keeps the Pi cool, responsive, and reliable.
5GB buffer means no memory pressure. The Pi runs 24/7 without breaking a sweat.
Where the Thinking Happens
Primary: Ollama Cloud
Most queries go to ollama.com. No local GPU management, no memory pressure, no model updates to track.
Tiers: Free (with session limits), Pro ($20/mo for frontier models), Max ($100/mo for heavy use). I use the free tier for day-to-day; Pro when I need frontier models.
Local: EVO X2
When I need privacy — sensitive documents, work data, personal notes — the request routes to the EVO instead. Local inference, nothing leaves the network.
Privacy note: “Nothing leaves the network” means the request doesn’t go to cloud. The Pi gateway still sees all traffic. This is “not sent to cloud” privacy, not adversarial security — if someone compromises the Pi, they see everything.
This isn’t cloud-native. It’s cloud-first, local-when-needed.
The EVO Wakes When Needed
The EVO doesn’t run 24/7. It’s the heavy lifter — 96GB unified memory, Radeon 890M GPU, NPU for inference.
What makes it work:
| Component | Why It Matters |
|---|---|
| 96GB unified | Models don’t need to fit in VRAM — CPU and GPU share the pool |
| ROCm 7.1.1 | AMD’s CUDA alternative, experimental but working |
| Distrobox | Container isolation without losing hardware access |
| Bazzite | Immutable OS, atomic updates, SteamOS for desktop |
What runs here:
- Ollama (local privacy) — When cloud isn’t appropriate. GLM 4.7 Flash is my preferred model (has a KV cache bug I work around), but Qwen 3.5 runs without issues. Also serves as the backend for coding agents (Pi, Opencode) and other services that need local LLM access.
- ComfyUI — Image generation (FLUX Schnell, Real-ESRGAN)
The EVO wakes when I need local inference. Most of the time, Ollama Cloud handles the load.
Edge Devices
The GL.iNet Flint 2 sits at the edge — it’s the router that connects everything and provides WireGuard VPN for remote access.
Why the Flint 2?
- WireGuard at line rate — no CPU bottleneck, no latency penalty
- Surface Go 2 and Pixel 8 connect through it when I’m not home
- Routes traffic to the Pi gateway without exposing services directly
The Surface Go 2 runs Ultramarine Linux — a Fedora Spin, not Atomic. Performance matters on constrained hardware. Immutable distros add weight; Ultramarine keeps things light. Plus, it has Surface kernel support out of the box.
The Pixel 8 is my daily driver. Both connect to the Pi gateway through standard OpenClaw channels — no dedicated node software needed.
Neither runs heavy compute. They’re terminals with a direct line to the Pi — whether I’m at my desk or on WireGuard from somewhere else.
Why This Layout Works
Keep data close to compute.
| Service | Location | Why |
|---|---|---|
| OpenClaw | Pi | Orchestrates everything, must be always-on |
| fmem | Pi | Memory queries need low latency |
| SearXNG | Pi | Search during conversations, no API limits |
| Hermes | Pi (Docker) | Family assistant, lightweight container |
| Ollama Cloud | Remote | Primary inference, zero local overhead |
| Ollama Local | EVO | Privacy-sensitive work, offline fallback |
| ComfyUI | EVO | GPU-required, not time-critical |
What doesn’t move: the orchestrator and the memory. What moves: heavy compute to where the RAM is.
This is the same pattern as grocery shopping. You don’t check every aisle. You check the list. The list is the cache — local, fast, filtered.
What I Didn’t Build
I didn’t build a Kubernetes cluster. No Proxmox, no TrueNAS, no homelab staples.
This isn’t competing with homelabs — it’s a different focus. Kubernetes would help with GPU scheduling if I needed it. Proxmox would help if I were running many services. I’m not. I’m running three things: orchestration, memory, inference.
| Homelab Focus | AI Lab Focus |
|---|---|
| Service availability | Inference speed |
| High availability | Privacy-first |
| Many services | Few services, deep |
| GUI dashboards | CLI and APIs |
I don’t need five nines uptime. I need inference available, my sensitive data local when needed, and minimal cloud costs.
What It Costs
| Component | Hardware | Power Draw | Est. Cost/Month |
|---|---|---|---|
| Pi 5 | 8GB, always-on | ~5W | ~$0.50 |
| GL.iNet Flint 2 | Router, WireGuard | ~6W | ~$0.60 |
| EVO X2 | 96GB, on-demand | ~120W active | ~$5 (occasional use) |
| Surface Go 2 | On-demand | ~15W | Negligible |
| Pixel 8 | Personal device | — | — |
| Ollama Cloud | Remote inference | — | Free tier / $20 Pro |
Total: ~$6/month power + Ollama Cloud tier.
How It Stays Secure
Nothing is exposed to the public internet. The entire lab runs on a private network — no open ports, no port forwarding, no attack surface.
The GL.iNet Flint 2 handles the perimeter:
- WireGuard VPN for remote access — Surface Go and Pixel 8 connect securely from anywhere
- Line-rate encryption — no CPU bottleneck, no noticeable latency
- Routes all traffic through the Pi, nothing bypasses the gateway
What’s exposed:
- SpudHub — Dashboard via Cloudflare tunnel
- BingeWatching — Entertainment tracking
- Foundry VTT — Tabletop gaming (on-demand)
What’s not exposed:
- OpenClaw API
- Ollama endpoints
- fmem, SearXNG, Hermes
Cloudflare tunnels (cloudflared) handle the routing. No inbound connections. The tunnel dials out, Cloudflare routes traffic back. If the tunnel dies, the service disappears — no stale attack surface.
Where This Breaks
This architecture isn’t for everyone.
Pi 5 limits:
- 8GB RAM is fine without Ollama — 5GB+ buffer
- No GPU. But with cloud-first, this doesn’t matter
- External SSD for OS — fast storage for orchestration workloads
EVO X2 limits:
- ROCm 7.1.1 works for Ollama out of the box (vLLM has issues, but I’m not using it)
- Binary execution requires approval in Distrobox containers — by design, but can be disabled with yolo mode if needed
- Not always-on without accepting power cost
The mistake: Copy this because it looks cool. Don’t.
The pattern: Start with your constraints. Build what works. Add sophistication when the constraint bites.
The Point
You don’t need a data center. You need a Raspberry Pi and a clear idea of what you’re optimizing for.
For me, it was: cloud-first for convenience, local for privacy.
The Pi orchestrates. Ollama Cloud thinks. The EVO handles sensitive work. Edge devices are terminals with a direct line home.
Same pattern as grocery delivery: let someone else stock the warehouse, cook in your own kitchen when it matters.
Disclaimer: This architecture reflects my constraints — power cost priority, local-first preference, no need for enterprise HA. If you’re running a production workload, build for your constraints, not mine.