my.pgnd.dev

infrastructure on a raspberry pi

· 1811 words · 9 minutes to read

Infrastructure on a Raspberry Pi 🔗

I did it again. I set up a Kubernetes cluster on a few Raspberry Pis. The previous iteration was about 4 years ago, but since then, tooling has improved: improved Arm64 support across the board and my newest teammate, Claude.

What I wanted: a reliable place for my own playground that runs at a capped cost. A self-hosted solution for side projects where I don’t have to think twice about keeping instances up. I also wanted to maintain a certain amount of comfort—I don’t buy into the “it’s fine to be more manual for side projects” mentality. It’s fine, but simple automation isn’t costly to set up.

That’s where Raspberry Pis come in. Small but mighty, around a hundred dollars a piece depending on the exact spec. Cheaper than a full on server or computer, more predictable than auto-scaling cloud compute.

My cluster runs GitOps for declarative infrastructure. CI/CD for automated deployments. High-availability storage and comprehensive telemetry. I really wanted to unlock the ability to self host projects without monthly subscriptions. I really wanted production-like not because I think Raspberry Pis are the hardware of the future but because of the operational comfort it would bring to my side projects. And a belief that doing things right will generally save you from some trouble somewhere down the line.

It turns out you can have both operational comfort and hardware constraints.

Foundation: Reproducible Infrastructure 🔗

Automation is King. Production infrastructure starts with predictability. If you can’t reproduce your environment reliably, you can’t recover quickly when things break. Cloud teams use Infrastructure as Code and declarative configuration management. Homelab clusters can too.

GitOps is the way. No more ad-hoc config patches that get forgotten by all. Instead, describe desired state in Git, let a controller (Flux in my case, lighter than ArgoCD) reconcile the cluster to match. Every change is version-controlled. Every deployment is auditable. Broke something? Revert the commit.

Flux has been solid. However, Flux polling is awkward when iterating fast. Polling every minute doesn’t really make sense. On the other hand, waiting 10 minutes for the change I just pushed to be picked up was too slow so I enabled webhooks for faster response times and no longer having to reconcile manually every time. I’ll see if I need to migrate the entire cluster at some point, but already the centralization of the config is great.

The principle extends to hardware provisioning: cloud-init scripts, automated cluster operations and secret management. Adding the last node took roughly 30 minutes across three steps: copying the image (longest part), generating cloud-init scripts, and joining the cluster.

Ansible could handle this more elegantly, but I decided to skip the Python dependency and stick with Makefiles for simplicity. I can always revisit this if scaling beyond a few nodes requires it.

Operational Excellence: CI/CD 🔗

Fast reliable iteration relies on predictability. Infrastructure is one, dev cycle is the other. I wouldn’t consider an actual project without unit tests and CI testing. Going through a manual checklist for deployments is a strong no-go in my mind—I’m bound to miss a step, and my attention should be on building the next feature, not on whether I deployed correctly.

The value of Continuous Integration is different in my case: I work alone on most projects, so the typical CI benefits (catching merge conflicts, cross-team issues) don’t apply. But consistency and pre-deploy validation are still valuable. Works on my local, right? I tend to have more faith in CI.

Continuous Deployment is the real win. This is what I was chasing. Code it, test it, merge it, and if it passes CI, it ships. No manual approval gates, no SSH sessions, no “I think I deployed it but let me check.” That’s why I chased—automation that reduces friction instead of adding layers.

I chose Woodpecker CI over GitHub Actions (cost and control) or Jenkins (dated, heavy, unsuitable for constrained hardware). Woodpecker runs on the cluster itself: lightweight, YAML pipelines, fast builds, accessible logs. It just works—exactly what production teams want from CI/CD.

Woodpecker CI runs well. I definitely hit some config gaps, but overall it’s far better than Jenkins. Once set up, it’s a modern CI system with YAML config in the project. Complex pipelines and storage sharing might become challenging, but I’ll address those when needed.

Building images from CI proved complex. I tried different versions of buildx and podman but ran into persistent socket and permission issues. Kaniko solved this (though it’s now unmaintained), and the Woodpecker CI plugin made it straightforward.

CD presented a challenge: my project code and infrastructure config live in separate repos, requiring cross-repo commits for deployment. Rather than managing git credentials across repos, I trigger a GitHub Action on my infra repo from CI on success. This approach decouples the systems cleanly and avoids credential complexity.

Docker Registry remains one of the worst components to set up. At least now there’s a multi-arch off-the-shelf image that solves the cold start problem: if you self-host your docker registry in the cluster, how do you restart the registry itself. The HTTPS default for self-hosted is a pain. It bleeds into how the pods are set up and involves more configuration on the nodes because certificates are handled differently.

Resilience: Surviving Failures at Pi Scale 🔗

Automation and reproducible infrastructure only matter if the platform stays running. Production systems expect failures—disk corruption, network issues, nodes going down—and plan for them. A two-node cluster is a ticking time bomb. Time to add real resilience.

High Availability 🔗

Kubernetes consensus requires at least three control-plane nodes to tolerate a single failure. This isn’t a cloud luxury—it’s the price of a cluster you can trust. With only two nodes, losing one meant losing the entire control plane. With three, the cluster survives individual node failures.

Adding nodes was straightforward thanks to automated provisioning. Converting my existing agent node to a server took a couple of commands and some cloud-init script patching (I had to add the join-server option). Now I have three server nodes and an actual chance of surviving hardware failures.

Storage Replication 🔗

High availability means nothing if your data lives on a single SD card. Local storage isn’t reliable enough for infrastructure you want to trust. NFS would work, but I wanted storage self-contained within the cluster itself. Longhorn solved this—replicated storage volumes across nodes. A single Pi failing doesn’t take down persistent data.

Advantage of a fresh cluster: set it up right the first time instead of migrating PVCs later. Longhorn has been solid—replication works. It simplifies cluster expansion; patch the PersistentVolumeClaim to increase replica count, and you have storage redundancy. There’s more work to do—currently running RWO (single-node read-write). RWX (read-write-many) is a good future step.

Observability 🔗

Redundancy keeps the cluster alive. Observability shows the actual health. I always look for signs that the system is stable rather than trusting the absence of failures.

Observability is probably the odd one to set up on a resource-constrained environment. It feels overkill for a homelab, and it’s part of the reason I had to add a node. In practice, K9s only gave me so much. Sometimes things would chug, and I’d come back to find the cluster recovering. It didn’t show me history. I figured it would be a good opportunity to peek a bit further under the hood of the services I generally take for granted.

Grafana, VictoriaMetrics, Loki, and Alloy provide the observability I need. Dashboards load fast, logs are queryable, metrics show what’s happening. Running this footprint on Pi hardware is resource-intensive. A good observability stack is heavyweight by nature. I purposely scaled it down—while valuable, it shouldn’t consume 10x the resources of the actual applications it’s monitoring.

Working with AI for Infrastructure 🔗

I got a cluster up and running in roughly a week. I didn’t track exact hours, but it consumed significant free time.

Claude accelerated some tasks while complicating others. I loved having Flux and Cloudflare Tunnels set up in minutes. On the other hand, Woodpecker and the observability stack took much longer than they would have if I were more hands-on from the start.

Expanding boilerplate was a breeze. Secret management is fairly redundant: a bit of scripting for sealed secrets. The agent really saved me a lot of time on tedious setup. Some operations are also pretty well documented—I barely had to be involved in setting up cloud-init; I described the expectations and it appeared. My main tweak was around how to join nodes because the agent kept hardcoding things a bit too much to my taste.

Fighting Helm charts makes me appreciate my SRE teammates even more. The chart versioning and gaps between application configuration and chart configuration can be huge depending on the application. It made the agent trip a few times, increasing the disconnect between my expectations and what the agent was finding.

Flux definitely provides a great base for working with Helm charts. It’s much better than the fragmented management of Helm charts directly. Considering the overhead of understanding the charts, it still feels better than rewriting Kustomize for each app. At least it’s a decent trade-off for me since my goal remains being able to work on my side projects more than designing the perfect production environment.

This is where the agent struggled most: choosing between embedded configs and ConfigMaps, or debugging why ConfigMaps weren’t being picked up by deployments. What trips humans also trips agents; I was able to unblock the agent quickly because I tripped on the same issue in the past. Every knowledge gap I had was amplified by the agent suggesting outdated patterns or missing subtle configuration details.

Was it worth it? 🔗

I’ll need to use it more to know for certain, but so far, I am very happy with the setup. Better automation than some actual production setups. I can’t wait to play with it more.

It brought me closer to areas critical to my work that’s often managed for me. Some amount of exposure to the day-to-day struggles of your teammates never hurt. I wanted infrastructure I could trust and operate the way I like working—automated, observable, and predictable.

Is it overkill for a homelab? Absolutely. But the effort isn’t that big considering the operational burden of manual management in the long run. There’s elegance in CD—you focus on the work, and deployments happen automatically. All in all, it took less than 2 weeks of calendar time, and I wasn’t full-time on it. It’s an investment, but you get a production-like setup for your side projects. If this resonates with you, it’s not as overwhelming to set up as it sounds.

Being cost-conscious doesn’t mean being fragile. Being self-hosted doesn’t mean losing the comfort of production patterns. Constraint doesn’t require compromise. I am happy with this approach, but if you have better patterns, I’d love to hear them.