It’s cheaper to buy more hardware than optimize your code. I have heard that a lot in my career. While I generally agree with it, it is also based on the premise that optimization and fine tuning are hard… but they don’t always have to be, do they?
I definitely have seen a few incidents, outages or just cost spikes coming from poorly configured deployments. While the value of having the perfect resource requirements is often low, especially in the earlier stages of a company, misconfigurations tend to compound. The wrong pod is claiming too much memory, often hiding a leak that could have been caught early with memory limits, a greedy pod hogs all the CPU, a critical pod gets evicted due to resource pressure.
When running my raspberry Pis, I ran into similar issues. Nodes are saturating, the poorly configured pods pay the price because I had an LLM hogging all the resources and the sword of eviction falls. Kubernetes decides to evict my nginx servers… tough luck.
In a typical prod system, you have metrics and other systems that help you flag bad resource usage. When I asked my agent to review my resource configuration, it definitely pointed out issues. It becomes a cycle of reviewing pods specs, finding resource usage trend, adjusting and validating. Of course, as some applications evolve, you have to regularly do it to ensure these configs are still valid.
There must be a better way.
Facing bad resource usage, there are often a few obvious checks: identifying which workloads are missing resource request or limits, which workloads are asking for too much, and which workloads are constantly at the edge.
I didn’t want to figure out how to feed my entire stack to my agent. I also didn’t want to blow up my usage by having the agent spin countless cycles on figuring out how to query the data. I didn’t want to have to deploy some kind of heavyweight solution that does this but also a whole bunch of other things I do not really care for at the moment.
I wrote Winston. A small pod, written in go, backed by a SQLite persisted on a small PVC. It polls the resource metrics every minute, performs aggregation and runs checks to find exuberant pods:
- missing limits: no boundary set; pod can consume unbounded resources.
- missing request: no baseline declared; results in lower QoS and higher eviction risk.
- danger zone: running at 90% of its limits; pod is likely underprovisioned.
- over provisioned: running at less than 20% of requests; wasteful resource allocation.
- ghost limit: running at less than 10% of limits; limit provides little safety value.
Winston comes with a /metrics endpoint to feed your favorite OTel stack. There’s also an embedded static UI, an endpoint to get Markdown or JSON report of exuberant pods. Feed that report to an agent that has access to your flux repo. All that’s left to do is cherry pick the actions. For example, here’s what my agent suggested:
┌────────────────────┬─────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┐
│ Workload │ CPU avg │ CPU max │ Mem avg │ Mem max │ → CPU req/limit │ → Mem req/limit │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┤
│ alloy │ 16m │ 105m │ 386Mi │ 572Mi │ 50m / 200m │ 400Mi / 640Mi │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┤
│ config-reloader │ — │ 1m │ — │ 28Mi │ 5m / 10m │ 32Mi / 48Mi │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┤
│ grafana │ 27m │ 192m │ 117Mi │ 128Mi │ 50m / 300m │ 128Mi / 256Mi │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┤
│ kube-state-metrics │ 3m │ 8m │ 25Mi │ 28Mi │ 10m / 50m │ 32Mi / 64Mi │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┤
│ node-exporter │ 4m │ 16m │ 23Mi │ 26Mi │ 10m / 50m │ 32Mi / 64Mi │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┤
│ vm-operator │ 3m │ 6m │ 56Mi │ 63Mi │ 10m / 100m │ 64Mi / 128Mi │
└────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┘
Winston closed the resource definition gaps quickly, giving me better metrics about effective capacity. Curious about your thoughts.
I am Winston Wolfe, I solve problems. May I come in?