Open-sourcing our Karpenter provider for Hetzner
An open-source Karpenter provider for Hetzner Cloud: cost-optimal Kubernetes node autoscaling that launches the cheapest server type for the pending pods.
Today we are open-sourcing a Karpenter provider for Hetzner Cloud, under Apache 2.0. It implements Karpenter’s CloudProvider interface against the Hetzner Cloud API, so Karpenter can provision, consolidate, and replace Hetzner servers as Kubernetes nodes, choosing the cheapest server type that fits the pending pods.
We built it for our own infrastructure. Paperclip.inc runs a managed, EU-hosted platform on Kubernetes, with the cluster built on Hetzner and Talos. Those workloads are bursty: the cluster adds capacity when work arrives and gives it back when the work finishes, and each node should be the smallest server that fits. Karpenter does exactly that, but it had no Hetzner provider. The official list covers AWS, Azure, GCP, Cluster API and a handful of others; Hetzner was absent.
Cluster Autoscaler vs Karpenter on Hetzner
Hetzner Kubernetes autoscaling has so far meant the Cluster Autoscaler, usually wired up by kube-hetzner or Cluster API. The Cluster Autoscaler works against fixed node groups: you predefine a pool of one server type, and it adds or removes nodes of exactly that type. To offer a second size you define a second pool, and the scheduler picks between pools rather than picking the right machine for the work.
Karpenter inverts that. It looks at the pods that cannot be scheduled, considers the full catalog of instance types, and launches the single cheapest node that fits them. There are no node groups to size by hand. When a node goes empty or underused, Karpenter consolidates it away. On a provider with as many server types as Hetzner, across shared and dedicated CPU and both x86 and Arm, that selection shows up directly in the bill.
How the provider works
The provider implements Karpenter’s CloudProvider interface against the Hetzner Cloud API:
- Cheapest node that fits. It reads Hetzner’s live per-location pricing and offers Karpenter the lowest-cost server type that satisfies the pending pods, across shared (CPX, CX) and dedicated (CCX) CPU.
- x86 and Arm. Both architectures are first class, so you can steer workloads onto the cheaper Ampere (CAX) machines wherever your images support
arm64. - Talos and Ubuntu. A node joins from a Talos machine config or an Ubuntu cloud-init document, supplied through a Kubernetes Secret so no join credentials ever sit in a manifest.
- Drift and consolidation. Empty and underused nodes are removed automatically. A server whose image, network, firewall, server type, location or labels no longer matches its
HCloudNodeClassis flagged as drifted and replaced.
One detail worth calling out, because it is the kind of thing that quietly breaks autoscalers: Karpenter garbage-collects nodes by periodically listing the instances the provider manages and deleting any NodeClaim whose server is gone. Our List() is scoped by a karpenter.sh/cluster label, so two clusters sharing a single Hetzner project never see each other’s servers and can never garbage-collect each other’s nodes. Getting that boundary right is the difference between an autoscaler and an outage.
Install
The controller ships as a multi-arch image and an OCI Helm chart, both signed with cosign and published with SLSA provenance and an SBOM.
kubectl create namespace karpenterkubectl -n karpenter create secret generic hcloud-token \ --from-literal=token="$HCLOUD_TOKEN"
helm install karpenter-provider-hetzner \ oci://ghcr.io/paperclipinc/charts/karpenter-provider-hetzner --version 1.0.0 \ --namespace karpenter \ --set clusterName=my-cluster \ --set auth.secretRef.name=hcloud-tokenYou also need Karpenter’s own NodePool and NodeClaim CRDs installed, the same as any Karpenter setup.
Define your nodes
An HCloudNodeClass says how to build a node: which network, which image, how to bootstrap it. Here is a Talos worker on a private network, pinned to an exact image by label so you get the precise Talos version and the system extensions you baked in (gVisor, in our case):
apiVersion: karpenter.hetzner.cloud/v1alpha1kind: HCloudNodeClassmetadata: name: defaultspec: locations: ["nbg1"] networkID: 123456 # your Hetzner private network ID imageSelector: family: talos selector: caph-image-name: talos-v1.13.3-gvisor # pin the exact snapshot userDataSecretRef: # Talos worker machineconfig, from a Secret namespace: karpenter name: talos-worker key: userData placementGroupStrategy: spread # spread nodes across Hetzner hardware enablePublicIPv4: false # private-network cluster, skip the IPv4 chargeA NodePool says what Karpenter is allowed to launch and when to consolidate:
apiVersion: karpenter.sh/v1kind: NodePoolmetadata: name: defaultspec: template: spec: nodeClassRef: group: karpenter.hetzner.cloud kind: HCloudNodeClass name: default requirements: - key: kubernetes.io/arch operator: In values: ["amd64"] - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] limits: cpu: "32" disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1mThat is the whole setup. Karpenter now provisions the cheapest server that fits your pending pods and removes it once it is idle.
Cheaper nodes on Arm
The cheapest right-sized node is often an Ampere Arm machine. Define a second NodePool that only offers arm64, and let your arm64-capable workloads land there:
requirements: - key: kubernetes.io/arch operator: In values: ["arm64"] # Karpenter picks from Hetzner's CAX linePods select their architecture with a normal nodeSelector or nodeAffinity on kubernetes.io/arch. As long as your images are multi-arch, the work flows to whichever pool is cheaper for it.
Operating it
The controller exposes Prometheus metrics under the karpenter_hetzner_ prefix and ships a ServiceMonitor for Prometheus Operator. A couple of queries we watch:
# how often nodes are being replaced, by reasonsum by (reason) (rate(karpenter_hetzner_drift_detected_total[1h]))
# p90 server-create latencyhistogram_quantile(0.9, rate(karpenter_hetzner_server_create_duration_seconds_bucket[1h]))When a NodeClass will not become ready, the controller writes a Warning Event explaining why (a missing network, an unresolvable image, a bad userData Secret), so kubectl describe hcloudnodeclass gives you the answer instead of leaving you to guess.
Why we open-sourced it
We run on open infrastructure, and we give back to it. Our Kubernetes operators are open, and so is this. Running a managed cloud well means depending on tools you can read, fix, and carry with you, and the fastest way to keep those tools honest is to share them. If you run Kubernetes on Hetzner, you should not have to rebuild this layer yourself.
It is also how we think the cost question should be answered. The cheapest cloud bill is the one where every machine is the right size for the moment, added when the work shows up and removed when it leaves. That discipline is part of how we keep a sovereign European cloud affordable, and it is now yours to use too.
Try it
helm install karpenter-provider-hetzner \ oci://ghcr.io/paperclipinc/charts/karpenter-provider-hetzner --version 1.0.0The code, the Helm chart, and the Talos and Ubuntu bootstrap guides are on GitHub. We run it in production, so if you hit something, open an issue. We will be in there too.
FAQ
Does Karpenter work on Hetzner Cloud?
Yes. The open-source Karpenter provider for Hetzner implements Karpenter’s CloudProvider interface against the Hetzner Cloud API, so Karpenter can provision, consolidate, and replace Hetzner servers as Kubernetes nodes.
How is it different from the Cluster Autoscaler on Hetzner?
The Cluster Autoscaler scales fixed node groups of a predefined server type. Karpenter picks the cheapest server type that fits the pending pods from the whole Hetzner catalog, with no node groups to size by hand, and consolidates idle nodes automatically.
Does it support Arm and Talos?
Yes. Both amd64 and arm64 (the Ampere CAX line) are first class, and nodes can bootstrap from a Talos machine config or Ubuntu cloud-init supplied through a Kubernetes Secret.
Is it production-ready?
It is at v1.0.0, with unit and controller tests and an end-to-end suite (provision, join, drift, consolidation) validated on a live Talos cluster. Releases are signed and ship an SBOM and provenance. We run it in our own production cloud.