Skip to content

Open-sourcing our Karpenter provider for Hetzner

An open-source Karpenter provider for Hetzner Cloud: cost-optimal Kubernetes node autoscaling that launches the cheapest server type for the pending pods.

Today we are open-sourcing a Karpenter provider for Hetzner Cloud, under Apache 2.0. It implements Karpenter’s CloudProvider interface against the Hetzner Cloud API, so Karpenter can provision, consolidate, and replace Hetzner servers as Kubernetes nodes, choosing the cheapest server type that fits the pending pods.

We built it for our own infrastructure. Paperclip.inc runs a managed, EU-hosted platform on Kubernetes, with the cluster built on Hetzner and Talos. Those workloads are bursty: the cluster adds capacity when work arrives and gives it back when the work finishes, and each node should be the smallest server that fits. Karpenter does exactly that, but it had no Hetzner provider. The official list covers AWS, Azure, GCP, Cluster API and a handful of others; Hetzner was absent.

Cluster Autoscaler vs Karpenter on Hetzner

Hetzner Kubernetes autoscaling has so far meant the Cluster Autoscaler, usually wired up by kube-hetzner or Cluster API. The Cluster Autoscaler works against fixed node groups: you predefine a pool of one server type, and it adds or removes nodes of exactly that type. To offer a second size you define a second pool, and the scheduler picks between pools rather than picking the right machine for the work.

Karpenter inverts that. It looks at the pods that cannot be scheduled, considers the full catalog of instance types, and launches the single cheapest node that fits them. There are no node groups to size by hand. When a node goes empty or underused, Karpenter consolidates it away. On a provider with as many server types as Hetzner, across shared and dedicated CPU and both x86 and Arm, that selection shows up directly in the bill.

How the provider works

The provider implements Karpenter’s CloudProvider interface against the Hetzner Cloud API:

  • Cheapest node that fits. It reads Hetzner’s live per-location pricing and offers Karpenter the lowest-cost server type that satisfies the pending pods, across shared (CPX, CX) and dedicated (CCX) CPU.
  • x86 and Arm. Both architectures are first class, so you can steer workloads onto the cheaper Ampere (CAX) machines wherever your images support arm64.
  • Talos and Ubuntu. A node joins from a Talos machine config or an Ubuntu cloud-init document, supplied through a Kubernetes Secret so no join credentials ever sit in a manifest.
  • Drift and consolidation. Empty and underused nodes are removed automatically. A server whose image, network, firewall, server type, location or labels no longer matches its HCloudNodeClass is flagged as drifted and replaced.

One detail worth calling out, because it is the kind of thing that quietly breaks autoscalers: Karpenter garbage-collects nodes by periodically listing the instances the provider manages and deleting any NodeClaim whose server is gone. Our List() is scoped by a karpenter.sh/cluster label, so two clusters sharing a single Hetzner project never see each other’s servers and can never garbage-collect each other’s nodes. Getting that boundary right is the difference between an autoscaler and an outage.

Install

The controller ships as a multi-arch image and an OCI Helm chart, both signed with cosign and published with SLSA provenance and an SBOM.

Terminal window
kubectl create namespace karpenter
kubectl -n karpenter create secret generic hcloud-token \
--from-literal=token="$HCLOUD_TOKEN"
helm install karpenter-provider-hetzner \
oci://ghcr.io/paperclipinc/charts/karpenter-provider-hetzner --version 1.0.0 \
--namespace karpenter \
--set clusterName=my-cluster \
--set auth.secretRef.name=hcloud-token

You also need Karpenter’s own NodePool and NodeClaim CRDs installed, the same as any Karpenter setup.

Define your nodes

An HCloudNodeClass says how to build a node: which network, which image, how to bootstrap it. Here is a Talos worker on a private network, pinned to an exact image by label so you get the precise Talos version and the system extensions you baked in (gVisor, in our case):

apiVersion: karpenter.hetzner.cloud/v1alpha1
kind: HCloudNodeClass
metadata:
name: default
spec:
locations: ["nbg1"]
networkID: 123456 # your Hetzner private network ID
imageSelector:
family: talos
selector:
caph-image-name: talos-v1.13.3-gvisor # pin the exact snapshot
userDataSecretRef: # Talos worker machineconfig, from a Secret
namespace: karpenter
name: talos-worker
key: userData
placementGroupStrategy: spread # spread nodes across Hetzner hardware
enablePublicIPv4: false # private-network cluster, skip the IPv4 charge

A NodePool says what Karpenter is allowed to launch and when to consolidate:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
group: karpenter.hetzner.cloud
kind: HCloudNodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
limits:
cpu: "32"
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m

That is the whole setup. Karpenter now provisions the cheapest server that fits your pending pods and removes it once it is idle.

Cheaper nodes on Arm

The cheapest right-sized node is often an Ampere Arm machine. Define a second NodePool that only offers arm64, and let your arm64-capable workloads land there:

requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64"] # Karpenter picks from Hetzner's CAX line

Pods select their architecture with a normal nodeSelector or nodeAffinity on kubernetes.io/arch. As long as your images are multi-arch, the work flows to whichever pool is cheaper for it.

Operating it

The controller exposes Prometheus metrics under the karpenter_hetzner_ prefix and ships a ServiceMonitor for Prometheus Operator. A couple of queries we watch:

# how often nodes are being replaced, by reason
sum by (reason) (rate(karpenter_hetzner_drift_detected_total[1h]))
# p90 server-create latency
histogram_quantile(0.9, rate(karpenter_hetzner_server_create_duration_seconds_bucket[1h]))

When a NodeClass will not become ready, the controller writes a Warning Event explaining why (a missing network, an unresolvable image, a bad userData Secret), so kubectl describe hcloudnodeclass gives you the answer instead of leaving you to guess.

Why we open-sourced it

We run on open infrastructure, and we give back to it. Our Kubernetes operators are open, and so is this. Running a managed cloud well means depending on tools you can read, fix, and carry with you, and the fastest way to keep those tools honest is to share them. If you run Kubernetes on Hetzner, you should not have to rebuild this layer yourself.

It is also how we think the cost question should be answered. The cheapest cloud bill is the one where every machine is the right size for the moment, added when the work shows up and removed when it leaves. That discipline is part of how we keep a sovereign European cloud affordable, and it is now yours to use too.

Try it

Terminal window
helm install karpenter-provider-hetzner \
oci://ghcr.io/paperclipinc/charts/karpenter-provider-hetzner --version 1.0.0

The code, the Helm chart, and the Talos and Ubuntu bootstrap guides are on GitHub. We run it in production, so if you hit something, open an issue. We will be in there too.

FAQ

Does Karpenter work on Hetzner Cloud?

Yes. The open-source Karpenter provider for Hetzner implements Karpenter’s CloudProvider interface against the Hetzner Cloud API, so Karpenter can provision, consolidate, and replace Hetzner servers as Kubernetes nodes.

How is it different from the Cluster Autoscaler on Hetzner?

The Cluster Autoscaler scales fixed node groups of a predefined server type. Karpenter picks the cheapest server type that fits the pending pods from the whole Hetzner catalog, with no node groups to size by hand, and consolidates idle nodes automatically.

Does it support Arm and Talos?

Yes. Both amd64 and arm64 (the Ampere CAX line) are first class, and nodes can bootstrap from a Talos machine config or Ubuntu cloud-init supplied through a Kubernetes Secret.

Is it production-ready?

It is at v1.0.0, with unit and controller tests and an end-to-end suite (provision, join, drift, consolidation) validated on a live Talos cluster. Releases are signed and ship an SBOM and provenance. We run it in our own production cloud.