Doghouse

Site Reliablity Engineer

  • Terraform
  • Docker
  • Python

About the company

Our client is building a cloud platform for high-throughput, compute-heavyworkloads. They operate large-scale infrastructure where failure modes are real,capacity is finite, and reliability needs to be engineered, not "handled".

About the role

We're seeking a Senior SRE who will own production reliability end-to-end for ourclient: define SLIs/SLOs, run error budget conversations, and ship changes thatreduce incidents and improve latency (p95/p99). You'll build automation to kill toil,improve deployment safety (canary/rollback), and turn observability into signalrather than noise.

This is a bare-metal environment: think Linux, datacenters, physical fleets, and realhardware constraints, not managed services. You'll work close to the metal acrossKubernetes internals (scheduling, autoscaling behavior, kubelet pressure/evictions,etcd/control plane), Linux performance (CPU/memory/IO contention), and networkdebugging (DNS/TCP/TLS, packet loss, congestion). On-call is part of the job, butsuccess is measured by how much you reduce it

Requirements & Benefits

• Production Engineering experience running bare metal / on-prem / data centerinfrastructure (not public cloud only)
• Deep hands-on expertise in Linux systems debugging and performance (CPU,memory, IO, kernel-level behaviors)
• Strong understanding of networking (DNS/TCP/TLS, latency, packet loss,congestion, troubleshooting under load)
• Strong Kubernetes experience beyond manifests: scheduler behavior, autoscalingedge cases, kubelet pressure/evictions, etcd/control plane
• Experience with Terraform, Docker, Helm, and modern CI/CD practices
• Coding skills in Go, and/or Python and/or C

Location: Amsterdam – hybrid
Total compensation: up to €180k

If you're looking for complexity and a new place to nerd out on infrastructureoptimization, we'd love to hear from you.