Infrastructure Software Engineer
Software / AI Infrastructure Engineer – GPU Clusters - Up To Principal – Remote
This position is open to candidates working remotely in the United States or Canada.
Our client is a cloud technology company driving the next generation of AI infrastructure. They empower organizations to build and scale AI and ML solutions without the need for large in-house teams or heavy upfront infrastructure costs. Their global team of engineers works at the forefront of GPU cloud computing, supporting businesses across industries to solve complex, real-world problems.
The company operates with a flat structure, minimal bureaucracy, and a strong focus on ownership, speed, and technical excellence. Engineers work closely with customers and internal teams to design scalable solutions and influence product direction, creating direct impact on how modern AI platforms are built and operated.
The Role
They are looking for someone to build the automation and lifecycle systems that power a global, large-scale GPU-cluster fleet. This is a hands-on engineering role at the intersection of software and physical infrastructure. You will design and build systems that provision, configure, test, and manage physical hardware at scale, working close to the metal by interfacing directly with servers, networks, and management controllers to support highly automated and reliable infrastructure operations.
You will work with cutting-edge NVIDIA hardware that most engineers never get close to, and you'll be helping design systems that often get redesigned within weeks: because that's the pace. If you thrive in environments where speed, autonomy, and real engineering ownership matter, this role is for you.
Responsibilities
- Design and develop backend services and automation tooling in Python
- Build and maintain provisioning, testing, and lifecycle management systems for physical hardware, including software that runs directly on bare-metal environments
- Integrate with Linux systems using shell scripting and low-level tooling, and implement CI/CD pipelines for infrastructure-focused software
- Work across networking layers (IPv4/IPv6, DHCP, DNS, network boot) and interface with hardware management controllers and their protocols
- Design NoSQL data stores for system state and orchestration
- Support ARM64 architectures and contribute clear documentation and operational excellence across large machine fleets
What You'll Bring
- Strong Python engineering experience with a solid Linux and shell scripting background
- Hands-on familiarity with bare-metal servers, networking fundamentals, and hardware management interfaces and APIs
- Experience with CI/CD pipelines and NoSQL databases
- The ability to debug complex issues spanning software, hardware, and networks
- A strong ownership mindset and clear communication skills in a distributed team
Nice to Have
- Experience at large infrastructure scale, with ARM platforms in production, or in hardware testing and factory provisioning
- A background in infrastructure automation, internal platform tooling, or open-source systems software
Interview Process
- Preliminary interview
- Technical coding interview
- Final technical deep dive
The Offer
- Base salary up to 250K USD plus bonus and RSUs
- Remote role within the US/Canada
- No take-home assignment throughout the process