All jobs

[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations

100% Remote Full-time Open now

Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud provider focused on AI, offering high-performance infrastructure for AI start-ups and large enterprises. They are seeking a Principal Site Reliability Engineer to lead reliability strategy, design foundational systems, and drive operational excellence across their AI Infrastructure Operations team.

Responsibilities

  • Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
  • Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
  • Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams
  • Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes
  • Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level
  • Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity
  • Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices
  • Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability

Skills

  • 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure
  • Expert-level software engineering skills, with a strong track record of building production-grade automation and systems
  • Deep expertise in Linux, networking, and distributed systems design at scale
  • Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers
  • Proven ability to lead technical initiatives across teams without direct authority
  • Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost
  • Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)
  • Experience designing observability systems for high-cardinality, high-throughput environments
  • Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures
  • A history of driving step-change improvements in reliability, scalability, or operational efficiency

Benefits

  • Highly competitive package (base + equity) with reviews every 12 months.
  • In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs.
  • Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.
  • Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
  • Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Company Overview

  • Nscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.
  • Apply To This Job

    You might also like

    [Remote] Senior Product Marketing Manager, Archives

    100% Remote Full-time

    [Remote] Senior Consultant, Healthcare Consulting – Epidemiology

    100% Remote Full-time

    [Remote] Senior Financial Analyst - Managed Services

    100% Remote Full-time

    [Remote] Lead, Sales Operations

    100% Remote Full-time

    [Remote] Principal Technical Project Manager

    100% Remote Full-time

    [Remote] Staff Software Engineer, Fullstack

    100% Remote Full-time

    [Remote] Product Designer

    100% Remote Full-time

    [Remote] District Vice President of Operations (Southeast, USA)

    100% Remote Full-time

    [Remote] Data Risk Operations Analyst

    100% Remote Full-time

    [Remote] Staff Technical Writer

    100% Remote Full-time

    BCBA Clinical Supervisor – Remote, Hybrid & Senior Opportunities Available

    100% Remote Full-time

    Part-Time Data Entry Specialist (Night Shifts) – arenaflex

    100% Remote Full-time

    Clinical Director, Clinical Research, Ophthalmology

    100% Remote Full-time

    Advanced Specialist, Legal Counsel/Contract Management

    100% Remote Full-time

    Learning Environment Specialist (Remote)

    100% Remote Full-time

    Entry-Level Remote Data Entry Specialist – arenaflex – $35/hr – Flexible Home‑Based Role

    100% Remote Full-time

    Experienced Remote Customer Experience Representative – Deliver Exceptional Service, Shape Customer Journeys, and Thrive in a Dynamic arenaflex Team

    100% Remote Full-time

    Senior Software Safety Engineer-Aerospace Controls

    100% Remote Full-time

    Technical Compliance Analyst

    100% Remote Full-time

    Remote Customer Service Representative – Evening Shift (5 PM – Midnight) – High‑Volume Call Center, Indianapolis (Hybrid Remote)

    100% Remote Full-time