[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Autodesk is a company that helps innovators turn their ideas into reality through software. They are seeking a Senior Site Reliability Engineer to build and operate reliable, secure, and scalable cloud services for Autodesk GovCloud products, focusing on improving production services and establishing operational excellence practices.
Responsibilities
- Serve as a primary owner for the reliability, availability, performance, operability, and capacity of one or more production services
- Deploy, operate, maintain, and continuously improve production services running in Autodesk GovCloud environments
- Partner with engineering teams to ensure services are designed with reliability, scalability, security, and operability in mind
- Define and operate reliability practices such as SLOs/SLIs, error budgets, production readiness reviews, service reviews, and operational health reviews
- Build automation to improve deployment safety, operational efficiency, incident response, and service recovery
- Design, develop, and maintain software, automation, and tooling that improve the reliability, scalability, and efficiency of production systems
- Implement and improve monitoring, alerting, logging, tracing, and observability capabilities across supported services
- Lead and participate in incident response, troubleshooting, and post-incident reviews focused on learning and continuous improvement
- Develop and maintain operational documentation, runbooks, and recovery procedures
- Scale and enhance resilience testing and Gameday practices to validate system behavior, recovery capabilities, and operational readiness
- Continuously identify and eliminate operational toil through software engineering, automation, and process improvement
- Ensure supported services remain compliant with Autodesk security, privacy, and regulatory requirements, including FedRAMP and related controls where applicable
- Participate in a 24x7 on-call rotation for production services
- Function effectively in a fast-paced environment while helping establish and mature operational excellence practices for Autodesk GovCloud
Skills
- B.S. or higher in Computer Science, Engineering, or a related technical discipline, or equivalent practical experience
- 7+ years of experience in Site Reliability Engineering, Software Engineering, Platform Engineering, Cloud Infrastructure, or Production Operations
- Experience operating and supporting customer-facing production services in large-scale cloud environments
- Strong understanding of reliability engineering principles, including SLOs/SLIs, observability, incident management, capacity planning, production readiness, and automation
- Experience with AWS, Azure, or other public cloud platforms
- Experience developing automation using languages such as Python, Go, Java, PowerShell, Bash, or similar
- Experience with Infrastructure as Code, CI/CD pipelines, deployment automation, and modern cloud operations practices
- Understanding of security, compliance, and operational risk management in production environments
- Strong written and verbal communication skills
- 10+ years of experience operating highly available, customer-facing production systems
- Experience with AWS GovCloud, FedRAMP, IL4/IL5, or other regulated cloud environments
- Experience supporting services with stringent availability, reliability, and security requirements
- Experience with containers, Kubernetes, cloud-native architectures, APIs, load balancing, networking, DNS, and distributed systems
- Experience with observability platforms such as Splunk, Dynatrace, Datadog, CloudWatch, or similar technologies
- Experience operating databases, storage platforms, messaging systems, caching technologies
- Experience designing and implementing operational automation at scale
- Experience leading or participating in Gamedays, disaster recovery exercises, resilience testing, or operational readiness reviews
- Strong incident management experience, including technical leadership during major incidents and stakeholder communication
- Strong collaboration skills and ability to work effectively across engineering, security, compliance, and operations teams
- Passion for building reliable, secure, and scalable systems that customers can trust
Benefits
- Annual cash bonuses
- Commissions for sales roles
- Stock grants
- A comprehensive benefits package
Company Overview