Site Reliability Engineer (SRE)
Position: Site Reliability Engineer (SRE) Location: Fully Remote (Offices in Limassol, Kyiv, London, Tbilisi) Working Hours: Availability to work between 5 PM and 8 AM CET Company Overview: Our client is one of the fastest-growing B2B iGaming solutions providers in Europe, with over 100 remote team members across the continent. They specialize in delivering high-quality software platforms, payment solutions integrations, marketing tools, and technical support to clients in the online casino and betting sectors. As they continue to expand, they are looking for a talented and growth-oriented individual to help enhance and streamline their infrastructure. The company offers a dynamic and supportive environment where your input is valued and your professional growth is encouraged. Don’t miss the opportunity to join their exciting journey! Role Overview: As a Site Reliability Engineer (SRE), you will bridge the gap between development and operations to ensure that services and platforms remain reliable, scalable, and performant — even under high transaction volumes and regulatory requirements. You will work closely with backend engineers, DevOps, InfoSec, and operational teams to build automation, improve observability, and respond to incidents. Key Requirements:
- 2–5 years in SRE / Infrastructure / Platform / Production DevOps
- Strong Linux experience in production
- Networking: TCP/IP, DNS, HTTP, load balancers, TLS
- Kubernetes in production (cluster ops, networking, ingress)
- AWS experience (EC2, ALB/NLB, RDS, S3, IAM, EKS or self-managed K8s)
- Terraform, Ansible (IaC), Helm (optional)
- Observability tools: Prometheus, Alertmanager, Grafana, ELK, Loki
- Containers and image lifecycle (Docker)
- Troubleshooting across application, network, and infrastructure layers
- CI/CD pipelines: Jenkins, GitLab CI, GitHub Actions, ArgoCD
- Incident response experience and participation in post-incident reviews
- Availability for late-evening and night shifts 17:00–01:00 or 00:00–08:00 CET
Bonus Skills:
- Experience with high-load or real-time systems
- CDNs, log aggregation, real-time analytics
- Scripting: Python, Bash, Go
- Knowledge of Java/PHP ecosystems
- Databases: PostgreSQL, MySQL, MongoDB
- Message systems: Kafka, Redis, RabbitMQ
- External API integrations
Key Responsibilities:
- Ensure reliability, scalability, and performance of distributed services
- Operate and improve Kubernetes clusters
- Manage AWS-based infrastructure
- Build and maintain IaC with Terraform and Ansible
- Enhance monitoring, logging, and alerting stacks
- Handle production incidents end-to-end and reduce MTTR
- Maintain SLOs, SLIs, and error budgets for critical systems
- Automate operations and reduce manual toil
- Collaborate with engineering teams to embed SRE practices
Success Metrics:
- Apply tot his job
Apply To this Job