Filter
Site Reliability Engineer
Job Ref: 1120873
What’s it all about?
We’re helping an international software and Gen AI business that’s not just growing—it’s thriving! They’ve built a platform handling over 25 billion events daily and are looking for an experienced Site Reliability Engineer to join the journey.
The Site Reliability Engineering Team here is responsible for provisioning and maintaining the cloud infrastructure from development through production and working with the wider Engineering and Product Teams to ensure the product suite is not only reliable but also cost-effiecient. The platform is built on Kubernetes Engine and leverages several other Google technologies such as Memorystore, Cloud Datastore, PubSub, BigQuery and Vertex AI, as well as services from other vendors such as Amazon SES.
You’ll work in an environment where your ideas aren’t just heard—they’re implemented! From shaping the infrastructure to collaborating across teams, you’ll play a pivotal role in keeping things running smoothly while innovating for what’s next.
What you’ll be doing?
- Automate everything: You’ll build and manage infrastructure with tools like Terraform and Ansible, ensuring systems are scalable, secure, and efficient
- Solve real problems: Debug production issues, fix them fast, and put measures in place to prevent them
- Build smarter systems: Design and monitor new services alongside developers, ensuring SLIs and SLOs align with performance and reliability goals
- Collaborate across teams: Work with developers, product managers, and security to keep the platform cutting-edge and compliant
- Stay ahead of the curve: Proactively manage capacity and plan for growth
What you bring?
- Cloud expertise: You know your way around cloud infrastructure, with hands-on experience using Terraform, Ansible, or similar tools
- Coding chops: Proficiency in Python, Go, or a similar language—and the willingness to pick up new ones as needed
- Systems thinker: You understand how systems fail, where bottlenecks hide, and how to design for resilience
- Great communicator: You can translate technical details into clear, actionable documentation for your team
- Metrics-driven mindset: You can talk performance, cost analysis, and operational metrics like a pro
Why we're excited
- Scale: Be part of a team powering a platform processing over 25 billion events a day—and growing fast
- Progression opportunities: Whether you want to deepen your technical skills or step into leadership, the path is yours to shape
- Broad exposure: Work with cutting-edge tech like Kubernetes, BigQuery, Vertex AI, and more
- Global impact: What you build here matters—your work will directly influence systems that countless businesses across the world depend on
- Dynamic environment: Forget the corporate grind—this is a place where ideas flow, growth happens, and your work truly makes a difference
Why this role is different?
This isn’t just about keeping the lights on. You’ll be at the heart of a team that doesn’t settle for “good enough.” From scaling the infrastructure to improving processes and mentoring others, you’ll shape the future of a rapidly growing platform—and your own career along the way.
Working Policy?
Hybrid - think c3 days a week in the office in Sheffield.
Apply NowFor full details, email Daniel Koseoglu: daniel@affecto.co.uk or call 0114 401 0521.
>© 2025 Affecto Recruitment Ltd | Responsive Web Design by Prototype Creative