Senior Engineer, Site Reliability Engineering (SRE)

Our client are a internationally renowned trading firm looking for a passionate and talented engineer to cultivate the SRE philosophy, processes and technologies from within the company. This person will join the Platform Group but be the individual driving SRE within the entire company. Collaborating with talented individuals from the DevOps and Cloud teams to create excellence across the firm. This will require hands on engineering excellence both on prem and on the cloud to create reliable and scalable trading/production platforms. Take the opportunity to transform a globally elite trading firm and increase performance across all engineering aspects.

Responsibilities:

Promote SRE best practices to optimize infrastructure for scale
Take ownership of the firms processes and systems, conducting post mortems and creating a culture of perfection
Lead by example in SRE principles and engineering across teams
Use observability/monitoring tools like Prometheus, Grafana, Loki, AWS Cloudwatch for clarity on performance and health
Sentry standards for application reliability and issues
Create standards of reliability with the Kubernetes environment and a structure of top performance and efficiency
Automate where appropriate
Collaborate across teams and with developers for fault tolerance through blameless post mortems and SLOs
Review performance metrics and SLOs to spot issues preemptively.

Core Tech Stack:

Languages: Python, Java, NodeJS, C#, Shell
Public cloud: AWS
CI/CD: TeamCity, Octopus, Jenkins
Configuration Management: Puppet, Ansible
Infrastructure Code: Terraform, CloudFormation
Application Management: Kubernetes, Docker, Helm
OS: Linux and Windows

Observability: Prometheus, Amazon CloudWatch, Sentry, Grafana, Loki

Requirements:

Minimum 5 years in SRE or related field, handling complex low-latency systems.
Bachelor's in engineering, computer science, or similar, or equivalent experience.
Proficiency in SRE tools like Prometheus, Grafana, and AWS CloudWatch.
Expertise in Kubernetes, Docker, and both cloud and on-premises hosting.
Skilled in scripting with Python, Bash, or Go for automation tasks.
Strong background in CI/CD, agile practices, and Linux administration.
Excellent problem-solving, communication, and documentation skills.

Senior Engineer, Site Reliability Engineer

Share job

Contact us