Senior Engineer, Site Reliability Engineering (SRE)
Our client are a internationally renowned trading firm looking for a passionate and talented engineer to cultivate the SRE philosophy, processes and technologies from within the company. This person will join the Platform Group but be the individual driving SRE within the entire company. Collaborating with talented individuals from the DevOps and Cloud teams to create excellence across the firm. This will require hands on engineering excellence both on prem and on the cloud to create reliable and scalable trading/production platforms. Take the opportunity to transform a globally elite trading firm and increase performance across all engineering aspects.
Responsibilities:
- Promote SRE best practices to optimize infrastructure for scale
- Take ownership of the firms processes and systems, conducting post mortems and creating a culture of perfection
- Lead by example in SRE principles and engineering across teams
- Use observability/monitoring tools like Prometheus, Grafana, Loki, AWS Cloudwatch for clarity on performance and health
- Sentry standards for application reliability and issues
- Create standards of reliability with the Kubernetes environment and a structure of top performance and efficiency
- Automate where appropriate
- Collaborate across teams and with developers for fault tolerance through blameless post mortems and SLOs
- Review performance metrics and SLOs to spot issues preemptively.
Core Tech Stack:
- Languages: Python, Java, NodeJS, C#, Shell
- Public cloud: AWS
- CI/CD: TeamCity, Octopus, Jenkins
- Configuration Management: Puppet, Ansible
- Infrastructure Code: Terraform, CloudFormation
- Application Management: Kubernetes, Docker, Helm
- OS: Linux and Windows
Observability: Prometheus, Amazon CloudWatch, Sentry, Grafana, Loki
Requirements:
- Minimum 5 years in SRE or related field, handling complex low-latency systems.
- Bachelor's in engineering, computer science, or similar, or equivalent experience.
- Proficiency in SRE tools like Prometheus, Grafana, and AWS CloudWatch.
- Expertise in Kubernetes, Docker, and both cloud and on-premises hosting.
- Skilled in scripting with Python, Bash, or Go for automation tasks.
- Strong background in CI/CD, agile practices, and Linux administration.
- Excellent problem-solving, communication, and documentation skills.