As a Site Reliability Engineer, you will work on diagnosing and solving problems occurring with our highly available production systems and build solutions and automation to prevent issues in the future. You will collaborate with the team to plan, implement, and maintain Azure cloud-based solutions with a focus on networking, virtual servers, web applications, databases, storage, and overall security.
In this role, you will define and verify standards for configuration, monitoring, reliability, and performance. You will also set up log logistics (logs, metrics, and events) on production systems to create continuous real-time feedback mechanisms, providing insights to the support and development teams– including Metrics reporting and Dashboard creation.
As a Site Reliability Engineer, you will support requests from the Engineering and QA teams for configuration changes, permissions, and access. You will make recommendations to the DevOps & development teams on areas related to the reliability, maintainability, availability, security, scalability, and performance of the system as well as efficiency of the team.
You will drive and improve the whole lifecycle of operational readiness – from inception and design, through deployment, operation, and refinement. Additionally, you will continually drive down time-to-detect and time-to-resolve through improved outlier detection and real-time root cause analysis.
Furthermore, you will drive organizational awareness of the importance of availability, applying a creative lens to the many facets of availability. You will design, develop, and implement software and processes that improve the stability, scalability, availability, and latency of our enterprise SaaS products. You will perform root cause analysis and work to implement preventative measures.
As a Site Reliability Engineer, you will assist in reviewing and maintaining security posture through facilities such as Azure Secure Score and CIS Benchmarks. You will participate in software releases and deployments and drive operational best practice adoption across critical services, continually looking to lower operational barriers to achieving improved reliability.
You will document and detail areas of improvement to bolster architecture, design, technical requirements, and service specifications. You will demonstrate technical leadership and mentoring on the application of new technologies and systems management methodologies.