Site Reliability Engineer

We are seeking a Site Reliability Engineer who can diagnose and solve problems occurring in our highly available production systems and build solutions to prevent future issues. You will work collaboratively to plan, implement, and maintain Azure cloud-based solutions, define and verify standards for configuration, monitoring, reliability, and performance. As a member of our team, you will be responsible for driving and improving the whole lifecycle of operational readiness and continuously driving down time-to-detect and time-to-resolve through improved outlier detection and real-time root cause analysis.

What will be your key responsibilities:

As a Site Reliability Engineer, you will work on diagnosing and solving problems occurring with our highly available production systems and build solutions and automation to prevent issues in the future. You will collaborate with the team to plan, implement, and maintain Azure cloud-based solutions with a focus on networking, virtual servers, web applications, databases, storage, and overall security.

In this role, you will define and verify standards for configuration, monitoring, reliability, and performance. You will also set up log logistics (logs, metrics, and events) on production systems to create continuous real-time feedback mechanisms, providing insights to the support and development teams– including Metrics reporting and Dashboard creation.

As a Site Reliability Engineer, you will support requests from the Engineering and QA teams for configuration changes, permissions, and access. You will make recommendations to the DevOps & development teams on areas related to the reliability, maintainability, availability, security, scalability, and performance of the system as well as efficiency of the team.

You will drive and improve the whole lifecycle of operational readiness – from inception and design, through deployment, operation, and refinement. Additionally, you will continually drive down time-to-detect and time-to-resolve through improved outlier detection and real-time root cause analysis.

Furthermore, you will drive organizational awareness of the importance of availability, applying a creative lens to the many facets of availability. You will design, develop, and implement software and processes that improve the stability, scalability, availability, and latency of our enterprise SaaS products. You will perform root cause analysis and work to implement preventative measures.

As a Site Reliability Engineer, you will assist in reviewing and maintaining security posture through facilities such as Azure Secure Score and CIS Benchmarks. You will participate in software releases and deployments and drive operational best practice adoption across critical services, continually looking to lower operational barriers to achieving improved reliability.

You will document and detail areas of improvement to bolster architecture, design, technical requirements, and service specifications. You will demonstrate technical leadership and mentoring on the application of new technologies and systems management methodologies.

What experience should you have:

To be considered for this role, you should have:

Technical expertise:

  • 4+ years of experience in a related position
  • Strong knowledge of Linux/Unix/Windows operating systems
  • Proficiency with infrastructure-as-code tools such as Terraform or CloudFormation
  • Experience with container orchestration tools like Kubernetes, Docker Swarm, or Mesos
  • Familiarity with cloud platforms like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP)
  • Experience with database technologies such as MySQL, PostgreSQL, or Oracle
  • Proficiency with at least one programming language such as Python, Ruby, or Go
  • Knowledge of networking principles and protocols, including DNS, TCP/IP, and HTTP/HTTPS
  • Experience with monitoring and logging tools such as Prometheus, Grafana, or ELK stack
  • Knowledge of security best practices and experience implementing them in a production environment

Problem-solving skills:

  • Ability to analyze complex systems and identify areas for improvement
  • Experience with root cause analysis and incident management processes
  • Strong troubleshooting skills and ability to quickly resolve issues
  • Familiarity with testing and deployment processes to ensure high-quality software releases
  • Ability to work under pressure and prioritize competing demands
  • Strong communication skills to effectively collaborate with team members and stakeholders

Collaborative abilities:

  • Experience working in a cross-functional team environment, collaborating with developers, DevOps engineers, and other stakeholders
  • Ability to mentor and guide other team members on best practices and new technologies
  • Strong communication and interpersonal skills to effectively share information and build relationships with team members
  • Ability to work in an Agile or DevOps environment with a focus on continuous improvement and collaboration 
  • Willingness to work during US Business hours

What do you get in return:

Join a team of skilled professionals working collaboratively to maintain highly available production systems and secure cloud-based solutions for our enterprise SaaS products. As a valued member of our team, you will have the opportunity to:

  • Collaborate with experts in their respective fields, including DevOps engineers, developers, and QA testers, to deliver high-quality work and continuously improve our products and systems.
  • Work remotely from anywhere in the world, with a fully remote team, and enjoy a mutually agreed schedule that allows you to balance your work and personal life. (Core US working hours) 
  • Join a team of passionate professionals who prioritize maintaining a positive work culture and fostering an environment that encourages growth and learning.
  • Work primarily with US-based colleagues, providing you with the opportunity to collaborate with people from diverse backgrounds and skill sets.
  • Use your technical expertise to make a significant impact on the reliability, maintainability, availability, security, scalability, and performance of our products and systems, and ensure the protection of our customers' data from cyber threats.
  • Work in a supportive environment that values your contribution, provides you with the resources and training you need to grow in your career, and encourages technical leadership and mentoring.
  • Enjoy a 40-hour workweek that provides you with a healthy work-life balance and the time to pursue your personal and professional goals outside of work.

At our company, we are committed to providing our employees with a rewarding work experience, a positive culture, and opportunities for growth and development. Join us and help us build the best products and systems in the industry while advancing your career and achieving your personal and professional goals.

Mám zájem o tuto pozici

Poslat nabídku na e-mail

Další pozice v oboru Informační technologie, region remote

Senior QA Automation Tester

  • Fortuna
  • Prague
  • By agreement

Fortuna has become an established brand among customers within just a few years. We became a proud international Family of companies carrying Fortuna Entertainment Group from the first betting shop.…

Senior QA Automation Tester

IT Business Analyst

  • Goodcall
  • Praha
  • Dohodou

Pozice IT Business Analytika v mezinárodní retailové společnosti 

IT Business Analyst