SITE RELIABILITY ENGINEER
We're seeking a Site Reliability Engineer (SRE) team member with on-call duties to manage and oversee our SAFEQ Cloud print services hosted on AWS. This new colleague will play a critical role in ensuring system reliability, infusing SRE principles into the company culture and processes, and responding to system emergencies in a 24/7 setup.
RESPONSIBILITIES:
- Monitor and manage the SAFEQ Cloud print service, ensuring high availability and reliability within the AWS environment.
- Develop and implement tools and practices for automating routine tasks to improve system scalability and resilience.
- Set up alerts and monitoring metrics for proactive identification and mitigation of system issues.
- Participate in capacity planning and performance tuning to enhance system performance.
- Collaborate with software engineering teams to ensure seamless deployment, efficient trouble-resolution, and effective crisis management.
- Conduct root cause analysis following system incidents - post mortems; define corrective actions and preventative measures.
- Education and Training: Act as an educator and advocate for SRE best practices. Train and mentor cross-functional teams in SRE principles.