As a Site Reliability Engineer, you will play a key role in ensuring the reliability and smooth operation of our AWS-based cloud data platform. You will be responsible for refining and expanding existing observability practices, designing and automating custom alerts, and creating CI/CD workflows and playbooks for service upgrades and security patches using GitLab.
You will also document runbooks and playbooks for system updates and disaster recovery scenarios. A critical part of your role will be reducing alert noise by setting appropriate thresholds and implementing dynamic alerting. Additionally, you will establish a framework for manual health checks across the system, including daily and weekly routines.