Hiring

Senior Site Reliability Engineer (sre) - Mission-critical Saas Cloud Products

Product · Contract · Atlanta, United States

Job description

Key Responsibilities

Reliability and Performance Management

  • Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.

  • Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.

  • Continuously optimize system performance and resource utilization across multiple cloud platforms.

  • Finetune/Optimize Application performance by analyzing the code, traces and database queries.

Incident Management and Troubleshooting

  • Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.

  • Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.

  • Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.

Observability and Monitoring

  • Design and implement end-to-end observability solutions across our distributed systems.

  • Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.

  • Create and optimize product status dashboards to provide real-time visibility into system health and performance.

Automation and Infrastructure as Code (IaC)

  • Implement Infrastructure as Code practices using tools like Terraform.

  • Develop and maintain automated deployment pipelines and CI/CD workflows.

  • Create self-healing systems and automate routine operational tasks to reduce manual intervention.

Cloud-Agnostic Architecture

  • Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.

  • Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).

  • Implement and manage containerized applications using Kubernetes across different cloud environments.

Continuous Improvement

  • Regularly review and refine operational practices to enhance efficiency and reliability.

  • Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.

  • Contribute to the development of internal tools and frameworks to support SRE practices.

Requirements

  • Strong knowledge of cloud platforms - Azure and their associated services.

  • Expert in Observability tools (ELK Stack, Dynatrace, Prometheus )

  • Expertise in containerization technologies such as Docker and Kubernetes 

  • Understanding of Event-driven architecture and database technologies (Mongo Atlas, Azure SQL, PostgresDB )

  • Proficient in IaaC tools such as - Terraform and GitHub Actions.

  • Proficiency in one or more programming languages - Python/.Net/Java

  • Strong understanding of networking concepts, load balancing, and security practices.


Org chart

No direct reports

Teams

This job is not in any teams


Offices