Senior Site Reliability Engineer

Engineering · Full-time · Remote · Remote possible

Job description

🌎 About Us At TeamSnap, we believe when the world connects through sports; the world becomes better. TeamSnap is a sports and communication platform dedicated to taking the work out of play in youth sports. We also believe our jobs should excite us, our teammates should support us and our bosses should inspire us. We empower our people to bring big ideas and tiny egos, landing us on Outside Magazine’s list of “Best Places to Work" and Built In’s “100 Best Remote-First Places to Work."

TeamSnap is seeking a Senior Site Reliability Engineer to join our remote infrastructure team. This person will play a pivotal role in ensuring a seamless experience for both our developers and users. By driving improvements in the development lifecycle, automating tasks, and taking our development tools to the next level with AI, you'll be the backbone of our product initiatives.

As a key member of our engineering team, you will architect and build scalable, highly available systems alongside our infrastructure team that serve millions of daily users and some of the largest youth and amateur sports organizations in the world. We value collaboration and regularly participate in pair sessions and virtual team swarms to stay connected and improve the team and company.

What You'll Do:

  • You'll build scalable, reliable systems using cutting-edge technologies like Kubernetes, Docker, Terraform and public cloud platforms, ensuring our applications reach a global audience.
  • Collaborating across teams, you'll identify pain points in the development lifecycle and build tools to improve efficiency and reliability.
  • You'll also be on the front lines during incidents, working closely with engineers across the company to quickly resolve issues and strengthen our infrastructure.
  • You'll be a champion for system reliability, continuously optimizing performance, monitoring systems, and leading incident response efforts. By proactively addressing issues and exploring innovative solutions.
  • You'll ensure the smooth operation and resilience of our platform, providing an exceptional user experience.

What Will Set You Up for Success:

  • 5+ years of SRE or equivalent experience: Demonstrated success building and maintaining large-scale production systems.
  • Experience with Kubernetes, Docker, cloud platforms (ideally GCP), and IaC tools like Terraform, and a proven ability to monitor, scale, debug and harden web services and APIs.
  • Strong analytical and communication skills, with experience working with product engineers and participating in on-call rotations.
  • Proficiency in at least one of our core languages (GO, Elixir, Typescript) to automate and improve operational efficiency.
  • Experience guiding and mentoring junior engineers to grow their skills and knowledge.

Bonus Points:

  • Experience defining and implementing SLOs.
  • Advanced knowledge of monitoring, logging, and tracing tools (e.g., Prometheus, Grafana, ELK stack).
  • Expertise in GCP, AWS or Azure (especially in areas relevant to observability).
  • Interest in applying generative AI to SRE practices.