Senior Cloud Ops Engineer

Operations · Full-time · Lehi, United States

Job description

Since its inception in 2003, driven by visionary college students transforming online rent payment, Entrata has evolved into a global leader serving property owners, managers, and residents. Honored with prestigious awards like the Utah Business Fast 50, Silicon Slopes Hall of Fame - Software Company - 2022, Women Tech Council Shatter List, our comprehensive software suite spans rent payments, insurance, leasing, maintenance, marketing, and communication tools, reshaping property management worldwide.

Our 2200+ global team members embody intelligence and adaptability, engaging actively from top executives to part-time employees. With offices across Utah, Texas, India, and the Netherlands, Entrata blends startup innovation with established stability, evident in our transparent communication values and executive town halls. Our product isn't just desirable; it's industry essential. At Entrata, we passionately refine living experiences, uphold collective excellence, embrace boldness and resilience, and prioritize diverse perspectives, endeavoring to craft a better world to live in.

The Cloud Operations Engineering team is central to the continuous improvement of our products, people, processes, and technologies. As a Senior Cloud Operations Engineer, you will play a critical role in driving the reliability and scalability of our production systems while enabling development teams to focus on delivering innovative new features and services.

Responsibilites:

  • Lead efforts to enhance the reliability, repeatability, and flexibility of our production systems by developing and utilizing software tools that streamline operations.
  • Mentor and promote a CloudOps mindset across the development organization, fostering a culture of site reliability engineering and DevOps best practices.
  • Collaborate with Engineering, Architecture, and InfoSec teams to improve operational health, security, growth, usability, and quality of our applications.
  • Develop and implement comprehensive monitoring, logging, tagging, and other feedback mechanisms to ensure transparency and improve the customer experience.
  • Continuously enhance system performance and reliability by creating and maintaining frameworks for Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs).
  • Optimize incident response processes through alerting, troubleshooting, automation, playbooks, and root-cause analysis, and actively participate in the Incident Response team.
  • Leverage cloud technologies to improve performance, reliability, quality, and cost-efficiency.
  • Drive the deployment, scaling, and management of distributed systems on cloud platforms like AWS, with a focus on cloud-native architecture and application performance.

Minimum Qualifications:

  • 7+ years of software development experience, including at least 2 years in a senior role focused on Site Reliability Engineering (SRE), DevOps, or platform automation.
  • Strong expertise in building and expanding Application Performance Monitoring (APM) systems such as New Relic, Dynatrace, etc.
  • In-depth understanding of modern cloud-native architecture, with experience in building, deploying, and managing distributed systems on AWS or other cloud providers.
  • Proficiency with CI/CD tools such as GitHub Actions, CircleCI, Jenkins, or similar.
  • Hands-on experience with Kubernetes (K8s) for orchestration and Argo CD for continuous deployment in cloud environments.
  • Strong analytical skills for debugging, troubleshooting, and resolving complex technical problems.
  • Fluency in one or more programming languages, along with familiarity with scripting languages.
  • Ability to manage on-call duties and respond to out-of-band requests as needed.

Preferred Qualifications:

  • 5+ years of experience as a Site Reliability Engineer in a cloud environment, preferably with AWS.
  • AWS Certifications.
  • Extensive experience in a high-volume or critical production environment.
  • Expertise in networking, network analysis, and performance troubleshooting using tools like tcpdump.
  • Proven ability to analyze and troubleshoot large-scale distributed systems effectively.

Org chart