Senior Site Reliability Engineer

Job description

This position is for applicants in Latin America.

As a Senior Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of Rocket.Chat. Your expertise in designing, implementing, and maintaining robust infrastructure will be instrumental in delivering exceptional user experiences.

Mandatory Hard Skills 🎯

  • Strong proficiency in Linux/Unix systems administration;
  • Proficiency in scripting languages such as Python, Go or Bash;
  • In-depth knowledge of cloud platforms such as AWS, Azure, or GCP;
  • Experience with containerization tools such as Docker and container orchestration platforms such as Kubernetes;
  • Proficiency in monitoring tools such as Prometheus and Grafana for collecting, analyzing, and visualizing system metrics, logs, and events;
  • Experience with CI/CD pipelines and tools such as ArgoCD;
  • Solid understanding of networking fundamentals, including TCP/IP, DNS, DHCP, VLANs, routing, and firewalls;
  • Familiarity with database technologies such as MySQL, PostgreSQL, MongoDB, or Redis.

Desirable Hard Skills 💕 

  • Familiarity with database technologies such as MongoDB or Redis;
  • Familiarity with agile management tools such as Jira;
  • Knowledge of Javascript technology.

Soft Skills

  • Collaboration with development teams to ensure that applications are designed with reliability and scalability in mind;
  • Excellent problem-solving and troubleshooting skills;
  • Effective communication and collaboration skills with both technical and non-technical stakeholders;
  • Strong analytical skills to identify root causes of complex issues and develop effective solutions;
  • Leadership skills to guide and inspire team members, especially during incidents or critical situations;
  • Staying updated with emerging technologies and trends in the field is important for continuous learning.

What You'll Do 🖥️

  • Develop and maintain Infrastructure as Code (IaC) using tools like Terraform; 
  • Automate deployment processes to achieve consistent and repeatable infrastructure provisioning;
  • Configure and maintain CI/CD automation pipelines;
  • Observability: proficient in leveraging diverse data sources for troubleshooting, optimization, and ensuring system reliability - skilled in ad-hoc querying and analysis of observability data using tools like Elasticsearch or Grafana;
  • Designing for reliability: Continuously monitor and plan for capacity increases to accommodate traffic growth and ensure that the infrastructure remains fault-tolerant under varying load conditions;
  • Post mortems: Take leadership and accountability in writing blameless post mortems; make sure post mortems have clear action items; take action items from inside post mortems to implement them and design a solution for the post mortem items;
  • Disaster Recovery: Leads teams in disaster recovery procedures; assign DR tasks to less senior engineers during DR practices; Leads a DR practice at least once a year; creates DR plans for critical systems; suggests and implements improvements to disaster recovery processes, tools, and automation to enhance the organization's readiness and reduce recovery time;
  • Network Security: Well-versed in network security principles, be able to assess the security of complex network architectures, and make informed decisions about security configurations, monitoring, and incident response to protect critical systems and data;
  • Incident Management: Coordinates the efforts of responding teams efficiently and ensures that communication flows both between the responders and those interested in the incident’s progress;
  • Coding: proficient in at least one scripting or programming language e.g. Go, Bash; Creates scripts and automation tools to streamline operational tasks; good understanding of IaC principles and practices; Understanding and using configuration management tools such as Ansible or Terraform;
  • Documentation: Proactively suggests and maintains information that describes processes and procedures related to SRE; constantly improves documentation;
  • Cloud computing: In-depth knowledge and hands-on experience with one or more major cloud providers; Expertise in configuring and managing cloud networking components, such as Virtual Private Clouds (VPCs), subnets, load balancers, and security groups.;
  • Containers and Orchestration: In-depth knowledge of container technology, including Docker and container runtimes; experience with Kubernetes networking concepts, including Services, Ingress, and Network Policies; mastery of Kubernetes architecture, including control plane components (API server, etc., controller manager, scheduler) and worker nodes (kubelet, container runtime).

Benefits

Our goal is to make your routine as a Rocketeer feel enjoyable, exciting, and comfortable in a 100% remote environment. So, you’ll receive a set of benefits to improve your remote work experience! They include a flexible schedule, unlimited Paid Time Off, language and tech courses, stock options, a multicultural environment with colleagues in over 26 countries, a vibrant company culture, and more! 

About Rocket.Chat 🚀

‍Rocket.Chat is the world's largest open-source communications platform. Built for organizations needing more control over their communications, it enables collaboration between colleagues, partners, customers, communities, and even platforms without compromising data ownership, customizations, or integrations.

Tens of millions of users in over 150 countries and organizations such as Deutsche Bahn, the U.S. Navy and Credit Suisse trust Rocket.Chat every day to keep their communications completely private and secure. As Rocket.Chat we believe in reconnecting the world, one conversation at a time! See yourself in that? So apply now!

Check out our handbook for more information about our rocket.

View in org chart

Open roles at Rocket.Chat

Two candidates
The Org
helps you hire
great candidates
It takes less than ten minutes to set up your company page.
It’s free to use - try it out today.