Rocket.Chat
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Job description

Job Title: Senior Site Reliability Engineer

Level: Senior

Working Hours: Full Time (40h/Week)

Contract: Contractor (PJ), Employee (CLT, Brazil)

Location: Remote

Your Team 👥

You will report to our Senior Engineering Manager and join the Engineering team. On TheOrg you can view the complete structure of our organisation, including information about every team member, hiring managers and the size of each department.

Your Responsabilities ✏️

As a Senior Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of Rocket.Chat. Your expertise in designing, implementing, and maintaining robust infrastructure will be instrumental in delivering exceptional user experiences.

Mandatory Hard Skills 🎯

Strong proficiency in Linux/Unix systems administration;
Proficiency in scripting languages such as Python, Go or Bash;
In-depth knowledge of cloud platforms such as AWS, Azure, or GCP;
Experience with containerization tools such as Docker and container orchestration platforms such as Kubernetes;
Proficiency in monitoring tools such as Prometheus and Grafana for collecting, analyzing, and visualizing system metrics, logs, and events;
Experience with CI/CD pipelines and tools such as ArgoCD;
Solid understanding of networking fundamentals, including TCP/IP, DNS, DHCP, VLANs, routing, and firewalls;
Familiarity with database technologies such as MySQL, PostgreSQL, MongoDB, or Redis.

Desirable Hard Skills 💕

Familiarity with database technologies such as MongoDB or Redis;
Familiarity with agile management tools such as Jira;
Knowledge of Javascript technology.

Soft Skills ✨

Collaboration with development teams to ensure that applications are designed with reliability and scalability in mind;
Excellent problem-solving and troubleshooting skills;
Effective communication and collaboration skills with both technical and non-technical stakeholders;
Strong analytical skills to identify root causes of complex issues and develop effective solutions;
Leadership skills to guide and inspire team members, especially during incidents or critical situations;
Staying updated with emerging technologies and trends in the field is important for continuous learning;
Passion: Genuine enthusiasm for what you do and how it contributes to our company's mission;
Dream: Proactively seek out opportunities and challenges to achieve extraordinary results. If you're someone who takes initiative and is always striving to improve, you'll fit right in;
Own: Take ownership of your work, set high standards for yourself, and be accountable for outcomes demonstrating a strong sense of responsibility and commitment;
Trust: Recognizing the importance of trust and support and actively working towards a collaborative and inclusive workplace;
Share: Communicating openly and transparently, ensures clarity and honesty in interactions.

What You'll Do 🖥️

Develop and maintain Infrastructure as Code (IaC) using tools like Terraform;
Automate deployment processes to achieve consistent and repeatable infrastructure provisioning;
Configure and maintain CI/CD automation pipelines;
Observability: proficient in leveraging diverse data sources for troubleshooting, optimization, and ensuring system reliability - skilled in ad-hoc querying and analysis of observability data using tools like Elasticsearch or Grafana;
Designing for reliability: Continuously monitor and plan for capacity increases to accommodate traffic growth and ensure that the infrastructure remains fault-tolerant under varying load conditions;
Post mortems: Take leadership and accountability in writing blameless post mortems; make sure post mortems have clear action items; take action items from inside post mortems to implement them and design a solution for the post mortem items;
Disaster Recovery: Leads teams in disaster recovery procedures; assign DR tasks to less senior engineers during DR practices; Leads a DR practice at least once a year; creates DR plans for critical systems; suggests and implements improvements to disaster recovery processes, tools, and automation to enhance the organization's readiness and reduce recovery time;
Network Security: Well-versed in network security principles, be able to assess the security of complex network architectures, and make informed decisions about security configurations, monitoring, and incident response to protect critical systems and data;
Incident Management: Coordinates the efforts of responding teams efficiently and ensures that communication flows both between the responders and those interested in the incident’s progress;
Coding: proficient in at least one scripting or programming language e.g. Go, Bash; Creates scripts and automation tools to streamline operational tasks; good understanding of IaC principles and practices; Understanding and using configuration management tools such as Ansible or Terraform;
Documentation: Proactively suggests and maintains information that describes processes and procedures related to SRE; constantly improves documentation;
Cloud computing: In-depth knowledge and hands-on experience with one or more major cloud providers; Expertise in configuring and managing cloud networking components, such as Virtual Private Clouds (VPCs), subnets, load balancers, and security groups.;
Containers and Orchestration: In-depth knowledge of container technology, including Docker and container runtimes; experience with Kubernetes networking concepts, including Services, Ingress, and Network Policies; mastery of Kubernetes architecture, including control plane components (API server, etc., controller manager, scheduler) and worker nodes (kubelet, container runtime).

Benefits ✨

Flexible Working Hours
Fully Remote
Unlimited Paid Time Off
Holidays and Vacation Days
Company Laptop and Headphone
Remote Benefit
iTalki
Courses and Books
Stock Options
Multicultural environment with colleagues in over 26 countries
Vibrant Company Culture

Check out our handbook to dive into each of our awesome benefits! At Rocket.Chat, we have tailored base pay ranges according to work locations. This approach ensures that we can competitively and consistently compensate our employees across different geographic markets.

About Rocket.Chat 🚀

‍Rocket.Chat is the world's largest open-source communications platform. Built for organizations needing more control over their communications, it enables collaboration between colleagues, partners, customers, communities, and even platforms without compromising data ownership, customizations, or integrations.

Tens of millions of users in over 150 countries and organizations such as Deutsche Bahn, the U.S. Navy and Credit Suisse trust Rocket.Chat every day to keep their communications completely private and secure. As Rocket.Chat we believe in reconnecting the world, one conversation at a time!

See yourself in that? So apply now! Check out our handbook for more information about our rocket.

A panel showing how The Org can help with contacting the right person.

Senior Site Reliability Engineer

Job description

Related jobs

Airfield Maintenance Technician

Associate Aircraft Broker

Data Manager