Xero
Lead Engineer - SRE (reliability Enablement)

Lead Engineer - SRE (reliability Enablement)

Engineering · Full-time · Auckland, New Zealand · Remote possible

Job description

Xero is a beautiful, easy-to-use platform that helps small businesses and their accounting and bookkeeping advisors grow and thrive.

At Xero, our purpose is to make life better for people in small business, their advisors, and communities around the world. This purpose sits at the centre of everything we do. We support our people to do the best work of their lives so that they can help small businesses succeed through better tools, information and connections. Because when they succeed they make a difference, and when millions of small businesses are making a difference, the world is a more beautiful place.

About the team

Reliability Enablement (AKA Reliability Rangers)

As a member of our Reliability Enablement team at Xero, you’ll help teams deliver a great customer experience through a better understanding of the behaviour and operation of their systems. We do this through a focus in post incident analysis and advocating for learning from incidents, as well as engaging with teams across the organization with specialized reliability enablement and consulting, and running SRE workshops and training.

There will be a lot of variety to your work as a part of reliability enablement, as you may be embedded within an engineering portfolio, or ‘home’ in our central reliability enablement team. Regardless of your current focus you will be an advocate for reliability and incident learning, as well as an active member of our SRE On Call function, providing specialist incident commander capabilities for complex major and critical incidents

How you’ll make an impact

When you are ‘home’ in the central reliability enablement team you could be doing any combination of the following.
Investigating operational surprises and supporting teams in post incident activities.
Conducting in depth incident analysis and maximizing post incident learning across the organization
Complete short term reliability consultancy and enablement engagements such as SLO reviews and facilitating pre-mortems.

When you are an embedded SRE you could spend several months immersed with a product engineering portfolio, working alongside teams to uplift system reliability and robustness through the following.- Improving on call health, uplifting observability and addressing any operational hotspots

Identifying, planning and leading implementation of reliability uplift work and initiatives
Support delivery of strategic features and initiatives with reliability and distributed systems expertise
Observing and improving rituals and practices relating to production operations, incident response and incident learning

What you’ll bring with you

Required
Solid experience in logging, monitoring and observability of a highly distributed system
Leading incident management and response and troubleshooting efforts, including critical, complex and high severity incidents
Post incident reviews, incident analysis and learning from incidents
Experience working in a tech or product company with comparable scale and complexity
Systems thinking and thinking about how systems and components interact, how they respond to failure
Proficiency in one or more object-oriented programming languages (C#, JavaScript, Java, Python etc) or experience with infrastructure-as-code (e.g. Terraform, Cloudformation)
Experience in technical leadership, setting technical direction
Experience in leading delivery of technical initiatives in an operational, site reliability or platform engineering capacity

Preferred- Experience working with cloud providers such as AWS, Azure or GCP

Experience with designing, developing and operating distributed systems and large scale software systems
Strong experience delivering technical initiatives in an operational, site reliability or platform engineering capacity
The ability to solve engineering challenges outside of your own team, including using influence rather than authority to enact change
Demonstrated experience in reliability concepts like capacity management, autoscaling, deployment and release safety, software strategies for reliability, fault tolerance and graceful failure
Experienced in implementing customer focused Service Level Objectives (SLOs)
Experience using software engineering to solve operational and reliability challenges
Understanding of human factors, safety science and resilience engineering
Experience working in environments with advanced security and networks

Org chart

View in org chart

Manager

Chris Patalano

EVP Product Engineering

Peers

Mark Bruce

Principal Engineer

Simon Gow

Principal Engineer

Carl Jackson

Lead Engineer

View in org chart

BUSINESS

EVERY ORG CHART

Access the worlds's biggest network of public org charts

Learn more

A panel showing how The Org can help with contacting the right person.

Lead Engineer - SRE (reliability Enablement)

Job description

Org chart

Manager

Peers

EVERY ORG CHART

Related jobs

Junior Sales Engineer (f/m/d)*

Channel Sales Engineer - Remote (f/m/d)*

Senior Software Engineer, Transportation (fullstack)