Site Reliability Engineer

Engineering · Full-time · Piedmont, Italy

Job description

Who is Bold Commerce?

Bold Commerce powers personalized checkout experiences for leading omnichannel retailers and direct-to-consumer brands.

As a leader in the composable commerce space, Bold makes checkout better, boosting profitability by enabling personalized, customer-specific checkout flows designed to increase the Checkout Power Trio of conversion, AOV, and LTV - not just conversion. Built with a composable & headless architecture, Bold Checkout fits with any commerce stack, making it easy to overcome platform limitations. Leading omnichannel retailers like Harry Rosen and Staples Canada trust their business with Bold Checkout.

Named one of Built In Austin’s Best Places to Work, Canada’s Top Employers for Young People, and Manitoba’s Top Employers, we're a dynamic team that truly cares about building the future of ecommerce. We live by the BUILDERS Code, a shared set of practices, beliefs, and values that help shape this remote-first company.

Founded in 2012, with team members (Builders) located throughout Canada and the U.S., and backed by investors like OMERS Ventures, WhiteCap Venture Partners, and Round13 Capital, Bold is leading the way to a better, composable ecommerce future.

About the role

Bold is looking for a Site Reliability Engineer (SRE) to enhance the reliability, scalability, and performance of our software systems and infrastructure. You’ll work closely with Engineering and IT Operations teams to design and maintain robust systems that meet our service-level objectives (SLOs) and drive value for our merchants.

What you’ll do

  • Design and manage scalable, fault-tolerant infrastructure for SaaS services.

  • Develop and implement proactive monitoring, alerting, and incident response processes to address system issues.

  • Optimize system performance through capacity planning, load testing, and performance tuning.

  • Automate tasks and streamline deployments using configuration management and infrastructure-as-code practices.

  • Collaborate with development teams to ensure efficient software deployment and release management.

  • Conduct root cause analysis and post-incident reviews to drive continuous improvements.

  • Stay updated on best practices and emerging technologies in site reliability engineering.

  • Contribute to the architecture of monitoring and performance systems.

  • Train team members on tools and processes.

  • Balance feature development speed with adherence to SLOs.

  • Effectively manage project execution.

What we’re looking for

  • Bachelor’s degree in Computer Science, Engineering, or a related field.

  • 5+ years of experience as an SRE or in a similar SaaS/cloud-based role.

  • Expertise in Linux/Unix systems administration, shell scripting, and proficiency in at least one programming language (e.g., Python, Go, Ruby).

  • Experience with automation and configuration management tools (e.g., Ansible, Chef, Puppet, Terraform) and cloud platforms (AWS, Azure, GCP).

  • Proficient in containerization technologies like Docker and Kubernetes, along with a solid understanding of networking concepts.

  • Skilled in using monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) and incident management systems.

  • Strong problem-solving abilities, with excellent prioritization and communication skills.

  • Proven ability to build trust and maintain strong relationships both internally and externally.

  • Flexible work hours, including occasional overnight maintenance and participation in an on-call rotation once a month.