Senior Site Reliability Engineer Supporting Nvidia

Engineering · CA, United States

Job description

Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!

As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.

What you’ll be doing:

  • Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
  • Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
  • Providing high quality of user support.
  • Monitoring KPIs and making sure that team’s SLAs are met.
  • Managing and maintaining production Kubernetes clusters.
  • Drive automation of monitoring to gain more insight into applications and system health.
  • Craft and implement critical metrics using various analytics methods and dashboards.
  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.

What we need to see:

  • Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years.
  • Proficient in troubleshooting Linux OS issues such as SSH and performance.
  • Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs.
  • Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar.
  • Strong understanding and practical experience with REST API calls.
  • Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus.
  • Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes.
  • Experience with the Kickstart process for automated Linux installations.
  • Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI.
  • Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus.
  • Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk.
  • Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce.
  • Advanced knowledge of standard methodologies related to security.
  • Bachelor’s degree in Computer Science, Information Technology, or related field, or equivalent experience.

Ways to stand out from the crowd:

  • Working knowledge of OpenStack.
  • Previous experience managing NVIDIA hardware such as GPUs and Tegras.
  • Prior experience with large scale operations teams.
  • Experience managing Windows server infrastructure.
  • Outstanding interpersonal skills and ability to communicate effectively with all levels of management.
  • Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities.

Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.