Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!
As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.
What you’ll be doing:
- Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
- Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
- Providing high quality of user support.
- Monitoring KPIs and making sure that team’s SLAs are met.
- Managing and maintaining production Kubernetes clusters.
- Drive automation of monitoring to gain more insight into applications and system health.
- Craft and implement critical metrics using various analytics methods and dashboards.
- Reuse AI techniques to extract useful signals about machines and jobs from the data generated.
What we need to see:
- Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years.
- Proficient in troubleshooting Linux OS issues such as SSH and performance.
- Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs.
- Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar.
- Strong understanding and practical experience with REST API calls.
- Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus.
- Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes.
- Experience with the Kickstart process for automated Linux installations.
- Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI.
- Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus.
- Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk.
- Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce.
- Advanced knowledge of standard methodologies related to security.
- Bachelor’s degree in Computer Science, Information Technology, or related field, or equivalent experience.
Ways to stand out from the crowd:
- Working knowledge of OpenStack.
- Previous experience managing NVIDIA hardware such as GPUs and Tegras.
- Prior experience with large scale operations teams.
- Experience managing Windows server infrastructure.
- Outstanding interpersonal skills and ability to communicate effectively with all levels of management.
- Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.