Firm Name: Tesla
Numbers of Jobs: SRE, Supercomputing
Education Need: Graduate
Job Hours: 8
Payment: $20-$30/Hours
What's Job City: PALO ALTO
Job Details:
Ways to Get Ready.
Tesla's Supercomputing team has direct access to the infrastructure for high-performance computing and machine learning, including virtual simulations, Autopilot hardware, silicon design, and Dojo. Due to the rapidly growing demand for more data and optimized compute resources, cluster builds are becoming larger and more complex. Our engineering teams' continued improvement and automation of deployment, monitoring, self-healing, and alerting processes are essential to their success. With the reach and impact of our Autopilot/AI and RandD organizations growing, so does the significance of this team and its work.
As a Site Reliability Engineer on our Supercomputing team, you will be in charge of preserving and improving our infrastructure to guarantee that our engineering teams have the tools and resources they require to be productive. This requires overseeing our HPC clusters, monitoring compute, GPU, and network metrics, creating configuration management scripts, and collaborating with our Data Center team to plan the effective operation of hundreds of servers and add additional capacity to our GPU clusters.
What You Do.
- Support the AI/ML cluster infrastructure on the GPU and Dojo platforms, focusing on systems automation, configuration management, and extensive deployment.
- Increase the pipeline for cluster health monitoring and auto-recovery.
- Involve users in the investigation of application performance problems.
- tuning and enhancing our servers, storage, and network in collaboration with hardware and storage suppliers.
- Ansible playbooks can be made to manage configurations.
- Tuning the performance of a Linux system.
- Control HPC workloads, clusters, and applications.
- Python, Golang, or Bash/Shell systems engineering and automation.
- Alternate being available at all times.
What You Are Bringing.
- Bachelor's degree in electrical engineering, computer science, or a related field.
- 3 or more additional years of experience that is comparable, or evidence of exceptional talent appropriate to the position.
- solid understanding of the performance enhancements and fundamentals of the Ubuntu/RHEL operating systems.
- the ability to use configuration management tools like Ansible with ease.
- It is necessary to demonstrate that you are familiar with the Linux operating system's internals, filesystems, disk/storage technologies, and storage protocols.
- Python, Golang, Bash, and other high-level programming languages, as well as scripting, are knowledges.
- experience constructing large clusters while collaborating with network and data center teams.
- the use of configuration management tools (such as Ansible, etc.) for at least five years. Monitoring and notification programs (Prometheus, Grafana, Telegraf, Splunk, etc. or overseeing workload managers for HPC (SLURM, LSF, etc. ). ).
- working knowledge of high-throughput, low-latency networks, GPU-based computing, and/or high-performance storage systems for a minimum of three years.
- Knowing Slurm and distributed parallel file system storage management is a plus.
benefits and paid vacation days.
Benefits.
- You are eligible for the following benefits as a full-time Tesla employee starting on your first day of work in addition to competitive pay.
- Two medical plan options from Aetna, PPO and HSA, both without payroll deductions.
- A family can be formed through adoption, surrogacy, and other methods.
- Both dental and vision plans (which include coverage for orthodontics) are available without payroll deductions.
- Your employer will contribute to your HSA if you choose the High Deductible Aetna Medical Plan with HSA.
- Healthcare and dependent care Flexible Spending Accounts (FSA).
- LGBTQ+ care concierge services.
- Employee stock purchase programs, a 401(k) with employer match, and additional financial benefits.
- Employer-paid benefits included basic life, AD&D, short-term, and long-term disability insurance.
- Program for occupational therapy.
- Paid holidays as well as paid sick and vacation time (flex time for salaried positions).
- as a backup, resources for parenting and child care support.
- Examples of optional benefits include insurance for pets, critical illness, hospital indemnity, accidents, theft, and legal services.
- programs to aid in weight loss and quitting smoking.
- program dubbed Tesla Babies.
- Benefits of commuting.
Employee discounts and advantages.
- There will be compensation.
- In addition to bonuses and stock awards, salaries range from $104,000 to $348,000 annually.
The pay that is provided may differ depending on a variety of special factors, including market location, job-related knowledge, skills, and experience. Other elements might also be a part of this position's overall compensation, depending on the position that is being offered. When an employee accepts a job offer, instructions on how to take part in these benefit plans will be provided.