Staff Software Engineer, ML Platform

Wayve • Sunnyvale, California, United States • 1w ago

The role

We are looking for a Staff Software Engineer to drive the direction of the Wayve Machine Learning platform. The ML Platform team owns the machine learning training infrastructure and works with users to ensure that this infrastructure is reliable and efficiently utilised.

Key responsibilities:

You will take ownership of the training infrastructure, which is used for distributed training of large jobs. Your technical decisions will drive high quality projects that ensure availability, reliability and scalability of the system.
You will be working across functions with machine learning research engineers to optimise models so that they can be trained efficiently by maximising their usage of hardware resources and improving their reliability and observability.
You will collaborate with technical and non-technical stakeholders to understand current user needs and identify future bottlenecks.
You will guide and mentor mid-level engineers and promote high software engineering standards

Examples Projects:

Training job scheduling and orchestration e.g. tooling to schedule jobs across multiple cloud providers depending on model needs and hardware availability.
Tooling which provides thousands of GPUs simultaneously to our driving simulator, which we use to test the driving performance of our models off road.
Profiling training jobs with tools such as NVIDIA Nsight, identifying bottlenecks and optimizing the models to increase efficiency.

About you

In order to set you up for success in this role at Wayve, we’re looking for the following skills and experience.

Essential

Minimum of 10 years experience in platform engineering or similar field with a proven track record of designing and scaling resilient systems
Proficiency in Python, with the ability to mentor engineers on best practices and scalable design
Extensive experience with concurrent, parallel and distributed computing, including performance tuning and optimisation for large-scale applications
Comprehensive knowledge of cloud platforms, preferably Azure, including architecture design, cost optimization, security best practices and declarative configuration (Terraform)
Proven experience with containerization and orchestration technologies, including advanced knowledge of Docker and Kubernetes
Leadership and mentorship experience, guiding mid-level engineers, driving technical decision-making and collaborating with cross-functional teams to align engineering initiatives with business goals.
Passion for building stable and scalable infrastructure that empowers users to train large models seamlessly, efficiently and at scale.

Desirable

Experience with ML frameworks, preferably PyTorch, with a strong understanding of their internal workings and optimisation strategies.
Proven ability to profile, optimise and scale ML training jobs using advanced tools such as NVIDIA Nsight or TensorRT