The role
We are looking for a Staff Software Engineer to drive the direction of the Wayve Machine Learning platform. The ML Platform team owns the machine learning training infrastructure and works with users to ensure that this infrastructure is reliable and efficiently utilised.
Key responsibilities:
- You will take ownership of the training infrastructure, which is used for distributed training of large jobs. Your technical decisions will drive high quality projects that ensure availability, reliability and scalability of the system.
- You will be working across functions with machine learning research engineers to optimise models so that they can be trained efficiently by maximising their usage of hardware resources and improving their reliability and observability.
- You will collaborate with technical and non-technical stakeholders to understand current user needs and identify future bottlenecks.
- You will guide and mentor mid-level engineers and promote high software engineering standards
Examples Projects:
- Training job scheduling and orchestration e.g. tooling to schedule jobs across multiple cloud providers depending on model needs and hardware availability.
- Tooling which provides thousands of GPUs simultaneously to our driving simulator, which we use to test the driving performance of our models off road.
- Profiling training jobs with tools such as NVIDIA Nsight, identifying bottlenecks and optimizing the models to increase efficiency.
About you
In order to set you up for success in this role at Wayve, we’re looking for the following skills and experience.
Essential
- Minimum of 10 years experience in platform engineering or similar field with a proven track record of designing and scaling resilient systems
- Proficiency in Python, with the ability to mentor engineers on best practices and scalable design
- Extensive experience with concurrent, parallel and distributed computing, including performance tuning and optimisation for large-scale applications
- Comprehensive knowledge of cloud platforms, preferably Azure, including architecture design, cost optimization, security best practices and declarative configuration (Terraform)
- Proven experience with containerization and orchestration technologies, including advanced knowledge of Docker and Kubernetes
- Leadership and mentorship experience, guiding mid-level engineers, driving technical decision-making and collaborating with cross-functional teams to align engineering initiatives with business goals.
- Passion for building stable and scalable infrastructure that empowers users to train large models seamlessly, efficiently and at scale.
Desirable
- Experience with ML frameworks, preferably PyTorch, with a strong understanding of their internal workings and optimisation strategies.
- Proven ability to profile, optimise and scale ML training jobs using advanced tools such as NVIDIA Nsight or TensorRT
#LI-HH1