Nvidia Industry · Engineering

Senior Software Developer, AI Networking

CHF 150'000 – 170'000 / year
ZÜRICH
AI-TITLEDEEP LEARNINGNEURAL NETWORKLLMSPYTORCH

Description

NVIDIA is changing the world of AI Networking with groundbreaking technology. We are excited to be adding an AI Networking Software Developer to our AI Networking SW development and codesign team. We are working with the latest NVIDIA hardware and technologies. We do full stack benchmarking for Data Center scale systems for AI training/inference and lower level benchmarks. We strive for automation and develop many tools in-house yet adopt community accepted practices and frameworks. Moreover we give back to community developing our own tools in public GitHub repositories. Our goal is to ensure that large-scale systems deliver expected performance in practice, not just on paper, by uncovering bottlenecks and driving continuous improvements.

Responsibilities

  • Developing AI networking communication frameworks and applications running in production on the world’s largest supercomputers and data centers.
  • Develop production tools and benchmarks used by multiple teams inside and outside NVIDIA.
  • Enable new AI models within our benchmarking infrastructure and deliver insights through end-to-end analysis of large-scale workloads across hardware and software stacks.
  • Design and implement automation systems, including large-scale parameter search to identify optimal configurations across complex systems.
  • Collaborate closely with networking and hardware teams to co-design new features and software interfaces in a fast-paced, evolving environment.

Qualifications

  • B.Sc., M.Sc degree in Computer Science / Software engineering, and 5+ years or equivalent experience.
  • Professional Python development experience. We seek individuals who build maintainable, long-lived tools that do not impose a heavy burden on the team in terms of maintenance.
  • Solid Linux expertise and passion for working extensively in command-line environments.
  • Ability to work across a broad and evolving stack, with a strong drive to learn—from hardware and networking up to large-scale AI systems running across entire clusters

Ways to stand out from the crowd:

  • Knowledge and/or experience with modern AI ecosystem: PyTorch, LLMs, inference and training.
  • Familiarity with cluster orchestration systems such as Slurm or Kubernetes.
  • Knowledge in MPI and HPC, InfiniBand, Ethernet and Networking.
  • Experience in performance optimizations