AIML - Senior ML/RL Training Infrastructure Engineer
Description
As a core member of our ML infrastructure team, you will design, build, and scale the systems that enable large-scale reinforcement learning for Apple’s foundation models. You will focus on TPU-based training with JAX, developing robust, high-performance RL pipelines that support distributed actor/learner architectures, efficient experience replay, and large-scale environment execution.
In this role, you will work across the full stack of RL training systems—from low-level performance tuning and compiler optimization to cluster-level orchestration and resource management. You will ensure that training pipelines are efficient, reliable, reproducible, and observable, enabling research teams to iterate quickly and explore more complex RL environments and models.
Your work will directly impact the scalability, throughput, and stability of RL experiments, helping to unlock new capabilities in agentic reasoning, decision-making, and policy learning for Apple’s foundation models. This position is ideal for engineers who enjoy distributed systems, high-performance ML frameworks, and building the infrastructure that makes large-scale RL research possible.
Minimum Qualifications
- PhD or MSc in Computer Science, Computer Engineering or a closely related field.
- Hands-on experience designing, building, or maintaining large-scale ML training infrastructure.
- Strong proficiency with PyTorch or JAX and experience running training workloads on GPUs/TPUs.
- Solid understanding of distributed systems concepts (parallelism strategies, fault tolerance, synchronization).
Preferred Qualifications
- Practical experience developing or optimizing training loops, RL pipelines, or large-scale model-training frameworks.
- Strong software engineering skills in Python, with emphasis on reliability, debuggability, and high-performance execution.
- Deep experience with PyTorch/JAX internals, XLA, debugging and performance profiling on GPU/TPU architectures.
- Expertise in distributed RL training patterns, including actor/learner architectures, experience replay, and parallel environment execution.
- Experience building training services, orchestration tools, or automated pipelines for large-scale experiments.
- Proven success diagnosing bottlenecks in large-scale ML jobs (I/O, input pipelines, kernel performance, memory, compilation).
- Familiarity with RL-specific infrastructure requirements (e.g., actor/learner architectures, experience replay systems, large-scale environment execution).
- Strong software engineering practices: code quality, design reviews, testing, observability, CI/CD.
- Experience working with cloud-scale clusters or specialized accelerators (TPU v5/v6, GPU, custom hardware)
- Contributions to ML frameworks, distributed training libraries, or high-performance computing systems.
- Excellent communication and collaboration skills for working with research and engineering partners.