Huawei Industry · Research

Senior Researcher: AI Computing Systems

CHF 130'000 – 150'000 / year
ZÜRICH
COMPUTER VISIONML SYSTEMSLLMRAGPYTORCH

Description

Huawei Technologies Switzerland AG is looking for a strong researcher with hands-on LLM + RAG experience who can help build and optimize techniques such as KV-cache precomputation, KV reuse/blending (e.g., CacheBlend-style), and sparse attention / selective recompute. You will work close to the metal (attention kernels + profiling) and at the system level (vLLM/LMCache-style stacks), turning research ideas into robust, high-performance code.

Responsibilities:

  • Design and implement RAG acceleration techniques that reduce TTFT and improve throughput (e.g., document KV precomputation, reuse, caching policies).
  • Develop KV-cache reuse / blending pipelines and integrate them into inference stacks (batching, paging, eviction, correctness/quality trade-offs).
  • Implement and optimize sparse attention / selective attention paths, including mask construction and block-granularity strategies.
  • Work with PyTorch and modern attention backends/kernels (e.g., FlashAttention / FlashInfer-like kernels), profiling and optimizing performance.
  • Stay up to date with the latest research and open-source progress in LLM inference, KV caching, and RAG systems, and translate it into practical improvements.

Qualifications:

  • PhD in Computer Science, Electrical Engineering, or a related field.
  • Strong software engineering skills in Python, with substantial PyTorch experience (model internals, attention/KV cache concepts, performance-aware coding).
  • Solid understanding of transformer inference fundamentals: prefill vs decode, KV cache layout, masking, batching, latency/throughput trade-offs.
  • Experience benchmarking and profiling AI LLM workloads, and diagnosing performance bottlenecks.
  • Strong communication skills and comfort collaborating across research + engineering.

Preferred Qualifications (Nice to Have):

  • Experience with vLLM and/or LMCache (integration, debugging, extending attention/KV-cache logic).
  • Familiarity with attention kernel stacks and customization (FlashAttention/FlashInfer, Triton, CUDA extensions, custom ops).
  • Practical experience building RAG pipelines (retrieval, chunking, indexing, reranking) and understanding how retrieval interacts with inference latency.
  • Contributions to open-source projects or publications/technical reports in AI systems, LLM inference, caching, or storage-aware ML systems.
  • Systems background (Linux, performance engineering, storage/IO, memory hierarchy) and comfort working close to hardware.

Why join us:

  • Collaborate with world-class scientists and engineers in an open, curiosity-driven environment.
  • Access to state-of-the-art technology and tools.
  • Opportunities for professional growth and development.
  • Competitive salary, and a high quality of life in Zurich, at the center of Europe.
  • Last but certainly not least: be part of innovative projects that make a difference.