Huawei Industry · Research

Senior Researcher: AI Computing Systems

CHF 130'000 – 150'000 / year

ZÜRICH

COMPUTER VISIONML SYSTEMSLLMRAGPYTORCH

Description

Huawei Technologies Switzerland AG is looking for a strong researcher with hands-on LLM + RAG experience who can help build and optimize techniques such as KV-cache precomputation, KV reuse/blending (e.g., CacheBlend-style), and sparse attention / selective recompute. You will work close to the metal (attention kernels + profiling) and at the system level (vLLM/LMCache-style stacks), turning research ideas into robust, high-performance code.

Responsibilities:

Design and implement RAG acceleration techniques that reduce TTFT and improve throughput (e.g., document KV precomputation, reuse, caching policies).
Develop KV-cache reuse / blending pipelines and integrate them into inference stacks (batching, paging, eviction, correctness/quality trade-offs).
Implement and optimize sparse attention / selective attention paths, including mask construction and block-granularity strategies.
Work with PyTorch and modern attention backends/kernels (e.g., FlashAttention / FlashInfer-like kernels), profiling and optimizing performance.
Stay up to date with the latest research and open-source progress in LLM inference, KV caching, and RAG systems, and translate it into practical improvements.

Qualifications:

PhD in Computer Science, Electrical Engineering, or a related field.
Strong software engineering skills in Python, with substantial PyTorch experience (model internals, attention/KV cache concepts, performance-aware coding).
Solid understanding of transformer inference fundamentals: prefill vs decode, KV cache layout, masking, batching, latency/throughput trade-offs.
Experience benchmarking and profiling AI LLM workloads, and diagnosing performance bottlenecks.
Strong communication skills and comfort collaborating across research + engineering.

Preferred Qualifications (Nice to Have):

Experience with vLLM and/or LMCache (integration, debugging, extending attention/KV-cache logic).
Familiarity with attention kernel stacks and customization (FlashAttention/FlashInfer, Triton, CUDA extensions, custom ops).
Practical experience building RAG pipelines (retrieval, chunking, indexing, reranking) and understanding how retrieval interacts with inference latency.
Contributions to open-source projects or publications/technical reports in AI systems, LLM inference, caching, or storage-aware ML systems.
Systems background (Linux, performance engineering, storage/IO, memory hierarchy) and comfort working close to hardware.

Why join us:

Collaborate with world-class scientists and engineers in an open, curiosity-driven environment.
Access to state-of-the-art technology and tools.
Opportunities for professional growth and development.
Competitive salary, and a high quality of life in Zurich, at the center of Europe.
Last but certainly not least: be part of innovative projects that make a difference.

Apply Now