Huawei
Industry · Research
Senior Researcher: AI Computing Systems
CHF 130'000 – 150'000 / year
ZÜRICH
COMPUTER VISIONML SYSTEMSLLMRAGPYTORCH
Description
Huawei Technologies Switzerland AG is looking for a strong researcher with hands-on LLM + RAG experience who can help build and optimize techniques such as KV-cache precomputation, KV reuse/blending (e.g., CacheBlend-style), and sparse attention / selective recompute. You will work close to the metal (attention kernels + profiling) and at the system level (vLLM/LMCache-style stacks), turning research ideas into robust, high-performance code.
Responsibilities:
- Design and implement RAG acceleration techniques that reduce TTFT and improve throughput (e.g., document KV precomputation, reuse, caching policies).
- Develop KV-cache reuse / blending pipelines and integrate them into inference stacks (batching, paging, eviction, correctness/quality trade-offs).
- Implement and optimize sparse attention / selective attention paths, including mask construction and block-granularity strategies.
- Work with PyTorch and modern attention backends/kernels (e.g., FlashAttention / FlashInfer-like kernels), profiling and optimizing performance.
- Stay up to date with the latest research and open-source progress in LLM inference, KV caching, and RAG systems, and translate it into practical improvements.
Qualifications:
- PhD in Computer Science, Electrical Engineering, or a related field.
- Strong software engineering skills in Python, with substantial PyTorch experience (model internals, attention/KV cache concepts, performance-aware coding).
- Solid understanding of transformer inference fundamentals: prefill vs decode, KV cache layout, masking, batching, latency/throughput trade-offs.
- Experience benchmarking and profiling AI LLM workloads, and diagnosing performance bottlenecks.
- Strong communication skills and comfort collaborating across research + engineering.
Preferred Qualifications (Nice to Have):
- Experience with vLLM and/or LMCache (integration, debugging, extending attention/KV-cache logic).
- Familiarity with attention kernel stacks and customization (FlashAttention/FlashInfer, Triton, CUDA extensions, custom ops).
- Practical experience building RAG pipelines (retrieval, chunking, indexing, reranking) and understanding how retrieval interacts with inference latency.
- Contributions to open-source projects or publications/technical reports in AI systems, LLM inference, caching, or storage-aware ML systems.
- Systems background (Linux, performance engineering, storage/IO, memory hierarchy) and comfort working close to hardware.
Why join us:
- Collaborate with world-class scientists and engineers in an open, curiosity-driven environment.
- Access to state-of-the-art technology and tools.
- Opportunities for professional growth and development.
- Competitive salary, and a high quality of life in Zurich, at the center of Europe.
- Last but certainly not least: be part of innovative projects that make a difference.