Computer Vision Research Internship: Image to Sequence Modeling (e.g. Transformers)
Description
We are offering a research-focused internship aimed at advancing machine learning methods for complex visual understanding tasks. The project centers on deep learning architectures for image-to-sequence modelling, such as Transformers, attention mechanisms, and modern sequence and representation-learning frameworks, to address challenging and highly structured computer vision problems. This project contributes to long-term research efforts aimed at achieving even higher performance, robustness, and generalization in large-scale visual applications. An ideal position for experienced master’s students, PhD collaborations, or candidates preparing for a research career in industry or academia.
Responsibilities
You will work closely with experienced ML researchers and engineers on cutting-edge research at the intersection of computer vision and sequence modeling. Your work will include:
- Designing and experimenting with new ML architectures for structured visual data.
- Evaluating alternative modeling paradigms (e.g., encoder–decoder, hybrid Transformer models, sequence-based representations).
- Investigating techniques for improving robustness, generalization, and multi-view reasoning.
- Running systematic experiments, ablations, and error analyses to validate research hypotheses.
This project provides opportunities for novel model design, extensive experimentation, and scholarly research. You will contribute to long-term innovation in our technology, with potential real-world impact for millions of users.
Qualifications
MSc or PhD student in Computer Science, Machine Learning, Artificial Intelligence, or a related field with a strong research focus. Candidates should have a solid foundation in machine learning theory, neural networks, and computer vision.
Essential Skills:
- Proficiency in Python and deep learning frameworks such as PyTorch.
- Practical experience designing, training, and evaluating neural networks, including CNNs and Transformer-based architectures.
- Strong analytical and problem-solving abilities, with the capability to interpret experimental results and iterate effectively.
- Familiarity with research best practices, including reproducibility, controlled experiments, and ablation studies.
Desirable Skills:
- Prior research experience in computer vision, pattern recognition, sequence modeling, or image-to-sequence architectures.
- Experience training large-scale models or working with foundation-style architectures.
- Contributions to publications, preprints, or open-source machine learning projects.
- Strong communication skills and the ability to work independently in a research-oriented environment.
Benefits
- A highly skilled team and a fun environment where you can put your enthusiasm for computer vision challenges and cutting-edge technologies to use
- Hackathons, summer parties, company outings and other regular events
- Office in the city center of Zurich