Training Optimization + Infrastructure Job in Mirelo AI

Training Optimization + Infrastructure

Tübingen, BW, DE, Germany

Job Description

Key Responsibilities

Find ideal training strategies (parallelism approaches, precision trade-offs) for a variety of model sizes and compute loads Profile, debug, and optimize single and multi-GPU operations using tools like Nsight and stack trace viewers to understand what's actually happening at the hardware level Analyze and improve the whole training pipeline from start to end (efficient data storage, data loading, distributed training, checkpoint/artifact saving, logging, …) Set up scalable systems for experiment tracking, data/model versioning, experiment insights. Design, deploy and maintain large-scale ML training clusters running SLURM for distributed workload orchestration

Ideal Candidate Profile

Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them Deep understanding of GPU memory hierarchy and computation capabilities—knowing what the hardware can do theoretically and what prevents us from achieving it Experience optimizing for both memory-bound and compute-bound operations and understanding when each constraint matters Expertise with efficient attention algorithms and their performance characteristics at different scales

Nice to Have

Experience in implementing custom GPU kernels and integrating them into PyTorch. Experience with diffusion and autoregressive models and understanding of their specific optimization challenges Familiarity with high-performance storage solutions (VAST, blob storage) and understanding of their performance characteristics for ML workloads * Experience with managing SLURM clusters at scale

Beware of fraud agents! do not pay money to get a job

MNCJobs.de will not be responsible for any payment made to a third-party. All Terms of Use are applicable.

Related Jobs

M

Training Optimization + Infrastructure

Mirelo AI

Tübingen, BW, DE

Apply Now
M

Training Optimization + Infrastructure

Mirelo AI

Berlin, BE, DE

Apply Now

Senior Applied Scientist Systems for ML Inference and Training Optimization, Deep Science for Systems and Services

Amazon Web Services

Tübingen, BW, DE

Apply Now
Mandatory Internship – Code Models: Synthetic Data Generation & RL Post Training

Siemens

München, BY, DE

Apply Now

Job Detail

Job Id

JD3944093
Industry

Not mentioned
Total Positions

1
Job Type:

Vollzeit
Salary:

Not mentioned
Employment Status

Permanent
Job Location

Tübingen, BW, DE, Germany
Education

Not mentioned

Jobs by Function

Popular Job Skills

Popular Industries

Popular Cities

Jobseekers

Employers