Shaping change: this is what drives us at Forschungszentrum Jülich. As a member of the Helmholtz Association with some 7,600 employees, we conduct interdisciplinary research into a digitalized society, a climate-friendly energy system, and a sustainable economy. We focus on the natural, life, and engineering sciences in the fields of information, energy, and bioeconomy. We combine this with expertise in high-performance computing and artificial intelligence using unique scientific infrastructures.
Your Job
As an MLOps Engineer, you will work at the intersection of AI enablement and production-grade operations-helping teams run model inference services that are dependable, scalable, and secure. Your key responsibilities include:
Model Onboarding & Deployment
Take AI models from internal and external users into production inference workflows. Package and deploy models as containerized inference endpoints. Define and operate validation gates covering functional checks, performance benchmarks, and safety/security requirements before production rollout.
Platform Operations & Reliability
Run and maintain BLABLADOR as an operational inference suite within JAIFs environment, aligned with the goal of enabling high-performance, secure, privacy-compliant AI applications. Ensure reliability through comprehensive monitoring, logging, alerting, incident response, and post-incident improvements. Handle day-2 operations including upgrades, patching, lifecycle management, and capacity planning.
Scalability, Performance & Efficiency
Tune inference services to optimize latency, throughput, batching, GPU utilization, and multi-tenant fairness.
Your Profile
We are looking for a colleague who can confidently operate modern infrastructure and understands what it takes to run AI inference services in production.
Requirements
- Excellent master's degree in Computer Science, Mathematics or comparable relevant fields
- Relevant experience in platform operations, Site reliability engineering, DevOps, or infrastructure engineering
- Practical expertise with containers and OCI tooling (e.g., Docker/Podman, image builds, registries, runtime troubleshooting)
- Experience operating cloud technologies and orchestration (e.g., Kubernetes and its ecosystem)
- Very good command of written and spoken English with extensive vocabulary
Nice to Have
- Infrastructure-as-code and CI/CD experience (e.g., Terraform/Ansible, GitOps)
- Observability stacks (e.g., Prometheus/Grafana, ELK/OpenSearch, Loki, OpenTelemetry)
- Familiarity with AI model inference engines and serving stacks (e.g., TensorRT, Triton, vLLM serving, or comparable inference frameworks)
- Experience with web service deployment, such as reverse proxies, TLS certificates, load balancing, API endpoints
- Experience with GPU scheduling/operations and performance debugging on accelerated systems
- Strong Linux skills (process/network debugging, automation, scripting)
- Prior work in (or with) research computing, scientific environments, or multi-tenant platforms
Our Benefits for You
- Working Environment: A highly relevant mission at the intersection of exascale computing and AI, including JAIFs goal of enabling broad access to JUPITER and its planned cloud-based inference capabilities.
- Work-Life Balance: Optimal conditions for balancing work and private life, as well as a family-friendly company policy. The option of flexible working (in terms of location) is generally available after consultation and in line with upcoming tasks and (on-site) appointments.
- Flexibility: Flexible working time models, including options close to full-time, allow you to tailor your working hours to suit your individual needs.
- Knowledge & Further Training: Your professional development is important to us – we provide targeted, individual support.
- Health & Well-Being: Your health is important to us. You can look forward to a comprehensive occupational health management program with a wide range of offerings.
- Vacation: You will receive 30 days of vacation plus additional days off (e.g. between Christmas and New Years).
- Fair remuneration: Depending on your existing qualifications and the tasks assigned to you, you will be classified in pay grade 13 of the TVöD Bund (Collective Agreement for the Public Service).