Backend python engineer | ML | Infrastructure | Reliability |
Full remote - ideally European based.
We’re hiring a Backend Software Engineer to own and operate mission-critical Django services that orchestrate large-scale ML inference workflows in production.
About the Role
This is a hands-on, end-to-end ownership role focused on building reliable, high-throughput backend systems — not a research role, not a pure infra role, and not a ticket-driven support position.
Responsibilities
- Design, build, and run Django services in production
- Own high-throughput async workflows using queues, workers, and schedulers
- Implement safe orchestration patterns: retries, idempotency, rate limiting, backpressure
- Define and operate SLOs, error budgets, alerts, and on-call
- Lead incident response and write postmortems that drive real improvements
- Build end-to-end observability (metrics, logs, traces, dashboards, runbooks)
- Improve reliability of service integrations using timeouts, circuit breakers, and fallbacks
- Work closely with ML engineers to productionise inference pipelines
- Own CI/CD and deployment workflows for backend services
- Use Infrastructure as Code (Terraform) to support reliability and scale
- Optimise performance and cost across compute, storage, databases, and external APIs
Qualifications
- Strong experience as a Python backend engineer owning production systems
- Hands-on experience running Django in production (ORM, migrations, performance tuning)
- Experience building and operating asynchronous job systems (Celery, RQ, Arq, or similar)
- Experience with workflow/orchestration systems (Temporal, Prefect, Airflow, Step Functions, etc.)
- Solid understanding of distributed systems reliability (timeouts, retries, idempotency, rate limiting, backpressure)
- Experience defining and operating SLOs/SLAs and participating in on-call
- Strong Linux, networking, and debugging fundamentals
- Experience with AWS and/or GCP
- Practical experience using Terraform as part of a wider system
Required Skills
- Experience running ML inference or training systems at scale
- Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo)
- Experience with observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki)
- Experience operating Postgres and Redis in high-throughput environments
- Startup or greenfield system ownership experience
Preferred Skills
- Experience running ML inference or training systems at scale
- Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo)
- Experience with observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki)
- Experience operating Postgres and Redis in high-throughput environments
- Startup or greenfield system ownership experience