Backend engineer

Backend python engineer | ML | Infrastructure | Reliability |

Full remote - ideally European based.

We’re hiring a Backend Software Engineer to own and operate mission-critical Django services that orchestrate large-scale ML inference workflows in production.

About the Role

This is a hands-on, end-to-end ownership role focused on building reliable, high-throughput backend systems — not a research role, not a pure infra role, and not a ticket-driven support position.

Responsibilities

Design, build, and run Django services in production
Own high-throughput async workflows using queues, workers, and schedulers
Implement safe orchestration patterns: retries, idempotency, rate limiting, backpressure
Define and operate SLOs, error budgets, alerts, and on-call
Lead incident response and write postmortems that drive real improvements
Build end-to-end observability (metrics, logs, traces, dashboards, runbooks)
Improve reliability of service integrations using timeouts, circuit breakers, and fallbacks
Work closely with ML engineers to productionise inference pipelines
Own CI/CD and deployment workflows for backend services
Use Infrastructure as Code (Terraform) to support reliability and scale
Optimise performance and cost across compute, storage, databases, and external APIs

Qualifications

Strong experience as a Python backend engineer owning production systems
Hands-on experience running Django in production (ORM, migrations, performance tuning)
Experience building and operating asynchronous job systems (Celery, RQ, Arq, or similar)
Experience with workflow/orchestration systems (Temporal, Prefect, Airflow, Step Functions, etc.)
Solid understanding of distributed systems reliability (timeouts, retries, idempotency, rate limiting, backpressure)
Experience defining and operating SLOs/SLAs and participating in on-call
Strong Linux, networking, and debugging fundamentals
Experience with AWS and/or GCP
Practical experience using Terraform as part of a wider system

Required Skills

Experience running ML inference or training systems at scale
Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo)
Experience with observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki)
Experience operating Postgres and Redis in high-throughput environments
Startup or greenfield system ownership experience

Preferred Skills

Experience running ML inference or training systems at scale
Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo)
Experience with observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki)
Experience operating Postgres and Redis in high-throughput environments
Startup or greenfield system ownership experience

Remotehey

Work anywhere, Live anywhere