Remotehey

Work anywhere, Live anywhere

SR2 | Socially Responsible Recruitment | Certified B Corporation™ - remotehey

Backend python engineer | ML | Infrastructure | Reliability |

Full remote - ideally European based.

We’re hiring a Backend Software Engineer to own and operate mission-critical Django services that orchestrate large-scale ML inference workflows in production.


About the Role

This is a hands-on, end-to-end ownership role focused on building reliable, high-throughput backend systems — not a research role, not a pure infra role, and not a ticket-driven support position.


Responsibilities

  • Design, build, and run Django services in production
  • Own high-throughput async workflows using queues, workers, and schedulers
  • Implement safe orchestration patterns: retries, idempotency, rate limiting, backpressure
  • Define and operate SLOs, error budgets, alerts, and on-call
  • Lead incident response and write postmortems that drive real improvements
  • Build end-to-end observability (metrics, logs, traces, dashboards, runbooks)
  • Improve reliability of service integrations using timeouts, circuit breakers, and fallbacks
  • Work closely with ML engineers to productionise inference pipelines
  • Own CI/CD and deployment workflows for backend services
  • Use Infrastructure as Code (Terraform) to support reliability and scale
  • Optimise performance and cost across compute, storage, databases, and external APIs



Qualifications

  • Strong experience as a Python backend engineer owning production systems
  • Hands-on experience running Django in production (ORM, migrations, performance tuning)
  • Experience building and operating asynchronous job systems (Celery, RQ, Arq, or similar)
  • Experience with workflow/orchestration systems (Temporal, Prefect, Airflow, Step Functions, etc.)
  • Solid understanding of distributed systems reliability (timeouts, retries, idempotency, rate limiting, backpressure)
  • Experience defining and operating SLOs/SLAs and participating in on-call
  • Strong Linux, networking, and debugging fundamentals
  • Experience with AWS and/or GCP
  • Practical experience using Terraform as part of a wider system


Required Skills

  • Experience running ML inference or training systems at scale
  • Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo)
  • Experience with observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki)
  • Experience operating Postgres and Redis in high-throughput environments
  • Startup or greenfield system ownership experience


Preferred Skills

  • Experience running ML inference or training systems at scale
  • Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo)
  • Experience with observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki)
  • Experience operating Postgres and Redis in high-throughput environments
  • Startup or greenfield system ownership experience