Skip to main content
Posted 01 July, 2026

Senior Site Reliability Engineer

Obsidian Security
Australia Full Time
Reference: 102_700101_5237083008

About the DevOps / SRE Team

The DevOps/SRE team at Obsidian ensures that engineering excellence translates into stable, scalable, and high-performing production systems. We work closely with Engineering, Quality Engineering, and Customer Support to deliver end-to-end services that bring code to life and maintain our world-class SaaS security platform.

As part of our Sydney team, you will also play a foundational role in building Sherlock - our AI-powered SRE agent - owning the infrastructure that enables autonomous incident detection, root cause analysis, and remediation at scale.

What You'll Do

Platform Reliability

  • Support and maintain the service quality of our customer-facing SaaS security platform
  • Address complex challenges around scalability, reliability, observability, and cost efficiency
  • Collaborate with Engineering teams to maintain and enhance Helm charts, application deployment, monitoring, and CI/CD pipelines
  • Embed into the engineering team so that you understand the application deeply
  • Define service verification strategies and implement them as part of the CI/CD process to meet SLAs
  • Improve developer experience by optimizing CI/CD workflows and performance
  • Participate in the on-call rotation, providing 24/7 support in coordination with our global SRE team
  • Monitor, debug, and optimize production infrastructure and services on AWS/GCP
  • Own and evolve the observability stack: design and maintain Prometheus/Mimir metrics pipelines, Grafana dashboards, Loki log aggregation, and distributed tracing (e.g. Tempo, Jaeger, or OpenTelemetry)
  • Define and instrument SLIs/SLOs across services; build alerting strategies that reduce noise and surface actionable signals

AI SRE Agent (Sherlock)

  • Own the Kubernetes infrastructure for Sherlock: five independently-scaled worker pools, each tuned for its agent's compute profile with HPA autoscaling
  • Design and maintain the CloudSQL schema, migration pipeline, task queue (SKIP LOCKED), and pgvector IVFFlat index for 1,000+ RCA entries
  • Build Grafana dashboards covering queue depth, worker latency, agent error rates, accuracy trends, and P50/P95 speed
  • Own and maintain the benchmark CI gate in GitLab that blocks any prompt version merge regressing accuracy >5% or speed >15%
  • Deliver capacity planning and cost dashboards for Sherlock's GKE node pools
  • By month 3, serve as the primary on-call engineer for all Sherlock infrastructure

What We're Looking For

Required

  • 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS
  • Bachelor's degree in Computer Science or related field
  • Production Kubernetes experience: authored and owns Deployments, HPAs, and resource limits - not just applied YAML
  • Strong proficiency in Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD
  • Deep hands-on experience with the Grafana observability stack: Prometheus/Mimir (metrics), Loki (logs), and distributed tracing (Tempo, Jaeger, or OpenTelemetry)
  • Ability to design SLI/SLO frameworks, build alerting rules, and reduce alert fatigue across complex microservices
  • PostgreSQL fluency: schema design, indexing, migrations, and query optimisation
  • Async / queue-based architecture experience: debugged stuck queues, consumer lag, and duplicate processing
  • Programming proficiency in Python or Go
  • Strong ownership mindset and comfort with production on-call responsibility

Highly Desired

  • GCP expertise: Cloud SQL, GKE, IAM, Pub/Sub
  • pgvector or other vector database experience
  • CI/CD pipeline ownership (GitLab CI or GitHub Actions)
  • Familiarity with LLM APIs (Anthropic, Bedrock, or Vertex)
  • Understanding of AI agent design patterns and frameworks
  • Experience with Kafka, Elasticsearch, ScyllaDB, Databricks, Dagster, Sentry, or Kong

Sign up for Job Alerts