About the role
Iru is the AI-powered security & IT platform used by the world’s fastest-growing companies to secure their users, apps, and devices. Built for the AI era, Iru unifies identity & access, endpoint security & management, and compliance automation—collapsing the stack and giving IT & security time and control back.
Iru is backed by some of the smartest investors in tech—General Catalyst, Tiger Global, Felicis, Greycroft, and First Round Capital. In July 2024, Iru raised $100 million from General Catalyst, valuing the company at $850 million. Customers include Notion, Cursor, Lovable, Replit, and Mercor, and Iru partners with industry leaders such as ServiceNow and AWS. Iru was named to Forbes’ America’s Best Startup Employers 2025 list for employee engagement and satisfaction.
The Opportunity
We are looking for a Senior SRE to own how we detect, respond to, and learn from incidents, and to drive consistent observability across services and teams. This role sits at the intersection of reliability engineering and cross-team enablement—you will work alongside our Infrastructure team to complement their platform-building work with a sharp focus on operational excellence and measurable reliability. You will partner with engineering and platform teams to reduce MTTD and MTTR, and to make reliability measurable, repeatable, and ultimately team-owned.
Responsibilities
- Lead and refine the incident lifecycle: detection, triage, communication, mitigation, resolution, and post-incident review.
- Define and maintain severity models, escalation paths, on-call expectations, and runbooks/playbooks—keeping them current and usable under pressure.
- Facilitate blameless postmortems; turn findings into tracked remediations and shared learning that reduces repeat incidents.
- Improve coordination during major incidents: roles, tooling, customer/stakeholder updates, and handoffs.
- Partner with security, support, and product on incident communications and regulatory or contractual obligations where applicable.
Requirements
- Experience: 5+ years in SRE, production engineering, or equivalent, including on-call responsibility for customer-facing systems.
- Incidents: Proven experience running or significantly improving incident response (process, tooling, or both) in a distributed systems environment.
- Observability: Deep, hands-on experience with Datadog—building dashboards, monitors, and instrumentation standards across multiple teams or services. Experience with metrics, logging, and tracing at scale.
- SLI/SLO Programs: Demonstrated experience defining SLOs/SLIs and error budget policies in production; comfortable working with teams to codify the metrics their reliability posture is based on.
- Systems: Strong understanding of Linux, networking, distributed systems failure modes, and cloud or hybrid infrastructure (Kubernetes, load balancers, databases, queues).
- Automation: Proficiency in at least one of Go, Python, or similar for tooling and automation; comfort with IaC concepts (Terraform or equivalent).
- Communication: Clear written and verbal communication; ability to facilitate discussions during high-pressure incidents and deliberate postmortems alike.
- Collaboration: Track record of influencing without direct authority and driving adoption across engineering teams.
Apply for this role
This listing is hosted elsewhere. Applications are handled on the company's site — we don't see them on hire.ee.