About Datadog :
We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams.
We operate at high scale trillions of data points per day providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies.
Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.
The team :
The Resilience Engineering group at Datadog focuses on improving resilience in our software and staff. We work on defining our on-call tooling and incident response process for the entire company, constantly iterating on it through the lessons we learn from production.
We help out during the most complex production incidents - our Resilience Engineers excel in troubleshooting and have a passion for problem solving and efficiency.
We also build the chaos platform and tooling so that engineers can use a measured approach to break and test for system resilience and reproduce past bugs / incidents to verify their remediation.
The opportunity :
When we design systems, our Software Engineers and Site Reliability Engineers invest heavily on making them reliable and robust.
However, it wouldn’t be pragmatic to expect our systems to be perfect and never fail. Being prepared to deal with unknown failures both from a technical and organizational standpoint is the core work of Resilience Engineers.
You will :
You will also help train our on-call staff, preparing newcomers to their on-call responsibilities but also refreshing the rest of the staff with what we’ve learnt from past incidents.
Bonus points :