Data

Data Pipelines & Pagers: A Data Engineer's Guide to Oncall Excellence

In the high-stakes world of streaming entertainment, data pipeline reliability isn't just about keeping jobs running—it's about maintaining the quality of user experience for millions of global subscribers. This talk draws from real-world experiences managing Netflix's data infrastructure to reveal critical patterns and antipatterns in building resilient data systems.


We'll dive deep into the evolution of our alert ecosystem, exploring how we transformed from a reactive ""fix-it-when-it-breaks"" approach to a proactive reliability framework. Through practical examples, we'll examine common pitfalls in data pipeline monitoring, the dangers of unclear alert definitions, and the challenge of balancing alert sensitivity.


Key areas of discussion include:

* Implementing effective alerting strategies for complex data workflows
* Designing actionable runbooks that bridge the gap between documentation and incident response
* Building ownership models that scale with distributed data teams
* Quantifying and communicating business impact during pipeline incidents
* Automating recovery procedures while maintaining data integrity

Attendees will walk away with concrete strategies for improving their data pipeline reliability, clear guidelines for implementing effective alerting, and a framework for building robust incident response processes. Whether you're handling terabytes or petabytes, these battle-tested patterns will help you build more resilient data systems.


Target Audience: Data Engineers, Analytics Engineers, and Data Platform Engineers who design, maintain, and support data pipelines in production environments.

20 Mar

12:50 pm

-

1:00 pm PST

Add to Calendar

Register  for move(data) 2025!