stage(squidington)

Better data testing with the data (error) generating process

Calendar Icon - Evently Webflow Template
07 Dec
 
Clock Icon - Evently Webflow Template
11:10 am
 - 
11:20 am PST

About the Session

Statisticians often approach probabilistic modeling by first understanding the conceptual data generating process. However, when validating messy real-world data, the technical aspects of the data generating process is largely ignored.

In this talk, I will argue the case for developing more semantically meaningful and well-curated data tests by incorporating both conceptual and technical aspects of "how the data gets made".

To illustrate these concepts, we will explore the NYC subway rides open dataset to see how the simple act of reasoning about real-world events their collection through ETL processes can help craft far more sensitive and expressive data quality checks. I will also illustrate instrumenting such checks based on new features in the dbt-utils package (pending approval of a PR that I recently authored).

Audience members should leave this talk with a clear framework in mind for ideating better tests for their own pipelines.

Prior work inspiring this post come from past blog posts on grouped data checks (https://www.emilyriederer.com/post/grouping-data-quality/), common causes of error in ETL pipelines (https://www.emilyriederer.com/post/data-error-gen/), and in-review PR to dbt-utils (to be reviewed and, per initial communications with dbt team, approved before this conference).

Add to Calendar
Want to Join the movement?

Register today for move(data)!