Data Quality with Databricks Delta Live Tables

Sandip Roy
5 min readAug 8, 2022

--

What is Data Quality?

Data quality is the measure of how well suited a data set is to serve its specific purpose. Measures of data quality are based on data quality characteristics such as accuracy, completeness, consistency, validity, uniqueness, and timeliness.

Now while you generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems.

Data Quality in Databricks

Though Databricks is known for its excellence in data processing, recently Databricks released new frameworks to make data governance easier and more efficient. Without going into much details on how overall governance aspect is handled, in this article, we would discuss how Databrick’s Delta Live Tables (DLT) framework ensures different data quality rules are applied and then records are segregated into cleansed (good) and quarantined (bad) records by calculating data quality metrics on your dataset, defining and verifying data quality constraints etc. The advantage is instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look.

Without spending too much time on the theoretical aspects, let’s jump into few cases/scenarios. To begin, let’s start with sample data set below:

And then read raw data in a DLT table called person_raw as below:

Once data is read into person_raw table, apply data quality rules and add bad record indicator against each records like below:

Then create workflow using WorksflowsàDelta Live Tables with following sample specification:

Now if we run this workflow, we can see the results how records are marked for good/bad records in tables created under metastore db. Please note if you don’t specify any target db (e.g. sample_db as shown above), tables would be created under default db.

Here’s the result:

So through column is_quarantined = “true”, you can track all the records that got failed during the validation process.

Now while this approach is highly performant (as it scans entire records once), the only trade-off is — since this doesn’t use expectations framework, data quality metrics do not appear in the event logs or the pipelines UI.

In order to have rich/enhanced DQ metrices, we would have to go for expectation based approach where we can bifurcate good and bad records into different tables, fail pipeline if there is any bad records, get data validation statistics as metrices etc.

I’ll illustrate sample implementations (how we segregate good and bad records) below as such to give an idea. Please note you can use the expect, expect or drop, and expect or fail expectations with Python or SQL queries to define a single data quality constraint while you have to use one or more data quality constraints in Python pipelines using the @expect_all, @expect_all_or_drop, and @expect_all_or_fail decorators.

Once above is executed, the graph will look like this:

Also when you check the metastore db, you can see following:

Again this approach has got it’s own disadvantage and that is, that it generates the quarantine table by processing the data twice. If you don’t want this performance overhead, you can use the constraints directly within a query to generate a column indicating the validation status of a record (i.e. previous approach as demonstrated above).

But you can easily derive from above implementations how quickly and efficiently we can build scalable data validation framework with minimum of effort. Hope you liked this!

References

https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html

https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-expectations.html

Thanks for reading. In case you want to share your case studies or want to connect, please ping me via LinkedIn

--

--