Data Validation at Scale with Spark/Databricks

Sandip Roy
6 min readMar 11, 2022

--

What is Data Quality?

Data quality is the measure of how well suited a data set is to serve its specific purpose. Measures of data quality are based on data quality characteristics such as accuracy, completeness, consistency, validity, uniqueness, and timeliness.

Now while you generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems.

A typical data ingestion pipeline with data quality

Data Quality in Bigdata world

Apache Spark has become a technology by default nowadays for big data ingestion & transformation. This becomes more robust with the managed service provided by Databricks in terms of working with data at scale.

Without spending too much time on the theoretical aspects, let’s jump into few cases/scenarios (I’ll be using Databricks notebooks for demo purpose):

Anomaly Detection

Here’s the result:

What’s there beyond anomaly detection?

Above we can clearly see which records got changed and specifically which are the columns that got changed. Now that we are comfortable with basic anomaly check, how do we address broader aspects like following:

  • Missing values that can lead to failures in production system that require non-null values (NullPointerException)?
  • How changes in the distribution of data can lead to unexpected outputs of machine learning models?
  • How aggregations of incorrect data can lead to wrong business decisions?

Well there are several open source data quality frameworks viz. Apache Griffin, Great Expectations, Deequ as well as Delta Live Tables (DLT) from Databricks to facilitate this.

In our case, we will be using Deequ from AWS. It allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.

Firstly let’s configure the required libraries deequ (pydeequ for PySpark) as shown below:

And then let’s start with sample data set below:

And then run quality checks around it as below:

Here’s the result:

Through column constraint_status = “Failure”, you can track all the records that got failed during the validation process.

Though Deequ provides an overall data quality report it doesn’t fetch the individual bad records which failed the constraints. However, we can construct methods to create dynamic queries to identify bad records. We can get ideas from few Deequ constraint implementations that we used against each dataset.

I’ll illustrate two sample implementations (e.g. “CompletenessConstraint” and “MaxLengthConstraint”) as such to give an idea.

Sample I

Sample II

So you can easily derive from above implementations how quickly and efficiently we can build scalable data validation framework with minimum of effort. You can view full set of constraints/checks for your use below:

  1. hasSize — calculates the data frame size and runs the assertion on it.
  2. isComplete — asserts on a column completion.
  3. hasCompleteness — asserts on a column completion.
  4. isUnique — asserts on a column uniqueness.
  5. isPrimaryKey — asserts on a column(s) primary key characteristics.
  6. hasUniqueness — asserts on uniqueness in a single or combined set of key columns.
  7. hasDistinctness — distinctness in a single or combined set of key columns.
  8. hasUniqueValueRatio — the unique value ratio in a single or combined set of key columns.
  9. hasNumberOfDistinctValues — asserts on the number of distinct values a column has.
  10. hasHistogramValues — asserts on column’s value distribution.
  11. hasEntropy — asserts on a column entropy.
  12. hasMutualInformation — asserts on a mutual information between two columns.
  13. hasApproxQuantile — asserts on an approximated quantile.
  14. hasMinLength — asserts on the minimum length of the column.
  15. hasMaxLength — asserts on the maximum length of the column.
  16. hasMin — asserts on the minimum of the column.
  17. hasMax — asserts on the maximum of the column.
  18. hasMean — asserts on the mean of the column.
  19. hasSum — asserts on the sum of the column.
  20. hasStandardDeviation — asserts on the standard deviation of the column.
  21. hasApproxCountDistinct — asserts on the approximate count distinct of the given column.
  22. hasCorrelation — asserts on the Pearson correlation between two columns.
  23. satisfies — runs the given condition on the data frame.
  24. hasPattern — checks for pattern compliance.
  25. containsCreditCardNumber — verifies against a Credit Card pattern.
  26. containsEmail — verifies against an e-mail pattern.
  27. containsURL — verifies against an URL pattern.
  28. containsSocialSecurityNumber — verifies against the Social security number pattern for the US.
  29. hasDataType — verifies against the fraction of rows that conform to the given data type.
  30. isNonNegative — asserts that a column contains no negative values.
  31. isPositive — asserts that a column contains no negative values.
  32. isLessThan — asserts that, in each row, the value of columnA < the value of columnB.
  33. isLessThanOrEqualTo — asserts that, in each row, the value of columnA ≤ the value of columnB.
  34. isGreaterThan — asserts that, in each row, the value of columnA > the value of columnB.
  35. isGreaterThanOrEqualTo — asserts that, in each row, the value of columnA ≥ to the value of columnB.
  36. isContainedIn — asserts that every non-null value in a column is contained in a set of predefined values.

For simplicity and priority perspective, I’ve discussed only the critical and commonly used scenarios in our day to day life but deequ framework also covers advanced aspects like full-fledged metrics calculation (i.e. Deequ computes data quality metrics, that is, statistics such as completeness, maximum, or correlation) and auto suggestion of constraints (i.e. automated constraint suggestion methods that profile the data to infer useful constraints).

References

https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

https://github.com/awslabs/deequ

https://databricks.com/session_na21/data-quality-with-or-without-apache-spark-and-its-ecosystem

Thanks for reading. In case you want to share your case studies or want to connect, please ping me via LinkedIn

--

--