Skip to main content

Data Quality Overview

Every time a pipeline tasks runs it reads from input tables and writes to interim and output tables. definity agents are running as part of each task extracting metrics that are specific to each table and to the specific operation and the specific time the task is running. Since the metrics are collected in real time during the task execution, real-time reactions can be applied when a metric indicates quality issue.

1. AI-Powered Data Quality Monitoring

Modern data pipelines are complex, and maintaining data quality manually can be challenging. definity uses AI and ML techniques to learn pipeline behavior and automatically generate data quality tests. This ensures that data remains accurate, complete, and consistent over time.

2. Real-Time Proactive Observability

Unlike traditional tools that analyze data after execution, definity runs within the pipeline with minimal footprint. This allows for real-time monitoring and proactive incident handling.

Key Real-Time Data Quality Features:

Analysis TypeProactive InterventionImpact
Detect stale / faulty input dataPrevent pipeline executionSaves resources & prevents downstream impact
Detect faulty output dataDivert output to a quarantine locationPrevents downstream contamination, allows debugging

definity's Data Quality Metrics

definity tracks a comprehensive set of data quality metrics across multiple dimensions. These metrics ensure the integrity of data by detecting anomalies, inconsistencies, and inefficiencies.

3. Metric Types

definity categorizes its data quality metrics into five analytical pillars:

Metric TypeDescription
TimeMeasures freshness of data. E.g., Freshness.
StructureTracks schema evolution and changes in data structure. E.g., Column Count, Schema Changes.
ContentEnsures valid data distributions and uniqueness. E.g., Null Percent, Value Histogram.
VolumeTracks data growth and size variations. E.g., Row Count, File Size.
BehaviorMonitors changes in data and code modifications.

4. Column-Level Data Quality Metrics

definity provides detailed column-level metrics to assess individual data points.

Metric NameDescription
Null PercentPercentage of null values in a column.
Distinct CountNumber of unique values in a column.
Unique PercentPercentage of unique values out of non-null entries.
Value HistogramDistribution of values for low-cardinality fields.
Min/Max/Avg/Std DevSummary statistics for numeric columns.
Data FreshnessMeasures the time difference between read time and latest timestamp.

5. Table-Level Data Quality Metrics

definity ensures data quality at the table level by tracking key metadata properties with minimal computational overhead.

Volume Metrics:

Metric NameDescription
Row CountTotal number of records in a table.
Output BytesTotal size of data written.
Files SizeTotal size of data read.
Partition CountNumber of partitions accessed.

Schema & Freshness Metrics:

Metric NameDescription
Column CountNumber of columns in a table.
Schema ChangesTracks changes in table structure.
Table FreshnessTime elapsed since last update.

6. Automated Data Quality Tests

definity's Test Engine automatically generates data quality tests based on historical execution patterns. A test defines a valid range for a specific metric, triggering an alert when the metric deviates beyond its expected range.

How Automated Tests Work:

  1. Baseline Establishment: definity requires 4-5 successful pipeline executions to establish an expected range for each metric.
  2. Test Generation: Once a baseline is formed, definity generates tests to validate future executions.
  3. Anomaly Detection: If a metric falls outside the expected range, an alert is triggered.

Test Actions & Handling

Test ActionDescription
AlertRaise an internal alert (Email/Slack support coming soon).
PassIgnore test failure temporarily.
BreakStop pipeline execution (future feature).

Conclusion

definity's comprehensive data quality observability enables organizations to maintain high data integrity and optimize costs. By leveraging AI-driven automated tests, real-time monitoring, and deep metric insights, definity ensures that data pipelines operate smoothly and efficiently.