Data Quality Overview

Every time a pipeline tasks runs it reads from input tables and writes to interim and output tables. definity agents are running as part of each task extracting metrics that are specific to each table and to the specific operation and the specific time the task is running. Since the metrics are collected in real time during the task execution, real-time reactions can be applied when a metric indicates quality issue.

1. AI-Powered Data Quality Monitoring

Modern data pipelines are complex, and maintaining data quality manually can be challenging. definity uses AI and ML techniques to learn pipeline behavior and automatically generate data quality tests. This ensures that data remains accurate, complete, and consistent over time.

2. Real-Time Proactive Observability

Unlike traditional tools that analyze data after execution, definity runs within the pipeline with minimal footprint. This allows for real-time monitoring and proactive incident handling.

Key Real-Time Data Quality Features:

Analysis Type	Proactive Intervention	Impact
Detect stale / faulty input data	Prevent pipeline execution	Saves resources & prevents downstream impact
Detect faulty output data	Divert output to a quarantine location	Prevents downstream contamination, allows debugging

definity's Data Quality Metrics

definity tracks a comprehensive set of data quality metrics across multiple dimensions. These metrics ensure the integrity of data by detecting anomalies, inconsistencies, and inefficiencies.

3. Metric Types

definity categorizes its data quality metrics into five analytical pillars:

Metric Type	Description
Time	Measures freshness of data. E.g., Freshness.
Structure	Tracks schema evolution and changes in data structure. E.g., Column Count, Schema Changes.
Content	Ensures valid data distributions and uniqueness. E.g., Null Percent, Value Histogram.
Volume	Tracks data growth and size variations. E.g., Row Count, File Size.
Behavior	Monitors changes in data and code modifications.

4. Column-Level Data Quality Metrics

definity provides detailed column-level metrics to assess individual data points.

Metric Name	Description
Null Percent	Percentage of null values in a column.
Distinct Count	Number of unique values in a column.
Unique Percent	Percentage of unique values out of non-null entries.
Value Histogram	Distribution of values for low-cardinality fields.
Min/Max/Avg/Std Dev	Summary statistics for numeric columns.
Data Freshness	Measures the time difference between read time and latest timestamp.

5. Table-Level Data Quality Metrics

definity ensures data quality at the table level by tracking key metadata properties with minimal computational overhead.

Volume Metrics:

Metric Name	Description
Row Count	Total number of records in a table.
Output Bytes	Total size of data written.
Files Size	Total size of data read.
Partition Count	Number of partitions accessed.

Schema & Freshness Metrics:

Metric Name	Description
Column Count	Number of columns in a table.
Schema Changes	Tracks changes in table structure.
Table Freshness	Time elapsed since last update.

6. Automated Data Quality Tests

definity's Test Engine automatically generates data quality tests based on historical execution patterns. A test defines a valid range for a specific metric, triggering an alert when the metric deviates beyond its expected range.

How Automated Tests Work:

Baseline Establishment: definity requires 4-5 successful pipeline executions to establish an expected range for each metric.
Test Generation: Once a baseline is formed, definity generates tests to validate future executions.
Anomaly Detection: If a metric falls outside the expected range, an alert is triggered.

Test Actions & Handling

Test Action	Description
Send notification	Send an alert Email
Break pipeline	Stop pipeline execution

Conclusion

definity's comprehensive data quality observability enables organizations to maintain high data integrity and optimize costs. By leveraging AI-driven automated tests, real-time monitoring, and deep metric insights, definity ensures that data pipelines operate smoothly and efficiently.

1. AI-Powered Data Quality Monitoring​

2. Real-Time Proactive Observability​

Key Real-Time Data Quality Features:​

definity's Data Quality Metrics​

3. Metric Types​

4. Column-Level Data Quality Metrics​

5. Table-Level Data Quality Metrics​

Volume Metrics:​

Schema & Freshness Metrics:​

6. Automated Data Quality Tests​

How Automated Tests Work:​

Test Actions & Handling​

Conclusion​