Data Quality Overview
Every time a pipeline tasks runs it reads from input tables and writes to interim and output tables. definity agents are running as part of each task extracting metrics that are specific to each table and to the specific operation and the specific time the task is running. Since the metrics are collected in real time during the task execution, real-time reactions can be applied when a metric indicates quality issue.
1. AI-Powered Data Quality Monitoring
Modern data pipelines are complex, and maintaining data quality manually can be challenging. definity uses AI and ML techniques to learn pipeline behavior and automatically generate data quality tests. This ensures that data remains accurate, complete, and consistent over time.
2. Real-Time Proactive Observability
Unlike traditional tools that analyze data after execution, definity runs within the pipeline with minimal footprint. This allows for real-time monitoring and proactive incident handling.
Key Real-Time Data Quality Features:
Analysis Type | Proactive Intervention | Impact |
---|---|---|
Detect stale / faulty input data | Prevent pipeline execution | Saves resources & prevents downstream impact |
Detect faulty output data | Divert output to a quarantine location | Prevents downstream contamination, allows debugging |
definity's Data Quality Metrics
definity tracks a comprehensive set of data quality metrics across multiple dimensions. These metrics ensure the integrity of data by detecting anomalies, inconsistencies, and inefficiencies.
3. Metric Types
definity categorizes its data quality metrics into five analytical pillars:
Metric Type | Description |
---|---|
Time | Measures freshness of data. E.g., Freshness. |
Structure | Tracks schema evolution and changes in data structure. E.g., Column Count, Schema Changes. |
Content | Ensures valid data distributions and uniqueness. E.g., Null Percent, Value Histogram. |
Volume | Tracks data growth and size variations. E.g., Row Count, File Size. |
Behavior | Monitors changes in data and code modifications. |
4. Column-Level Data Quality Metrics
definity provides detailed column-level metrics to assess individual data points.
Metric Name | Description |
---|---|
Null Percent | Percentage of null values in a column. |
Distinct Count | Number of unique values in a column. |
Unique Percent | Percentage of unique values out of non-null entries. |
Value Histogram | Distribution of values for low-cardinality fields. |
Min/Max/Avg/Std Dev | Summary statistics for numeric columns. |
Data Freshness | Measures the time difference between read time and latest timestamp. |
5. Table-Level Data Quality Metrics
definity ensures data quality at the table level by tracking key metadata properties with minimal computational overhead.
Volume Metrics:
Metric Name | Description |
---|---|
Row Count | Total number of records in a table. |
Output Bytes | Total size of data written. |
Files Size | Total size of data read. |
Partition Count | Number of partitions accessed. |
Schema & Freshness Metrics:
Metric Name | Description |
---|---|
Column Count | Number of columns in a table. |
Schema Changes | Tracks changes in table structure. |
Table Freshness | Time elapsed since last update. |
6. Automated Data Quality Tests
definity's Test Engine automatically generates data quality tests based on historical execution patterns. A test defines a valid range for a specific metric, triggering an alert when the metric deviates beyond its expected range.
How Automated Tests Work:
- Baseline Establishment: definity requires 4-5 successful pipeline executions to establish an expected range for each metric.
- Test Generation: Once a baseline is formed, definity generates tests to validate future executions.
- Anomaly Detection: If a metric falls outside the expected range, an alert is triggered.
Test Actions & Handling
Test Action | Description |
---|---|
Alert | Raise an internal alert (Email/Slack support coming soon). |
Pass | Ignore test failure temporarily. |
Break | Stop pipeline execution (future feature). |
Conclusion
definity's comprehensive data quality observability enables organizations to maintain high data integrity and optimize costs. By leveraging AI-driven automated tests, real-time monitoring, and deep metric insights, definity ensures that data pipelines operate smoothly and efficiently.