Overview

Pipelines, Tasks, Transformations & Tables

Pipeline: Is a set of Tasks. Executed by an orchestrator (e.g Apache Airflow/Oozie etc.)
Task: An actual computation job that is being executed. for example A spark Job or a DBT project. A Task is comprised from a set of Transformations (TFs)
Transformation: A single logical action such as an SQL query, a Spark Action or a DBT Model. RF

Definity agents are running as part of the Tasks execution and can observe all transformations and their Input/Output datasets. This is how Definity builds the task-dataset lineage and can also obtain column level lineage.

Point-In-Time (PIT) & Run Time

Run Time Pipeline are executed every defined time period (for example every day @ 12:00) each specific start time is the Pipeline/Task/.
Point In Time (PIT) Each such execution also has a logical time indicating the input data consumed. This is the PIT or logical date. A user can run the logical date 01/01/2020-22:00 with Run Time of 08/01/2024-14:00 and then again with Run Time of 09/01/2024-17:00

Metrics

Metrics are the basic building blocks of the monitoring in Definity. A metric is a numeric value that was measured on a specific asset (table/column or pipeline/task/transformation) in a specific execution (Run Time).

For example:

Pipeline: A / Task: B / PIT: 07-25-23 / Run Time: 07-25-26 / Table: T1
Metric type: volume count - Value: 1177500

Metric types

Metrics are split into five main analytical pillars of data observability:

Time - metrics that measure the time dimension of the pipeline's data and execution. E.g., Freshness, Duration.
Structure - metrics that measure the data structure consumed and generated by the pipeline. E.g., Schema (size, types, order).
Content - metrics that measure the data values distribution and behavior. E.g., Uniqueness, Null, Value Distribution, Binning.
Volume - metrics that measure the data volume. E.g., Total Row Count, Read Count, Write Count, Partitions, Files, Bytes.
Behavior - metrics that measure changes in the data pipeline itself, the code, and the environment setup.

Metric profiles

Metadata metrics - metrics that can be extracted for "free", with no additional query resource overhead. These metrics are calculated by the data platform (e.g., Spark, DBT, DWH).
One-pass metrics - metrics that can be extracted with low (limited) additional query resource overhead, requiring only a single pass on the data.
Advanced metrics - metrics that require more advanced data analysis and might require data shuffle or multiple scans.

Tests

A test is defined as the valid range of values for a specific metric.

If in a new execution of the pipeline the metric will have a value outside this range, the test will fail and an alert will be generated.

definity Test Engine begins to generate automated tests for a specific metric (for a specific pipeline) after a baseline for the metric was established.

I.e., tests are only created after min ~4-5 pipeline executions are completed successfully with the specific metric.
If a new metric is opted-in, ~4-5 successful pipeline executions are also required (to establish a baseline), before tests will be generated for it.

Test action definition:

Name	Action	Comments
Alert	Raise an alert	Currently alerts are internal in Definity. Email/Slack integration will be added
Pass	Ignore test failure	Used for temporary muting tests
Break	Stop pipeline execution	Currently not supported

Pipelines, Tasks, Transformations & Tables​

Point-In-Time (PIT) & Run Time

Metrics​

Metric types​

Metric profiles​

Tests​

Test action definition:​