Skip to main content

Overview

Data Applications Lifecycle – Three Critical Phases

The lifecycle of data applications involves three key phases:

  • Infrastructure Upgrades: Upgrading core components such as Spark, Iceberg, Python, or internal infrastructure.
  • Migrations: Transitioning from Yarn to Kubernetes, on-premises to cloud, Databricks to EMR, etc.
  • Version Releases: Deploying new application versions and bug fixes.

Infrastructure upgrades and migrations require coordinated efforts between platform teams and application owners. While platform teams ensure the new environment is supported, application owners must update their builds accordingly and conduct thorough validation:

  1. Run side-by-side executions of both versions on identical input data.
  2. Monitor and compare data and performance metrics to ensure parity.
  3. If discrepancies arise, perform a Root Cause Analysis (RCA).
info

These validation steps make upgrades and migrations resource-intensive and manual, often leading organizations to postpone them—impacting platform performance and increasing costs.

Similarly, ongoing version releases and fixes can introduce unexpected performance regressions or business-impacting issues. However, these are rarely pre-tested to identify potential risks in advance.

Definity automates these validation stages (1-3), enabling seamless, low-impact migrations and upgrades.


Definity CI Workflow

Step 1: Automating Data Application Staging

Objective: Run any pipeline consisting of multiple Spark jobs, reading production input data while redirecting all intermediate and output writes to a staging path.

Required Definity Agent Configuration:

NameDescription
spark.definity.output.table.suffixSuffix appended to all output table names.
spark.definity.output.database.suffixSuffix added to output database names.
spark.definity.output.database.baseLocationBase location for all generated output databases.

Step 2: Comparative Analysis Between Versions

Once both runs are generated using Definity's agent configuration, a comparative analysis report can be created:

Comparison Report

To ensure accuracy, select the same pipeline version and the same point-in-time (PIT) snapshot to compare equivalent datasets:

Version Selection

Adjust threshold values to define acceptable data and performance metric deviations.


Step 3: Root Cause Analysis (RCA)

If discrepancies exceed defined thresholds, drill down to identify the root cause of the changes:

A. Determine whether discrepancies stem from data changes or technical execution issues.

Data Distribution Issue example

B. Analyze affected metrics collected by Definity to identify out-of-bound shifts.

C. Use the lineage view to pinpoint where in the pipeline the issue originated.

D. Quickly diagnose environment or code changes contributing to unexpected shifts.

info

By automating these steps, Definity streamlines migrations, upgrades, and release validations—reducing manual effort while minimizing performance risks.