Skip to main content

Agent Footprint

Definity agents are designed to have negligible impact on your production data pipelines. The agents run passively within your existing Spark and DBT jobs, collecting observability data with minimal resource overhead.

Types of Instrumentation

Definity's agent uses three types of instrumentation, each with a different footprint profile:

1. Driver Metadata Collection

Metrics collected in the Spark driver or DBT process by hooking into the native event system - no additional computation required:

  • Lineage - Captured from Spark's logical query plans
  • Row counts, bytes, partitions - From Spark's scan/write metrics
  • Execution times, stage and task counts - From Spark's event listeners
  • Schema information - From catalog metadata
  • DBT model results - From DBT adapter statistics

Footprint: Purely passive - metrics are a byproduct of normal execution.

2. Executor Observability (Spark only)

A Java agent runs inside each executor JVM, enabling deeper observability beyond what the driver can see:

  • Skew analysis - Detecting data skew by aggregating per-task timing distributions
  • Key-value skew analysis - Identifying skewed keys by sampling data during reads/writes
  • Thread dumps - Capturing executor thread state to diagnose UDF performance issues and long idle times
  • Executor and node info - Machine-level metrics such as CPU, memory, and executor environment details

Footprint: The Java agent intercepts data and JVM state in-process with negligible overhead through sampling. No additional data scans are triggered.

3. Data Quality Queries (Opt-In)

When users explicitly enable data quality metrics for specific columns, Definity either adds dedicated queries or extends already-running queries using Spark Observe to collect the requested metrics in a single pass:

  • Null percentage - Count aggregation per column
  • Distinct values - Distinct count per column
  • Value distributions - Histogram of column values

Footprint: Controlled and explicit - only runs for columns the customer has opted into, with Spark Observe minimizing the overhead by piggybacking on existing data scans where possible.

Measured Resource Impact

Definity continuously monitors its own agent footprint through dedicated metrics (definity_driver_overhead and definity_executors_overhead). Based on over 1.5 million production task executions:

Driver Overhead

Based on our measurements and analysis, driver overhead is approximately ~0.5% for short-running tasks (up to 5 minutes), and lower as tasks get longer.

Executor Overhead

Executor overhead is measured as accumulated vcore-time across all executors. The cost impact is < 0.2% of total compute resources.

Real Production Data

These measurements are derived from ~1M task executions across production Spark workloads.

No Customer Data Collection

Definity agents do not collect actual customer data values by default. Only metadata about the data is collected:

  • ✅ Table names, column names, schemas
  • ✅ Row counts, null counts, distinct counts
  • ✅ Execution times, resource usage
  • ❌ Actual row values or column content (unless opt-in for value distribution metrics)
  • ❌ Query results or data samples

Note: Value distribution metrics (histogram of column values) are opt-in only and require explicit customer configuration for specific columns.

How Definity Minimizes Footprint

Definity is designed from the ground up to minimize impact on production workloads:

Passive Observation Architecture

  1. Leverage Spark's existing metadata - Extract metrics from Spark's query execution listeners and metrics that are already computed
  2. No additional queries by default - Metadata metrics require zero extra data scans or queries
  3. Event-driven collection - Hook into Spark's native event system rather than actively polling

Efficient Execution Strategy

  1. Asynchronous reporting - Metrics are reported to the server asynchronously, never blocking job completion
  2. Batched updates - Multiple metrics are batched together to minimize network calls
  3. Selective computation - Distribution metrics (requiring extra queries) are opt-in only

Summary

Definity's agent is designed to be a transparent observer of your data pipelines, backed by real production data:

  • Driver overhead: ~0.5% or less of task runtime (based on production analysis)
  • Executor overhead: < 0.2% of vcore-time/compute cost (negligible resource impact)
  • No additional data scans: Metadata collection uses existing Spark execution data
  • Opt-in data quality metrics: Only run when explicitly configured per column

Bottom line: Definity provides comprehensive observability (lineage, metrics, execution details) by passively observing Spark's existing query execution - no additional data scans or queries for metadata instrumentation, and negligible compute cost for executor-side collection.