Getting Started

Definity's Spark Agent provides comprehensive observability for your Apache Spark applications, capturing lineage, metrics, and execution details automatically.

Quick Start

To get started with the Spark Agent:

Install - Download the appropriate agent JAR for your Spark version
Configure - Add required Spark configuration parameters
Integrate - Set up with PySpark, Airflow, Databricks, or other platforms

Key Features

Automatic Lineage Tracking - Capture data lineage across tables, files, and queries
Execution Metrics - Monitor performance, resource usage, and time-series metrics
Custom Metrics - Report domain-specific KPIs and data quality measurements
Multi-Task Support - Track shared Spark clusters and logical task boundaries
Skew Detection - Identify and diagnose data skew issues
Platform Integration - Native support for Databricks, EMR, and more

Core Configuration

At minimum, configure these parameters to enable Definity tracking:

Parameter	Description	Example
`spark.jars`	URL to the Definity agent JAR	`definity-spark-agent-[X.X]-latest.jar`
`spark.plugins`	Definity plugin (Spark 3.x)	`ai.definity.spark.plugin.DefinitySparkPlugin`
`spark.definity.server`	Definity server URL	`https://app.definity.run`
`spark.definity.api.token`	Authentication token for SaaS	(your token)
`spark.definity.pipeline.name`	Name of the pipeline	`my-pipeline`
`spark.definity.task.name`	Name of the task	`my-task`

See the Configuration Reference for all available parameters.

Example: PySpark

spark = (
    SparkSession.builder.appName("my_app")
    .config("spark.jars", "definity-spark-agent-[X.X]-latest.jar")
    .config("spark.plugins", "ai.definity.spark.plugin.DefinitySparkPlugin")
    .config("spark.definity.server", "https://app.definity.run")
    .config("spark.definity.pipeline.name", "my-pipeline")
    .config("spark.definity.task.name", "my-task")
    .getOrCreate()
)

Learn More

Setup & Configuration

Installation - Download agent JARs and version compatibility
Configuration Reference - Complete parameter documentation

Features & Advanced Usage

Custom Metrics - Report user-defined metrics
Tracking Modes - Advanced multi-task tracking

Platform Integrations

PySpark - Standalone PySpark applications
Databricks - Databricks notebooks and jobs
EMR - Amazon EMR clusters
Dataproc - Google Cloud Dataproc
Airflow - Apache Airflow orchestration

Quick Start​

Key Features​

Core Configuration​

Example: PySpark​

Learn More​

Setup & Configuration​

Features & Advanced Usage​

Platform Integrations​