Skip to main content

Getting Started

Definity's Spark Agent provides comprehensive observability for your Apache Spark applications, capturing lineage, metrics, and execution details automatically.

Quick Start

To get started with the Spark Agent:

  1. Install - Download the appropriate agent JAR for your Spark version
  2. Configure - Add required Spark configuration parameters
  3. Integrate - Set up with PySpark, Airflow, Databricks, or other platforms

Key Features

  • Automatic Lineage Tracking - Capture data lineage across tables, files, and queries
  • Execution Metrics - Monitor performance, resource usage, and time-series metrics
  • Custom Metrics - Report domain-specific KPIs and data quality measurements
  • Multi-Task Support - Track shared Spark clusters and logical task boundaries
  • Skew Detection - Identify and diagnose data skew issues
  • Platform Integration - Native support for Databricks, EMR, and more

Core Configuration

At minimum, configure these parameters to enable Definity tracking:

ParameterDescriptionExample
spark.jarsURL to the Definity agent JARdefinity-spark-agent-3.5_2.12-0.75.1.jar
spark.pluginsDefinity plugin (Spark 3.x)ai.definity.spark.plugin.DefinitySparkPlugin
spark.definity.serverDefinity server URLhttps://app.definity.run
spark.definity.api.tokenAuthentication token for SaaS(your token)
spark.definity.pipeline.nameName of the pipelinemy-pipeline
spark.definity.task.nameName of the taskmy-task

See the Configuration Reference for all available parameters.

Example: PySpark

spark = (
SparkSession.builder.appName("my_app")
.config("spark.jars", "definity-spark-agent-3.5_2.12-0.75.1.jar")
.config("spark.plugins", "ai.definity.spark.plugin.DefinitySparkPlugin")
.config("spark.definity.server", "https://app.definity.run")
.config("spark.definity.pipeline.name", "my-pipeline")
.config("spark.definity.task.name", "my-task")
.getOrCreate()
)

Learn More

Setup & Configuration

Features & Advanced Usage

Platform Integrations

  • PySpark - Standalone PySpark applications
  • Databricks - Databricks notebooks and jobs
  • EMR - Amazon EMR clusters
  • Dataproc - Google Cloud Dataproc
  • Airflow - Apache Airflow orchestration