Getting Started
Definity's Spark Agent provides comprehensive observability for your Apache Spark applications, capturing lineage, metrics, and execution details automatically.
Quick Start
To get started with the Spark Agent:
- Install - Download the appropriate agent JAR for your Spark version
- Configure - Add required Spark configuration parameters
- Integrate - Set up with PySpark, Airflow, Databricks, or other platforms
Key Features
- Automatic Lineage Tracking - Capture data lineage across tables, files, and queries
- Execution Metrics - Monitor performance, resource usage, and time-series metrics
- Custom Metrics - Report domain-specific KPIs and data quality measurements
- Multi-Task Support - Track shared Spark clusters and logical task boundaries
- Skew Detection - Identify and diagnose data skew issues
- Platform Integration - Native support for Databricks, EMR, and more
Core Configuration
At minimum, configure these parameters to enable Definity tracking:
| Parameter | Description | Example |
|---|---|---|
spark.jars | URL to the Definity agent JAR | definity-spark-agent-3.5_2.12-0.75.1.jar |
spark.plugins | Definity plugin (Spark 3.x) | ai.definity.spark.plugin.DefinitySparkPlugin |
spark.definity.server | Definity server URL | https://app.definity.run |
spark.definity.api.token | Authentication token for SaaS | (your token) |
spark.definity.pipeline.name | Name of the pipeline | my-pipeline |
spark.definity.task.name | Name of the task | my-task |
See the Configuration Reference for all available parameters.
Example: PySpark
spark = (
SparkSession.builder.appName("my_app")
.config("spark.jars", "definity-spark-agent-3.5_2.12-0.75.1.jar")
.config("spark.plugins", "ai.definity.spark.plugin.DefinitySparkPlugin")
.config("spark.definity.server", "https://app.definity.run")
.config("spark.definity.pipeline.name", "my-pipeline")
.config("spark.definity.task.name", "my-task")
.getOrCreate()
)
Learn More
Setup & Configuration
- Installation - Download agent JARs and version compatibility
- Configuration Reference - Complete parameter documentation
Features & Advanced Usage
- Custom Metrics - Report user-defined metrics
- Tracking Modes - Advanced multi-task tracking
Platform Integrations
- PySpark - Standalone PySpark applications
- Databricks - Databricks notebooks and jobs
- EMR - Amazon EMR clusters
- Dataproc - Google Cloud Dataproc
- Airflow - Apache Airflow orchestration