Databricks
Supported Databricks Runtime versions: 12.2 - 16.4 (scala 2.12)
❗ Note: Databricks Serverless is not supported for this instrumentation. You may optionally use the DBT agent instead.
Compatibility matrix
| Databricks Release | Spark Version | Scala Version | Definity Agent |
|---|---|---|---|
| 16.4_LTS (scala 2.12) | 3.5.2 | 2.12 | 3.5_2.12-latest |
| 15.4_LTS | 3.5.0 | 2.12 | 3.5_2.12-latest |
| 14.3_LTS | 3.5.0 | 2.12 | 3.5_2.12-latest |
| 13.3_LTS | 3.4.1 | 2.12 | 3.4_2.12-latest |
| 12.2_LTS | 3.3.2 | 2.12 | 3.3_2.12-latest |
Quick Setup
For standard Databricks setups, simply add an init script to your cluster. The default configuration will:
- Track each cluster as a compute entity in Definity
- Track each job running through the Jobs API as a pipeline in Definity
- Track each task within a job as a task in Definity
1. Create an Init Script
Create a script to download and add the definity Spark agent to the cluster’s CLASSPATH and set the default definity parameters. Save this script in cloud storage (e.g., S3).
#!/bin/bash
JAR_DIR="/databricks/jars"
mkdir -p $JAR_DIR
DEFINITY_JAR_URL="https://user:[email protected]/java/definity-spark-agent-[spark.version]-[agent.version].jar"
curl -o $JAR_DIR/definity-spark-agent.jar $DEFINITY_JAR_URL
export CLASSPATH=$CLASSPATH:$JAR_DIR/definity-spark-agent.jar
cat > /databricks/driver/conf/00-definity.conf << EOF
spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin
spark.definity.server="https://app.definity.run"
spark.definity.api.token=YOUR_TOKEN
#spark.definity.env.name=YOUR_DEFAULT_ENV
EOF
2. Attach the Init Script to Your Compute Cluster
In the Databricks UI:
- Go to Cluster configuration → Advanced options → Init Scripts.
- Add your script with:
- Source:
s3 - File path:
s3://your-s3-bucket/init-scripts-dir/definity_init.sh
- Source:
3. Configure Cluster Name [Optional]
By default, the cluster name is derived from the Databricks cluster name. To customize it, navigate to Cluster configuration → Advanced options → Spark and add:
spark.definity.compute.name my_cluster_name
Advanced Tracking Modes
The default Databricks integration tracks the compute cluster separately from workflows and automatically detects running workflow tasks. You may want to change this behavior in these scenarios:
Single-Task Cluster
If you have a dedicated cluster per task, disable shared cluster tracking mode and provide the Pipeline Tracking Parameters in the init script:
spark.definity.sharedCompute=false
Manual Task Tracking
To manually control task scopes programmatically, disable Databricks automatic tracking:
spark.definity.databricks.automaticSessions.enabled=false
Then follow the Multi-Task Shared Spark App guide.
Job Configuration Overrides
By default, Definity auto-detects pipeline and task names from the Jobs API. To override them, use the base_parameters or parameters fields depending on the task type.
Example: Jobs API
{
"tasks": [
{
"task_key": "task1",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@org/task_notebook_1",
"source": "WORKSPACE",
"base_parameters": {
"spark.definity.pipeline.name": "my_pipeline",
"spark.definity.pipeline.pit": "2025-01-01 01:00:00",
"spark.definity.task.name": "task1"
}
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
},
{
"task_key": "task2",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@org/task_notebook_2",
"source": "WORKSPACE",
"base_parameters": {
"spark.definity.pipeline.name": "my_pipeline",
"spark.definity.pipeline.pit": "2025-01-01 01:00:00",
"spark.definity.task.name": "task2"
}
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
},
{
"task_key": "python_task1",
"spark_python_task": {
"python_file": "s3://my-bucket/python_task.py",
"parameters": [
"yourArg1",
"yourArg2",
"spark.definity.task.name=python_task_1",
"spark.definity.pipeline.name=my_pipeline",
"spark.definity.pipeline.pit=2025-01-01 01:00:00"
]
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
}
],
"format": "MULTI_TASK",
"queue": {
"enabled": true
}
}
Example: Airflow Notebook Job
run_notebook = DatabricksSubmitRunOperator(
task_id="run_notebook",
json={
"notebook_task": {
"notebook_path": "/Users/[email protected]/my_notebook",
"base_parameters": {
"spark.definity.pipeline.name": "{{ dag_run.dag_id }}",
"spark.definity.pipeline.pit": "{{ ts }}",
"spark.definity.task.name": "{{ ti.task_id }}"
},
},
"name": "notebook-job",
}
)
Example: Airflow Python Job
run_python = DatabricksSubmitRunOperator(
task_id="run_python_script",
json={
"spark_python_task": {
"python_file": "dbfs:/path/to/job.py",
"parameters": [
"spark.definity.pipeline.name={{ dag_run.dag_id }}",
"spark.definity.pipeline.pit={{ ts }}",
"spark.definity.task.name={{ ti.task_id }}"
]
},
"name": "python-job",
}
)