Skip to main content

Databricks

Supported Databricks Runtime versions: 12.2 - 16.4 (scala 2.12)

Note: Databricks Serverless is not supported for this instrumentation. You may optionally use the DBT agent instead.

Compatibility matrix

Databricks ReleaseSpark VersionScala VersionDefinity Agent
16.4_LTS (scala 2.12)3.5.22.123.5_2.12-latest
15.4_LTS3.5.02.123.5_2.12-latest
14.3_LTS3.5.02.123.5_2.12-latest
13.3_LTS3.4.12.123.4_2.12-latest
12.2_LTS3.3.22.123.3_2.12-latest

Quick Setup

For standard Databricks setups, simply add an init script to your cluster. The default configuration will:

  • Track each cluster as a compute entity in Definity
  • Track each job running through the Jobs API as a pipeline in Definity
  • Track each task within a job as a task in Definity

1. Create an Init Script

Create a script to download and add the definity Spark agent to the cluster’s CLASSPATH and set the default definity parameters. Save this script in cloud storage (e.g., S3).

definity_init.sh
#!/bin/bash
JAR_DIR="/databricks/jars"
mkdir -p $JAR_DIR
DEFINITY_JAR_URL="https://user:[email protected]/java/definity-spark-agent-[spark.version]-[agent.version].jar"
curl -o $JAR_DIR/definity-spark-agent.jar $DEFINITY_JAR_URL
export CLASSPATH=$CLASSPATH:$JAR_DIR/definity-spark-agent.jar

cat > /databricks/driver/conf/00-definity.conf << EOF
spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin
spark.definity.server="https://app.definity.run"
spark.definity.api.token=YOUR_TOKEN
#spark.definity.env.name=YOUR_DEFAULT_ENV
EOF

2. Attach the Init Script to Your Compute Cluster

In the Databricks UI:

  1. Go to Cluster configurationAdvanced optionsInit Scripts.
  2. Add your script with:
    • Source: s3
    • File path: s3://your-s3-bucket/init-scripts-dir/definity_init.sh

3. Configure Cluster Name [Optional]

By default, the cluster name is derived from the Databricks cluster name. To customize it, navigate to Cluster configurationAdvanced optionsSpark and add:

spark.definity.compute.name      my_cluster_name

Advanced Tracking Modes

The default Databricks integration tracks the compute cluster separately from workflows and automatically detects running workflow tasks. You may want to change this behavior in these scenarios:

Single-Task Cluster

If you have a dedicated cluster per task, disable shared cluster tracking mode and provide the Pipeline Tracking Parameters in the init script:

spark.definity.sharedCompute=false

Manual Task Tracking

To manually control task scopes programmatically, disable Databricks automatic tracking:

spark.definity.databricks.automaticSessions.enabled=false

Then follow the Multi-Task Shared Spark App guide.

Job Configuration Overrides

By default, Definity auto-detects pipeline and task names from the Jobs API. To override them, use the base_parameters or parameters fields depending on the task type.

Example: Jobs API

{
"tasks": [
{
"task_key": "task1",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@org/task_notebook_1",
"source": "WORKSPACE",
"base_parameters": {
"spark.definity.pipeline.name": "my_pipeline",
"spark.definity.pipeline.pit": "2025-01-01 01:00:00",
"spark.definity.task.name": "task1"
}
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
},
{
"task_key": "task2",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@org/task_notebook_2",
"source": "WORKSPACE",
"base_parameters": {
"spark.definity.pipeline.name": "my_pipeline",
"spark.definity.pipeline.pit": "2025-01-01 01:00:00",
"spark.definity.task.name": "task2"
}
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
},
{
"task_key": "python_task1",
"spark_python_task": {
"python_file": "s3://my-bucket/python_task.py",
"parameters": [
"yourArg1",
"yourArg2",
"spark.definity.task.name=python_task_1",
"spark.definity.pipeline.name=my_pipeline",
"spark.definity.pipeline.pit=2025-01-01 01:00:00"
]
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
}
],
"format": "MULTI_TASK",
"queue": {
"enabled": true
}
}

Example: Airflow Notebook Job

run_notebook = DatabricksSubmitRunOperator(
task_id="run_notebook",
json={
"notebook_task": {
"notebook_path": "/Users/[email protected]/my_notebook",
"base_parameters": {
"spark.definity.pipeline.name": "{{ dag_run.dag_id }}",
"spark.definity.pipeline.pit": "{{ ts }}",
"spark.definity.task.name": "{{ ti.task_id }}"
},
},
"name": "notebook-job",
}
)

Example: Airflow Python Job

run_python = DatabricksSubmitRunOperator(
task_id="run_python_script",
json={
"spark_python_task": {
"python_file": "dbfs:/path/to/job.py",
"parameters": [
"spark.definity.pipeline.name={{ dag_run.dag_id }}",
"spark.definity.pipeline.pit={{ ts }}",
"spark.definity.task.name={{ ti.task_id }}"
]
},
"name": "python-job",
}
)