Databricks
Supported Databricks Runtime versions: 12.2 - 16.4
❗ Note: Databricks Serverless is not supported for this instrumentation. You may optionally use the DBT agent instead.
Compatibility matrix
| Databricks Release | Spark Version | Scala Version | Definity Agent |
|---|---|---|---|
| 16.4_LTS (scala 2.13) | 3.5.2 | 2.13 | 3.5_2.13-latest |
| 16.4_LTS (scala 2.12) | 3.5.2 | 2.12 | 3.5_2.12-latest |
| 15.4_LTS | 3.5.0 | 2.12 | 3.5_2.12-latest |
| 14.3_LTS | 3.5.0 | 2.12 | 3.5_2.12-latest |
| 13.3_LTS | 3.4.1 | 2.12 | 3.4_2.12-latest |
| 12.2_LTS | 3.3.2 | 2.12 | 3.3_2.12-latest |
Quick Setup
For standard Databricks setups, simply add an init script to your cluster. The default configuration will:
- Track each cluster as a compute entity in Definity
- Track each job running through the Jobs API as a pipeline in Definity
- Track each task within a job as a task in Definity
1. Create an Init Script
Create an init script to automatically download and configure definity's Spark agent. The script will:
- Automatically detect your Databricks Runtime version
- Download the appropriate definity Spark agent for your Spark version
- Configure definity plugin with default settings
databricks_definity_init.sh
#!/bin/bash
# ============================================================================
# Tested Databricks Runtimes: 12.2 LTS - 16.4 LTS (Spark 3.3 - 3.5)
# ============================================================================
# ============================================================================
# CONFIGURATION
# ============================================================================
# Optional: Set a specific agent version (e.g. "0.75.1")
# Leave empty to use the latest version
DEFINITY_AGENT_VERSION=""
# IMPORTANT: For production use, upload the agent JAR to your own
# artifact repository (Artifactory, Nexus, S3, etc.) and update this URL.
# The definity.run URL shown here is for demonstration purposes only.
# Example: "https://your-artifactory.company.com/repository/libs-release/definity-spark-agent"
ARTIFACT_BASE_URL="https://user:[email protected]/java"
# ============================================================================
# AUTO-DETECTION AND INSTALLATION
# ============================================================================
JAR_DIR="/databricks/jars"
mkdir -p "$JAR_DIR"
# Extract Spark version from /databricks/spark/VERSION
FULL_SPARK_VERSION=$(cat /databricks/spark/VERSION)
SPARK_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE '^[0-9]+\.[0-9]+')
echo "Detected Spark version: $SPARK_VERSION"
if [ -z "$SPARK_VERSION" ]; then
echo "Spark major.minor version is empty or not found. Will not proceed to install definity agent"
exit 0
fi
# Extract Scala version from /databricks/IMAGE_KEY
DBR_VERSION=$(cat /databricks/IMAGE_KEY)
SCALA_VERSION=$(echo "$DBR_VERSION" | grep -oE 'scala([0-9]+\.[0-9]+)' | sed 's/scala//')
echo "Detected Scala version: $SCALA_VERSION"
if [ -z "$SCALA_VERSION" ]; then
echo "Scala version is empty or not found. Will not proceed to install definity agent"
exit 0
fi
# Build agent version string with Spark and Scala versions
SPARK_AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}"
# Build the full agent version string
if [ -z "$DEFINITY_AGENT_VERSION" ]; then
# Use latest version
FULL_AGENT_VERSION="${SPARK_AGENT_VERSION}-latest"
else
# Use specific version
FULL_AGENT_VERSION="${SPARK_AGENT_VERSION}-${DEFINITY_AGENT_VERSION}"
fi
# Download the agent
DEFINITY_JAR_URL="${ARTIFACT_BASE_URL}/definity-spark-agent-${FULL_AGENT_VERSION}.jar"
echo "Downloading Definity Spark Agent ${FULL_AGENT_VERSION} for Spark ${SPARK_VERSION} (Scala ${SCALA_VERSION})..."
curl -f -o $JAR_DIR/definity-spark-agent.jar $DEFINITY_JAR_URL
if [ $? -eq 0 ]; then
echo "Successfully downloaded Definity Spark Agent"
else
echo "Failed to download Definity Spark Agent from: $DEFINITY_JAR_URL"
echo "Cluster will start without Definity agent"
exit 0
fi
# Configure Definity plugin
cat > /databricks/driver/conf/00-definity.conf << EOF
spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin
spark.definity.server="https://app.definity.run"
spark.definity.api.token=YOUR_TOKEN
#spark.definity.env.name=YOUR_DEFAULT_ENV
EOF
echo "Definity Spark Agent configured successfully"
For production use, upload the Definity agent JAR to your own artifact repository (Artifactory, Nexus, S3, etc.) and update the ARTIFACT_BASE_URL in the script. The definity.run URL is for demonstration purposes only.
2. Attach the Init Script to Your Compute Cluster
In the Databricks UI:
- Go to Cluster configuration → Advanced options → Init Scripts.
- Add your script with:
- Source:
s3 - File path:
s3://your-s3-bucket/init-scripts-dir/definity_init.sh
- Source:
3. Configure Cluster Name [Optional]
By default, the cluster name is derived from the Databricks cluster name. To customize it, navigate to Cluster configuration → Advanced options → Spark and add:
spark.definity.compute.name my_cluster_name
Advanced Tracking Modes
The default Databricks integration tracks the compute cluster separately from workflows and automatically detects running workflow tasks. You may want to change this behavior in these scenarios:
Single-Task Cluster
If you have a dedicated cluster per task, disable shared cluster tracking mode and provide the Pipeline Tracking Parameters in the init script:
spark.definity.sharedCompute=false
Manual Task Tracking
To manually control task scopes programmatically, disable Databricks automatic tracking:
spark.definity.databricks.automaticSessions.enabled=false
Then follow the Multi-Task Shared Spark App guide.
Job Configuration Overrides
By default, Definity auto-detects pipeline and task names from the Jobs API. To override them, use the base_parameters or parameters fields depending on the task type.
Example: Jobs API
{
"tasks": [
{
"task_key": "task1",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@org/task_notebook_1",
"source": "WORKSPACE",
"base_parameters": {
"spark.definity.pipeline.name": "my_pipeline",
"spark.definity.pipeline.pit": "2025-01-01 01:00:00",
"spark.definity.task.name": "task1"
}
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
},
{
"task_key": "task2",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@org/task_notebook_2",
"source": "WORKSPACE",
"base_parameters": {
"spark.definity.pipeline.name": "my_pipeline",
"spark.definity.pipeline.pit": "2025-01-01 01:00:00",
"spark.definity.task.name": "task2"
}
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
},
{
"task_key": "python_task1",
"spark_python_task": {
"python_file": "s3://my-bucket/python_task.py",
"parameters": [
"yourArg1",
"yourArg2",
"spark.definity.task.name=python_task_1",
"spark.definity.pipeline.name=my_pipeline",
"spark.definity.pipeline.pit=2025-01-01 01:00:00"
]
},
"existing_cluster_id": "${DATABRICKS_CLUSTER}"
}
],
"format": "MULTI_TASK",
"queue": {
"enabled": true
}
}
Example: Airflow Notebook Job
run_notebook = DatabricksSubmitRunOperator(
task_id="run_notebook",
json={
"notebook_task": {
"notebook_path": "/Users/[email protected]/my_notebook",
"base_parameters": {
"spark.definity.pipeline.name": "{{ dag_run.dag_id }}",
"spark.definity.pipeline.pit": "{{ ts }}",
"spark.definity.task.name": "{{ ti.task_id }}"
},
},
"name": "notebook-job",
}
)
Example: Airflow Python Job
run_python = DatabricksSubmitRunOperator(
task_id="run_python_script",
json={
"spark_python_task": {
"python_file": "dbfs:/path/to/job.py",
"parameters": [
"spark.definity.pipeline.name={{ dag_run.dag_id }}",
"spark.definity.pipeline.pit={{ ts }}",
"spark.definity.task.name={{ ti.task_id }}"
]
},
"name": "python-job",
}
)