Databricks

Supported Databricks Runtime versions: 12.2 - 16.4

❗ Note: Databricks Serverless is not supported for this instrumentation. You may optionally use the DBT agent instead.

Compatibility matrix

Databricks Release	Spark Version	Scala Version	Definity Agent
16.4_LTS (scala 2.13)	3.5.2	2.13	3.5_2.13-latest
16.4_LTS (scala 2.12)	3.5.2	2.12	3.5_2.12-latest
15.4_LTS	3.5.0	2.12	3.5_2.12-latest
14.3_LTS	3.5.0	2.12	3.5_2.12-latest
13.3_LTS	3.4.1	2.12	3.4_2.12-latest
12.2_LTS	3.3.2	2.12	3.3_2.12-latest

Quick Setup

For standard Databricks setups, simply add an init script to your cluster. The default configuration will:

Track each cluster as a compute entity in Definity
Track each job running through the Jobs API as a pipeline in Definity
Track each task within a job as a task in Definity

1. Create an Init Script

Create an init script to automatically download and configure definity's Spark agent. The script will:

Automatically detect your Databricks Runtime version
Download the appropriate definity Spark agent for your Spark version
Configure definity plugin with default settings

databricks_definity_init.sh

#!/bin/bash

# ============================================================================
# Tested Databricks Runtimes: 12.2 LTS - 16.4 LTS (Spark 3.3 - 3.5)
# ============================================================================

# ============================================================================
# CONFIGURATION
# ============================================================================

# Optional: Set a specific agent version (e.g. "0.75.1")
# Leave empty to use the latest version
DEFINITY_AGENT_VERSION=""

# IMPORTANT: For production use, upload the agent JAR to your own
# artifact repository (Artifactory, Nexus, S3, etc.) and update this URL.
# The definity.run URL shown here is for demonstration purposes only.
# Example: "https://your-artifactory.company.com/repository/libs-release/definity-spark-agent"
ARTIFACT_BASE_URL="https://user:[email protected]/java"

# ============================================================================
# AUTO-DETECTION AND INSTALLATION
# ============================================================================

JAR_DIR="/databricks/jars"
mkdir -p "$JAR_DIR"

# Extract Spark version from /databricks/spark/VERSION
FULL_SPARK_VERSION=$(cat /databricks/spark/VERSION)
SPARK_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE '^[0-9]+\.[0-9]+')
echo "Detected Spark version: $SPARK_VERSION"

if [ -z "$SPARK_VERSION" ]; then
    echo "Spark major.minor version is empty or not found. Will not proceed to install definity agent"
    exit 0
fi

# Extract Scala version from /databricks/IMAGE_KEY
DBR_VERSION=$(cat /databricks/IMAGE_KEY)
SCALA_VERSION=$(echo "$DBR_VERSION" | grep -oE 'scala([0-9]+\.[0-9]+)' | sed 's/scala//')
echo "Detected Scala version: $SCALA_VERSION"

if [ -z "$SCALA_VERSION" ]; then
    echo "Scala version is empty or not found. Will not proceed to install definity agent"
    exit 0
fi

# Build agent version string with Spark and Scala versions
SPARK_AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}"

# Build the full agent version string
if [ -z "$DEFINITY_AGENT_VERSION" ]; then
  # Use latest version
  FULL_AGENT_VERSION="${SPARK_AGENT_VERSION}-latest"
else
  # Use specific version
  FULL_AGENT_VERSION="${SPARK_AGENT_VERSION}-${DEFINITY_AGENT_VERSION}"
fi

# Download the agent
DEFINITY_JAR_URL="${ARTIFACT_BASE_URL}/definity-spark-agent-${FULL_AGENT_VERSION}.jar"
echo "Downloading Definity Spark Agent ${FULL_AGENT_VERSION} for Spark ${SPARK_VERSION} (Scala ${SCALA_VERSION})..."
curl -f -o $JAR_DIR/definity-spark-agent.jar $DEFINITY_JAR_URL

if [ $? -eq 0 ]; then
  echo "Successfully downloaded Definity Spark Agent"
else
  echo "Failed to download Definity Spark Agent from: $DEFINITY_JAR_URL"
  echo "Cluster will start without Definity agent"
  exit 0
fi

# Configure Definity plugin
cat > /databricks/driver/conf/00-definity.conf << EOF
spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin
spark.definity.server="https://app.definity.run"
spark.definity.api.token=YOUR_TOKEN
#spark.definity.env.name=YOUR_DEFAULT_ENV
EOF

echo "Definity Spark Agent configured successfully"

Production Deployment

For production use, upload the Definity agent JAR to your own artifact repository (Artifactory, Nexus, S3, etc.) and update the ARTIFACT_BASE_URL in the script. The definity.run URL is for demonstration purposes only.

2. Attach the Init Script to Your Compute Cluster

In the Databricks UI:

Go to Cluster configuration → Advanced options → Init Scripts.
Add your script with:
- Source: s3
- File path: s3://your-s3-bucket/init-scripts-dir/definity_init.sh

3. Configure Cluster Name [Optional]

By default, the cluster name is derived from the Databricks cluster name. To customize it, navigate to Cluster configuration → Advanced options → Spark and add:

spark.definity.compute.name      my_cluster_name

Advanced Tracking Modes

The default Databricks integration tracks the compute cluster separately from workflows and automatically detects running workflow tasks. You may want to change this behavior in these scenarios:

Single-Task Cluster

If you have a dedicated cluster per task, disable shared cluster tracking mode and provide the Pipeline Tracking Parameters in the init script:

spark.definity.sharedCompute=false

Manual Task Tracking

To manually control task scopes programmatically, disable Databricks automatic tracking:

spark.definity.databricks.automaticSessions.enabled=false

Then follow the Multi-Task Shared Spark App guide.

Job Configuration Overrides

By default, Definity auto-detects pipeline and task names from the Jobs API. To override them, use the base_parameters or parameters fields depending on the task type.

Example: Jobs API

{
  "tasks": [
    {
      "task_key": "task1",
      "notebook_task": {
        "notebook_path": "/Workspace/Users/user@org/task_notebook_1",
        "source": "WORKSPACE",
        "base_parameters": {
          "spark.definity.pipeline.name": "my_pipeline",
          "spark.definity.pipeline.pit": "2025-01-01 01:00:00",
          "spark.definity.task.name": "task1"
        }
      },
      "existing_cluster_id": "${DATABRICKS_CLUSTER}"
    },
    {
      "task_key": "task2",
      "notebook_task": {
        "notebook_path": "/Workspace/Users/user@org/task_notebook_2",
        "source": "WORKSPACE",
        "base_parameters": {
          "spark.definity.pipeline.name": "my_pipeline",
          "spark.definity.pipeline.pit": "2025-01-01 01:00:00",
          "spark.definity.task.name": "task2"
        }
      },
      "existing_cluster_id": "${DATABRICKS_CLUSTER}"
    },
    {
      "task_key": "python_task1",
      "spark_python_task": {
        "python_file": "s3://my-bucket/python_task.py",
        "parameters": [
          "yourArg1",
          "yourArg2",
          "spark.definity.task.name=python_task_1",
          "spark.definity.pipeline.name=my_pipeline",
          "spark.definity.pipeline.pit=2025-01-01 01:00:00"
        ]
      },
      "existing_cluster_id": "${DATABRICKS_CLUSTER}"
    }
  ],
  "format": "MULTI_TASK",
  "queue": {
    "enabled": true
  }
}

Example: Airflow Notebook Job

run_notebook = DatabricksSubmitRunOperator(
    task_id="run_notebook",
    json={
        "notebook_task": {
            "notebook_path": "/Users/[email protected]/my_notebook",
            "base_parameters": {
                "spark.definity.pipeline.name": "{{ dag_run.dag_id }}",
                "spark.definity.pipeline.pit": "{{ ts }}",
                "spark.definity.task.name": "{{ ti.task_id }}"
            },
        },
        "name": "notebook-job",
    }
)

Example: Airflow Python Job

run_python = DatabricksSubmitRunOperator(
    task_id="run_python_script",
    json={
        "spark_python_task": {
            "python_file": "dbfs:/path/to/job.py",
            "parameters": [
                "spark.definity.pipeline.name={{ dag_run.dag_id }}",
                "spark.definity.pipeline.pit={{ ts }}",
                "spark.definity.task.name={{ ti.task_id }}"
            ]
        },
        "name": "python-job",
    }
)

Compatibility matrix​

Quick Setup​

1. Create an Init Script​

2. Attach the Init Script to Your Compute Cluster​

3. Configure Cluster Name [Optional]​

Advanced Tracking Modes​

Single-Task Cluster​

Manual Task Tracking​

Job Configuration Overrides​

Example: Jobs API​

Example: Airflow Notebook Job​

Example: Airflow Python Job​