Skip to main content

Databricks

Supported Databricks Runtime versions: 12.217.3

Note: Databricks Serverless is not supported for this instrumentation. You may optionally use the DBT agent instead.

Setup

TL;DR

Five steps to get Definity running on a cluster. The init script will:

  • Automatically detect your Spark and Scala versions
  • Download the appropriate Definity Spark agent
  • Configure the Definity plugin with default settings
  • If configuration fails, the cluster will continue to start normally

1. Generate an agent token

In the Definity app, click your user avatar (top right) and select Generate Token.

User menu — Generate Token

Select the Agent token type, optionally add a label, and click Generate. Copy the token.

Generate Agent Token dialog

2. Store the token as a Databricks secret

databricks secrets create-scope definity.ai
databricks secrets put-secret --json '{"scope": "definity.ai", "key": "agent.token", "string_value": "<YOUR_AGENT_TOKEN>"}'
databricks secrets put-acl definity.ai users READ

Definity reads the token from scope definity.ai, key agent.token by default. To use a different scope or key, update the commands above and set spark.definity.databricks.tokenSecret.scope / spark.definity.databricks.tokenSecret.key in the init script (see step 3).

Quick Evaluation

For a quick test without setting up a secret, set DEFINITY_AGENT_TOKEN=<YOUR_TOKEN> directly in the cluster environment variables under Cluster configurationAdvanced optionsSparkEnvironment Variables, then skip to step 3.

3. Save the init script

Save the script below as databricks_definity_init.sh. Upload it to a Volume, cloud storage, or workspace files.

databricks_definity_init.sh
#!/bin/bash

# ============================================================================
# Definity Agent Configuration for Databricks
# Tested Databricks Runtimes: 12.2 LTS - 17.3 LTS (Spark 3.3 - 4.0)
# ============================================================================
# This script automatically detects your Spark and Scala versions and
# installs the appropriate Definity Spark Agent.
#
# The agent token is read automatically from the Databricks workspace secret.
# Default scope: definity.ai, key: agent.token. Override via spark.definity.databricks.tokenSecret.scope/key.
#
# If installation fails, the cluster will start normally without the agent.
# ============================================================================

# ============================================================================
# CONFIGURATION
# ============================================================================

# Base path to the agent JARs.
# The script auto-detects Spark/Scala and appends the JAR filename, e.g.:
# {base_path}/definity-spark-agent-3.5_2.12-0.80.2.jar
#
# For production use, upload the agent JAR to your own artifact repository
# and update this path. See "Hosting the agent JAR yourself" below.
# Volumes : "/Volumes/<catalog>/<schema>/<volume>"
# HTTP/HTTPS : "https://your-artifactory.company.com/repository/libs-release"
# S3 : "s3://your-bucket/definity"
ARTIFACT_BASE_PATH="https://user:[email protected]/java"

# Version of the Definity agent (e.g. "0.80.2")
DEFINITY_AGENT_VERSION="latest"

# Agent token — optional. Definity reads it automatically from the workspace secret
# (scope: definity.ai, key: agent.token). Uncomment to override on this cluster only.
# DEFINITY_AGENT_TOKEN="<YOUR_AGENT_TOKEN>"

# ============================================================================
# AUTO-DETECTION AND INSTALLATION
# ============================================================================

JAR_DIR="/databricks/jars"
mkdir -p "$JAR_DIR"

FULL_SPARK_VERSION=$(cat /databricks/spark/VERSION)
SPARK_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE '^[0-9]+\.[0-9]+')
echo "Detected Spark version: $SPARK_VERSION"

if [ -z "$SPARK_VERSION" ]; then
echo "Spark major.minor version is empty or not found. Will not proceed to install definity agent"
exit 0
fi

DBR_VERSION=$(cat /databricks/IMAGE_KEY)
SCALA_VERSION=$(echo "$DBR_VERSION" | grep -oE 'scala([0-9]+\.[0-9]+)' | sed 's/scala//')
echo "Detected Scala version: $SCALA_VERSION"

if [ -z "$SCALA_VERSION" ]; then
echo "Scala version is empty or not found. Will not proceed to install definity agent"
exit 0
fi

SPARK_AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}"
FULL_AGENT_VERSION="${SPARK_AGENT_VERSION}-${DEFINITY_AGENT_VERSION}"

AGENT_JAR_NAME="definity-spark-agent-${FULL_AGENT_VERSION}.jar"
AGENT_JAR_SRC="${ARTIFACT_BASE_PATH}/${AGENT_JAR_NAME}"
echo "Fetching Definity Spark Agent ${FULL_AGENT_VERSION} from ${AGENT_JAR_SRC} ..."

if [[ "$ARTIFACT_BASE_PATH" == s3://* ]]; then
aws s3 cp "$AGENT_JAR_SRC" "$JAR_DIR/definity-spark-agent.jar"
elif [[ "$ARTIFACT_BASE_PATH" == /Volumes/* ]] || [[ "$ARTIFACT_BASE_PATH" == /dbfs/* ]]; then
cp "$AGENT_JAR_SRC" "$JAR_DIR/definity-spark-agent.jar"
else
curl -f -o "$JAR_DIR/definity-spark-agent.jar" "$AGENT_JAR_SRC"
fi

if [ $? -ne 0 ]; then
echo "Failed to fetch Definity Spark Agent from: $AGENT_JAR_SRC"
echo "Cluster will start without Definity agent"
exit 0
fi

echo "Successfully installed Definity Spark Agent"

cat > /databricks/driver/conf/00-definity.conf << EOF
spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin
spark.definity.server="https://app.definity.run"
# spark.definity.databricks.tokenSecret.scope="definity.ai"
# spark.definity.databricks.tokenSecret.key="agent.token"
EOF

echo "Definity Spark Agent configured successfully"

4. Attach the init script

Cluster configurationAdvanced optionsInit Scripts → add the path where you uploaded the script.

5. Start the cluster

Run any query. The cluster should appear in the Definity UI within ~30 seconds.

Configure cluster name [Optional]

By default, the cluster name is derived from the Databricks cluster name. To customize it, navigate to Cluster configurationAdvanced optionsSpark and add:

spark.definity.compute.name      my_cluster_name

Workspace-Wide Deployment

Cluster Policy

A cluster policy automates step 4 — instead of manually attaching the init script on every new cluster, the policy enforces the init script path automatically. Combined with the workspace secret from step 2, any user who selects the policy from the Policy dropdown gets a fully instrumented cluster with no extra configuration.

In ComputePolicies, create a policy with these entries:

{
"init_scripts.0.volumes.destination": {
"type": "fixed",
"value": "/Volumes/<catalog>/<schema>/<volume>/databricks_definity_init.sh"
}
}

In the policy's Permissions tab, grant CAN_USE to the relevant group.

Existing clusters aren't retroactively bound to a policy — edit each one once to switch it, or recreate them. Job clusters defined in job JSON only inherit a policy if the job sets policy_id explicitly.

Global Init Script

A global init script runs automatically on every cluster in the workspace — no per-cluster configuration needed. Combined with the workspace secret from step 2, any cluster that starts will be instrumented with Definity.

In SettingsComputeGlobal init scriptsManage, click + Add, paste the script from step 3, enable it, and click Add.

warning

Global init scripts only apply to clusters using Single User or No Isolation Shared access mode. They do not run on Standard (Unity Catalog) access mode clusters. Use a cluster policy for Standard mode clusters.


Hosting the Agent JAR Yourself

The default ARTIFACT_BASE_PATH in the script points at a Definity-hosted URL. For production, upload the JAR for your Spark/Scala combination (see Compatibility Matrix) to your own location and update ARTIFACT_BASE_PATH in the init script:

StorageARTIFACT_BASE_PATH exampleNotes
Unity Catalog Volume/Volumes/<catalog>/<schema>/definity_jarsInit script must also be stored in a Volume
HTTP/HTTPShttps://your-artifactory.com/repo/libs-releaseArtifactory, Nexus, or any HTTP server
Cloud storage (S3, GCS, ADLS)s3://your-bucket/definityCluster needs appropriate cloud credentials (instance profile, IAM role, or service principal)
note

When the JAR lives in a Volume, the init script must also live in a Volume. Volume credentials are only available to Volume-stored init scripts — see Databricks docs.


Compatibility Matrix

Databricks ReleaseSpark VersionScala VersionDefinity Agent
17.3_LTS (scala 2.13)4.0.02.134.0_2.13-latest
16.4_LTS (scala 2.13)3.5.22.133.5_2.13-latest
16.4_LTS (scala 2.12)3.5.22.123.5_2.12-latest
15.4_LTS3.5.02.123.5_2.12-latest
14.3_LTS3.5.02.123.5_2.12-latest
13.3_LTS3.4.12.123.4_2.12-latest
12.2_LTS3.3.22.123.3_2.12-latest

Entity Mapping

Definity follows the same pipeline hierarchy used by orchestrators like Apache Airflow. The table below shows how these entities map to their Databricks and Spark equivalents.

DefinityDatabricksSparkNotes
PipelineWorkflow (Job)A logical grouping of Tasks, typically orchestrated on a schedule.
Pipeline RunJob RunA single execution of a Pipeline. All Task Runs sharing the same pipeline.pit or pipeline.run.id belong to the same run.
TaskJob TaskApplication (SparkContext)The definition of a unit of compute work. In Databricks multi-task workflows, each job task maps to a Definity Task. In standalone Spark, one SparkSession = one Task.
Task RunJob Task RunApplication RunA single execution of a Task — surfaced as the Task Run page in the UI and the task_run REST API.
Transformation (TF)Notebook cell / SQL statementJob (action-triggered)A single logical operation within a Task Run — e.g. one SQL query, one DataFrame.write(), or one Spark action.
DatasetDelta Table / Unity Catalog table / file pathDataFrame / TableA data artifact read or written by a Transformation. Definity tracks both inputs and outputs for lineage.
PIT (Point In Time)Scheduled time / logical dateThe logical time of the data being processed. Equivalent to the execution date in Airflow or the partition date of a time-partitioned table.
ComputeCluster (Job Cluster / All-Purpose Cluster)SparkSession / SparkContextThe infrastructure running the tasks. Tracked separately in shared compute mode.

Example: Databricks Multi-Task Workflow

Databricks Workflow "payments-etl"          →  Definity Pipeline "payments-etl"
└─ Job Run #1234 (2025-05-01) → Pipeline Run (pit=2025-05-01)
├─ Job Task "ingest" → Task "ingest"
│ └─ Job Task Run → Task Run (pit=2025-05-01)
│ ├─ SQL: INSERT INTO payments… → Transformation
│ ├─ Reads: raw_payments → Dataset (input)
│ └─ Writes: stg_payments → Dataset (output)
└─ Job Task "transform" → Task "transform"
└─ Job Task Run → Task Run (pit=2025-05-01)
├─ SQL: CREATE TABLE … → Transformation
└─ Writes: payments_final → Dataset (output)

Advanced Tracking Modes

The default Databricks integration tracks the compute cluster separately from workflows and automatically detects running workflow tasks. You may want to change this behavior in these scenarios:

Single-Task Cluster

If you have a dedicated cluster per task, disable shared cluster tracking mode and provide the Pipeline Tracking Parameters in the init script:

spark.definity.sharedCompute=false

Manual Task Tracking

To manually control task scopes programmatically, disable Databricks automatic tracking:

spark.definity.databricks.automaticSessions.enabled=false

Then follow the Multi-Task Shared Spark App guide.