Google DataProc

Overview

Google Cloud Dataproc supports two deployment models:

Dataproc Serverless: Batch workloads using Spark runtime versions with LTS (Long-Term Support)

Dataproc Clusters: Traditional cluster-based workloads using Dataproc image versions

Compatibility Matrix

Dataproc Serverless (Runtime LTS Versions)

Runtime Version	Spark Version	Scala Version	Definity Agent
2.2 LTS, 2.3 LTS	3.5.x	2.13	3.5_2.13-latest
1.2 LTS	3.5.x	2.12	3.5_2.12-latest

Dataproc Clusters (Image Version)

Image Version	Spark Version	Scala Version	Definity Agent
2.2.x, 2.3.X	3.5.x	2.12.18	3.5_2.12-latest
2.1.x	3.3.x	2.12.18	3.3_2.12-latest
2.0.x	3.1.x	2.12.14	3.1_2.12-latest
1.5.x	2.4.x	2.12.10	2.4_2.12-latest
1.4.x	2.4.x	2.11.12	2.4_2.11-latest
1.3.x	2.3.x	2.11.8	2.3_2.11-latest

Configuration Methods (Dataproc Clusters)

There are two ways to configure the Definity agent on Dataproc clusters:

Initialization Action (Recommended) - Automatically configures the agent on cluster startup
Job Submission - For serverless batches or when configuring the agent per job

Method 1: Initialization Action (Init Script)

Use an initialization action to automatically configure the Definity agent when the Dataproc cluster starts. The script will:

Automatically detect your Spark and Scala versions
Download the appropriate Definity Spark agent
Configure the Definity plugin with default settings
If configuration fails, the cluster will continue to start normally

1. Create an Init Script

Create an init script to automatically download and configure the Definity Spark agent:

definity_init.sh

#!/bin/bash

# ============================================================================
# Definity Agent Configuration for AWS EMR & Google Cloud Dataproc
# ============================================================================
# This script automatically detects your Spark and Scala versions and
# configures the appropriate Definity Spark Agent.
#
# IMPORTANT: Replace YOUR_TOKEN below with your actual Definity API token
# before running this script.
#
# If configuration fails, the cluster will start normally without the agent.
# ============================================================================

# ============================================================================
# CONFIGURATION
# ============================================================================

# Optional: Set a specific agent version (e.g. "0.75.1")
# Leave empty to use the latest version
DEFINITY_AGENT_VERSION=""
DEFINITY_API_TOKEN="YOUR_TOKEN"  # <<< REPLACE WITH YOUR ACTUAL TOKEN

# IMPORTANT: For production use, upload the agent JAR to your own
# artifact repository (Artifactory, Nexus, S3, etc.) and update this URL.
# The definity.run URL shown here is for demonstration purposes only.
# Example: "https://your-artifactory.company.com/repository/libs-release/definity-spark-agent"
ARTIFACT_BASE_URL="https://user:[email protected]/java"

# ============================================================================

echo "==============================================================="
echo "Definity Agent configuration"
echo "==============================================================="

# ============================================================================
# VERSION DETECTION
# ============================================================================

SPARK_VERSION=""
SCALA_VERSION=""

echo "Detecting Spark and Scala versions..."

# Method 1: RELEASE file
if [ -f /usr/lib/spark/RELEASE ]; then
    FULL_SPARK_VERSION=$(cat /usr/lib/spark/RELEASE)
    SPARK_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE '[0-9]+\.[0-9]+' | head -n 1)
    SCALA_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE 'scala-[0-9]+\.[0-9]+|_[0-9]+\.[0-9]+' | sed 's/scala-//;s/_//' | head -n 1)
fi

# Method 2: EMR version (fallback for EMR 6.3 and lower, or if Scala version missing)
if { [ -z "$SPARK_VERSION" ] || [ -z "$SCALA_VERSION" ]; } && [ -f /emr/instance-controller/lib/info/extraInstanceData.json ]; then
    EMR_RELEASE=$(jq -r '.releaseLabel' /emr/instance-controller/lib/info/extraInstanceData.json 2>/dev/null || echo "")

    if [ -n "$EMR_RELEASE" ]; then
        case "$EMR_RELEASE" in
            emr-7.*) SPARK_VERSION="3.5"; SCALA_VERSION="2.12" ;;
            emr-6.1[2-5].*) SPARK_VERSION="3.4"; SCALA_VERSION="2.12" ;;
            emr-6.[8-9].*|emr-6.1[01].*) SPARK_VERSION="3.3"; SCALA_VERSION="2.12" ;;
            emr-6.[67].*) SPARK_VERSION="3.2"; SCALA_VERSION="2.12" ;;
            emr-6.[3-5].*) SPARK_VERSION="3.1"; SCALA_VERSION="2.12" ;;
            emr-6.0.*) SPARK_VERSION="2.4"; SCALA_VERSION="2.12" ;;
        esac
    fi
fi

if [ -z "$SPARK_VERSION" ] || [ -z "$SCALA_VERSION" ]; then
    echo "Could not detect Spark or Scala version"
    echo "Cluster will start without Definity agent"
    exit 0
fi

echo "Detected: Spark $SPARK_VERSION, Scala $SCALA_VERSION"

# ============================================================================
# DOWNLOAD
# ============================================================================

AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}"
JAR_TEMP_PATH="/tmp/definity-spark-agent.jar"

echo "Downloading agent..."
# Build the full agent version string
if [ -z "$DEFINITY_AGENT_VERSION" ]; then
    # Use latest version
    AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}-latest"
else
    # Use specific version
    AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}-${DEFINITY_AGENT_VERSION}"
fi

DEFINITY_JAR_URL="${ARTIFACT_BASE_URL}/definity-spark-agent-${AGENT_VERSION}.jar"
echo "Downloading Definity Spark Agent from ${DEFINITY_JAR_URL} ..."


if curl -f --connect-timeout 30 --max-time 120 -o "$JAR_TEMP_PATH" "$DEFINITY_JAR_URL"; then
    echo "Agent jar download completed"
else
    echo "Agent jar download failed - cluster will start without Definity agent"
    exit 0
fi

if [ -d /usr/lib/spark/jars ]; then
    sudo cp "$JAR_TEMP_PATH" /usr/lib/spark/jars/definity-spark-agent.jar
    echo "Agent was copied to spark jars directory /usr/lib/spark/jars"
fi

# ============================================================================
# CONFIGURATION
# ============================================================================

cat > /tmp/definity_config.sh <<'SCRIPT_END'
#!/bin/bash
set -eu

TIMEOUT=300
START=$(date +%s)
LAST_LOG=0
JAR_TEMP_PATH="/tmp/definity-spark-agent.jar"

DEFINITY_API_TOKEN="${DEFINITY_API_TOKEN}"

check_timeout() {
    local CONTEXT="${1:-unknown}"
    local ELAPSED=$(($(date +%s) - START))

    # Check if timeout exceeded
    if [ $ELAPSED -ge $TIMEOUT ]; then
        echo "ERROR: Timeout after ${TIMEOUT}s while waiting for: ${CONTEXT}"
        echo "Exiting the background configuration process without completing the full configuration"
        exit 1
    fi

    # Log progress every 30 seconds
    if [ $((ELAPSED - LAST_LOG)) -ge 30 ] && [ $ELAPSED -gt 0 ]; then
        echo "Still waiting for ${CONTEXT} (${ELAPSED}s elapsed)..."
        LAST_LOG=$ELAPSED
    fi
    return 0
}

echo ""
echo "Background configuration started"

# Wait for Spark directory
if [ ! -d /usr/lib/spark/jars ]; then
    echo "Waiting for Spark jars directory..."
    while [ ! -d /usr/lib/spark/jars ]; do
        check_timeout "Spark jars directory"
        sleep 5
    done
    echo "Spark jars directory is found"
fi

if [ ! -f /usr/lib/spark/jars/definity-spark-agent.jar ]; then
    if [ -f "$JAR_TEMP_PATH" ]; then
        sudo cp "$JAR_TEMP_PATH" /usr/lib/spark/jars/definity-spark-agent.jar
        echo "Agent JAR copied to /usr/lib/spark/jars/"
    else
        echo "ERROR: JAR not found at $JAR_TEMP_PATH"
        exit 1
    fi
fi

# Wait for config file
if [ ! -f /etc/spark/conf/spark-defaults.conf ]; then
    echo "Waiting for spark-defaults.conf..."
    while [ ! -f /etc/spark/conf/spark-defaults.conf ]; do
        check_timeout "spark-defaults.conf"
        sleep 5
    done
    echo "spark-defaults.conf is found"
fi

cat >> /etc/spark/conf/spark-defaults.conf <<DEFINITY_CONF

spark.plugins                                   ai.definity.spark.plugin.DefinitySparkPlugin
spark.extraListeners                            ai.definity.spark.AppListener
spark.executor.plugins                          ai.definity.spark.plugin.executor.DefinityExecutorPlugin
spark.definity.server                           https://app.definity.run
spark.definity.api.token                        ${DEFINITY_API_TOKEN}

DEFINITY_CONF

echo "Definity properties were added to spark-defaults.conf"
echo "Background configuration is completed"

echo ""
echo "==============================================================="
echo "Definity Spark Agent configured successfully"
echo "==============================================================="

SCRIPT_END

chmod +x /tmp/definity_config.sh

export DEFINITY_API_TOKEN

# Run config script in background with nohup for SIGHUP immunity
# Output still goes to bootstrap logs for visibility
nohup sudo -E /tmp/definity_config.sh &

exit 0

Production Deployment

For production use, upload the Definity agent JAR to your own artifact repository (S3, Artifactory, Nexus, GCS, etc.) and update the ARTIFACT_BASE_URL in the script. Replace YOUR_TOKEN with your actual Definity API token, and consider using a secrets manager to manage the token securely.

2. Upload the Init Script to GCS

gsutil cp definity_init.sh gs://your-bucket/scripts/definity_init.sh

3. Create Cluster with Initialization Action

gcloud dataproc clusters create my-cluster \
    --region=us-central1 \
    --initialization-actions=gs://your-bucket/scripts/definity_init.sh \
    --image-version=2.2

4. Configure Additional Settings [Optional]

You can extend the spark-defaults.conf section in the init script to include additional configuration parameters.

Method 2: Job Submission

Alternatively, you can specify the Definity agent JAR and configuration parameters directly when submitting each job. This approach gives you more control over individual job configurations but requires specifying the agent settings for every submission.

Cluster Job Submission

Submit a job to an existing Dataproc cluster:

gcloud dataproc jobs submit pyspark \
    gs://your-bucket/scripts/my_job.py \
    --cluster=my-cluster \
    --region=my-region \
    --jars gs://your-bucket/definity-spark-agent-X-X.jar \
    --properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline

Serverless Batch Submission

gcloud dataproc batches submit \
    --project your-project \
    --region=my-region \
    spark \
    --version 2.3 \
    --subnet default \
    --class com.example.spark.MySparkJob \
    --jars gs://your-bucket/definity-spark-agent-X-X.jar \
    --properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline

Overview​

Compatibility Matrix​

Dataproc Serverless (Runtime LTS Versions)​

Dataproc Clusters (Image Version)​

Configuration Methods (Dataproc Clusters)​

Method 1: Initialization Action (Init Script)​

1. Create an Init Script​

2. Upload the Init Script to GCS​

3. Create Cluster with Initialization Action​

4. Configure Additional Settings [Optional]​

Method 2: Job Submission​

Cluster Job Submission​

Serverless Batch Submission​