Google DataProc
Overview
Google Cloud Dataproc supports two deployment models:
Dataproc Serverless: Batch workloads using Spark runtime versions with LTS (Long-Term Support)
Dataproc Clusters: Traditional cluster-based workloads using Dataproc image versions
Compatibility Matrix
Dataproc Serverless (Runtime LTS Versions)
| Runtime Version | Spark Version | Scala Version | Definity Agent |
|---|---|---|---|
| 2.2 LTS, 2.3 LTS | 3.5.x | 2.13 | 3.5_2.13-latest |
| 1.2 LTS | 3.5.x | 2.12 | 3.5_2.12-latest |
Dataproc Clusters (Image Version)
| Image Version | Spark Version | Scala Version | Definity Agent |
|---|---|---|---|
| 2.2.x, 2.3.X | 3.5.x | 2.12.18 | 3.5_2.12-latest |
| 2.1.x | 3.3.x | 2.12.18 | 3.3_2.12-latest |
| 2.0.x | 3.1.x | 2.12.14 | 3.1_2.12-latest |
| 1.5.x | 2.4.x | 2.12.10 | 2.4_2.12-latest |
| 1.4.x | 2.4.x | 2.11.12 | 2.4_2.11-latest |
| 1.3.x | 2.3.x | 2.11.8 | 2.3_2.11-latest |
Configuration Methods (Dataproc Clusters)
There are two ways to configure the Definity agent on Dataproc clusters:
- Initialization Action (Recommended) - Automatically configures the agent on cluster startup
- Job Submission - For serverless batches or when configuring the agent per job
Method 1: Initialization Action (Init Script)
Use an initialization action to automatically configure the Definity agent when the Dataproc cluster starts. The script will:
- Automatically detect your Spark and Scala versions
- Download the appropriate Definity Spark agent
- Configure the Definity plugin with default settings
- If configuration fails, the cluster will continue to start normally
1. Create an Init Script
Create an init script to automatically download and configure the Definity Spark agent:
definity_init.sh
#!/bin/bash
# ============================================================================
# Definity Agent Configuration for AWS EMR & Google Cloud Dataproc
# ============================================================================
# This script automatically detects your Spark and Scala versions and
# configures the appropriate Definity Spark Agent.
#
# IMPORTANT: Replace YOUR_TOKEN below with your actual Definity API token
# before running this script.
#
# If configuration fails, the cluster will start normally without the agent.
# ============================================================================
# ============================================================================
# CONFIGURATION
# ============================================================================
# Optional: Set a specific agent version (e.g. "0.75.1")
# Leave empty to use the latest version
DEFINITY_AGENT_VERSION=""
DEFINITY_API_TOKEN="YOUR_TOKEN" # <<< REPLACE WITH YOUR ACTUAL TOKEN
# IMPORTANT: For production use, upload the agent JAR to your own
# artifact repository (Artifactory, Nexus, S3, etc.) and update this URL.
# The definity.run URL shown here is for demonstration purposes only.
# Example: "https://your-artifactory.company.com/repository/libs-release/definity-spark-agent"
ARTIFACT_BASE_URL="https://user:[email protected]/java"
# ============================================================================
echo "==============================================================="
echo "Definity Agent configuration"
echo "==============================================================="
# ============================================================================
# VERSION DETECTION
# ============================================================================
SPARK_VERSION=""
SCALA_VERSION=""
echo "Detecting Spark and Scala versions..."
# Method 1: RELEASE file
if [ -f /usr/lib/spark/RELEASE ]; then
FULL_SPARK_VERSION=$(cat /usr/lib/spark/RELEASE)
SPARK_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE '[0-9]+\.[0-9]+' | head -n 1)
SCALA_VERSION=$(echo "$FULL_SPARK_VERSION" | grep -oE 'scala-[0-9]+\.[0-9]+|_[0-9]+\.[0-9]+' | sed 's/scala-//;s/_//' | head -n 1)
fi
# Method 2: EMR version (fallback for EMR 6.3 and lower, or if Scala version missing)
if { [ -z "$SPARK_VERSION" ] || [ -z "$SCALA_VERSION" ]; } && [ -f /emr/instance-controller/lib/info/extraInstanceData.json ]; then
EMR_RELEASE=$(jq -r '.releaseLabel' /emr/instance-controller/lib/info/extraInstanceData.json 2>/dev/null || echo "")
if [ -n "$EMR_RELEASE" ]; then
case "$EMR_RELEASE" in
emr-7.*) SPARK_VERSION="3.5"; SCALA_VERSION="2.12" ;;
emr-6.1[2-5].*) SPARK_VERSION="3.4"; SCALA_VERSION="2.12" ;;
emr-6.[8-9].*|emr-6.1[01].*) SPARK_VERSION="3.3"; SCALA_VERSION="2.12" ;;
emr-6.[67].*) SPARK_VERSION="3.2"; SCALA_VERSION="2.12" ;;
emr-6.[3-5].*) SPARK_VERSION="3.1"; SCALA_VERSION="2.12" ;;
emr-6.0.*) SPARK_VERSION="2.4"; SCALA_VERSION="2.12" ;;
esac
fi
fi
if [ -z "$SPARK_VERSION" ] || [ -z "$SCALA_VERSION" ]; then
echo "Could not detect Spark or Scala version"
echo "Cluster will start without Definity agent"
exit 0
fi
echo "Detected: Spark $SPARK_VERSION, Scala $SCALA_VERSION"
# ============================================================================
# DOWNLOAD
# ============================================================================
AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}"
JAR_TEMP_PATH="/tmp/definity-spark-agent.jar"
echo "Downloading agent..."
# Build the full agent version string
if [ -z "$DEFINITY_AGENT_VERSION" ]; then
# Use latest version
AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}-latest"
else
# Use specific version
AGENT_VERSION="${SPARK_VERSION}_${SCALA_VERSION}-${DEFINITY_AGENT_VERSION}"
fi
DEFINITY_JAR_URL="${ARTIFACT_BASE_URL}/definity-spark-agent-${AGENT_VERSION}.jar"
echo "Downloading Definity Spark Agent from ${DEFINITY_JAR_URL} ..."
if curl -f --connect-timeout 30 --max-time 120 -o "$JAR_TEMP_PATH" "$DEFINITY_JAR_URL"; then
echo "Agent jar download completed"
else
echo "Agent jar download failed - cluster will start without Definity agent"
exit 0
fi
if [ -d /usr/lib/spark/jars ]; then
sudo cp "$JAR_TEMP_PATH" /usr/lib/spark/jars/definity-spark-agent.jar
echo "Agent was copied to spark jars directory /usr/lib/spark/jars"
fi
# ============================================================================
# CONFIGURATION
# ============================================================================
cat > /tmp/definity_config.sh <<'SCRIPT_END'
#!/bin/bash
set -eu
TIMEOUT=300
START=$(date +%s)
LAST_LOG=0
JAR_TEMP_PATH="/tmp/definity-spark-agent.jar"
DEFINITY_API_TOKEN="${DEFINITY_API_TOKEN}"
check_timeout() {
local CONTEXT="${1:-unknown}"
local ELAPSED=$(($(date +%s) - START))
# Check if timeout exceeded
if [ $ELAPSED -ge $TIMEOUT ]; then
echo "ERROR: Timeout after ${TIMEOUT}s while waiting for: ${CONTEXT}"
echo "Exiting the background configuration process without completing the full configuration"
exit 1
fi
# Log progress every 30 seconds
if [ $((ELAPSED - LAST_LOG)) -ge 30 ] && [ $ELAPSED -gt 0 ]; then
echo "Still waiting for ${CONTEXT} (${ELAPSED}s elapsed)..."
LAST_LOG=$ELAPSED
fi
return 0
}
echo ""
echo "Background configuration started"
# Wait for Spark directory
if [ ! -d /usr/lib/spark/jars ]; then
echo "Waiting for Spark jars directory..."
while [ ! -d /usr/lib/spark/jars ]; do
check_timeout "Spark jars directory"
sleep 5
done
echo "Spark jars directory is found"
fi
if [ ! -f /usr/lib/spark/jars/definity-spark-agent.jar ]; then
if [ -f "$JAR_TEMP_PATH" ]; then
sudo cp "$JAR_TEMP_PATH" /usr/lib/spark/jars/definity-spark-agent.jar
echo "Agent JAR copied to /usr/lib/spark/jars/"
else
echo "ERROR: JAR not found at $JAR_TEMP_PATH"
exit 1
fi
fi
# Wait for config file
if [ ! -f /etc/spark/conf/spark-defaults.conf ]; then
echo "Waiting for spark-defaults.conf..."
while [ ! -f /etc/spark/conf/spark-defaults.conf ]; do
check_timeout "spark-defaults.conf"
sleep 5
done
echo "spark-defaults.conf is found"
fi
cat >> /etc/spark/conf/spark-defaults.conf <<DEFINITY_CONF
spark.plugins ai.definity.spark.plugin.DefinitySparkPlugin
spark.extraListeners ai.definity.spark.AppListener
spark.executor.plugins ai.definity.spark.plugin.executor.DefinityExecutorPlugin
spark.definity.server https://app.definity.run
spark.definity.api.token ${DEFINITY_API_TOKEN}
DEFINITY_CONF
echo "Definity properties were added to spark-defaults.conf"
echo "Background configuration is completed"
echo ""
echo "==============================================================="
echo "Definity Spark Agent configured successfully"
echo "==============================================================="
SCRIPT_END
chmod +x /tmp/definity_config.sh
export DEFINITY_API_TOKEN
# Run config script in background with nohup for SIGHUP immunity
# Output still goes to bootstrap logs for visibility
nohup sudo -E /tmp/definity_config.sh &
exit 0
For production use, upload the Definity agent JAR to your own artifact repository (S3, Artifactory, Nexus, GCS, etc.) and update the ARTIFACT_BASE_URL in the script. Replace YOUR_TOKEN with your actual Definity API token, and consider using a secrets manager to manage the token securely.
2. Upload the Init Script to GCS
gsutil cp definity_init.sh gs://your-bucket/scripts/definity_init.sh
3. Create Cluster with Initialization Action
gcloud dataproc clusters create my-cluster \
--region=us-central1 \
--initialization-actions=gs://your-bucket/scripts/definity_init.sh \
--image-version=2.2
4. Configure Additional Settings [Optional]
You can extend the spark-defaults.conf section in the init script to include additional configuration parameters.
Method 2: Job Submission
Alternatively, you can specify the Definity agent JAR and configuration parameters directly when submitting each job. This approach gives you more control over individual job configurations but requires specifying the agent settings for every submission.
Cluster Job Submission
Submit a job to an existing Dataproc cluster:
gcloud dataproc jobs submit pyspark \
gs://your-bucket/scripts/my_job.py \
--cluster=my-cluster \
--region=my-region \
--jars gs://your-bucket/definity-spark-agent-X-X.jar \
--properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline
Serverless Batch Submission
gcloud dataproc batches submit \
--project your-project \
--region=my-region \
spark \
--version 2.3 \
--subnet default \
--class com.example.spark.MySparkJob \
--jars gs://your-bucket/definity-spark-agent-X-X.jar \
--properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline