Skip to main content

Google DataProc

Overview

Google Cloud Dataproc supports two deployment models:

Dataproc Serverless: Batch workloads using Spark runtime versions with LTS (Long-Term Support)

Dataproc Clusters: Traditional cluster-based workloads using Dataproc image versions

Compatibility Matrix

Dataproc Serverless (Runtime LTS Versions)

Runtime VersionSpark VersionScala VersionDefinity Agent
2.2 LTS, 2.3 LTS3.5.x2.133.5_2.13-latest
1.2 LTS3.5.x2.123.5_2.12-latest

Dataproc Clusters (Image Version)

Image VersionSpark VersionScala VersionDefinity Agent
2.2.x, 2.3.X3.5.x2.12.183.5_2.12-latest
2.1.x3.3.x2.12.183.3_2.12-latest
2.0.x3.1.x2.12.143.1_2.12-latest
1.5.x2.4.x2.12.102.4_2.12-latest
1.4.x2.4.x2.11.122.4_2.11-latest
1.3.x2.3.x2.11.82.3_2.11-latest

Job Submission Example

Cluster Job Submission

Submit a job to an existing Dataproc cluster:

gcloud dataproc jobs submit pyspark \
gs://your-bucket/scripts/my_job.py \
--cluster=my-cluster \
--region=my-region \
--jars gs://your-bucket/definity-spark-agent-X-X.jar \
--properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline

Serverless Batch Submission

gcloud dataproc batches submit \
--project your-project \
--region=my-region \
spark \
--version 2.3 \
--subnet default \
--class com.example.spark.MySparkJob \
--jars gs://your-bucket/definity-spark-agent-X-X.jar \
--properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline