Google DataProc
Overview
Google Cloud Dataproc supports two deployment models:
Dataproc Serverless: Batch workloads using Spark runtime versions with LTS (Long-Term Support)
Dataproc Clusters: Traditional cluster-based workloads using Dataproc image versions
Compatibility Matrix
Dataproc Serverless (Runtime LTS Versions)
| Runtime Version | Spark Version | Scala Version | Definity Agent |
|---|---|---|---|
| 2.2 LTS, 2.3 LTS | 3.5.x | 2.13 | 3.5_2.13-latest |
| 1.2 LTS | 3.5.x | 2.12 | 3.5_2.12-latest |
Dataproc Clusters (Image Version)
| Image Version | Spark Version | Scala Version | Definity Agent |
|---|---|---|---|
| 2.2.x, 2.3.X | 3.5.x | 2.12.18 | 3.5_2.12-latest |
| 2.1.x | 3.3.x | 2.12.18 | 3.3_2.12-latest |
| 2.0.x | 3.1.x | 2.12.14 | 3.1_2.12-latest |
| 1.5.x | 2.4.x | 2.12.10 | 2.4_2.12-latest |
| 1.4.x | 2.4.x | 2.11.12 | 2.4_2.11-latest |
| 1.3.x | 2.3.x | 2.11.8 | 2.3_2.11-latest |
Job Submission Example
Cluster Job Submission
Submit a job to an existing Dataproc cluster:
gcloud dataproc jobs submit pyspark \
gs://your-bucket/scripts/my_job.py \
--cluster=my-cluster \
--region=my-region \
--jars gs://your-bucket/definity-spark-agent-X-X.jar \
--properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline
Serverless Batch Submission
gcloud dataproc batches submit \
--project your-project \
--region=my-region \
spark \
--version 2.3 \
--subnet default \
--class com.example.spark.MySparkJob \
--jars gs://your-bucket/definity-spark-agent-X-X.jar \
--properties spark.plugins=ai.definity.spark.plugin.DefinitySparkPlugin,spark.definity.server=https://app.definity.run,spark.definity.api.token=$DEFINITY_API_TOKEN,spark.definity.env.name=demo,spark.definity.pipeline.name=example_pipeline