Configuration Reference

Core Configuration Parameters

Name	Details
`spark.jars`	URL of `definity-spark-agent-X.X.jar` (and optionally `definity-spark-iceberg-1.2-X.X.jar`)
`spark.plugins`	Add `ai.definity.spark.plugin.DefinitySparkPlugin` (for Spark 3.x)
`spark.extraListeners`	Add `ai.definity.spark.AppListener` (for Spark 2.x)
`spark.executor.plugins`	Add `ai.definity.spark.plugin.executor.DefinityExecutorPlugin` (for Spark 2.x)
`spark.definity.server`	Definity server URL (e.g., `https://app.definity.run`)
`spark.definity.api.token`	Integration token (required for SaaS usage). Can also read from driver's Env var `DEFINITY_API_TOKEN`

Pipeline Tracking Parameters

These parameters enable tracking and monitoring of your Spark application's execution over time. Consistent naming allows you to correlate metrics and logs across multiple runs, while a shared PIT ensures all tasks in a pipeline reference the same logical point in time.

Name	Details
`spark.definity.env.name`	defaults to `default`
`spark.definity.pipeline.name`	defaults to `spark.app.name`
`spark.definity.pipeline.pit`	the Point-in-Time of a run; defaults to `now`
`spark.definity.task.name`	defaults to `spark.app.name`

Advanced Configuration

Name	Details
`spark.definity.enabled`	Enables or disables functionality with options: `true`, `false`, or `opt-in` (default: `true`). For `opt-in`, users can toggle this in the pipeline settings page.
`spark.definity.pipeline.run.id`	Used for grouping tasks in the same run.
`spark.definity.task.id`	User-defined task ID to show in the UI and notifications (e.g., YARN run ID); defaults to `spark.app.name`.
`spark.definity.tags`	Comma-separated tags, supports `key:value` format (e.g., `team:team-A`).
`spark.definity.email.to`	Comma-separated list of notification recipient emails.
`spark.definity.task.heartbeat.interval`	Interval in seconds for sending heartbeat to the server; defaults to `60`.
`spark.definity.server.request.retry.count`	Number of retries for server request errors; defaults to `1`.
`spark.definity.ignoredTables`	Comma-separated list of tables to ignore. Names can be full (e.g., `db_a.table_a`) or partial (e.g., `table_a`), which applies to all databases.
`spark.definity.files.sanitizedNamePattern`	Regular expression to extract time partitions from file names. Defaults to `^.*?(?=/\d+/\|/[^/]_=[^/]_/)`. Set empty to disable.
`spark.definity.delta.enabled`	Enables Delta instrumentation; defaults to `true`. Set to `false` to opt-out.
`spark.definity.inputs.maxPerQuery`	Maximum number of allowed inputs per query; defaults to `100`.
`spark.definity.default.session.enabled`	Enables default session for multi-concurrent SparkSession apps; defaults to `true`. Set to `false` to disable.
`spark.definity.default.session.rotationSeconds`	Maximum duration in seconds for the default session before rotation; defaults to `3600`.
`spark.definity.metrics.injection.enabled`	Enable in flight data distribution metrics; defaults to `false`.
`spark.definity.debug`	Enable debug logs; defaults to `false`.
`spark.definity.databricks.automaticSessions.enabled`	Enable auto detection of tasks in Databricks multi-task workflows; defaults to `false`. defaults to `true`.
`spark.definity.events.enabled`	Flag to enable reporting of events. defaults to `true`.
`spark.definity.events.maxPerTaskRun`	Maximum number of events to report in one task. defaults to `5000`.
`spark.definity.slowPlanning.thresholdSeconds`	Threshold to decide when execution planning is too slow and trigger event. defaults to `60`.
`spark.definity.plugin.executor.enabled`	Enables executor side plugin when definity plugin is configured; defaults to `true`.

Metrics Calculation

Name	Details
`spark.definity.num.threads`	Number of threads for metrics calculation; defaults to `2`.
`spark.definity.metrics.timeout`	Timeout for metrics calculation, in seconds; defaults to `180`.
`spark.definity.metrics.histogram.maxNumValues`	Maximum number of values for histogram distribution; defaults to `10`.
`spark.definity.metrics.executorsMetrics.enabled`	Specifies whether to extract metrics from Spark's `ExecutorMetricsUpdate` event; defaults to `true`.
`spark.definity.metrics.timeSeries.intervalSeconds`	Time-series metrics bucket size in seconds; defaults to `60`.
`spark.definity.driver.containerMemory`	Total container memory for the driver in bytes (for client mode).
`spark.definity.driver.heapMemory`	Total heap memory for the driver in bytes (for client mode).

Output Diversion (Testing & CI)

Useful for CI shadow runs flows

Name	Details
`spark.definity.output.table.suffix`	Suffix to add to all output tables
`spark.definity.output.database.suffix`	Suffix to add to all output tables' database name
`spark.definity.output.database.baseLocation`	Base location for all the created output databases
`spark.definity.output.file.baseLocation`	Base location for output files. Either a full base location path, to divert all files to a single location regardless of their original location, or partial path to keep each in its own bucket but under a different base directory. e.g: - `gs://my-tests-bucket`, or `my-tests-base-dir`)
`spark.definity.output.bigquery.project`	BigQuery project id override for all BigQuery output tables.
`spark.definity.output.bigquery.dataset`	BigQuery dataset name override for all BigQuery output tables.

Skew Detection Events

Skew events are calculated in the executors and use Spark's plugins mechanism.

Name	Details
`spark.definity.plugin.executor.driverPollingIntervalMs`	Interval in milliseconds between consecutive polling requests from executor to driver when using the Definity plugin; defaults to `20000`.
`spark.definity.skewDetection.minTaskSkewTimeSeconds`	Minimum difference in seconds between suspected skewed task duration and the average task duration in its stage; defaults to `60`.
`spark.definity.skewDetection.minTaskSkewFactor`	Minimum ratio between suspected skewed task duration and the average task duration in its stage; defaults to `2`.
`spark.definity.skewDetection.samplingRatio`	Sampling ratio of task rows (e.g., `0.01` equals 1% sampling); defaults to `0.01`.
`spark.definity.skewDetection.maxSampledRowsPerTask`	Maximum number of sampled rows per task; defaults to `1000`.
`spark.definity.skewDetection.maxReportedKeysPerTask`	Maximum number of reported keys per task; defaults to `10`.

Core Configuration Parameters​

Pipeline Tracking Parameters​

Advanced Configuration​

Metrics Calculation​

Output Diversion (Testing & CI)​

Skew Detection Events​