Skip to main content

Configuration Reference

Core Configuration Parameters

NameDetails
spark.jarsURL of definity-spark-agent-X.X.jar (and optionally definity-spark-iceberg-1.2-X.X.jar)
spark.pluginsAdd ai.definity.spark.plugin.DefinitySparkPlugin (for Spark 3.x)
spark.extraListenersAdd ai.definity.spark.AppListener (for Spark 2.x)
spark.executor.pluginsAdd ai.definity.spark.plugin.executor.DefinityExecutorPlugin (for Spark 2.x)
spark.definity.serverDefinity server URL (e.g., https://app.definity.run)
spark.definity.api.tokenIntegration token (required for SaaS usage). Can also read from driver's Env var DEFINITY_API_TOKEN

Pipeline Tracking Parameters

These parameters enable tracking and monitoring of your Spark application's execution over time. Consistent naming allows you to correlate metrics and logs across multiple runs.

To group multiple tasks into the same pipeline run, use either pipeline.pit or pipeline.run.id (but not both - they are mutually exclusive):

  • pipeline.pit (recommended): Groups tasks by a shared logical point-in-time, which is also used as each task's app_pit.
  • pipeline.run.id: Groups tasks by an arbitrary identifier. When using this, each task's app_pit defaults to run time of the first task that share the same pipeline.run.id.
NameDetails
spark.definity.env.namedefaults to default
spark.definity.pipeline.namedefaults to spark.app.name
spark.definity.pipeline.pitthe logical Point-in-Time of a run; defaults to now (see supported formats below)
spark.definity.pipeline.run.idalternative to pit for grouping tasks; use when a logical time isn't available
spark.definity.task.namedefaults to spark.app.name

Supported PIT Formats

The pipeline.pit parameter accepts the following date/time formats:

FormatExample
YYYY-MM-DD HH:MM:SS2020-05-03 13:15:00
YYYY-MM-DDTHH:MM:SS2020-05-04T13:15:00
YYYY-MM-DD HH2020-05-03 12
YYYY/MM/DD HH:MM2020/05/03 14:10
YYYY_MM_DD2020_05_03
YYYY-MM-DD_HH2020-05-03_12
Unix timestamp (seconds)1688289300
Unix timestamp (milliseconds)1688289300000
ISO 8601 with microseconds2023-07-18T19:17:41.286948
ISO 8601 with timezone2023-07-18T19:17:41+00:00
ISO 8601 with Z suffix2025-03-13T14:00:00Z

Advanced Configuration

NameDetails
spark.definity.enabledEnables or disables functionality with options: true, false, or opt-in (default: true). For opt-in, users can toggle this in the pipeline settings page.
spark.definity.task.idUser-defined task ID to show in the UI and notifications (e.g., YARN run ID); defaults to spark.app.name.
spark.definity.tagsComma-separated tags, supports key:value format (e.g., team:team-A).
spark.definity.email.toComma-separated list of notification recipient emails.
spark.definity.task.heartbeat.intervalInterval in seconds for sending heartbeat to the server; defaults to 60.
spark.definity.server.request.retry.countNumber of retries for server request errors; defaults to 1.
spark.definity.ignoredTablesComma-separated list of tables to ignore. Names can be full (e.g., db_a.table_a) or partial (e.g., table_a), which applies to all databases.
spark.definity.files.sanitizedNamePatternRegular expression to extract time partitions from file names. Defaults to ^.*?(?=/\d+/|/[^/]_=[^/]_/). Set empty to disable.
spark.definity.delta.enabledEnables Delta instrumentation; defaults to true. Set to false to opt-out.
spark.definity.inputs.maxPerQueryMaximum number of allowed inputs per query; defaults to 100.
spark.definity.default.session.enabledEnables default session for multi-concurrent SparkSession apps; defaults to true. Set to false to disable.
spark.definity.default.session.rotationSecondsMaximum duration in seconds for the default session before rotation; defaults to 3600.
spark.definity.metrics.injection.enabledEnable in flight data distribution metrics; defaults to false.
spark.definity.debugEnable debug logs; defaults to false.
spark.definity.databricks.automaticSessions.enabledEnable auto detection of tasks in Databricks multi-task workflows; defaults to false. defaults to true.
spark.definity.events.enabledFlag to enable reporting of events. defaults to true.
spark.definity.events.maxPerTaskRunMaximum number of events to report in one task. defaults to 5000.
spark.definity.slowPlanning.thresholdSecondsThreshold to decide when execution planning is too slow and trigger event. defaults to 60.
spark.definity.plugin.executor.enabledEnables executor side plugin when definity plugin is configured; defaults to true.

Metrics Calculation

NameDetails
spark.definity.num.threadsNumber of threads for metrics calculation; defaults to 2.
spark.definity.metrics.timeoutTimeout for metrics calculation, in seconds; defaults to 180.
spark.definity.metrics.histogram.maxNumValuesMaximum number of values for histogram distribution; defaults to 10.
spark.definity.metrics.executorsMetrics.enabledSpecifies whether to extract metrics from Spark's ExecutorMetricsUpdate event; defaults to true.
spark.definity.metrics.timeSeries.intervalSecondsTime-series metrics bucket size in seconds; defaults to 60.
spark.definity.driver.containerMemoryTotal container memory for the driver in bytes (for client mode).
spark.definity.driver.heapMemoryTotal heap memory for the driver in bytes (for client mode).

Output Diversion (Testing & CI)

Useful for CI shadow runs flows

NameDetails
spark.definity.output.table.suffixSuffix to add to all output tables
spark.definity.output.database.suffixSuffix to add to all output tables' database name
spark.definity.output.database.baseLocationBase location for all the created output databases
spark.definity.output.file.baseLocationBase location for output files. Either a full base location path, to divert all files to a single location regardless of their original location, or partial path to keep each in its own bucket but under a different base directory. e.g: - gs://my-tests-bucket, or my-tests-base-dir)
spark.definity.output.bigquery.projectBigQuery project id override for all BigQuery output tables.
spark.definity.output.bigquery.datasetBigQuery dataset name override for all BigQuery output tables.

Skew Detection Events

Skew events are calculated in the executors and use Spark's plugins mechanism.

NameDetails
spark.definity.plugin.executor.driverPollingIntervalMsInterval in milliseconds between consecutive polling requests from executor to driver when using the Definity plugin; defaults to 20000.
spark.definity.skewDetection.minTaskSkewTimeSecondsMinimum difference in seconds between suspected skewed task duration and the average task duration in its stage; defaults to 60.
spark.definity.skewDetection.minTaskSkewFactorMinimum ratio between suspected skewed task duration and the average task duration in its stage; defaults to 2.
spark.definity.skewDetection.samplingRatioSampling ratio of task rows (e.g., 0.01 equals 1% sampling); defaults to 0.01.
spark.definity.skewDetection.maxSampledRowsPerTaskMaximum number of sampled rows per task; defaults to 1000.
spark.definity.skewDetection.maxReportedKeysPerTaskMaximum number of reported keys per task; defaults to 10.