Configuration Reference
Core Configuration Parameters
| Name | Details |
|---|---|
spark.jars | URL of definity-spark-agent-X.X.jar (and optionally definity-spark-iceberg-1.2-X.X.jar) |
spark.plugins | Add ai.definity.spark.plugin.DefinitySparkPlugin (for Spark 3.x) |
spark.extraListeners | Add ai.definity.spark.AppListener (for Spark 2.x) |
spark.executor.plugins | Add ai.definity.spark.plugin.executor.DefinityExecutorPlugin (for Spark 2.x) |
spark.definity.server | Definity server URL (e.g., https://app.definity.run) |
spark.definity.api.token | Integration token (required for SaaS usage). Can also read from driver's Env var DEFINITY_API_TOKEN |
Pipeline Tracking Parameters
These parameters enable tracking and monitoring of your Spark application's execution over time. Consistent naming allows you to correlate metrics and logs across multiple runs.
To group multiple tasks into the same pipeline run, use either pipeline.pit or pipeline.run.id (but not both - they are mutually exclusive):
pipeline.pit(recommended): Groups tasks by a shared logical point-in-time, which is also used as each task'sapp_pit.pipeline.run.id: Groups tasks by an arbitrary identifier. When using this, each task'sapp_pitdefaults to run time of the first task that share the samepipeline.run.id.
| Name | Details |
|---|---|
spark.definity.env.name | defaults to default |
spark.definity.pipeline.name | defaults to spark.app.name |
spark.definity.pipeline.pit | the logical Point-in-Time of a run; defaults to now (see supported formats below) |
spark.definity.pipeline.run.id | alternative to pit for grouping tasks; use when a logical time isn't available |
spark.definity.task.name | defaults to spark.app.name |
Supported PIT Formats
The pipeline.pit parameter accepts the following date/time formats:
| Format | Example |
|---|---|
YYYY-MM-DD HH:MM:SS | 2020-05-03 13:15:00 |
YYYY-MM-DDTHH:MM:SS | 2020-05-04T13:15:00 |
YYYY-MM-DD HH | 2020-05-03 12 |
YYYY/MM/DD HH:MM | 2020/05/03 14:10 |
YYYY_MM_DD | 2020_05_03 |
YYYY-MM-DD_HH | 2020-05-03_12 |
| Unix timestamp (seconds) | 1688289300 |
| Unix timestamp (milliseconds) | 1688289300000 |
| ISO 8601 with microseconds | 2023-07-18T19:17:41.286948 |
| ISO 8601 with timezone | 2023-07-18T19:17:41+00:00 |
| ISO 8601 with Z suffix | 2025-03-13T14:00:00Z |
Advanced Configuration
| Name | Details |
|---|---|
spark.definity. | Enables or disables functionality with options: true, false, or opt-in (default: true). For opt-in, users can toggle this in the pipeline settings page. |
spark.definity. | User-defined task ID to show in the UI and notifications (e.g., YARN run ID); defaults to spark.app.name. |
spark.definity. | Comma-separated tags, supports key:value format (e.g., team:team-A). |
spark.definity. | Comma-separated list of notification recipient emails. |
spark.definity. | Interval in seconds for sending heartbeat to the server; defaults to 60. |
spark.definity. | Number of retries for server request errors; defaults to 1. |
spark.definity. | Comma-separated list of tables to ignore. Names can be full (e.g., db_a.table_a) or partial (e.g., table_a), which applies to all databases. |
spark.definity. | Regular expression to extract time partitions from file names. Defaults to ^.*?(?=/\d+/|/[^/]_=[^/]_/). Set empty to disable. |
spark.definity. | Enables Delta instrumentation; defaults to true. Set to false to opt-out. |
spark.definity. | Maximum number of allowed inputs per query; defaults to 100. |
spark.definity. | Enables default session for multi-concurrent SparkSession apps; defaults to true. Set to false to disable. |
spark.definity. | Maximum duration in seconds for the default session before rotation; defaults to 3600. |
spark.definity. | Enable in flight data distribution metrics; defaults to false. |
spark.definity. | Enable debug logs; defaults to false. |
spark.definity. | Enable auto detection of tasks in Databricks multi-task workflows; defaults to false. defaults to true. |
spark.definity. | Flag to enable reporting of events. defaults to true. |
spark.definity. | Maximum number of events to report in one task. defaults to 5000. |
spark.definity. | Threshold to decide when execution planning is too slow and trigger event. defaults to 60. |
spark.definity. | Enables executor side plugin when definity plugin is configured; defaults to true. |
Metrics Calculation
| Name | Details |
|---|---|
spark.definity. | Number of threads for metrics calculation; defaults to 2. |
spark.definity. | Timeout for metrics calculation, in seconds; defaults to 180. |
spark.definity. | Maximum number of values for histogram distribution; defaults to 10. |
spark.definity. | Specifies whether to extract metrics from Spark's ExecutorMetricsUpdate event; defaults to true. |
spark.definity. | Time-series metrics bucket size in seconds; defaults to 60. |
spark.definity. | Total container memory for the driver in bytes (for client mode). |
spark.definity. | Total heap memory for the driver in bytes (for client mode). |
Output Diversion (Testing & CI)
Useful for CI shadow runs flows
| Name | Details |
|---|---|
spark.definity. | Suffix to add to all output tables |
spark.definity. | Suffix to add to all output tables' database name |
spark.definity. | Base location for all the created output databases |
spark.definity. | Base location for output files. Either a full base location path, to divert all files to a single location regardless of their original location, or partial path to keep each in its own bucket but under a different base directory. e.g: - gs://my-tests-bucket, or my-tests-base-dir) |
spark.definity. | BigQuery project id override for all BigQuery output tables. |
spark.definity. | BigQuery dataset name override for all BigQuery output tables. |
Skew Detection Events
Skew events are calculated in the executors and use Spark's plugins mechanism.
| Name | Details |
|---|---|
spark.definity. | Interval in milliseconds between consecutive polling requests from executor to driver when using the Definity plugin; defaults to 20000. |
spark.definity. | Minimum difference in seconds between suspected skewed task duration and the average task duration in its stage; defaults to 60. |
spark.definity. | Minimum ratio between suspected skewed task duration and the average task duration in its stage; defaults to 2. |
spark.definity. | Sampling ratio of task rows (e.g., 0.01 equals 1% sampling); defaults to 0.01. |
spark.definity. | Maximum number of sampled rows per task; defaults to 1000. |
spark.definity. | Maximum number of reported keys per task; defaults to 10. |