Configuration Reference
Core Configuration Parameters
| Name | Details |
|---|---|
spark.jars | URL of definity-spark-agent-X.X.jar (and optionally definity-spark-iceberg-1.2-X.X.jar) |
spark.plugins | Add ai.definity.spark.plugin.DefinitySparkPlugin (for Spark 3.x) |
spark.extraListeners | Add ai.definity.spark.AppListener (for Spark 2.x) |
spark.executor.plugins | Add ai.definity.spark.plugin.executor.DefinityExecutorPlugin (for Spark 2.x) |
spark.definity.server | Definity server URL (e.g., https://app.definity.run) |
spark.definity.api.token | Integration token (required for SaaS usage). Can also read from driver's Env var DEFINITY_API_TOKEN |
Pipeline Tracking Parameters
These parameters enable tracking and monitoring of your Spark application's execution over time. Consistent naming allows you to correlate metrics and logs across multiple runs, while a shared PIT ensures all tasks in a pipeline reference the same logical point in time.
| Name | Details |
|---|---|
spark.definity.env.name | defaults to default |
spark.definity.pipeline.name | defaults to spark.app.name |
spark.definity.pipeline.pit | the Point-in-Time of a run; defaults to now |
spark.definity.task.name | defaults to spark.app.name |
Advanced Configuration
| Name | Details |
|---|---|
spark.definity. | Enables or disables functionality with options: true, false, or opt-in (default: true). For opt-in, users can toggle this in the pipeline settings page. |
spark.definity. | Used for grouping tasks in the same run. |
spark.definity. | User-defined task ID to show in the UI and notifications (e.g., YARN run ID); defaults to spark.app.name. |
spark.definity. | Comma-separated tags, supports key:value format (e.g., team:team-A). |
spark.definity. | Comma-separated list of notification recipient emails. |
spark.definity. | Interval in seconds for sending heartbeat to the server; defaults to 60. |
spark.definity. | Number of retries for server request errors; defaults to 1. |
spark.definity. | Comma-separated list of tables to ignore. Names can be full (e.g., db_a.table_a) or partial (e.g., table_a), which applies to all databases. |
spark.definity. | Regular expression to extract time partitions from file names. Defaults to ^.*?(?=/\d+/|/[^/]_=[^/]_/). Set empty to disable. |
spark.definity. | Enables Delta instrumentation; defaults to true. Set to false to opt-out. |
spark.definity. | Maximum number of allowed inputs per query; defaults to 100. |
spark.definity. | Enables default session for multi-concurrent SparkSession apps; defaults to true. Set to false to disable. |
spark.definity. | Maximum duration in seconds for the default session before rotation; defaults to 3600. |
spark.definity. | Enable in flight data distribution metrics; defaults to false. |
spark.definity. | Enable debug logs; defaults to false. |
spark.definity. | Enable auto detection of tasks in Databricks multi-task workflows; defaults to false. defaults to true. |
spark.definity. | Flag to enable reporting of events. defaults to true. |
spark.definity. | Maximum number of events to report in one task. defaults to 5000. |
spark.definity. | Threshold to decide when execution planning is too slow and trigger event. defaults to 60. |
spark.definity. | Enables executor side plugin when definity plugin is configured; defaults to true. |
Metrics Calculation
| Name | Details |
|---|---|
spark.definity. | Number of threads for metrics calculation; defaults to 2. |
spark.definity. | Timeout for metrics calculation, in seconds; defaults to 180. |
spark.definity. | Maximum number of values for histogram distribution; defaults to 10. |
spark.definity. | Specifies whether to extract metrics from Spark's ExecutorMetricsUpdate event; defaults to true. |
spark.definity. | Time-series metrics bucket size in seconds; defaults to 60. |
spark.definity. | Total container memory for the driver in bytes (for client mode). |
spark.definity. | Total heap memory for the driver in bytes (for client mode). |
Output Diversion (Testing & CI)
Useful for CI shadow runs flows
| Name | Details |
|---|---|
spark.definity. | Suffix to add to all output tables |
spark.definity. | Suffix to add to all output tables' database name |
spark.definity. | Base location for all the created output databases |
spark.definity. | Base location for output files. Either a full base location path, to divert all files to a single location regardless of their original location, or partial path to keep each in its own bucket but under a different base directory. e.g: - gs://my-tests-bucket, or my-tests-base-dir) |
Skew Detection Events
Skew events are calculated in the executors and use Spark's plugins mechanism.
| Name | Details |
|---|---|
spark.definity. | Interval in milliseconds between consecutive polling requests from executor to driver when using the Definity plugin; defaults to 20000. |
spark.definity. | Minimum difference in seconds between suspected skewed task duration and the average task duration in its stage; defaults to 60. |
spark.definity. | Minimum ratio between suspected skewed task duration and the average task duration in its stage; defaults to 2. |
spark.definity. | Sampling ratio of task rows (e.g., 0.01 equals 1% sampling); defaults to 0.01. |
spark.definity. | Maximum number of sampled rows per task; defaults to 1000. |
spark.definity. | Maximum number of reported keys per task; defaults to 10. |