Operational Patterns

Spark Performance Engineering

Apache Spark Optimization Patterns

Pattern

Adaptive Query Execution (AQE) — Always Enable

Static spark.sql.shuffle.partitions=200 causes OOM at high data volumes and excessive task scheduling overhead at low volumes. AQE dynamically coalesces shuffle partitions at runtime based on actual data volume, eliminating per-workload tuning.

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")

⚠ Exception: streaming micro-batch jobs with strict latency targets where AQE re-optimization overhead may exceed the acceptable checkpoint interval.

Anti-Pattern

Sort-Merge Join on Large + Small Table

Default join strategy on a 500 GB fact table + 150 MB dimension table generates a full network shuffle on both sides — potentially terabytes of shuffle data for a dimension that fits in executor memory.

# Anti-pattern — triggers SortMergeJoin on both sides
result = large_df.join(small_df, "customer_id")

# Pattern — broadcasts small_df, eliminates shuffle entirely
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "customer_id")

# Also: raise the auto-broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(256 * 1024 * 1024))

Verify the fix: run df.explain(mode="formatted") and confirm the physical plan shows BroadcastHashJoin, not SortMergeJoin.

Anti-Pattern

Using `.collect()` on Multi-GB DataFrames

All data is transferred to the Driver in a single operation. Driver OOM, job fails, Airflow retries begin. The fix: write results with .write. to Delta or S3. If the output must be in-memory, sample first and validate the sample is Driver-safe.

Delta Lake Maintenance

Delta Lake Operational Patterns

Pattern

Automated OPTIMIZE + ZORDER on a Weekly Schedule

Streaming and frequent-write workloads generate thousands of small files per day. Delta Lake performance degrades non-linearly as file count grows — acceptable for months, then a cliff. Run as a dedicated Automated Job Cluster (never on a shared cluster).

OPTIMIZE prod.customer_events
ZORDER BY (event_date, customer_id);

VACUUM prod.customer_events RETAIN 168 HOURS;  -- 7 days time-travel

Anti-Pattern

Running `VACUUM` with `RETAIN 0 HOURS` in Production

Destroys all time-travel capability instantly. If a downstream pipeline produced corrupted output 6 hours ago and you need to roll back, RETAIN 0 HOURS has made that impossible. Always retain a minimum of 168 hours (7 days) in production environments.

Pattern

Enable Auto-Optimize for Streaming Write Tables

For tables receiving frequent small writes (Kafka streaming, API event ingestion), enable auto-optimization to compact files continuously rather than requiring manual OPTIMIZE runs.

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Workload Isolation

Cluster Isolation Patterns

Pattern

Four-Lane Workload Isolation Architecture

All workloads compete for shared resources on a shared cluster by default. Separate by workload type, not by team, to eliminate noisy-neighbor failures and enable per-lane cost attribution.

LANE 1 — ETL

Automated Job Clusters via Cluster Pools. Ephemeral, Spot instances, 15-second cold start. 0 idle cost.

LANE 2 — STREAMING

Dedicated always-on cluster. Fixed size, no autoscaling. Predictable memory allocation for checkpoint management.

LANE 3 — ML TRAINING

On-demand high-memory or GPU cluster. Spot-first. Terminated on completion. No overnight idle cost.

LANE 4 — BI / AD-HOC

Serverless SQL Warehouse. Per-query billing. Auto-scaling concurrency. Zero management overhead.

Anti-Pattern

All-Purpose Clusters for Airflow-Scheduled ETL

All-Purpose Clusters carry a 2.5× DBU multiplier vs. Automated Job Clusters. Running production scheduled pipelines on All-Purpose Clusters is the most common source of Databricks cost overruns. Airflow DAGs should always point to Job Clusters — either dynamically provisioned or pool-backed for fast startup.

Orchestration Design

Airflow Reliability Patterns

Pattern

Dataset-Triggered DAGs (Airflow 2.4+)

ExternalTaskSensors consume Airflow worker slots continuously via polling — every 30-second poll by 200 sensors consumes meaningful worker capacity. Airflow 2.4 Dataset events replace polling with event-driven triggers: upstream DAGs emit Dataset updates on completion, downstream DAGs trigger automatically.

Anti-Pattern

Monolithic DAGs with All Dependencies in One File

A single failure in any task blocks all downstream tasks. Debugging requires examining hundreds of task states. A modification to one pipeline segment risks breaking unrelated tasks in the same file. Decompose into single-responsibility DAGs, each owning one logical pipeline phase, with Dataset event handoffs.

Pattern

Exponential Backoff with a Retry Budget

Fixed-interval retries at high concurrency convert a transient API rate limit into a retry storm. Define a maximum retry budget per task and use exponential backoff to spread retry pressure over time.

task = PythonOperator(
    task_id="call_llm_api",
    python_callable=call_llm,
    retries=3,                              # maximum 3 attempts
    retry_delay=timedelta(seconds=30),
    retry_exponential_backoff=True,         # 30s, 60s, 120s
    max_retry_delay=timedelta(minutes=10),  # cap at 10 minutes
)

Apache Spark Optimization Patterns

Adaptive Query Execution (AQE) — Always Enable

Sort-Merge Join on Large + Small Table

Using .collect() on Multi-GB DataFrames

Delta Lake Operational Patterns

Automated OPTIMIZE + ZORDER on a Weekly Schedule

Running VACUUM with RETAIN 0 HOURS in Production

Enable Auto-Optimize for Streaming Write Tables

Cluster Isolation Patterns

Four-Lane Workload Isolation Architecture

All-Purpose Clusters for Airflow-Scheduled ETL

Airflow Reliability Patterns

Dataset-Triggered DAGs (Airflow 2.4+)

Monolithic DAGs with All Dependencies in One File

Exponential Backoff with a Retry Budget

See These Patterns Applied in Context

Using `.collect()` on Multi-GB DataFrames

Running `VACUUM` with `RETAIN 0 HOURS` in Production