A reference library of engineering patterns, anti-patterns, and operational guidelines for enterprise data platforms. Written for engineers running production systems, not general audiences.
Static spark.sql.shuffle.partitions=200 causes OOM at high data volumes and excessive task scheduling overhead at low volumes. AQE dynamically coalesces shuffle partitions at runtime based on actual data volume, eliminating per-workload tuning.
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
⚠ Exception: streaming micro-batch jobs with strict latency targets where AQE re-optimization overhead may exceed the acceptable checkpoint interval.
Default join strategy on a 500 GB fact table + 150 MB dimension table generates a full network shuffle on both sides — potentially terabytes of shuffle data for a dimension that fits in executor memory.
# Anti-pattern — triggers SortMergeJoin on both sides
result = large_df.join(small_df, "customer_id")
# Pattern — broadcasts small_df, eliminates shuffle entirely
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "customer_id")
# Also: raise the auto-broadcast threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(256 * 1024 * 1024))
Verify the fix: run df.explain(mode="formatted") and confirm the physical plan shows BroadcastHashJoin, not SortMergeJoin.
.collect() on Multi-GB DataFramesAll data is transferred to the Driver in a single operation. Driver OOM, job fails, Airflow retries begin. The fix: write results with .write. to Delta or S3. If the output must be in-memory, sample first and validate the sample is Driver-safe.
Streaming and frequent-write workloads generate thousands of small files per day. Delta Lake performance degrades non-linearly as file count grows — acceptable for months, then a cliff. Run as a dedicated Automated Job Cluster (never on a shared cluster).
OPTIMIZE prod.customer_events
ZORDER BY (event_date, customer_id);
VACUUM prod.customer_events RETAIN 168 HOURS; -- 7 days time-travel
VACUUM with RETAIN 0 HOURS in ProductionDestroys all time-travel capability instantly. If a downstream pipeline produced corrupted output 6 hours ago and you need to roll back, RETAIN 0 HOURS has made that impossible. Always retain a minimum of 168 hours (7 days) in production environments.
For tables receiving frequent small writes (Kafka streaming, API event ingestion), enable auto-optimization to compact files continuously rather than requiring manual OPTIMIZE runs.
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
All workloads compete for shared resources on a shared cluster by default. Separate by workload type, not by team, to eliminate noisy-neighbor failures and enable per-lane cost attribution.
Automated Job Clusters via Cluster Pools. Ephemeral, Spot instances, 15-second cold start. 0 idle cost.
Dedicated always-on cluster. Fixed size, no autoscaling. Predictable memory allocation for checkpoint management.
On-demand high-memory or GPU cluster. Spot-first. Terminated on completion. No overnight idle cost.
Serverless SQL Warehouse. Per-query billing. Auto-scaling concurrency. Zero management overhead.
All-Purpose Clusters carry a 2.5× DBU multiplier vs. Automated Job Clusters. Running production scheduled pipelines on All-Purpose Clusters is the most common source of Databricks cost overruns. Airflow DAGs should always point to Job Clusters — either dynamically provisioned or pool-backed for fast startup.
ExternalTaskSensors consume Airflow worker slots continuously via polling — every 30-second poll by 200 sensors consumes meaningful worker capacity. Airflow 2.4 Dataset events replace polling with event-driven triggers: upstream DAGs emit Dataset updates on completion, downstream DAGs trigger automatically.
A single failure in any task blocks all downstream tasks. Debugging requires examining hundreds of task states. A modification to one pipeline segment risks breaking unrelated tasks in the same file. Decompose into single-responsibility DAGs, each owning one logical pipeline phase, with Dataset event handoffs.
Fixed-interval retries at high concurrency convert a transient API rate limit into a retry storm. Define a maximum retry budget per task and use exponential backoff to spread retry pressure over time.
task = PythonOperator(
task_id="call_llm_api",
python_callable=call_llm,
retries=3, # maximum 3 attempts
retry_delay=timedelta(seconds=30),
retry_exponential_backoff=True, # 30s, 60s, 120s
max_retry_delay=timedelta(minutes=10), # cap at 10 minutes
)
Each of these patterns appears in one or more of the infrastructure teardowns — with before-and-after metrics showing the operational impact.