Documented root-cause analyses and remediation engagements. Each teardown follows the same structure: operational context, diagnostic process, root causes, interventions, and measured outcomes.
A high-growth AI startup was spending $50,000/month on Databricks. Despite the spend, their engineering team was in constant reactive firefighting mode. The problems were not unique — they were the cumulative result of default configurations, unchecked cluster provisioning habits, and no cost governance framework.
60% of production pipelines ran on All-Purpose Clusters — 2.5× the DBU cost of Automated Job Clusters. $12,000/month premium for convenience.
Streaming jobs checkpointing every 30 seconds created 2,880 file writes per table per day. Metadata I/O cost exceeded compute cost on downstream reads.
The "floor" of 10 workers kept $8,400/month of compute running overnight, processing nothing.
All Airflow-triggered pipelines migrated to ephemeral Job Clusters with cluster pools for 15-second warm start. DBU rate reduced ~50%.
Automated weekly maintenance job: 1.2M files → 1,190 files. Average file size: 2.8 MB → 880 MB. Read latency cut 62%.
Adaptive Query Execution eliminated OOM shuffle failures. Driver downsized from r5.4xlarge → r5.xlarge. Saved $3,500/month.
A scaling SaaS company's AI-powered features degraded progressively after launch. Their demos worked flawlessly. Six months into production, the platform was in near-constant incident mode. The engineering team believed they had a model problem. They had an infrastructure problem.
Nightly embedding refresh jobs timed out, writing into partially-updated Pinecone namespaces. RAG pipelines were serving contextually incoherent chunks — confident, grammatically correct, and factually wrong.
Airflow's immediate retry behavior bombarded the OpenAI API during rate limiting events, locking worker slots for 30 minutes and generating $4,200 in a single overnight billing event.
Inference pods spent 85% of request wall-time waiting for Pinecone network responses. GPU utilization: 32%. More GPUs would have done nothing.
Write to staging namespace → validate count + sample similarity → atomic swap of production alias. Zero partial states. Instant rollback capability.
After 5 consecutive API failures, circuit opens. No retries for 120-second cooling period. Retry budget enforced at 3 attempts maximum per task.
Retrieval logic (CPU, horizontal) separated from inference (GPU). GPU utilization jumped from 32% to 88%. P99 latency: 1,200ms → 250ms.
A Series B SaaS company's data platform was reliable at seed stage. By Series B, it required daily manual intervention. 15 TB/day ingestion, 70+ concurrent users, 800+ Airflow DAGs, and a monolithic shared cluster had compounded into a system that failed differently every morning.
Kafka streaming at 30-second intervals generated 2,880 file writes per table per day. After 90 days: Delta log read time exceeded 8 minutes per table.
Analyst GROUP BY queries on the shared cluster consumed all executor memory, OOM-killing concurrent ETL pipelines. CPU averaged 19% — I/O bound, not compute bound.
Hundreds of Airflow sensors polling every 30 seconds consumed 35% of worker slots, leaving insufficient capacity for actual task execution during peak hours.
ETL → Job Clusters. Streaming → dedicated always-on cluster. ML → isolated high-memory cluster. BI → Serverless SQL Warehouse. No shared compute between tiers.
Replaced ExternalTaskSensors with Airflow Dataset events. Worker slot waste from polling: 35% → under 5%.
Auto-broadcast threshold raised to 256 MB, eliminating sort-merge joins for large+small table combinations. 4 TB daily shuffle spill → 90 GB.
A FinTech organization with 450+ users across 12 isolated Databricks workspaces could not answer a compliance audit in under three weeks. Access was tied to cluster IAM roles, permissions were fragmented across 12 Hive Metastores, and there was no cross-workspace data lineage.
Data access was tied to cluster IAM roles, not user identities. Any user on a cluster inherited all data access that cluster's IAM role permitted — regardless of personal need.
To bypass 4-day IAM approval cycles, engineers broadened cluster profiles. Production cluster IAM role accumulated 40+ S3 path permissions over 3 years.
12 isolated Hive Metastores meant zero lineage visibility between teams. Schema changes in one workspace silently broke pipelines in another.
One Unity Catalog Metastore per region. All 12 workspaces attached. Access grants: GRANT SELECT ON TABLE prod.schema.table TO ad-group@company.com
All production grants defined in Terraform. New team onboarding: 2 hours via IaC module (was 2 weeks manual IAM + Hive ACL configuration).
dev / staging / prod separated by Unity Catalog Catalogs, not workspaces. Production: humans have SELECT only. Writes via CI/CD service principals only.
Rising costs not tied to business growth, unreliable pipelines, governance gaps, or performance that degrades as data volume scales — an infrastructure assessment is the right starting point.