Infrastructure Teardowns

Cost Engineering · Spark Performance

Reducing a $50K/Month Databricks Bill by 40%

A high-growth AI startup was spending $50,000/month on Databricks. Despite the spend, their engineering team was in constant reactive firefighting mode. The problems were not unique — they were the cumulative result of default configurations, unchecked cluster provisioning habits, and no cost governance framework.

Failures Identified

Anti-Pattern

All-Purpose Clusters for Production ETL

60% of production pipelines ran on All-Purpose Clusters — 2.5× the DBU cost of Automated Job Clusters. $12,000/month premium for convenience.

Anti-Pattern

1.2 Million Delta Small Files

Streaming jobs checkpointing every 30 seconds created 2,880 file writes per table per day. Metadata I/O cost exceeded compute cost on downstream reads.

Anti-Pattern

Autoscaling Minimum = 10 Workers

The "floor" of 10 workers kept $8,400/month of compute running overnight, processing nothing.

Interventions Applied

Pattern

Migrate to Automated Job Clusters

All Airflow-triggered pipelines migrated to ephemeral Job Clusters with cluster pools for 15-second warm start. DBU rate reduced ~50%.

Pattern

Delta OPTIMIZE + ZORDER Weekly

Automated weekly maintenance job: 1.2M files → 1,190 files. Average file size: 2.8 MB → 880 MB. Read latency cut 62%.

Pattern

Enable AQE + Resize Driver

Adaptive Query Execution eliminated OOM shuffle failures. Driver downsized from r5.4xlarge → r5.xlarge. Saved $3,500/month.

Measured Impact (90-day post-optimization)

−40%

Monthly Compute Cost

−81%

ETL Pipeline Runtime

−94%

OOM Job Failures

−93%

S3 Metadata API Cost

        Technologies: Databricks (E2) · Apache Spark · Delta Lake · Airflow · AWS · Databricks Cluster Policies
      

AI Infrastructure Reliability · MLOps

Why AI Pipelines Fail in Production

A scaling SaaS company's AI-powered features degraded progressively after launch. Their demos worked flawlessly. Six months into production, the platform was in near-constant incident mode. The engineering team believed they had a model problem. They had an infrastructure problem.

Root Causes

Critical

Vector Index Inconsistency

Nightly embedding refresh jobs timed out, writing into partially-updated Pinecone namespaces. RAG pipelines were serving contextually incoherent chunks — confident, grammatically correct, and factually wrong.

Critical

Retry Storm on LLM Rate Limits

Airflow's immediate retry behavior bombarded the OpenAI API during rate limiting events, locking worker slots for 30 minutes and generating $4,200 in a single overnight billing event.

Critical

GPU Idling on Network I/O

Inference pods spent 85% of request wall-time waiting for Pinecone network responses. GPU utilization: 32%. More GPUs would have done nothing.

Remediation Applied

Pattern

Blue/Green Vector Index Refresh

Write to staging namespace → validate count + sample similarity → atomic swap of production alias. Zero partial states. Instant rollback capability.

Pattern

Circuit Breakers + Exponential Backoff

After 5 consecutive API failures, circuit opens. No retries for 120-second cooling period. Retry budget enforced at 3 attempts maximum per task.

Pattern

Decoupled Retrieval Service

Retrieval logic (CPU, horizontal) separated from inference (GPU). GPU utilization jumped from 32% to 88%. P99 latency: 1,200ms → 250ms.

−95%

Failed Airflow Tasks

250ms

P99 Latency (was 1,200ms)

−38%

AI Infrastructure Cost

45min

Model Rollback (was 3 days)

        Technologies: Pinecone · MLflow · Apache Airflow · Kubernetes (EKS) · OpenAI API · Apache Spark · Great Expectations
      

Platform Scalability · Distributed Systems

Why Most Data Platforms Break at Scale

A Series B SaaS company's data platform was reliable at seed stage. By Series B, it required daily manual intervention. 15 TB/day ingestion, 70+ concurrent users, 800+ Airflow DAGs, and a monolithic shared cluster had compounded into a system that failed differently every morning.

Scale Failures

Anti-Pattern

812,000 Delta Files (Avg: 4 MB)

Kafka streaming at 30-second intervals generated 2,880 file writes per table per day. After 90 days: Delta log read time exceeded 8 minutes per table.

Anti-Pattern

Shared Cluster Concurrency Collapse

Analyst GROUP BY queries on the shared cluster consumed all executor memory, OOM-killing concurrent ETL pipelines. CPU averaged 19% — I/O bound, not compute bound.

Anti-Pattern

ExternalTaskSensor Polling Overload

Hundreds of Airflow sensors polling every 30 seconds consumed 35% of worker slots, leaving insufficient capacity for actual task execution during peak hours.

Architecture Changes

Pattern

Workload Isolation (4 compute lanes)

ETL → Job Clusters. Streaming → dedicated always-on cluster. ML → isolated high-memory cluster. BI → Serverless SQL Warehouse. No shared compute between tiers.

Pattern

Dataset-Triggered DAGs (Airflow 2.4+)

Replaced ExternalTaskSensors with Airflow Dataset events. Worker slot waste from polling: 35% → under 5%.

Pattern

Broadcast Joins + AQE

Auto-broadcast threshold raised to 256 MB, eliminating sort-merge joins for large+small table combinations. 4 TB daily shuffle spill → 90 GB.

−40%

Compute Cost

−95%

OOM Failures

−2.5h

Morning ETL Completion

−93%

S3 API Metadata Cost

        Technologies: Databricks · Apache Kafka · Apache Spark · dbt · Delta Lake · Apache Airflow · AWS
      

Enterprise Governance · Platform Migration

Migrating Multi-Team Databricks to Unity Catalog

A FinTech organization with 450+ users across 12 isolated Databricks workspaces could not answer a compliance audit in under three weeks. Access was tied to cluster IAM roles, permissions were fragmented across 12 Hive Metastores, and there was no cross-workspace data lineage.

Governance Failures

Anti-Pattern

Compute-Centric Access Control

Data access was tied to cluster IAM roles, not user identities. Any user on a cluster inherited all data access that cluster's IAM role permitted — regardless of personal need.

Anti-Pattern

Shadow Data Access via Broad S3 Policies

To bypass 4-day IAM approval cycles, engineers broadened cluster profiles. Production cluster IAM role accumulated 40+ S3 path permissions over 3 years.

Anti-Pattern

Zero Cross-Workspace Lineage

12 isolated Hive Metastores meant zero lineage visibility between teams. Schema changes in one workspace silently broke pipelines in another.

Migration Architecture

Pattern

Single Metastore, Identity-Centric Grants

One Unity Catalog Metastore per region. All 12 workspaces attached. Access grants: GRANT SELECT ON TABLE prod.schema.table TO ad-group@company.com

Pattern

Terraform-Managed Access Provisioning

All production grants defined in Terraform. New team onboarding: 2 hours via IaC module (was 2 weeks manual IAM + Hive ACL configuration).

Pattern

Catalog-Level Environment Isolation

dev / staging / prod separated by Unity Catalog Catalogs, not workspaces. Production: humans have SELECT only. Writes via CI/CD service principals only.

10min

Audit (was 3 weeks)

5min

Access Provision (was 4 days)

0

IAM Profiles (was 47)

100%

Tables with Owners (was 12%)

        Technologies: Databricks Unity Catalog · Terraform · Azure Entra ID · Delta Lake · Apache Spark · SCIM
      

Reducing a $50K/Month Databricks Bill by 40%

Failures Identified

All-Purpose Clusters for Production ETL

1.2 Million Delta Small Files

Autoscaling Minimum = 10 Workers

Interventions Applied

Migrate to Automated Job Clusters

Delta OPTIMIZE + ZORDER Weekly

Enable AQE + Resize Driver

Measured Impact (90-day post-optimization)

Why AI Pipelines Fail in Production

Root Causes

Vector Index Inconsistency

Retry Storm on LLM Rate Limits

GPU Idling on Network I/O

Remediation Applied

Blue/Green Vector Index Refresh

Circuit Breakers + Exponential Backoff

Decoupled Retrieval Service

Why Most Data Platforms Break at Scale

Scale Failures

812,000 Delta Files (Avg: 4 MB)

Shared Cluster Concurrency Collapse

ExternalTaskSensor Polling Overload

Architecture Changes

Workload Isolation (4 compute lanes)

Dataset-Triggered DAGs (Airflow 2.4+)

Broadcast Joins + AQE

Migrating Multi-Team Databricks to Unity Catalog

Governance Failures

Compute-Centric Access Control

Shadow Data Access via Broad S3 Policies

Zero Cross-Workspace Lineage

Migration Architecture

Single Metastore, Identity-Centric Grants

Terraform-Managed Access Provisioning

Catalog-Level Environment Isolation

Recognise Any of These Symptoms?