Enterprise data platforms fail in two ways: technically and operationally. Technical failures are visible — jobs crash, dashboards go stale. Operational failures are invisible until they become audits, breaches, or incident investigations.
"Governance debt compounds faster than technical debt. Manual permission management that works at 10 users becomes a compliance liability at 500. The cleanup cost always exceeds the implementation cost."
Legacy Databricks environments built on workspace-isolated Hive Metastores cannot scale governance. Access is tied to compute, not identity. Lineage stops at the workspace boundary. Compliance audits require manual log correlation. Permission sprawl is invisible until it causes a breach.
Account-level SCIM federation (Azure Entra ID / Okta), Storage Credential design, External Location mapping from S3/ADLS paths. No data migration at this stage.
Cluster compatibility audit (No Isolation Shared clusters are not Unity Catalog compatible), Hive Metastore object inventory, IAM Instance Profile mapping, identification of hardcoded S3 paths in views.
Migrate the analytics or BI workspace first — lowest risk, highest visibility. Validate with dual-read capability: Unity Catalog schema and Hive Metastore schema serving data in parallel during validation window.
Migrate one schema at a time. Validate row counts, column types, and query results against the Hive Metastore baseline before cutting over. Never migrate entire catalogs in one batch.
All access grants defined in Terraform, reviewed via pull request, applied via CI/CD. No human executes GRANT statements directly in production. Terraform state is the single source of truth for permissions.
Hive Metastore set to read-only mode. IAM Instance Profiles detached from clusters and removed. External Location permissions replace all S3 path access. Legacy access model fully retired.
The access control model that works for 20 users is operationally dangerous at 200. The three most common governance failures in growing Databricks environments: manual GRANT statements in production, access tied to cluster IAM roles, and individual user permissions instead of AD group memberships.
Data access follows the user identity, not the compute resource. A cluster has no implicit data access. All access is tied to the authenticated principal via Unity Catalog grants.
All grants target Azure AD or Okta groups — never individual user accounts. Users inherit access via group membership, enabling instant onboarding and offboarding via group management without touching Databricks directly.
Every principal starts with zero access. Access is explicitly granted at the minimum scope required. Catalog-level grants are reserved for platform administrators only.
All production access is defined in Terraform. Every grant is reviewed via pull request. The permission model in Terraform state is the authoritative source of truth — not the UI, not ad-hoc SQL commands.
| Catalog | Human Access | Pipeline Access | Admin Access |
|---|---|---|---|
prod |
SELECT only (via AD group) | MODIFY via CI/CD service principal | Platform Admin group only |
staging |
SELECT + limited MODIFY | Full write for deployment validation | Platform Admin group |
dev |
Full read/write for engineers | Full write | N/A — no production data |
# Onboard analytics team to production schema
resource "databricks_grants" "analytics_read" {
schema = "prod.customer_analytics"
grant {
principal = "analytics-team@company.com" # Azure AD Group
privileges = ["USE_SCHEMA", "SELECT"]
}
}
# Pipeline service principal — production write access
resource "databricks_grants" "pipeline_write" {
schema = "prod.customer_analytics"
grant {
principal = "svc-etl-pipeline"
privileges = ["USE_SCHEMA", "MODIFY", "CREATE_TABLE"]
}
}
Automated column-level lineage through Unity Catalog converts a 4-hour incident investigation into a 10-minute query. When a broken pipeline produces anomalous model outputs, "which feature table produced this column?" should take seconds, not days.
Unity Catalog tracks lineage from raw ingestion through transformed tables to ML model training runs automatically — no instrumentation required. Available via the Unity Catalog UI and system.access.audit system table.
Workspace-isolated Hive Metastores have zero cross-workspace lineage. Unity Catalog's single Metastore model provides lineage visibility across all workspaces attached to the same Metastore — including upstream schema change impact detection.
The system.access.audit table records every data access event across all workspaces. Queries that previously required 3-week manual log correlation are answered in under 5 seconds.
When lineage is visible, upstream teams can see exactly which downstream pipelines and models consume their schemas. Schema changes without coordination become rare — not because of policy, because of visibility.
-- Previously: 3 weeks of manual CloudTrail + Databricks log correlation
-- Now: runs in under 5 seconds
SELECT
user_identity.email,
action_name,
request_params.table_full_name,
event_time
FROM system.access.audit
WHERE
action_name IN ('SELECT', 'MODIFY', 'CREATE_TABLE')
AND request_params.table_full_name LIKE 'prod.pii_%'
AND event_time >= CURRENT_DATE - INTERVAL 30 DAYS
ORDER BY event_time DESC;
Infrastructure reliability cannot be achieved by configuration alone. Systems need defined Service Level Objectives that determine what "healthy" means before something breaks — not after. Without tiering, every pipeline is treated as equally critical, which means none receive appropriate operational attention.
| Tier | Description | Latency SLA | Failure Response | Cluster Strategy |
|---|---|---|---|---|
| Tier 1 | Revenue-critical or compliance-required pipelines | < 30 min delay | Immediate page — 24/7 on-call | Dedicated cluster, pool-backed |
| Tier 2 | Business reporting, daily BI dashboards | < 2 hour delay | Business hours alert, 4-hour SLA | Job cluster, Spot with fallback |
| Tier 3 | Experimental, low-frequency, non-customer pipelines | Best effort | Morning review, no page | Spot-only, auto-terminate |
GRANT ALL PRIVILEGES ON CATALOG prod — grants unrestricted write, delete, and table drop access to any principal. Never in production under any circumstances.
Access granted directly to a user's email cannot be revoked automatically when they leave the organization. All production grants must target AD security groups, not individual accounts.
Ad-hoc SQL GRANT commands executed directly in the Databricks SQL editor create permissions that exist in the Metastore but not in Terraform state. The next Terraform apply will overwrite or conflict with them.
When multiple teams' pipelines share a single service principal, auditing which pipeline caused a specific data mutation becomes impossible. Each team requires its own service principal with scoped, least-privilege grants.
Instance Profiles bypass Unity Catalog's access control entirely. A cluster with a broad IAM Instance Profile can read S3 data regardless of Unity Catalog grants. All S3 access must route through External Locations after UC migration.
If your Databricks environment cannot produce a complete access audit in under 30 minutes, or if your compliance team requires more than a few days to answer access questions — a governance assessment is the right starting point.