Platform Governance

Governance & Platform Reliability

Enterprise data platforms fail in two ways: technically and operationally. Technical failures are visible — jobs crash, dashboards go stale. Operational failures are invisible until they become audits, breaches, or incident investigations.

"Governance debt compounds faster than technical debt. Manual permission management that works at 10 users becomes a compliance liability at 500. The cleanup cost always exceeds the implementation cost."

Unity Catalog Migration

From Fragmented Hive Metastores to Centralized Governance

Legacy Databricks environments built on workspace-isolated Hive Metastores cannot scale governance. Access is tied to compute, not identity. Lineage stops at the workspace boundary. Compliance audits require manual log correlation. Permission sprawl is invisible until it causes a breach.

Migration Phases

  1. 01

    Foundation — Identity & Storage

    Account-level SCIM federation (Azure Entra ID / Okta), Storage Credential design, External Location mapping from S3/ADLS paths. No data migration at this stage.

  2. 02

    Assessment — Cluster & Metastore Inventory

    Cluster compatibility audit (No Isolation Shared clusters are not Unity Catalog compatible), Hive Metastore object inventory, IAM Instance Profile mapping, identification of hardcoded S3 paths in views.

  3. 03

    Pilot Migration — Low-Risk Workspace

    Migrate the analytics or BI workspace first — lowest risk, highest visibility. Validate with dual-read capability: Unity Catalog schema and Hive Metastore schema serving data in parallel during validation window.

  4. 04

    Production Migration — Incremental Schema-by-Schema

    Migrate one schema at a time. Validate row counts, column types, and query results against the Hive Metastore baseline before cutting over. Never migrate entire catalogs in one batch.

  5. 05

    Governance Automation — IaC for All Grants

    All access grants defined in Terraform, reviewed via pull request, applied via CI/CD. No human executes GRANT statements directly in production. Terraform state is the single source of truth for permissions.

  6. 06

    Decommission — IAM Instance Profile Removal

    Hive Metastore set to read-only mode. IAM Instance Profiles detached from clusters and removed. External Location permissions replace all S3 path access. Legacy access model fully retired.

Access Control Architecture

Identity-Centric, Least-Privilege, Infrastructure-as-Code

The access control model that works for 20 users is operationally dangerous at 200. The three most common governance failures in growing Databricks environments: manual GRANT statements in production, access tied to cluster IAM roles, and individual user permissions instead of AD group memberships.

Identity-Centric Access

Data access follows the user identity, not the compute resource. A cluster has no implicit data access. All access is tied to the authenticated principal via Unity Catalog grants.

AD Group as the Unit of Access

All grants target Azure AD or Okta groups — never individual user accounts. Users inherit access via group membership, enabling instant onboarding and offboarding via group management without touching Databricks directly.

Least Privilege by Default

Every principal starts with zero access. Access is explicitly granted at the minimum scope required. Catalog-level grants are reserved for platform administrators only.

IaC-Managed Grants

All production access is defined in Terraform. Every grant is reviewed via pull request. The permission model in Terraform state is the authoritative source of truth — not the UI, not ad-hoc SQL commands.

Environment Isolation Model

Catalog Human Access Pipeline Access Admin Access
prod SELECT only (via AD group) MODIFY via CI/CD service principal Platform Admin group only
staging SELECT + limited MODIFY Full write for deployment validation Platform Admin group
dev Full read/write for engineers Full write N/A — no production data

Standard Access Grant (Terraform)

# Onboard analytics team to production schema
resource "databricks_grants" "analytics_read" {
  schema = "prod.customer_analytics"

  grant {
    principal  = "analytics-team@company.com"  # Azure AD Group
    privileges = ["USE_SCHEMA", "SELECT"]
  }
}

# Pipeline service principal — production write access
resource "databricks_grants" "pipeline_write" {
  schema = "prod.customer_analytics"

  grant {
    principal  = "svc-etl-pipeline"
    privileges = ["USE_SCHEMA", "MODIFY", "CREATE_TABLE"]
  }
}
Data Lineage & Observability

Lineage Is a Reliability Tool, Not a Compliance Checkbox

Automated column-level lineage through Unity Catalog converts a 4-hour incident investigation into a 10-minute query. When a broken pipeline produces anomalous model outputs, "which feature table produced this column?" should take seconds, not days.

Column-Level Lineage (Automatic)

Unity Catalog tracks lineage from raw ingestion through transformed tables to ML model training runs automatically — no instrumentation required. Available via the Unity Catalog UI and system.access.audit system table.

Cross-Workspace Visibility

Workspace-isolated Hive Metastores have zero cross-workspace lineage. Unity Catalog's single Metastore model provides lineage visibility across all workspaces attached to the same Metastore — including upstream schema change impact detection.

Real-Time Access Audit

The system.access.audit table records every data access event across all workspaces. Queries that previously required 3-week manual log correlation are answered in under 5 seconds.

Upstream Accountability

When lineage is visible, upstream teams can see exactly which downstream pipelines and models consume their schemas. Schema changes without coordination become rare — not because of policy, because of visibility.

30-Day PII Access Report — Single Query

-- Previously: 3 weeks of manual CloudTrail + Databricks log correlation
-- Now: runs in under 5 seconds
SELECT
    user_identity.email,
    action_name,
    request_params.table_full_name,
    event_time
FROM system.access.audit
WHERE
    action_name IN ('SELECT', 'MODIFY', 'CREATE_TABLE')
    AND request_params.table_full_name LIKE 'prod.pii_%'
    AND event_time >= CURRENT_DATE - INTERVAL 30 DAYS
ORDER BY event_time DESC;
Platform Reliability

SLA Tiering: Reliability Requires Defined Standards

Infrastructure reliability cannot be achieved by configuration alone. Systems need defined Service Level Objectives that determine what "healthy" means before something breaks — not after. Without tiering, every pipeline is treated as equally critical, which means none receive appropriate operational attention.

Tier Description Latency SLA Failure Response Cluster Strategy
Tier 1 Revenue-critical or compliance-required pipelines < 30 min delay Immediate page — 24/7 on-call Dedicated cluster, pool-backed
Tier 2 Business reporting, daily BI dashboards < 2 hour delay Business hours alert, 4-hour SLA Job cluster, Spot with fallback
Tier 3 Experimental, low-frequency, non-customer pipelines Best effort Morning review, no page Spot-only, auto-terminate

Tiering Determines

Cluster sizing and instance type Retry budget and backoff strategy Alerting threshold and escalation path On-call rotation eligibility Delta OPTIMIZE priority Cost attribution category
Governance Anti-Patterns

Patterns That Are Prohibited in Production

Wildcard Grants on Production Catalogs

GRANT ALL PRIVILEGES ON CATALOG prod — grants unrestricted write, delete, and table drop access to any principal. Never in production under any circumstances.

Individual User Grants in Production

Access granted directly to a user's email cannot be revoked automatically when they leave the organization. All production grants must target AD security groups, not individual accounts.

Manual GRANT Statements in Production

Ad-hoc SQL GRANT commands executed directly in the Databricks SQL editor create permissions that exist in the Metastore but not in Terraform state. The next Terraform apply will overwrite or conflict with them.

Shared Service Principals Across Teams

When multiple teams' pipelines share a single service principal, auditing which pipeline caused a specific data mutation becomes impossible. Each team requires its own service principal with scoped, least-privilege grants.

IAM Instance Profile Access in a Unity Catalog Environment

Instance Profiles bypass Unity Catalog's access control entirely. A cluster with a broad IAM Instance Profile can read S3 data regardless of Unity Catalog grants. All S3 access must route through External Locations after UC migration.

Governance Assessment

If your Databricks environment cannot produce a complete access audit in under 30 minutes, or if your compliance team requires more than a few days to answer access questions — a governance assessment is the right starting point.