Skip to main content

Command Palette

Search for a command to run...

Understanding the Databricks Terminology & Hierarchy

A Mental Model for Data & ML Engineers

Updated
5 min read
Understanding the Databricks Terminology & Hierarchy

When I first started exploring Databricks, I thought I understood “where” my data lived — until I didn’t.
Clusters, catalogs, schemas… everything sounded familiar, yet nothing quite clicked.
It took a few rounds of trial and error (and a couple of broken pipelines) to finally build a clear mental model.

This post is about that clarity — understanding how Databricks actually organizes entitities, data and compute.

Flowchart showing the Databricks management hierarchy: "Tenant/Account" at the top, leading to "Workspace," then "Catalog," followed by "Schema," and finally "Table." Side notes include details on "Clusters" and "Catalog vs Unity Catalog" explanations.

Fig: Databricks Hierarchy Map


The Landscape

Databricks isn’t just a Spark notebook with some cloud storage behind it. It’s a layered ecosystem — designed to handle both governance and computation at scale.

At its core, there are five key entities:

Tenant / Account → Workspace → Catalog → Schema → Table

Let’s unpack these, not as definitions, but as relationships — how one gives rise to the next.


Tenant / Account — The Organizational Umbrella

Everything begins at the account level.
This is the topmost boundary — your organization’s home inside Databricks. It handles identity, billing, access, and the Unity Catalog setup.

If your company uses multiple Databricks workspaces (for dev, staging, production, or separate teams), the tenant ties them all together under one governance layer.

It’s invisible most days — until you need to understand why a workspace can or cannot access certain data.


Workspace — Where the Work Happens

A workspace is the human layer. It’s where engineers, analysts, and scientists live day-to-day.
You write notebooks, schedule jobs, explore tables, launch clusters. Every workspace is like a lab — independent, yet linked to the same organizational DNA.

Before Unity Catalog, each workspace had its own isolated metastore — a common pain point when teams tried to share data.
Unity Catalog changed that, introducing a shared governance model that lives above individual workspaces.


Cluster — The Compute Engine

A cluster is the heartbeat of execution.
When you run a notebook or job, the cluster spins up Spark drivers and workers, executes your transformations, and tears down when it’s done.

The key mental shift: clusters don’t own data.
They only process it. The data itself lives in cloud object storage — S3, ADLS, or GCS — wrapped in metadata managed by Databricks.

I’ve seen many beginners (my past self included) try to persist data inside clusters. It’s like writing a book on a whiteboard — it disappears once you clean up.


Catalog — The Governance Root

The catalog is where Databricks starts to feel like a real data platform.

A catalog is the highest-level namespace for your data assets. It defines who can access what, and where that data physically resides.

Now, a quick distinction:

  • A catalog is the object itself — a logical grouping of schemas and tables.
  • The Unity Catalog is the governance framework that manages those catalogs across workspaces.

Think of Unity Catalog as the constitution, and each catalog as a state under it — self-contained, but ruled by shared governance.


Schema — The Logical Grouping

A schema (also known as a database) lives inside a catalog.
It’s where you organize related tables — sales, marketing, operations, etc.
Each schema defines a default storage location, and you can manage permissions here independently.

In practice, schemas are how teams draw boundaries between domains.
If you’ve ever opened someone else’s workspace and wondered why all their tables live under “default,” that’s the telltale sign they skipped this step.


Table — The Actual Data

Finally, the table — the most familiar part, yet only the tip of the hierarchy.

Tables store your structured data in formats like Delta, Parquet, or CSV.
In Databricks, Delta Tables dominate because they offer ACID guarantees, version control, and schema evolution — essentials for production-grade pipelines.

Tables are where your code meets reality.
Everything above — tenant, workspace, catalog, schema — exists to ensure that these tables are queryable, governable, and reproducible.


Pulling It All Together

Visualize it like this:

Tenant / Account
   → Workspace
      → Catalog
         → Schema
            → Table

Each layer governs the one below it.
Each one adds context, control, or compute — but not all store data.

Once you internalize this structure, Databricks feels less like a collection of abstract terms and more like a city with its own urban plan.

You know which building to enter for what kind of work.

Reflections

What clicked for me was realizing Databricks isn’t about where data lives; it’s about how it’s organized and governed.
Once I started thinking in terms of hierarchy rather than storage, the architecture made sense.

That understanding also changed how I teach newcomers — I no longer start with clusters or Spark code. I start with the map.


Lessons Learned

  • Treat Unity Catalog as the source of truth for governance; treat clusters as disposable compute.
  • If teams can’t see each other’s data, check account/workspace scoping and catalog-level privileges first.
  • Default to Delta Tables for production; ACID + time travel prevents painful rollback stories.
  • Avoid the “everything in default” anti-pattern — create domain-aligned schemas from day one.
  • Separate dev/stage/prod workspaces, but share data through account-scoped catalogs.
  • Document the catalog → schema → table map; it becomes the backbone of onboarding.

Practical Checklist

✅ Enable Unity Catalog and define at least one account-level metastore.
✅ Create catalogs by domain or data sensitivity (e.g., sales, finance, ml_features).
✅ Inside each catalog, create schemas that mirror team or sub-domain boundaries.
✅ Standardize table defaults: USING DELTA, managed paths, retention, and naming conventions.
✅ Lock down catalog/schema/table permissions with principle of least privilege.
✅ Set up separate clusters for interactive exploration vs. scheduled jobs.
✅ Version notebooks and pipelines; keep infrastructure-as-code for catalogs and grants.
✅ Monitor table health (vacuum, optimize, Z-order where appropriate) and data quality checks.


References & Further Reading


Author’s Note

I write about AI, data systems, machine learning, and the thinking frameworks behind building reliable pipelines.
If this post clarified something for you, you might enjoy my other essays on data architecture and model deployment at abhiwrites.hashnode.dev.