The semantic layer: what it is and how to build it
What a semantic layer is in practice, how Iceberg and a catalog fit together, how DataHub and OpenLineage unify lineage, and a simple phased rollout for platform teams.
A single source of truth (SSOT) for analytics is rarely one database. It is a contract across storage, compute, definitions, and metadata. The semantic layer is where those definitions become executable: named metrics, dimensions, grain, and allowed joins—published once and consumed by SQL engines, BI tools, APIs, and agents. This note is for platform owners: what that layer is, how it connects to a lakehouse-style stack, and a concrete pattern using Apache Iceberg, a unified catalog and lineage plane, and DataHub as the metadata graph.
This follows our earlier note on AI agents, memory, and the shared-knowledge gap—why chat context is not organizational memory, and why definitions must be published before agents or dashboards scale. Here we focus on implementation.
Definition: what counts as a semantic layer
In implementation terms, a semantic layer is a logical analytics model that sits above physical tables. It declares measures (aggregated facts, often with non-trivial expressions), dimensions (attributes you slice by), entities or fact tables at a defined grain, and the relationships the business allows between them. Consumers do not ask “what column did someone name revenue five years ago?” They ask for the measure revenue_net under the rules bound to that name: filters, currency, time zone, and population.
That model may be authored in YAML (for example dbt metrics and semantic models), in a vendor semantic product, or in a headless metrics engine with a compiled graph. The important part for a data platform team is the same: the layer is versioned, reviewable, and mapped to physical relations—Iceberg tables, views, or federated sources—not ad hoc SQL strings scattered across notebooks.
Physical foundation: Iceberg and the catalog
On a lakehouse path, Apache Iceberg is the open table format that turns object storage into dependable tables: ACID commits, snapshot isolation, partition evolution without rewriting history, and predictable schema evolution. For SSOT at the storage layer, that matters. You can align “what the table means over time” with how partitions and columns change, instead of silently forking copies in different buckets.
Iceberg tables still need a catalog—for example AWS Glue Data Catalog, Hive Metastore, or a cloud-native catalog such as Databricks Unity Catalog in managed environments—that registers table identifiers, pointers to metadata, and permissions boundaries. Query engines (Trino, Spark, Flink, DuckDB with extensions, etc.) resolve tables through that catalog. The semantic layer does not replace the catalog; it references stable table or view names the catalog already governs.
A common pattern is bronze / silver / gold (or similar) Iceberg tables: ingest raw events or extracts, conform and deduplicate into cleaned entities, then publish subject-area marts or wide tables that your semantic model maps to. The semantic definitions attach to the gold layer (or to trusted silver entities) so metrics are not defined on volatile landing tables.
Unified lineage and metadata: DataHub and OpenLineage
DataHub is an open metadata platform: a graph of datasets, jobs, dashboards, users, and glossary terms, with APIs and UI for discovery, ownership, and impact analysis. For a platform team, it is where you answer “which pipelines and dashboards break if we change this Iceberg table or this metric?”—if you feed it the right edges.
OpenLineage is a standard for job and dataset lineage events. Orchestrators and engines (Airflow with providers, Dagster, Spark, dbt with adapters, Flink, etc.) can emit OpenLineage metadata to a backend; DataHub can ingest those events so lineage is not a separate, hand-drawn diagram. The result is a unified lineage plane: source systems and files → transformation jobs → Iceberg datasets → semantic projects and downstream consumers.
Practically, you also ingest technical metadata from the catalog (Glue, Unity, HMS) into DataHub so columns, partitions, and owners appear on dataset pages. Link glossary terms to columns and to business definitions of metrics so “revenue_net” in the semantic project traces to both a human definition and a physical column path. That is how SSOT becomes navigable for engineers, not only asserted in a slide deck.
How to implement: a phased path for platform teams
1. Freeze scope for v1. Pick one subject area (for example finance revenue and orders) and a handful of measures with named owners. Ingest sources into Iceberg with clear partition keys and idempotent writes; register tables in one catalog; enforce IAM or catalog-level access consistent with your security model.
2. Wire lineage before you scale usage. Turn on OpenLineage (or equivalent) from your orchestrator and Spark jobs; connect ingestion to DataHub. Without lineage, a semantic layer is a black box when something drifts—you will not know blast radius.
3. Author the semantic model against trusted tables. Use a toolchain your team can operate: for example dbt with metrics and semantic models compiled to the warehouse/lake query engine, or a dedicated metrics server that exposes SQL or REST over Iceberg-backed views. Expose one interface for interactive analytics (often SQL through Trino or Spark SQL) and register the semantic project as a DataHub “container” or linked documentation so consumers discover the official entry points.
4. Register semantic artifacts in metadata. Model metrics and dimensions as first-class entities where DataHub supports them, or document them with stable URIs and link to glossary terms and underlying Iceberg datasets. The goal is end-to-end trace: metric definition → semantic node → view/table → column lineage → source.
5. Add APIs and guardrails for agents and apps. Read-only SQL or REST over approved semantic endpoints; row- and column-level policies enforced at the engine or catalog; audit logs for who queried what definition. Treat the semantic layer as infrastructure, with CI checks when models change (breaking changes detected against downstream tests).
Tradeoffs and hard truths
Iceberg solves many table-format problems; it does not resolve organizational ownership. DataHub centralizes metadata; it still needs curation and ingestion ownership or it becomes stale. OpenLineage gives you edges; coverage depends on every material job emitting events. The semantic layer is where definitions meet politics—without executive backing on “official” measures, engineering will ship multiple parallel models no matter how good the tools are.
A workable SSOT for analytics is layered: Iceberg and the catalog for physical truth, OpenLineage and DataHub for operational truth about dependencies, and a semantic project for definitional truth in code. Ship those layers in that order of dependency, measure adoption by query paths and lineage coverage, and expand scope only when v1 stays stable under change.