dbt and Iceberg Tables: Building the Open Lakehouse on Snowflake

The promise of the “open lakehouse” has always been seductive: store data in open formats, query it from anywhere, and escape vendor lock-in. The reality, until recently, was a mess of manual metadata management, inconsistent tooling, and “it works in theory” architectures.

That’s changed. With dbt 1.9+ and Snowflake’s expanded Iceberg support, building production-grade open lakehouses is not just possible—it’s practical.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.

The Evolution of Iceberg on Snowflake

Apache Iceberg tables became generally available on Snowflake in June 2024. Since then, the ecosystem has matured rapidly:

July 2025: Write support for externally managed Iceberg tables and catalog-linked databases reached preview
dbt 1.9: Native support for Iceberg table materialisation
dbt 1.11: Support for Glue catalog integration via catalog-linked databases

This means you can now use dbt to create and transform Iceberg tables that register metadata in external catalogs—not just Snowflake’s internal catalog.

Why This Matters

The Old Way: Snowflake as a Silo

graph LR A[Source Data] --> B[Fivetran] --> C[Snowflake Native Tables] --> D[dbt] --> E[Snowflake Native Tables] C --> F[Export to S3 for Spark]

Data lives in Snowflake’s proprietary format. To use it elsewhere (Spark ML jobs, Databricks notebooks), you need to export it.

The New Way: Snowflake as a Lakehouse Citizen

graph TD A[Source Data] --> B[Ingestion] --> C[(S3 Iceberg Tables)] D[Snowflake + dbt] --> C E[Spark / Databricks / Trino] --> C

The same Iceberg table is readable and writable by multiple engines. No export pipelines. No data copies.

Configuring dbt for Iceberg

Basic Iceberg Table Materialisation

In your dbt_project.yml:

models:
  my_project:
    marts:
      +materialized: table
      +table_format: iceberg
      +external_volume: my_s3_volume
      +base_location: warehouse/marts/

yaml

Or at the model level in schema.yml:

models:
  - name: dim_customer
    config:
      materialized: table
      table_format: iceberg
      external_volume: my_s3_volume
      base_location: warehouse/dims/customers/

yaml

Incremental Iceberg Models

Iceberg tables support incremental materialisation with merge operations:

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        table_format='iceberg',
        external_volume='my_s3_volume',
        base_location='warehouse/facts/orders/',
        unique_key='order_id',
        incremental_strategy='merge'
    )
}}

SELECT
    order_id,
    customer_id,
    order_date,
    order_total,
    updated_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

sql

Key Point: The merge strategy leverages Iceberg’s ACID transactions, ensuring data consistency even when multiple processes access the table.

External Catalog Integration

The real power emerges when you connect dbt to external Iceberg catalogs.

Option 1: Snowflake Open Catalog (Polaris)

Snowflake Open Catalog (formerly Polaris) is an open-source Iceberg REST catalog. Configure it in Snowflake:

-- Create catalog integration
CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog
    CATALOG_SOURCE = POLARIS
    TABLE_FORMAT = ICEBERG
    CATALOG_NAMESPACE = 'my_namespace'
    REST_CONFIG = (
        CATALOG_URI = 'https://my-polaris-instance.example.com/api/catalog'
        WAREHOUSE = 'my_warehouse'
    )
    REST_AUTHENTICATION = (
        TYPE = OAUTH
        OAUTH_CLIENT_ID = 'my_client_id'
        OAUTH_CLIENT_SECRET = 'my_secret'
        OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
    )
    ENABLED = TRUE;

sql

Option 2: AWS Glue Catalog

For AWS-centric architectures, link to Glue:

-- Create catalog-linked database
CREATE DATABASE glue_catalog_db
    FROM CATALOG INTEGRATION glue_catalog;

sql

In dbt, reference tables from the linked database:

-- models/staging/stg_external_events.sql
SELECT *
FROM {{ source('glue_catalog_db', 'raw_events') }}
WHERE event_date >= CURRENT_DATE - 7

sql

Note: As of dbt 1.11, incremental materialisation is not yet supported for Glue catalog tables. Use full table refreshes or manage incremental logic externally.

Practical Architecture: Multi-Engine Analytics

Here’s a production pattern I’ve seen work well:

The Setup

graph TD subgraph Lake[S3 Data Lake] raw[raw/ Iceberg] staging[staging/ Iceberg] marts[marts/ Iceberg] end Spark[Spark<br/>Ingest] --> raw raw --> SF[Snowflake + dbt] SF --> staging SF --> marts marts --> Trino[Trino<br/>Ad-hoc]

Workflow

Spark ingests raw data into Iceberg tables (high-volume, cost-effective)
Snowflake + dbt transforms raw → staging → marts (SQL-native, governed)
Trino/Databricks runs ad-hoc ML workloads against the same marts (no data movement)

The dbt Side

# dbt_project.yml
models:
  lakehouse_project:
    staging:
      +materialized: view  # Views for staging (reference raw Iceberg)
    marts:
      +materialized: table
      +table_format: iceberg
      +external_volume: prod_lake_volume

yaml

Performance Considerations

Compaction

Iceberg tables accumulate small files over time. Snowflake automatically compacts Snowflake-managed Iceberg tables, but for externally managed tables, you need to handle this yourself (via Spark or a scheduled job).

Clustering

For large Iceberg tables, define clustering keys:

{{
    config(
        materialized='table',
        table_format='iceberg',
        cluster_by=['order_date', 'region']
    )
}}

sql

Query Pruning

Iceberg’s manifest files contain min/max statistics for each data file. Snowflake uses these for partition pruning. Ensure your queries filter on columns with good selectivity.

Common Pitfalls

Pitfall 1: Forgetting External Volume Permissions

Problem: dbt run fails with “Access denied to external volume.”

Solution: Grant the Snowflake role access to the external volume:

GRANT USAGE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;
GRANT READ, WRITE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;

sql

Pitfall 2: Mixing Managed and Unmanaged Tables

Problem: You create a table with CATALOG = 'SNOWFLAKE' but try to write to it from Spark.

Solution: Decide upfront who “owns” the table metadata:

Snowflake-managed: Only Snowflake writes; others read via Iceberg REST API
Externally-managed: Snowflake reads; writes via Spark/external engine

Pitfall 3: Schema Evolution Conflicts

Problem: dbt adds a column, but Spark doesn’t see it.

Solution: Ensure all engines refresh their metadata cache. In Snowflake:

ALTER ICEBERG TABLE my_table REFRESH;

sql

When to Use Iceberg Tables

Good fit for:

Multi-engine architectures (Snowflake + Spark + Databricks)
Cost-sensitive storage (S3 is cheaper than Snowflake native)
Regulatory requirements for data portability
ML/AI workloads that need direct file access

Not recommended for:

Small tables (< 100GB) with Snowflake-only access
Real-time streaming (use native tables with Snowpipe Streaming)
Workloads requiring Snowflake-specific features (Time Travel > 7 days, etc.)

Conclusion

The combination of dbt 1.9+ and Snowflake’s Iceberg support makes the open lakehouse a practical reality. You no longer have to choose between SQL-native development (dbt) and open data formats (Iceberg)—you can have both.

Key Takeaways:

dbt supports Iceberg tables as first-class citizens via table_format: iceberg
External catalog integration enables true multi-engine architectures
Snowflake’s Iceberg performance is now within 5-10% of native tables
Plan your managed vs. unmanaged strategy upfront to avoid conflicts

The Evolution of Iceberg on Snowflake

Why This Matters

The Old Way: Snowflake as a Silo

The New Way: Snowflake as a Lakehouse Citizen

Configuring dbt for Iceberg

Basic Iceberg Table Materialisation

Incremental Iceberg Models

External Catalog Integration

Option 1: Snowflake Open Catalog (Polaris)

Option 2: AWS Glue Catalog

Practical Architecture: Multi-Engine Analytics

The Setup

Workflow

The dbt Side

Performance Considerations

Compaction

Clustering

Query Pruning

Common Pitfalls

Pitfall 1: Forgetting External Volume Permissions

Pitfall 2: Mixing Managed and Unmanaged Tables

Pitfall 3: Schema Evolution Conflicts

When to Use Iceberg Tables

Conclusion

Further Reading

Disclaimer

dbt and Iceberg Tables: Building the Open Lakehouse on Snowflake

The Evolution of Iceberg on Snowflake#

Why This Matters#

The Old Way: Snowflake as a Silo#

The New Way: Snowflake as a Lakehouse Citizen#

Configuring dbt for Iceberg#

Basic Iceberg Table Materialisation#

Incremental Iceberg Models#

External Catalog Integration#

Option 1: Snowflake Open Catalog (Polaris)#

Option 2: AWS Glue Catalog#

Practical Architecture: Multi-Engine Analytics#

The Setup#

Workflow#

The dbt Side#

Performance Considerations#

Compaction#

Clustering#

Query Pruning#

Common Pitfalls#

Pitfall 1: Forgetting External Volume Permissions#

Pitfall 2: Mixing Managed and Unmanaged Tables#

Pitfall 3: Schema Evolution Conflicts#

When to Use Iceberg Tables#

Conclusion#

Further Reading#

Disclaimer

The Evolution of Iceberg on Snowflake

Why This Matters

The Old Way: Snowflake as a Silo

The New Way: Snowflake as a Lakehouse Citizen

Configuring dbt for Iceberg

Basic Iceberg Table Materialisation

Incremental Iceberg Models

External Catalog Integration

Option 1: Snowflake Open Catalog (Polaris)

Option 2: AWS Glue Catalog

Practical Architecture: Multi-Engine Analytics

The Setup

Workflow

The dbt Side

Performance Considerations

Compaction

Clustering

Query Pruning

Common Pitfalls

Pitfall 1: Forgetting External Volume Permissions

Pitfall 2: Mixing Managed and Unmanaged Tables

Pitfall 3: Schema Evolution Conflicts

When to Use Iceberg Tables

Conclusion

Further Reading