dbt and Iceberg Tables: Building the Open Lakehouse on Snowflake
How dbt 1.9+ transforms Iceberg table development on Snowflake. From external catalogs to bi-directional writes, the open lakehouse is now production-ready.
The promise of the “open lakehouse” has always been seductive: store data in open formats, query it from anywhere, and escape vendor lock-in. The reality, until recently, was a mess of manual metadata management, inconsistent tooling, and “it works in theory” architectures.
That’s changed. With dbt 1.9+ and Snowflake’s expanded Iceberg support, building production-grade open lakehouses is not just possible—it’s practical.
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.
The Evolution of Iceberg on Snowflake
Apache Iceberg tables became generally available on Snowflake in June 2024. Since then, the ecosystem has matured rapidly:
- July 2025: Write support for externally managed Iceberg tables and catalog-linked databases reached preview
- dbt 1.9: Native support for Iceberg table materialisation
- dbt 1.11: Support for Glue catalog integration via catalog-linked databases
This means you can now use dbt to create and transform Iceberg tables that register metadata in external catalogs—not just Snowflake’s internal catalog.
Why This Matters
The Old Way: Snowflake as a Silo
Data lives in Snowflake’s proprietary format. To use it elsewhere (Spark ML jobs, Databricks notebooks), you need to export it.
The New Way: Snowflake as a Lakehouse Citizen
The same Iceberg table is readable and writable by multiple engines. No export pipelines. No data copies.
Configuring dbt for Iceberg
Basic Iceberg Table Materialisation
In your dbt_project.yml:
models:
my_project:
marts:
+materialized: table
+table_format: iceberg
+external_volume: my_s3_volume
+base_location: warehouse/marts/yamlOr at the model level in schema.yml:
models:
- name: dim_customer
config:
materialized: table
table_format: iceberg
external_volume: my_s3_volume
base_location: warehouse/dims/customers/yamlIncremental Iceberg Models
Iceberg tables support incremental materialisation with merge operations:
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
table_format='iceberg',
external_volume='my_s3_volume',
base_location='warehouse/facts/orders/',
unique_key='order_id',
incremental_strategy='merge'
)
}}
SELECT
order_id,
customer_id,
order_date,
order_total,
updated_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}sqlKey Point: The merge strategy leverages Iceberg’s ACID transactions, ensuring data consistency even when multiple processes access the table.
External Catalog Integration
The real power emerges when you connect dbt to external Iceberg catalogs.
Option 1: Snowflake Open Catalog (Polaris)
Snowflake Open Catalog (formerly Polaris) is an open-source Iceberg REST catalog. Configure it in Snowflake:
-- Create catalog integration
CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog
CATALOG_SOURCE = POLARIS
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'my_namespace'
REST_CONFIG = (
CATALOG_URI = 'https://my-polaris-instance.example.com/api/catalog'
WAREHOUSE = 'my_warehouse'
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_CLIENT_ID = 'my_client_id'
OAUTH_CLIENT_SECRET = 'my_secret'
OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
)
ENABLED = TRUE;sqlOption 2: AWS Glue Catalog
For AWS-centric architectures, link to Glue:
-- Create catalog-linked database
CREATE DATABASE glue_catalog_db
FROM CATALOG INTEGRATION glue_catalog;sqlIn dbt, reference tables from the linked database:
-- models/staging/stg_external_events.sql
SELECT *
FROM {{ source('glue_catalog_db', 'raw_events') }}
WHERE event_date >= CURRENT_DATE - 7sqlNote: As of dbt 1.11, incremental materialisation is not yet supported for Glue catalog tables. Use full table refreshes or manage incremental logic externally.
Practical Architecture: Multi-Engine Analytics
Here’s a production pattern I’ve seen work well:
The Setup
Workflow
- Spark ingests raw data into Iceberg tables (high-volume, cost-effective)
- Snowflake + dbt transforms raw → staging → marts (SQL-native, governed)
- Trino/Databricks runs ad-hoc ML workloads against the same marts (no data movement)
The dbt Side
# dbt_project.yml
models:
lakehouse_project:
staging:
+materialized: view # Views for staging (reference raw Iceberg)
marts:
+materialized: table
+table_format: iceberg
+external_volume: prod_lake_volumeyamlPerformance Considerations
Compaction
Iceberg tables accumulate small files over time. Snowflake automatically compacts Snowflake-managed Iceberg tables, but for externally managed tables, you need to handle this yourself (via Spark or a scheduled job).
Clustering
For large Iceberg tables, define clustering keys:
{{
config(
materialized='table',
table_format='iceberg',
cluster_by=['order_date', 'region']
)
}}sqlQuery Pruning
Iceberg’s manifest files contain min/max statistics for each data file. Snowflake uses these for partition pruning. Ensure your queries filter on columns with good selectivity.
Common Pitfalls
Pitfall 1: Forgetting External Volume Permissions
Problem: dbt run fails with “Access denied to external volume.”
Solution: Grant the Snowflake role access to the external volume:
GRANT USAGE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;
GRANT READ, WRITE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;sqlPitfall 2: Mixing Managed and Unmanaged Tables
Problem: You create a table with CATALOG = 'SNOWFLAKE' but try to write to it from Spark.
Solution: Decide upfront who “owns” the table metadata:
- Snowflake-managed: Only Snowflake writes; others read via Iceberg REST API
- Externally-managed: Snowflake reads; writes via Spark/external engine
Pitfall 3: Schema Evolution Conflicts
Problem: dbt adds a column, but Spark doesn’t see it.
Solution: Ensure all engines refresh their metadata cache. In Snowflake:
ALTER ICEBERG TABLE my_table REFRESH;sqlWhen to Use Iceberg Tables
Good fit for:
- Multi-engine architectures (Snowflake + Spark + Databricks)
- Cost-sensitive storage (S3 is cheaper than Snowflake native)
- Regulatory requirements for data portability
- ML/AI workloads that need direct file access
Not recommended for:
- Small tables (< 100GB) with Snowflake-only access
- Real-time streaming (use native tables with Snowpipe Streaming)
- Workloads requiring Snowflake-specific features (Time Travel > 7 days, etc.)
Conclusion
The combination of dbt 1.9+ and Snowflake’s Iceberg support makes the open lakehouse a practical reality. You no longer have to choose between SQL-native development (dbt) and open data formats (Iceberg)—you can have both.
Key Takeaways:
- dbt supports Iceberg tables as first-class citizens via
table_format: iceberg - External catalog integration enables true multi-engine architectures
- Snowflake’s Iceberg performance is now within 5-10% of native tables
- Plan your managed vs. unmanaged strategy upfront to avoid conflicts
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.