Guides

How to Get Started with Apache Iceberg Using DuckDB

Arkzero ResearchApr 23, 20267 min read

Last updated Apr 23, 2026

Apache Iceberg is an open table format that adds ACID transactions, schema evolution, and time travel to files stored on object storage or local disk. DuckDB's Iceberg extension lets analysts query and write Iceberg tables with a single pip install and no Spark cluster required. This guide walks through installing the extension, creating your first table, running queries, and understanding the v3 features that changed how deletions and semi-structured data are handled.

Apache Iceberg data lakehouse setup with DuckDB

Apache Iceberg solves a problem that raw file storage never could: you cannot run reliable transactions, evolve schemas safely, or travel back in time on a folder of Parquet files sitting in S3. Iceberg adds a metadata layer on top of those files that tracks every change, every schema version, and every snapshot of the table. The result is a data file format that behaves like a database table without requiring a database server.

DuckDB makes Iceberg accessible to analysts who have no interest in running a Spark cluster. Install it with pip, load one extension, and you can read and write Iceberg tables from your laptop. As of late 2025, DuckDB supports full read and write access to Iceberg tables, and the new v3 spec brings features that close the remaining gap with proprietary formats.

What Apache Iceberg Is and Why It Matters

Object storage (S3, GCS, Azure Blob) is cheap and durable, but it has no concept of a transaction. If two writers update the same dataset simultaneously, you get corrupted or inconsistent data. If you change a column name, every downstream query breaks. There is no built-in way to roll back a bad write.

Iceberg fixes each of these problems. It stores data as ordinary Parquet or ORC files but maintains a tree of metadata files that track the table state. A metadata.json file points to a snapshot file, which lists the current set of data files. Writers create a new snapshot atomically; readers see a consistent view at any point in the table's history. Schema changes are recorded in the metadata and do not require rewriting existing data.

The format is query-engine agnostic. Spark, Trino, Flink, Snowflake, Databricks, and DuckDB can all read the same Iceberg table. That interoperability is why Iceberg became the standard choice for teams building open data lakehouses rather than locking data into a single vendor's format.

Why DuckDB Is the Easiest Starting Point

Most Iceberg tutorials open with a Docker Compose file that spins up a Spark cluster, a Hive metastore, and a REST catalog. For an analyst who wants to learn the format, that is too much infrastructure. DuckDB collapses all of that into a single in-process binary.

DuckDB 1.x ships with an Iceberg extension that can scan, create, and write Iceberg tables from local disk or S3-compatible storage. There is no separate metastore process; the catalog metadata lives in the same directory as the table. For learning purposes, or for small to medium datasets, that is entirely sufficient.

Installing DuckDB and the Iceberg Extension

Install DuckDB via pip:

pip install duckdb

Then open a Python session or the DuckDB CLI and install the extension:

INSTALL iceberg;
LOAD iceberg;

On first use, DuckDB downloads the extension binary from its official extension repository. After that, LOAD iceberg; is all that is needed at the start of each session.

Creating Your First Iceberg Table

DuckDB can create an Iceberg table from any query result using COPY ... TO:

-- Create sample data
CREATE TABLE orders AS
  SELECT
    range AS order_id,
    'customer_' || (range % 100)::VARCHAR AS customer,
    round(random() * 500, 2) AS amount,
    current_date - (range % 365) AS order_date
  FROM range(10000);

-- Write as an Iceberg table to local disk
COPY orders TO 'my_iceberg_table' (FORMAT ICEBERG, COMPRESSION SNAPPY);

DuckDB writes a valid Iceberg directory structure: a data/ folder of Parquet files and a metadata/ folder containing the snapshot, manifest, and table metadata JSON files. Any other Iceberg-compatible engine can now read this table.

Reading and Querying Iceberg Tables

Use iceberg_scan() to query an Iceberg table:

SELECT customer, sum(amount) AS total
FROM iceberg_scan('my_iceberg_table')
GROUP BY customer
ORDER BY total DESC
LIMIT 10;

The iceberg_scan() function reads the current snapshot by default. To query a previous snapshot (time travel), pass the snapshot ID:

-- List available snapshots
SELECT * FROM iceberg_snapshots('my_iceberg_table');

-- Query a specific snapshot
SELECT count(*) FROM iceberg_scan('my_iceberg_table', snapshot_id=3776207238926525440);

This is one of Iceberg's most practical features for analysts. If a pipeline runs a bad update at 2 AM, you can query what the data looked like the night before without restoring from backup.

What Changed in Iceberg v3

Apache Iceberg v3 entered public preview on Snowflake in March 2026 and on Databricks shortly after. Three features stand out for the audiences that use Iceberg most.

Deletion vectors change how row-level deletes work. Previously, deleting a row required rewriting the Parquet file that contained it, a process called copy-on-write. With deletion vectors, the engine writes a small sidecar file marking which rows are logically deleted, without touching the original data file. Databricks reports this makes data manipulation operations up to 10x faster on large tables. The trade-off is a small read overhead, which the engine resolves at compaction time.

Row lineage assigns every row a permanent ID and a sequence number reflecting when it was last modified. For teams running CDC pipelines or auditing data changes, this eliminates a common gap: you can now identify exactly which rows changed between two table versions without scanning the full dataset.

The VARIANT type adds a native column type for semi-structured data. Before v3, storing JSON blobs in Iceberg required flattening the structure into typed columns or using a binary column. VARIANT stores the payload as-is and supports SQL extraction with standard dot-notation and bracket syntax. For teams ingesting raw API responses or event streams, this removes an entire normalization step.

Connecting to Cloud Storage

For production use, point DuckDB at an S3-compatible bucket. Set your credentials as environment variables and use a s3:// path:

SET s3_region='us-east-1';
SET s3_access_key_id=... ;
SET s3_secret_access_key=... ;

COPY orders TO 's3://my-bucket/warehouse/orders' (FORMAT ICEBERG);

SELECT count(*) FROM iceberg_scan('s3://my-bucket/warehouse/orders');

For teams that need catalog-level access control and multi-engine sharing, an Iceberg REST catalog (Apache Polaris, Lakekeeper, or Nessie) can be attached via DuckDB's ATTACH command. That step adds a running catalog service to the stack, but the query syntax stays identical.

If your team is already pulling data from a connected source and wants to skip the local pipeline setup, VSLZ lets you upload a file or connect a database and query it directly without configuring storage or extension installs.

Practical Next Steps

Apache Iceberg with DuckDB is a reasonable starting point for any team that wants open-format lakehouse semantics without a large infrastructure investment. The local setup takes under five minutes. The same Iceberg files can later be read by Spark, Trino, or Snowflake without conversion, which protects the investment as data volumes grow.

The v3 features arriving in 2026 are worth enabling on new tables: deletion vectors reduce operational cost on tables with frequent updates, and the VARIANT type opens Iceberg to event and API data that previously needed a separate store.

FAQ

What is Apache Iceberg used for?

Apache Iceberg is an open table format used for storing large analytic datasets on object storage (S3, GCS, Azure Blob) with database-grade features: ACID transactions, schema evolution, time travel, and concurrent safe writes. It is widely used to build data lakehouses where multiple query engines (Spark, Trino, DuckDB, Snowflake) need to read and write the same data reliably.

Can I use Apache Iceberg without Spark?

Yes. DuckDB's Iceberg extension supports reading and writing Iceberg tables with no Spark or cluster infrastructure required. Install DuckDB via pip, run INSTALL iceberg and LOAD iceberg, then use iceberg_scan() and COPY ... TO with FORMAT ICEBERG. Trino, Snowflake, and Databricks also support Iceberg natively.

What is new in Apache Iceberg v3?

Apache Iceberg v3 (in public preview on Snowflake and Databricks as of 2026) adds three major features: deletion vectors (row-level deletes without rewriting data files, up to 10x faster), row lineage (permanent row IDs that make change tracking and CDC pipelines more precise), and the VARIANT column type (native support for semi-structured JSON data without requiring a separate flattening step).

How does Iceberg time travel work?

Iceberg stores an immutable snapshot for every table version. Each snapshot records exactly which data files were part of the table at that point. To query historical data, you pass a snapshot ID or a timestamp to iceberg_scan() in DuckDB, or use AS OF TIMESTAMP syntax in Spark and Trino. Snapshots accumulate until a vacuum or expiry job cleans up old ones.

What is the difference between Apache Iceberg and Delta Lake?

Both are open table formats that add transactions and schema evolution to Parquet files on object storage. Delta Lake originated at Databricks and has tighter integration with the Databricks platform; Iceberg was designed from the start to be engine-agnostic and is the default choice when data must be read by multiple engines from different vendors. As of 2026, Iceberg v3 unified many technical features (like deletion vectors) that were previously Delta Lake-only advantages.