How to Set Up DuckLake for Your Data Lake
Last updated Apr 25, 2026

What DuckLake Solves
Most data lakes have a metadata problem. Apache Iceberg stores table metadata as a hierarchy of JSON and Avro files on object storage. A single streaming ingest run can generate more than 300 metadata files before a row of business data is queryable. Those files require periodic compaction, a separate catalog service to track them, and additional infrastructure to keep them consistent.
DuckLake v1.0, released on April 13, 2026, takes a different approach. All table metadata lives in a standard SQL database: PostgreSQL, MySQL, or SQLite. The actual data files remain Parquet on object storage, the same as any other lakehouse format. The catalog state that tells DuckDB what those files represent is stored as a database row, not a JSON file on S3.
The practical result: consistent reads, atomic writes, schema evolution, and time travel built on infrastructure most engineering teams already operate.
How the Architecture Works
DuckLake separates three concerns. Data storage holds the Parquet files. For production this is an S3 bucket. For local experiments it is a directory on disk. Metadata storage holds the catalog: table definitions, schema versions, snapshots, branch state, and transaction history. In production this is PostgreSQL or MySQL. For single-user work it is a .ducklake file backed by DuckDB. Compute is DuckDB v1.5.2 or later, using the ducklake extension to bridge the catalog and the Parquet data.
This separation allows multiple users to connect to the same data lake through a shared catalog. When one analyst inserts rows and another runs a SELECT, the PostgreSQL catalog handles concurrency and isolation natively, without file-level locking on object storage.
Prerequisites
Before running any commands, confirm you have DuckDB v1.5.2 or later installed. Run duckdb --version to check. You also need an S3 bucket with read/write access, or a local directory for testing. If using S3, set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION in your environment. For production setup you need a PostgreSQL 14 or later instance reachable from your machine.
Install DuckDB on Linux:
curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip
unzip duckdb.zip && mv duckdb /usr/local/bin/duckdb
On macOS: brew install duckdb
Step 1: Local Setup
The fastest way to try DuckLake requires no external dependencies. Open a DuckDB shell and run:
INSTALL ducklake;
LOAD ducklake;
ATTACH 'ducklake:my_catalog.ducklake' AS lake;
USE lake;
The ATTACH command creates a .ducklake file that stores all catalog metadata locally. Data files write to a data/ subdirectory by default. Standard SQL works from this point:
CREATE TABLE lake.orders (
order_id INTEGER,
customer VARCHAR,
amount DECIMAL(10, 2),
order_date DATE
);
INSERT INTO lake.orders VALUES
(1, 'Acme Corp', 4500.00, '2026-04-01'),
(2, 'Beta LLC', 1200.50, '2026-04-03');
SELECT customer, SUM(amount) AS total
FROM lake.orders
GROUP BY customer;
DuckLake writes each batch as a Parquet file and records the transaction snapshot in the metadata file. Run SELECT * FROM ducklake_snapshots('lake'); to view the history.
Step 2: Production Setup with PostgreSQL and S3
For multi-user or production workloads, move the metadata to PostgreSQL and the data to S3.
Install the required extensions in DuckDB:
INSTALL ducklake; INSTALL postgres; INSTALL httpfs; INSTALL aws;
LOAD ducklake; LOAD postgres; LOAD httpfs; LOAD aws;
Configure S3 credentials if not already in your environment:
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'YOUR_KEY';
SET s3_secret_access_key = 'YOUR_SECRET';
Attach the production catalog:
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=your-pg-host user=your-user password=your-password'
AS prod_lake (DATA_PATH 's3://your-bucket/ducklake/');
USE prod_lake;
On first ATTACH, DuckLake creates the catalog schema in PostgreSQL automatically. Every subsequent CREATE TABLE, INSERT, UPDATE, DELETE, and ALTER TABLE writes metadata to Postgres and data to S3 as a single atomic transaction. Any DuckDB client that attaches to the same Postgres connection string sees the same consistent lake state.
Step 3: Schema Evolution and Time Travel
Schema changes are a first-class operation. Adding a column does not rewrite existing Parquet files:
ALTER TABLE prod_lake.orders ADD COLUMN channel VARCHAR DEFAULT 'online';
DuckLake records the change as a new snapshot. Historical rows return NULL for the new column when queried. To query a prior state, use the AT clause with a snapshot ID:
-- List available snapshots
SELECT * FROM ducklake_snapshots('prod_lake');
-- Query table as it was at snapshot 1
SELECT * FROM prod_lake.orders AT (VERSION => 1);
The time travel state is stored entirely in the PostgreSQL catalog. No separate log replay or compaction job is needed.
Step 4: Branching for Safe Experimentation
DuckLake v1.0 ships branching as a first-class feature. Create a branch to test changes in isolation before committing:
CREATE BRANCH staging FROM main;
USE BRANCH staging;
DELETE FROM prod_lake.orders WHERE amount < 100;
SELECT COUNT(*) FROM prod_lake.orders;
-- Merge if the result is correct
MERGE BRANCH staging INTO main;
-- Or discard without touching main
DROP BRANCH staging;
Branches use copy-on-write semantics on the underlying Parquet files, so branching does not duplicate storage. The PostgreSQL catalog tracks lineage per branch with no additional tooling.
When DuckLake Fits and When It Does Not
DuckLake is the right choice when your team is DuckDB-centric and wants minimal catalog infrastructure. It eliminates the catalog service layer that Iceberg requires (Lakekeeper, Polaris, Nessie, Unity Catalog) and the Spark dependency that Delta Lake defaults toward. A streaming DuckLake ingest creates Parquet data files plus metadata rows in PostgreSQL. The equivalent Iceberg setup generates more than 300 metadata files on S3 for the same workload, all requiring compaction and a running catalog service to stay queryable.
Use Iceberg or Delta Lake if you run multi-engine workloads where Spark, Trino, and Flink all need to read the same tables. DuckLake's primary implementation is the DuckDB extension. Other engines can query DuckLake catalogs, but the native tooling is DuckDB-first.
Practical Summary
DuckLake v1.0 moves lakehouse metadata from object storage files into a relational database. For teams that run SQL workflows, the setup reduces operational overhead to a PostgreSQL instance and an S3 bucket. Time travel, schema evolution, branching, and multi-user access all work on infrastructure you already have. The full setup takes four extension installs in a DuckDB shell and a single ATTACH command. If you want to explore your lake data in plain English without writing SQL, VSLZ handles end-to-end analysis from a file upload with no configuration needed.
FAQ
What is DuckLake and how does it differ from other lakehouse formats?
DuckLake is an open table format that stores lakehouse metadata in a standard SQL database rather than as files on object storage. Apache Iceberg and Delta Lake store metadata as JSON and Avro files, which creates operational overhead. A streaming ingest in Iceberg generates more than 300 metadata files that need compaction and a running catalog service. DuckLake stores the same metadata as rows in PostgreSQL, MySQL, or a local SQLite file, eliminating the separate catalog layer.
Do I need PostgreSQL to use DuckLake?
No. For single-user or local testing, DuckLake uses a local .ducklake file backed by DuckDB as the metadata store with no external database required. PostgreSQL or MySQL is recommended for production and multi-user workloads where multiple DuckDB clients need consistent access to the same catalog. Switching from local to PostgreSQL-backed metadata is a single change to the ATTACH connection string.
Does DuckLake support AWS S3 for data storage?
Yes. DuckLake stores data as Parquet files on object storage. For S3, install the httpfs and aws DuckDB extensions, set your AWS credentials as DuckDB SET variables, and specify the S3 path in the DATA_PATH parameter of your ATTACH command. Any S3-compatible object store works, including Google Cloud Storage, Cloudflare R2, and MinIO.
How does DuckLake time travel work?
Each INSERT, UPDATE, DELETE, or schema change creates a new snapshot recorded in the metadata catalog. List snapshots with SELECT * FROM ducklake_snapshots('your_catalog') and query any prior state with SELECT * FROM your_table AT (VERSION => snapshot_id). Because snapshots are catalog records in a database, there are no additional manifest files to compact compared to Iceberg's snapshot approach.
Which version of DuckDB is required for DuckLake?
DuckLake v1.0 requires DuckDB v1.5.2 or later. The ducklake extension installs from DuckDB's official extension repository with INSTALL ducklake; LOAD ducklake; from any DuckDB shell. Check your installed version with duckdb --version before running setup. DuckLake v1.0 ships with guaranteed backward compatibility, so catalogs created with this version remain readable in future DuckDB releases.


