Guides

How to Set Up DuckLake for Local Analytics

Arkzero ResearchApr 26, 20267 min read

Last updated Apr 26, 2026

DuckLake is a lakehouse format released to production in April 2026 that stores all metadata in a SQL database instead of scattered JSON files on object storage. Unlike Delta Lake or Apache Iceberg, it runs on a local SQLite or PostgreSQL catalog with no separate catalog server required. You can spin up a production-ready lakehouse in under five minutes using the ducklake extension for DuckDB v1.5.2, with ACID transactions, time travel, and schema evolution included out of the box.
A clean editorial scene depicting structured data files and local storage for a DuckLake analytics setup

DuckLake is a lakehouse format that stores all metadata in a SQL database instead of thousands of small files on object storage. Released as version 1.0 on April 13, 2026, it ships as a DuckDB extension and requires no catalog server to run. Install DuckDB v1.5.2, load the ducklake extension, and you have a production-ready lakehouse with ACID transactions, time travel, and schema evolution in under five minutes.

What DuckLake Actually Is

Most lakehouse formats store metadata as files. Delta Lake writes JSON transaction logs to object storage. Apache Iceberg writes manifest files and snapshot metadata. Both require a separate catalog service, Unity Catalog for Delta, Lakekeeper or Polaris for Iceberg, to coordinate access across multiple clients.

DuckLake takes a different approach. All metadata lives in a standard SQL database called the catalog. That catalog can be a local SQLite file, a shared PostgreSQL instance, or a DuckDB database. No server is needed for a local setup, and no third-party catalog service is needed for most team setups.

The performance difference is measurable. COUNT(*) queries run 8x to 258x faster than scanning Parquet files directly because the row count sits in the catalog rather than inside the data files. In streaming workloads with frequent small writes, DuckDB Labs benchmarks show 900x faster reads and 100x faster writes compared to Apache Iceberg, driven primarily by the data inlining feature covered below.

Prerequisites

You need DuckDB v1.5.2 or later. The ducklake extension shipped with this release. Check your version:

duckdb --version

If you are below v1.5.2, upgrade via your package manager:

# macOS
brew upgrade duckdb

# Python
pip install duckdb --upgrade

No other dependencies are required for a local SQLite-backed lakehouse.

Step 1: Install the ducklake Extension

Open a DuckDB shell or a Python session with import duckdb. Install the extension:

INSTALL ducklake;
LOAD ducklake;

Installation pulls from the DuckDB Community Extensions registry and requires an internet connection once. After that, the extension loads offline.

Step 2: Attach a DuckLake

The ATTACH command creates a new lakehouse. The ducklake: prefix tells DuckDB to use the ducklake extension. For a local setup, use SQLite as the catalog backend:

ATTACH 'ducklake:sqlite:my_catalog.db' AS lake (DATA_PATH 'my_data/');

This creates my_catalog.db (the catalog, stored as SQLite) and my_data/ (a directory where Parquet files will eventually land). Both are created automatically if they do not exist.

For a shared environment where multiple users or processes need to read and write simultaneously, use PostgreSQL:

ATTACH 'ducklake:postgres:dbname=my_catalog host=localhost' AS lake
  (DATA_PATH 's3://my-bucket/data/');

For a fully local setup with DuckDB as its own catalog:

ATTACH 'ducklake:duckdb:catalog.db' AS lake (DATA_PATH 'data/');

Step 3: Create Tables and Load Data

Once attached, write standard SQL against the lake schema:

CREATE TABLE lake.orders (
    order_id   INTEGER,
    customer   VARCHAR,
    amount     DECIMAL(10, 2),
    created_at TIMESTAMP
);

INSERT INTO lake.orders VALUES
  (1, 'Acme Corp', 1200.00, '2026-04-01 09:00:00'),
  (2, 'Beta Ltd',   450.00, '2026-04-01 10:30:00'),
  (3, 'Gamma Inc', 8750.00, '2026-04-02 14:15:00');

To load from a CSV or Parquet file:

INSERT INTO lake.orders SELECT * FROM read_csv('raw_orders.csv');
INSERT INTO lake.orders SELECT * FROM 'orders_backup.parquet';

Query the table exactly as you would any DuckDB table:

SELECT customer, SUM(amount) AS total
FROM lake.orders
GROUP BY customer
ORDER BY total DESC;

Step 4: How Data Inlining Works

DuckLake solves the small file problem at write time rather than after the fact. By default, write operations touching 10 rows or fewer land directly in the catalog database, not in new Parquet files on storage. This is called data inlining.

To confirm that small writes stay inlined:

FROM ducklake_list_files('lake', 'orders');
-- returns empty after small inserts; data is in the catalog

To flush all inlined data to Parquet files on storage, run a checkpoint:

CHECKPOINT;

For streaming ingestion where you receive a constant flow of small batches, raise the threshold:

SET ducklake_data_inlining_row_limit = 100;

This is the feature behind the 900x read speedup in DuckDB Labs benchmarks against Iceberg. Iceberg creates a new data file for every small write and requires scheduled compaction to clean up. DuckLake absorbs those writes into the catalog and writes to storage only when the threshold is reached or CHECKPOINT is called.

Step 5: Sorted Tables for Faster Reads

If queries frequently filter by a particular column, declare a sort order on the table. DuckLake will pre-sort new inserts and skip irrelevant files at read time:

ALTER TABLE lake.orders SET SORTED BY (created_at ASC);

Inserts after this statement are pre-sorted automatically. Existing data is not retroactively sorted. To sort existing data, re-insert from a sorted query or use the migration scripts in the DuckDB documentation.

Sort expressions support arbitrary SQL, which means you can sort by a computed expression:

ALTER TABLE lake.events SET SORTED BY (date_trunc('day', occurred_at) ASC);

Step 6: Bucket Partitioning

For high-cardinality columns you filter on frequently, bucket partitioning distributes data across a fixed number of buckets using a murmur3 hash. This is the same partitioning scheme used by Apache Iceberg v2, making DuckLake tables interoperable with Iceberg-compatible engines:

ALTER TABLE lake.orders SET PARTITIONED BY (bucket(8, customer));

A query with WHERE customer = 'Acme Corp' now scans one of eight buckets rather than the full table. The speedup scales proportionally with table size.

Time Travel

DuckLake records every transaction as a snapshot. Query the table as it appeared at any past point:

SELECT * FROM lake.orders AT (TIMESTAMP = NOW() - INTERVAL '1 day');

List all available snapshots:

FROM ducklake_snapshots('lake', 'orders');

Time travel is useful for debugging bad writes, auditing changes, and rolling back to a known-good state without maintaining separate backup copies.

When DuckLake Makes Sense

DuckLake is suited for local and single-team analytics that need lakehouse guarantees, ACID transactions, time travel, schema evolution, without running a catalog server. It works well when data arrives in many small batches and compaction would otherwise become a recurring maintenance job, when teams share a PostgreSQL instance and need multiple DuckDB clients to coordinate safely, and when files are too large for in-memory DuckDB but the workload does not justify a full Databricks or Snowflake contract.

For enterprises already running workloads on Spark or Flink, DuckLake is not a replacement. It covers the local and mid-scale range where those platforms are overkill.

If you want to skip format setup entirely, VSLZ connects directly to file sources and runs analytics from a plain-English prompt without requiring a catalog or storage path configuration.

Next Steps

After your first DuckLake is running, explore schema evolution with ALTER TABLE ... ADD COLUMN, the VARIANT type for semi-structured event data where you need field-level filter pushdown, and the migration guide for moving an existing DuckDB database into DuckLake format. DuckLake v2.0 is not on the near-term roadmap. The DuckDB team has stated the focus through 2026 is maturing the current feature set and guaranteeing backward compatibility of the v1.0 specification.

FAQ

What databases can DuckLake use as a catalog?

DuckLake supports three catalog backends: SQLite (local file, no server needed), PostgreSQL (recommended for shared or multi-user setups), and DuckDB itself. You specify the backend in the ATTACH command using the ducklake: prefix followed by the backend type and connection string.

How is DuckLake different from Apache Iceberg?

Iceberg stores all metadata as files on object storage and requires a separate catalog service such as Lakekeeper or Polaris. DuckLake stores metadata in a SQL database, which eliminates the catalog server requirement for most setups. DuckLake also includes data inlining, which absorbs small writes into the catalog rather than creating individual Parquet files, avoiding the small file problem that affects Iceberg in streaming workloads. DuckDB Labs benchmarks show 900x faster reads and 100x faster writes than Iceberg in streaming scenarios.

Does DuckLake require object storage like S3?

No. For local development and single-machine analytics, DuckLake works entirely with a local directory as the data path. Object storage such as S3, GCS, or Azure Blob is supported for team and production setups, but is not required. You set the data path when attaching the lakehouse using the DATA_PATH parameter.

What is data inlining in DuckLake?

Data inlining is a feature that stages small write operations directly in the catalog database rather than writing new Parquet files on storage. The default threshold is 10 rows. When a write operation is below that threshold, no new file is created. Data is flushed to storage files when you run CHECKPOINT or when a write exceeds the threshold. This eliminates the small file accumulation problem that affects Delta Lake and Iceberg in high-frequency write scenarios. The threshold is configurable via SET ducklake_data_inlining_row_limit.

Which version of DuckDB is required for DuckLake?

DuckLake v1.0 requires DuckDB v1.5.2 or later. Both were released on April 13, 2026. You can check your current version with duckdb --version and upgrade via brew upgrade duckdb on macOS or pip install duckdb --upgrade in Python. The ducklake extension is installed via INSTALL ducklake; LOAD ducklake; inside a DuckDB session.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026