Guides

How to Analyze CSV Files with DuckDB

Arkzero ResearchApr 25, 20267 min read

Last updated Apr 25, 2026

DuckDB is a free, open-source analytical database that runs entirely in-process with no server required. It lets analysts query CSV, Parquet, and JSON files directly with standard SQL in seconds. Version 1.5.2, released April 13, 2026, brings min and max query speedups of 6 to 18 times over prior releases. This guide walks through installation, first queries on CSV files, aggregations, multi-file joins, and exporting results.

DuckDB analytical database for CSV file analysis

What DuckDB Is and Why Analysts Use It

DuckDB is an in-process analytical database released under the MIT license. Unlike PostgreSQL or MySQL, it does not run as a separate service you connect to over a network. It runs inside your terminal session, your Python script, or your command line shell. Close the session and the engine stops. Open a new one and it starts again in milliseconds.

Version 1.5.2 shipped on April 13, 2026, and is the current stable release. It adds production support for the DuckLake lakehouse format and brings performance improvements that make min and max queries 6 to 18 times faster than in prior releases, because DuckDB can now read per-chunk statistics stored alongside data blocks instead of scanning every row.

The practical value for an analyst or ops manager is direct. You have a CSV file. You want to know total revenue by region, or the 10 customers who placed the most orders last quarter. In a standard BI tool setup, that means uploading the file, waiting for it to index, clicking through a dashboard builder, and hoping the filter logic matches what you intended. With DuckDB, the same question takes three lines of SQL against the file on your hard drive.

Installing DuckDB

The fastest path is the official installer script. On macOS or Linux, run this in your terminal:

curl https://install.duckdb.org | sh

On Windows, use winget:

winget install DuckDB.cli

Verify the installation:

duckdb --version

You should see v1.5.2 or later. That is the complete setup. No Docker container, no configuration files, no background service to keep running.

If you prefer to work inside Python, the package installs the same engine as a library:

pip install duckdb

Both the CLI and the Python library use identical SQL syntax. For one-off file exploration, the CLI is faster to reach. For scripts that produce repeatable reports, the Python library fits naturally into existing pipelines.

Your First CSV Query

Start the interactive CLI by running duckdb with no arguments. This opens a prompt. To query a CSV file, reference the file path as if it were a table name:

SELECT * FROM 'sales.csv' LIMIT 10;

DuckDB reads the file, detects the delimiter (comma, tab, or pipe), identifies the header row, and infers column types automatically. The first ten rows appear in under a second for files up to several hundred megabytes.

To inspect the inferred schema before querying:

DESCRIBE SELECT * FROM 'sales.csv';

This lists column names, inferred types (VARCHAR, INTEGER, DATE, DOUBLE), and nullability. If a column that should be a date is showing as VARCHAR, override the type inference explicitly:

SELECT * FROM read_csv('sales.csv',
    columns = {'order_date': 'DATE', 'revenue': 'DOUBLE'}
);

The read_csv function accepts over 20 parameters for edge cases including custom delimiters, quoted fields, and files that lack header rows. For standard exports from Excel or Google Sheets, auto-detection handles it without any overrides.

Running Aggregations

The most common analytical question involves counting or summing something grouped by a category. DuckDB handles this with standard SQL:

SELECT
    region,
    COUNT(*) AS order_count,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_order_value
FROM 'sales.csv'
GROUP BY region
ORDER BY total_revenue DESC;

This runs against a CSV file on disk. No upload step, no database connection, no waiting for indexing.

For date-based filtering, DuckDB supports standard date comparison syntax. If your CSV has an order_date column that auto-detected as a DATE type:

SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(revenue) AS monthly_revenue
FROM 'sales.csv'
WHERE order_date >= '2025-01-01'
GROUP BY 1
ORDER BY 1;

DuckDB accepts positional GROUP BY references (GROUP BY 1 means "group by the first selected column"), which shortens repetitive queries on calculated columns.

Joining Multiple CSV Files

A frequent scenario for ops teams is combining two files: a list of orders and a separate list of customer records, each as its own CSV. DuckDB treats both file references as tables in the same query:

SELECT
    c.customer_name,
    c.region,
    SUM(o.revenue) AS total_spent
FROM 'orders.csv' o
JOIN 'customers.csv' c ON o.customer_id = c.id
GROUP BY c.customer_name, c.region
ORDER BY total_spent DESC
LIMIT 20;

Both files are read and joined in memory without any preprocessing. For files up to several gigabytes, this runs in seconds on a standard laptop. The DuckDB 1.5 release notes document min and max aggregations running 6 to 18 times faster than prior versions, because the engine checks statistics stored per data chunk rather than scanning every value individually.

Saving Results

Once you have the aggregation you need, write the result to a new CSV:

COPY (
    SELECT region, SUM(revenue) AS total_revenue
    FROM 'sales.csv'
    GROUP BY region
) TO 'revenue_by_region.csv' (HEADER, DELIMITER ',');

Or export to Parquet for faster re-use in follow-up queries:

COPY (SELECT * FROM 'sales.csv') TO 'sales.parquet' (FORMAT PARQUET);

Parquet files compress significantly compared to CSVs and load several times faster in subsequent DuckDB sessions, which matters when you return to the same dataset repeatedly.

To persist a table across sessions, open DuckDB with a named database file instead of the default in-memory mode:

duckdb mydb.duckdb

Then create a persistent table inside that session:

CREATE TABLE sales AS SELECT * FROM 'sales.csv';

The table survives after you close the session. On the next open of mydb.duckdb, the table is queryable immediately.

Reading JSON and Parquet

DuckDB reads JSON and Parquet files with the same direct syntax as CSV:

SELECT * FROM 'events.json' LIMIT 5;
SELECT * FROM 'metrics.parquet' WHERE date = '2026-04-01';

The 1.5 release series introduced a VARIANT type for semi-structured JSON data. With VARIANT, DuckDB shreds nested JSON objects into typed columns during query execution. According to the DuckDB engineering team, this makes JSON analysis up to 100 times faster compared to prior releases, which is relevant for API response logs or webhook payloads stored as JSON files.

When DuckDB Reaches Its Limits

DuckDB runs on a single machine. It does not distribute queries across a cluster. For files measured in low gigabytes analyzed on a modern laptop, performance is consistent. For multi-terabyte datasets or queries that need to run concurrently across a shared team, a cloud data warehouse or a distributed engine is more appropriate.

DuckDB also has no built-in visualization layer. Query output is a results table in the terminal or a DataFrame in Python. If you want charts without writing additional code, results need to move into a separate tool. For teams who want to skip both the SQL and the charting steps, VSLZ handles CSV analysis from a plain-language question with no queries or setup required.

Summary

DuckDB 1.5.2 gives analysts a fast, server-free SQL engine that reads CSV, JSON, and Parquet files without loading them into a database first. Installation takes one command. The first query against a CSV file takes three lines. Aggregations, joins, and exports follow standard SQL. The engine runs on any laptop and handles files up to several gigabytes with no configuration overhead.

FAQ

Does DuckDB require installing a database server?

No. DuckDB runs in-process, meaning it executes inside your terminal session or Python script with no separate server to install, configure, or keep running. You install the CLI with one command and start querying CSV files immediately.

What file formats can DuckDB read directly?

DuckDB reads CSV, TSV, JSON, Parquet, and Arrow files directly from disk using SQL. It also reads files from remote storage like Amazon S3 and Azure Blob Storage. No import step is needed; the file path appears in the FROM clause.

How large a CSV file can DuckDB handle on a laptop?

DuckDB handles files in the range of several gigabytes reliably on a standard laptop with 8 to 16 GB of RAM. Its columnar engine and disk-spill capability allow it to process datasets larger than available memory by spilling intermediate results to disk, though performance slows at that point.

Can DuckDB connect to a PostgreSQL or MySQL database?

Yes. DuckDB includes a PostgreSQL scanner extension that allows querying a live PostgreSQL database from within DuckDB. You can join a local CSV file against a remote PostgreSQL table in a single SQL query. MySQL and SQLite scanners are available through the extension system.

Is DuckDB free to use commercially?

Yes. DuckDB is released under the MIT license, which permits free use, modification, and distribution in commercial projects. There are no licensing fees or usage limits. MotherDuck offers a managed cloud service built on DuckDB for teams that want hosted infrastructure.