Guides

How to Analyze CSV Files with DuckDB

Arkzero ResearchApr 2, 20266 min read

Last updated Apr 2, 2026

DuckDB is a free, serverless SQL database that lets you query CSV files directly from your terminal with zero setup beyond installing a single binary. You write standard SQL against your files as if they were database tables, making it one of the fastest ways for analysts to filter, aggregate, and join local data without loading it into Excel, Python, or a cloud warehouse.
DuckDB analytical database logo displayed in a professional editorial setting

What DuckDB Does and Why It Matters for CSV Analysis

DuckDB is an open source, in-process analytical database. Unlike PostgreSQL or MySQL, it has no server. You download one file, run it, and start querying. The reason it has taken off among data practitioners is simple: it reads CSV, Parquet, and JSON files natively, treating them as tables without any import step.

For analysts who spend hours opening large CSVs in Excel only to watch it freeze, or who get lost setting up Python environments just to run a groupby, DuckDB offers a direct path. You write SQL, one of the most widely known data languages, and it runs against your files on your own machine. No uploads, no cloud accounts, no dependencies.

As of early 2026, DuckDB has passed 30 million downloads and regularly appears in discussions on Reddit and Hacker News as the go-to tool for local data analysis. Its latest stable release (v1.2) introduced improved CSV sniffing, faster aggregations, and better memory management for files that exceed available RAM.

Installing DuckDB

Installation takes under a minute on any operating system.

On macOS with Homebrew, run: brew install duckdb. On Linux, download the binary from the DuckDB GitHub releases page, unzip it, and place it in your PATH. On Windows, download the zip from the same releases page, extract duckdb.exe, and run it from the command prompt.

There are no background services to configure. No config files to edit. The entire tool is a single binary.

To verify the install, open a terminal and type duckdb. You should see the DuckDB shell prompt. Type .exit to leave.

Querying a CSV File in One Line

The core feature that makes DuckDB valuable for CSV work is the read_csv function. Suppose you have a file called sales.csv in your current directory. Open the DuckDB shell and run:

SELECT * FROM read_csv('sales.csv') LIMIT 10;

DuckDB automatically detects headers, column types, and delimiters. No schema definition needed. If the file uses semicolons instead of commas, DuckDB's sniffer handles that too.

For a quick row count:

SELECT count(*) FROM read_csv('sales.csv');

You can also query files without entering the shell at all, directly from your terminal:

duckdb -c "SELECT count(*) FROM read_csv('sales.csv')"

This makes it easy to embed DuckDB queries in shell scripts or cron jobs.

Filtering, Grouping, and Aggregating

Standard SQL works exactly as you would expect. To find total revenue by product category:

SELECT category, sum(revenue) AS total_revenue
FROM read_csv('sales.csv')
GROUP BY category
ORDER BY total_revenue DESC;

To filter rows before aggregating:

SELECT region, count(*) AS order_count
FROM read_csv('sales.csv')
WHERE order_date >= '2026-01-01'
GROUP BY region;

DuckDB parses date strings automatically in most common formats (YYYY-MM-DD, MM/DD/YYYY, and others). If your dates use an unusual format, you can cast them explicitly with strptime.

Joining Multiple CSV Files

Analysts often need to combine data from separate exports. DuckDB handles this with standard JOIN syntax:

SELECT o.order_id, o.product, c.company_name
FROM read_csv('orders.csv') AS o
JOIN read_csv('customers.csv') AS c
  ON o.customer_id = c.id;

You can also query all CSV files in a directory at once using glob patterns:

SELECT * FROM read_csv('monthly_reports/*.csv');

DuckDB unions the files together, assuming they share the same schema. This is useful when you receive data split by month or region across separate files.

Exporting Results

After running your analysis, you probably want to save the output. DuckDB supports exporting to CSV, Parquet, and JSON:

COPY (
  SELECT category, sum(revenue) AS total
  FROM read_csv('sales.csv')
  GROUP BY category
) TO 'summary.csv' (HEADER, DELIMITER ',');

For better compression and faster re-reads, export to Parquet:

COPY (SELECT * FROM read_csv('sales.csv') WHERE region = 'EMEA')
TO 'emea_sales.parquet' (FORMAT PARQUET);

You can then reopen that Parquet file in DuckDB, Python, or any tool that supports the format.

Handling Large Files

One of DuckDB's advantages over Excel and even pandas is how it handles files that exceed available memory. DuckDB uses a streaming execution engine that processes data in chunks. A 10 GB CSV on a laptop with 8 GB of RAM will still complete, as long as the final result set fits in memory.

For very large aggregations, DuckDB automatically spills intermediate results to disk. No configuration required.

If you want to monitor query performance, use the .timer on command in the shell. This prints execution time after each query, helping you identify slow steps and optimize your SQL.

Practical Tips for Analysts

First, keep your CSV files in a dedicated project folder and run duckdb from that folder. This keeps file paths short and manageable.

Second, use CREATE TABLE AS to cache intermediate results during an analysis session:

CREATE TABLE clean_sales AS
SELECT * FROM read_csv('sales.csv')
WHERE revenue > 0 AND product IS NOT NULL;

Subsequent queries against clean_sales will be faster because DuckDB stores the data in its columnar format in memory.

Third, use the .mode command to change output formatting. .mode markdown renders results as a markdown table, which is useful if you are pasting into a report. .mode csv outputs comma-separated values for piping into other tools.

If SQL is not something you want to learn, or you need to go from a raw file to polished charts and statistical summaries without writing queries, tools like VSLZ handle that from a file upload using plain English prompts.

What DuckDB Does Not Do

DuckDB is not a visualization tool. It returns tables of numbers. For charts, you will need to export results and open them in a tool that renders graphics.

It also does not support concurrent writes from multiple users. It is designed for single-user analytical workloads, not as a production database serving a web application.

Finally, DuckDB runs locally. If you need to share a live dashboard with your team, you will need to pair it with a reporting layer or move to a cloud warehouse.

Summary

DuckDB fills a gap between spreadsheets and full data engineering stacks. For analysts who receive CSV exports and need answers fast, it removes the friction of loading data into another system. Install the binary, point it at your files, write SQL, and get results. The entire workflow stays on your machine, runs in seconds, and costs nothing.

FAQ

Can DuckDB open Excel XLSX files directly?

DuckDB does not read XLSX files natively. You need to export your Excel data as CSV first, then query the CSV with read_csv. Alternatively, the DuckDB spatial extension and community extensions add support for some formats, but CSV remains the most reliable path for spreadsheet data.

How large of a CSV file can DuckDB handle?

DuckDB can process CSV files larger than your available RAM because it uses a streaming execution engine that processes data in chunks and spills to disk when needed. Users have reported querying files over 100 GB on standard laptops. Performance depends on the complexity of your query and available disk space for temporary storage.

Is DuckDB faster than pandas for CSV analysis?

For most analytical queries on medium to large CSV files, DuckDB outperforms pandas significantly. Benchmarks show DuckDB completing aggregation queries 5 to 50 times faster than equivalent pandas operations, depending on file size and query complexity. DuckDB also uses less memory because it processes data in a columnar, streaming fashion rather than loading the entire file into memory at once.

Can I use DuckDB without knowing SQL?

DuckDB requires SQL to query data. However, SQL for basic analysis, including SELECT, WHERE, GROUP BY, and ORDER BY, can be learned in a few hours. If you prefer to skip SQL entirely, tools like VSLZ let you analyze CSV files using plain English prompts instead of writing queries.

Does DuckDB work with remote files stored in S3 or Google Cloud Storage?

Yes. DuckDB supports querying files stored in Amazon S3, Google Cloud Storage, and HTTP endpoints directly using the httpfs extension. You pass the URL instead of a local file path in the read_csv function. This lets you analyze cloud-stored data without downloading the file first.

Related