Guides

How to Analyze CSV Files with DuckDB

Arkzero ResearchApr 28, 20267 min read

Last updated Apr 28, 2026

DuckDB is a free, serverless SQL database that runs on your laptop and queries CSV files directly with standard SQL. Install it in under two minutes, point it at any CSV, and run aggregation, filtering, and join queries on files that would crash Excel. There is no server to configure, no Python required, and no import step before querying. You run SQL against the raw file from the terminal.

DuckDB is a free, embedded SQL database built for analytical queries. Unlike PostgreSQL or MySQL, it runs entirely inside your terminal process with no separate server, no installation wizard, and no database administrator required. You point it at a CSV file and start writing SQL.

The use case is straightforward: when a CSV file grows past a few hundred thousand rows, Excel slows to a crawl. Pivot tables take minutes. Filters lag. At a million rows, the experience becomes unusable. DuckDB handles the same file in milliseconds because its columnar storage engine reads only the columns a query needs rather than scanning entire rows. Benchmarks published on the DuckDB blog show it reading and aggregating a 100-million-row CSV in under 30 seconds on a standard laptop. At 500,000 rows, a GROUP BY aggregation completes in under 100 milliseconds.

This guide covers installation, basic CSV queries, common analysis patterns, multi-file joins, and how to save results. No Python required.

Installing DuckDB

DuckDB is a single binary. Installation takes under two minutes regardless of operating system.

On macOS, use Homebrew:

brew install duckdb

On Windows, use winget:

winget install DuckDB.cli

On Linux, download the latest release binary from the official DuckDB releases page at github.com/duckdb/duckdb/releases, extract it, and move it to a location on your PATH such as /usr/local/bin.

Verify the installation by running:

duckdb --version

That prints the version number and confirms the CLI is working. There is no database server to start, no configuration file to edit, and no port to open.

Querying a CSV File Directly

DuckDB's read_csv_auto() function reads a CSV file and automatically detects column names, data types, and delimiters. You do not need to define a schema first.

Start an interactive DuckDB session:

duckdb

Now query any CSV file on your computer using its file path. For a sales file called sales_2024.csv:

SELECT * FROM read_csv_auto('sales_2024.csv') LIMIT 10;

DuckDB scans the first few rows, infers the column types, and returns the result. If the file uses semicolons instead of commas, pass options explicitly:

SELECT * FROM read_csv_auto('sales_2024.csv', delim=';', header=true) LIMIT 10;

To see the inferred schema before running analysis:

DESCRIBE SELECT * FROM read_csv_auto('sales_2024.csv');

This shows each column name and its detected type. Catching a date column being read as a string at this step saves debugging later.

Common Analysis Patterns

Once the file loads, standard SQL aggregations work immediately.

Row count:

SELECT COUNT(*) FROM read_csv_auto('sales_2024.csv');

Sum and average by category:

SELECT region, SUM(revenue) AS total_revenue, AVG(deal_size) AS avg_deal
FROM read_csv_auto('sales_2024.csv')
GROUP BY region
ORDER BY total_revenue DESC;

Filter to a date range:

SELECT *
FROM read_csv_auto('sales_2024.csv')
WHERE close_date >= '2024-01-01' AND close_date < '2025-01-01';

Find duplicates:

SELECT email, COUNT(*) AS occurrences
FROM read_csv_auto('contacts.csv')
GROUP BY email
HAVING COUNT(*) > 1;

Top 20 accounts by revenue:

SELECT account_name, revenue
FROM read_csv_auto('sales_2024.csv')
ORDER BY revenue DESC
LIMIT 20;

Each of these runs against the raw CSV without importing the file into a database first. DuckDB reads the file in place from disk.

Joining Multiple CSV Files

One of DuckDB's most practical features for analysts is joining two or more CSV files with SQL. Excel VLOOKUP works for small lookups, but it breaks on large files and cannot express multi-column joins cleanly.

Suppose you have a customer list in customers.csv and transaction history in transactions.csv. Join them on customer ID:

SELECT c.company_name, c.segment, SUM(t.amount) AS lifetime_value
FROM read_csv_auto('customers.csv') AS c
JOIN read_csv_auto('transactions.csv') AS t ON c.customer_id = t.customer_id
GROUP BY c.company_name, c.segment
ORDER BY lifetime_value DESC;

DuckDB executes this join in memory using its vectorized execution engine. On two files each containing 500,000 rows, this query typically completes in under two seconds on a standard laptop. The equivalent operation in Excel requires VLOOKUP on 500,000 rows, which routinely takes several minutes and often produces incorrect results on near-duplicate keys.

Saving Results Back to CSV

To export query results to a new CSV file, use the COPY command:

COPY (
  SELECT region, SUM(revenue) AS total_revenue
  FROM read_csv_auto('sales_2024.csv')
  GROUP BY region
) TO 'regional_summary.csv' WITH (FORMAT 'CSV', HEADER true);

This writes a clean CSV to the current working directory. The output opens normally in Excel, Google Sheets, or any downstream tool.

To save results as a Parquet file, which compresses dramatically better and loads faster in future DuckDB queries:

COPY (
  SELECT * FROM read_csv_auto('sales_2024.csv')
) TO 'sales_2024.parquet' WITH (FORMAT 'PARQUET');

Parquet files saved this way can be queried directly the same way as CSVs: SELECT * FROM 'sales_2024.parquet'.

Persisting Tables for Repeated Analysis

If you run the same analysis multiple times on the same dataset, querying the raw CSV each time adds unnecessary read overhead. Create a persistent DuckDB database file:

duckdb mydata.duckdb

Inside the session, create a table once:

CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_2024.csv');

Exit and reconnect to mydata.duckdb in future sessions. The table loads instantly from DuckDB's compressed columnar format rather than re-reading the CSV. Queries on a stored DuckDB table typically run 3 to 10 times faster than repeated CSV reads on the same dataset.

When DuckDB Beats Excel

The honest answer is not always. For small files under 50,000 rows where you need formatting, charts, and formula-based calculations, Excel is faster to use. DuckDB has no GUI and no built-in charting.

Where DuckDB wins clearly is volume. Excel's practical limit for responsive pivot work is around 500,000 rows. DuckDB handles 100 million rows on a laptop with 16 GB RAM, completing a GROUP BY aggregation in seconds. The database's vectorized engine reads only the columns referenced by a query, making it efficient on wide tables where Excel would load every cell into memory.

For teams running monthly reconciliations, revenue analysis across multiple export files, or any recurring process that involves SQL-like operations on CSVs, replacing the Excel step with DuckDB reduces analysis time measurably. In one documented workflow comparing the two approaches on a 2-million-row sales dataset, the DuckDB query completed in 1.2 seconds versus 4 minutes for the equivalent Excel pivot operation.

If your workflow involves uploading the CSV and asking questions in plain English rather than writing SQL, VSLZ handles this end to end: upload a file, type what you need, and get back the analysis without any terminal or query syntax.

Practical Next Steps

Install DuckDB, find a CSV file you analyze regularly, and run a few of the queries from this guide against it. The practical test is the fastest way to see where it fits into your workflow.

For more complex transformations, dbt-duckdb (an open-source adapter) lets you build modular SQL transformation pipelines on top of local DuckDB tables. For cloud-hosted access and collaboration, MotherDuck offers DuckDB as a managed service with a web UI.

FAQ

Does DuckDB require Python or any programming language?

No. DuckDB ships as a standalone CLI binary that you run from a terminal. You write standard SQL queries directly against CSV files. No Python, no R, no programming language installation is required. The only prerequisite is the DuckDB binary itself, which installs via Homebrew on macOS, winget on Windows, or a direct binary download on Linux.

How large a CSV file can DuckDB handle?

DuckDB handles files significantly larger than available RAM through its out-of-core execution engine, which spills to disk when needed. Benchmarks from the DuckDB team show it processing a 9 GB CSV on a standard laptop in under a minute. For most analyst workloads involving files up to several gigabytes, DuckDB completes queries in seconds. The practical upper limit for a local setup depends on available disk space rather than RAM.

What is the difference between DuckDB and SQLite?

Both are embedded databases that run without a separate server. The key difference is their design target: SQLite is optimized for transactional workloads (many small reads and writes, like a mobile app database), while DuckDB is optimized for analytical workloads (aggregations, GROUP BY, window functions on large datasets). DuckDB uses columnar storage and vectorized query execution, which makes it 10 to 100 times faster than SQLite on typical data analysis queries.

Can DuckDB read files from cloud storage like S3 or Google Cloud Storage?

Yes. DuckDB can read CSV, Parquet, and other file formats directly from S3-compatible storage using the httpfs extension. Install it with `INSTALL httpfs; LOAD httpfs;` inside a DuckDB session, then query files using their S3 URL: `SELECT * FROM read_csv_auto('s3://bucket-name/file.csv');`. AWS credentials are picked up from environment variables or the standard AWS credentials file.

How do I handle CSV files with inconsistent formatting or missing headers?

DuckDB's read_csv_auto() function has several options for handling malformed files. Use `header=false` if the file has no header row, then reference columns by position (column0, column1, etc.) or rename them. Use `ignore_errors=true` to skip rows that do not parse cleanly. Use `null_padding=true` to fill in missing columns at the end of short rows. The DESCRIBE command shows the inferred schema before running analysis, which helps catch type detection issues early.