Guides

How to Analyze CSV and Excel Files with DuckDB

Arkzero ResearchApr 5, 20267 min read

Last updated Apr 5, 2026

DuckDB is a free, open-source analytical database that runs as a single binary on your laptop. It lets you query CSV and Excel files using standard SQL without installing a server, writing Python, or uploading data to a cloud tool. Setup takes under two minutes on Mac, Windows, or Linux. This guide covers installation, querying CSV files, reading Excel spreadsheets, joining multiple data files, and exporting results.

DuckDB is a free analytical database that runs as a single binary on your laptop. You point it at a CSV or Excel file and run SQL queries against it directly, without importing data, spinning up a server, or writing Python. Installation takes under two minutes. This guide covers setup, CSV and Excel querying, joining multiple files, and exporting results.

What DuckDB Is and Why Analysts Use It

DuckDB is an in-process OLAP database. It runs as a local process rather than as a background service you start and stop. When you open the DuckDB shell, the database lives in memory or in a local file you specify. When you close it, the process ends.

For analysts who work with CSV exports, Excel files, and data dumps, this design has practical advantages:

No database server to configure or maintain
No Python or pandas installation required
Files stay on your machine and are never uploaded anywhere
Query speeds on files under 1 GB are typically under one second

According to benchmark comparisons published in April 2026, DuckDB completes a GROUP BY aggregation on 100 million rows in under 30 seconds, while pandas often takes over 100 seconds or crashes on the same dataset. For files most analysts work with (under 10 million rows), DuckDB returns results in under two seconds.

DuckDB is not a replacement for a data warehouse or a BI tool. It is the fastest path from a raw local file to a SQL answer.

Installation

DuckDB ships as a standalone binary. The install process is a single command.

Mac (Homebrew):

brew install duckdb

Windows (winget):

winget install DuckDB.cli

Linux (direct download):

curl -LO https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
mv duckdb /usr/local/bin/

After installation, confirm it works:

duckdb --version

You should see something like v1.2.x. To open an interactive SQL session, run duckdb. To exit, type .quit or press Ctrl+D.

Querying CSV Files

DuckDB reads CSV files with the read_csv() function. No import step, no table creation. You reference the file path directly in a query.

Preview the first rows:

SELECT * FROM read_csv('sales.csv') LIMIT 5;

Check column names and inferred data types:

DESCRIBE SELECT * FROM read_csv('sales.csv');

DuckDB automatically detects delimiters, quote characters, and column types. For most CSV exports from Excel or Google Sheets, auto-detection works without any flags.

Aggregate by a category column:

SELECT region, COUNT(*) AS total_orders, SUM(revenue) AS total_revenue
FROM read_csv('sales.csv')
GROUP BY region
ORDER BY total_revenue DESC;

Filter rows by value:

SELECT *
FROM read_csv('sales.csv')
WHERE revenue > 10000 AND status = 'Closed'
ORDER BY revenue DESC;

Count missing values across columns:

SELECT
  COUNT(*) - COUNT(customer_id) AS missing_customer_id,
  COUNT(*) - COUNT(revenue) AS missing_revenue
FROM read_csv('sales.csv');

The missing-value query is useful before sharing a file with a stakeholder. It surfaces data quality issues immediately.

Querying Excel Files

DuckDB includes an official Excel extension that reads .xlsx files directly. Install and load it once per session:

INSTALL excel;
LOAD excel;

Query an Excel file the same way as a CSV:

SELECT * FROM 'quarterly_report.xlsx' LIMIT 5;

Read a specific sheet by name:

SELECT * FROM read_xlsx('quarterly_report.xlsx', sheet = 'Q1_Data');

Read a specific data range within a sheet. This is useful when headers start below row 1 or when the spreadsheet has surrounding labels:

SELECT *
FROM read_xlsx('quarterly_report.xlsx', sheet = 'SalesData', range = 'B5:F200', header = true);

When loading spreadsheets with inconsistent column types, read everything as strings first and cast afterward:

SELECT
  customer_id,
  CAST(revenue AS DOUBLE) AS revenue,
  order_date
FROM read_xlsx('messy_export.xlsx', all_varchar = true, sheet = 'Sheet1')
WHERE revenue IS NOT NULL AND revenue != '';

This pattern catches type errors before they silently corrupt aggregations.

Joining Multiple Files

One of DuckDB's most practical capabilities is joining data across separate files without loading them into any database first.

Suppose you have a transactions file and a separate customer metadata file:

SELECT
  t.order_id,
  t.revenue,
  c.segment,
  c.account_manager
FROM read_csv('transactions.csv') AS t
JOIN read_csv('customers.csv') AS c
  ON t.customer_id = c.customer_id
WHERE t.status = 'Closed'
ORDER BY t.revenue DESC
LIMIT 20;

This joins two CSV files and filters in a single query. No intermediate tables, no pandas merge, no VLOOKUP.

You can also read all CSV files in a directory at once using glob syntax:

SELECT * FROM read_csv('monthly_exports/*.csv');

DuckDB stacks all matching files into one result set. This is useful when you have a directory of monthly or quarterly export files you want to analyze as a single dataset.

Exporting Results

After running a query, export the result back to CSV or Excel using COPY:

Export to CSV:

COPY (
  SELECT region, SUM(revenue) AS total
  FROM read_csv('sales.csv')
  GROUP BY region
) TO 'summary_by_region.csv' (HEADER true);

Export to Excel:

COPY (
  SELECT region, SUM(revenue) AS total
  FROM read_csv('sales.csv')
  GROUP BY region
) TO 'summary_by_region.xlsx' (FORMAT 'excel', HEADER true);

The Excel export requires the excel extension to be loaded. The output file will open correctly in Excel or Google Sheets.

Persisting a Database File

If you run queries against the same file repeatedly, create a persistent DuckDB database to avoid re-reading the CSV each time:

duckdb mydata.db

Inside the session, create a table from your CSV:

CREATE TABLE sales AS SELECT * FROM read_csv('sales.csv');

Future sessions that open mydata.db can query the sales table directly. For files under 500 MB, this cuts subsequent query times to under 100 milliseconds.

When to Use DuckDB vs Other Options

DuckDB fits well when:

You have a CSV or Excel file and want SQL without setting up a database
Your file has more rows than Excel handles reliably (over 100,000 rows)
You want to join two export files without writing Python
You need a quick data quality check before sharing a report

It is the wrong tool when:

You need live charts or a shareable dashboard (use a BI tool)
Multiple people need to query the same data simultaneously (use a data warehouse)
You want to ask questions in plain English without writing SQL at all

For that last case, tools like VSLZ let you upload a CSV or Excel file and ask questions in plain English, returning charts and summaries without any SQL.

Summary

DuckDB turns your terminal into a SQL engine for local files. Install it with one command, load your CSV or Excel file, and start querying. The Excel extension handles multi-sheet workbooks and non-standard layouts. Glob patterns let you analyze entire directories of files as one dataset. For most analyst workflows involving local exports, DuckDB is faster to set up and faster to run than any alternative that requires a server or a language runtime.

FAQ

How do I install DuckDB on a Mac?

The fastest way is Homebrew. Run `brew install duckdb` in your terminal. After installation, run `duckdb --version` to confirm it works. You can also download the binary directly from the DuckDB GitHub releases page if you prefer not to use Homebrew.

Can DuckDB read Excel xlsx files directly?

Yes. DuckDB has an official Excel extension. Load it with `INSTALL excel; LOAD excel;` and then query Excel files with `SELECT * FROM 'yourfile.xlsx'` or `read_xlsx('yourfile.xlsx', sheet = 'SheetName')`. You can target specific sheets, row ranges, and handle mixed data types.

How do I query multiple CSV files at once in DuckDB?

Use glob syntax: `SELECT * FROM read_csv('folder/*.csv')`. DuckDB stacks all matching files into a single result set automatically. This is useful for monthly export files or split datasets that share the same schema.

Is DuckDB faster than pandas for data analysis?

For most aggregation and filtering tasks, yes. Benchmarks published in April 2026 show DuckDB completing GROUP BY operations on 100 million rows in under 30 seconds, while pandas often takes over 100 seconds or crashes on the same dataset. For typical analyst files under 10 million rows, both are fast enough that the difference is less important than ease of use.

How do I save DuckDB query results to a new CSV file?

Use the COPY command: `COPY (SELECT ... FROM read_csv('input.csv') WHERE ...) TO 'output.csv' (HEADER true);`. For Excel output, add `(FORMAT 'excel', HEADER true)` instead, after loading the Excel extension with `INSTALL excel; LOAD excel;`.