Guides

How to Analyze CSV and Excel Files with DuckDB

Arkzero ResearchApr 5, 20267 min read

Last updated Apr 5, 2026

DuckDB is a free, open-source analytical database that runs as a single binary on your laptop. It lets you query CSV and Excel files using standard SQL without installing a server, writing Python, or uploading data to a cloud tool. Setup takes under two minutes on Mac, Windows, or Linux. This guide covers installation, querying CSV files, reading Excel spreadsheets, joining multiple data files, and exporting results.
DuckDB logo on a clean professional background

DuckDB is a free analytical database that runs as a single binary on your laptop. You point it at a CSV or Excel file and run SQL queries against it directly, without importing data, spinning up a server, or writing Python. Installation takes under two minutes. This guide covers setup, CSV and Excel querying, joining multiple files, and exporting results.

What DuckDB Is and Why Analysts Use It

DuckDB is an in-process OLAP database. It runs as a local process rather than as a background service you start and stop. When you open the DuckDB shell, the database lives in memory or in a local file you specify. When you close it, the process ends.

For analysts who work with CSV exports, Excel files, and data dumps, this design has practical advantages:

  • No database server to configure or maintain
  • No Python or pandas installation required
  • Files stay on your machine and are never uploaded anywhere
  • Query speeds on files under 1 GB are typically under one second

According to benchmark comparisons published in April 2026, DuckDB completes a GROUP BY aggregation on 100 million rows in under 30 seconds, while pandas often takes over 100 seconds or crashes on the same dataset. For files most analysts work with (under 10 million rows), DuckDB returns results in under two seconds.

DuckDB is not a replacement for a data warehouse or a BI tool. It is the fastest path from a raw local file to a SQL answer.

Installation

DuckDB ships as a standalone binary. The install process is a single command.

Mac (Homebrew):

brew install duckdb

Windows (winget):

winget install DuckDB.cli

Linux (direct download):

curl -LO https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
mv duckdb /usr/local/bin/

After installation, confirm it works:

duckdb --version

You should see something like v1.2.x. To open an interactive SQL session, run duckdb. To exit, type .quit or press Ctrl+D.

Querying CSV Files

DuckDB reads CSV files with the read_csv() function. No import step, no table creation. You reference the file path directly in a query.

Preview the first rows:

SELECT * FROM read_csv('sales.csv') LIMIT 5;

Check column names and inferred data types:

DESCRIBE SELECT * FROM read_csv('sales.csv');

DuckDB automatically detects delimiters, quote characters, and column types. For most CSV exports from Excel or Google Sheets, auto-detection works without any flags.

Aggregate by a category column:

SELECT region, COUNT(*) AS total_orders, SUM(revenue) AS total_revenue
FROM read_csv('sales.csv')
GROUP BY region
ORDER BY total_revenue DESC;

Filter rows by value:

SELECT *
FROM read_csv('sales.csv')
WHERE revenue > 10000 AND status = 'Closed'
ORDER BY revenue DESC;

Count missing values across columns:

SELECT
  COUNT(*) - COUNT(customer_id) AS missing_customer_id,
  COUNT(*) - COUNT(revenue) AS missing_revenue
FROM read_csv('sales.csv');

The missing-value query is useful before sharing a file with a stakeholder. It surfaces data quality issues immediately.

Querying Excel Files

DuckDB includes an official Excel extension that reads .xlsx files directly. Install and load it once per session:

INSTALL excel;
LOAD excel;

Query an Excel file the same way as a CSV:

SELECT * FROM 'quarterly_report.xlsx' LIMIT 5;

Read a specific sheet by name:

SELECT * FROM read_xlsx('quarterly_report.xlsx', sheet = 'Q1_Data');

Read a specific data range within a sheet. This is useful when headers start below row 1 or when the spreadsheet has surrounding labels:

SELECT *
FROM read_xlsx('quarterly_report.xlsx', sheet = 'SalesData', range = 'B5:F200', header = true);

When loading spreadsheets with inconsistent column types, read everything as strings first and cast afterward:

SELECT
  customer_id,
  CAST(revenue AS DOUBLE) AS revenue,
  order_date
FROM read_xlsx('messy_export.xlsx', all_varchar = true, sheet = 'Sheet1')
WHERE revenue IS NOT NULL AND revenue != '';

This pattern catches type errors before they silently corrupt aggregations.

Joining Multiple Files

One of DuckDB's most practical capabilities is joining data across separate files without loading them into any database first.

Suppose you have a transactions file and a separate customer metadata file:

SELECT
  t.order_id,
  t.revenue,
  c.segment,
  c.account_manager
FROM read_csv('transactions.csv') AS t
JOIN read_csv('customers.csv') AS c
  ON t.customer_id = c.customer_id
WHERE t.status = 'Closed'
ORDER BY t.revenue DESC
LIMIT 20;

This joins two CSV files and filters in a single query. No intermediate tables, no pandas merge, no VLOOKUP.

You can also read all CSV files in a directory at once using glob syntax:

SELECT * FROM read_csv('monthly_exports/*.csv');

DuckDB stacks all matching files into one result set. This is useful when you have a directory of monthly or quarterly export files you want to analyze as a single dataset.

Exporting Results

After running a query, export the result back to CSV or Excel using COPY:

Export to CSV:

COPY (
  SELECT region, SUM(revenue) AS total
  FROM read_csv('sales.csv')
  GROUP BY region
) TO 'summary_by_region.csv' (HEADER true);

Export to Excel:

COPY (
  SELECT region, SUM(revenue) AS total
  FROM read_csv('sales.csv')
  GROUP BY region
) TO 'summary_by_region.xlsx' (FORMAT 'excel', HEADER true);

The Excel export requires the excel extension to be loaded. The output file will open correctly in Excel or Google Sheets.

Persisting a Database File

If you run queries against the same file repeatedly, create a persistent DuckDB database to avoid re-reading the CSV each time:

duckdb mydata.db

Inside the session, create a table from your CSV:

CREATE TABLE sales AS SELECT * FROM read_csv('sales.csv');

Future sessions that open mydata.db can query the sales table directly. For files under 500 MB, this cuts subsequent query times to under 100 milliseconds.

When to Use DuckDB vs Other Options

DuckDB fits well when:

  • You have a CSV or Excel file and want SQL without setting up a database
  • Your file has more rows than Excel handles reliably (over 100,000 rows)
  • You want to join two export files without writing Python
  • You need a quick data quality check before sharing a report

It is the wrong tool when:

  • You need live charts or a shareable dashboard (use a BI tool)
  • Multiple people need to query the same data simultaneously (use a data warehouse)
  • You want to ask questions in plain English without writing SQL at all

For that last case, tools like VSLZ let you upload a CSV or Excel file and ask questions in plain English, returning charts and summaries without any SQL.

Summary

DuckDB turns your terminal into a SQL engine for local files. Install it with one command, load your CSV or Excel file, and start querying. The Excel extension handles multi-sheet workbooks and non-standard layouts. Glob patterns let you analyze entire directories of files as one dataset. For most analyst workflows involving local exports, DuckDB is faster to set up and faster to run than any alternative that requires a server or a language runtime.

FAQ

How do I install DuckDB on a Mac?

The fastest way is Homebrew. Run `brew install duckdb` in your terminal. After installation, run `duckdb --version` to confirm it works. You can also download the binary directly from the DuckDB GitHub releases page if you prefer not to use Homebrew.

Can DuckDB read Excel xlsx files directly?

Yes. DuckDB has an official Excel extension. Load it with `INSTALL excel; LOAD excel;` and then query Excel files with `SELECT * FROM 'yourfile.xlsx'` or `read_xlsx('yourfile.xlsx', sheet = 'SheetName')`. You can target specific sheets, row ranges, and handle mixed data types.

How do I query multiple CSV files at once in DuckDB?

Use glob syntax: `SELECT * FROM read_csv('folder/*.csv')`. DuckDB stacks all matching files into a single result set automatically. This is useful for monthly export files or split datasets that share the same schema.

Is DuckDB faster than pandas for data analysis?

For most aggregation and filtering tasks, yes. Benchmarks published in April 2026 show DuckDB completing GROUP BY operations on 100 million rows in under 30 seconds, while pandas often takes over 100 seconds or crashes on the same dataset. For typical analyst files under 10 million rows, both are fast enough that the difference is less important than ease of use.

How do I save DuckDB query results to a new CSV file?

Use the COPY command: `COPY (SELECT ... FROM read_csv('input.csv') WHERE ...) TO 'output.csv' (HEADER true);`. For Excel output, add `(FORMAT 'excel', HEADER true)` instead, after loading the Excel extension with `INSTALL excel; LOAD excel;`.

Related

Python code editor displaying a Polars DataFrame analytics workflow
Guides

How to Get Started with Polars for Data Analysis

Polars is a Python DataFrame library built on a Rust engine with lazy evaluation and multi-core execution. Install it with pip install polars, read CSV or Parquet files with pl.read_csv() or pl.scan_csv(), and chain filter, group-by, and aggregation expressions to analyze data. On a 1 GB CSV file with 10 million rows, Polars loads data in 1.6 seconds and uses roughly 87 percent less memory than pandas on the same task.

Arkzero Research · Jun 4, 2026
How to Use Julius AI for Data Analysis - hero image
Guides

How to Use Julius AI for Data Analysis

Julius AI is a conversational data analysis platform that lets you upload a spreadsheet or CSV, ask questions in plain English, and receive charts, summaries, and statistical outputs in seconds with no SQL or code required. It runs Python in the background, handles messy real-world files automatically, and maintains session context so you can refine results conversationally. Free accounts are capped at 15 messages per month; real analysis work requires Plus at $35 per month or higher.

Arkzero Research · May 28, 2026
OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026