Guides

How to Analyze CSV Files with DuckDB

Arkzero ResearchApr 28, 20266 min read

Last updated Apr 28, 2026

DuckDB is an open-source analytical database that runs entirely on your machine with no server setup required. Analysts can install it in under two minutes and immediately run SQL queries against local CSV, Excel, and Parquet files. Version 1.5.2, released April 2026, brought the DuckLake extension to production status. This guide covers installation, querying local files, joining datasets, filtering large files, and exporting results using DuckDB's built-in CLI.
A professional data analyst reviewing tabular data at a workstation

DuckDB lets you run fast SQL queries directly against CSV, Parquet, and Excel files sitting on your laptop. No database server, no cloud account, no configuration files. Install it once with a single command, point it at a file, and query. This setup guide walks through installation, reading local data, joining multiple files, and exporting results using DuckDB's built-in CLI.

Why Analysts Are Switching to DuckDB

Traditional analytical workflows involve exporting data from a source system, loading it into a database or data warehouse, writing queries, and exporting again. For ad-hoc work on local files, this cycle adds 30 to 60 minutes of setup before any actual analysis happens.

DuckDB eliminates that setup. It reads CSV and Parquet files natively, treating each file as a table you can query directly. A benchmark published by MarkTechPost in April 2026 found DuckDB processes group-by aggregations on 10-million-row CSV files roughly 12 times faster than equivalent Pandas operations, using columnar execution and vectorized processing internally.

The tool ships as a single binary with no dependencies. Version 1.5.2, released April 2026, added the DuckLake extension, which brings lightweight catalog management to DuckDB without requiring a separate metadata service.

Installing DuckDB

On macOS, install via Homebrew:

brew install duckdb

On Windows, download the standalone CLI binary from duckdb.org under the Releases section. Unzip and run duckdb.exe.

On Linux:

curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip
unzip duckdb.zip
chmod +x duckdb
./duckdb

After installation, open a terminal and run duckdb. You will see a D> prompt. No connection string, no credentials, no port. Type .quit to exit.

Querying a CSV File Directly

DuckDB reads CSV files without importing or loading. Point read_csv_auto at any file and query it like a table:

SELECT * FROM read_csv_auto('sales_q1.csv') LIMIT 10;

DuckDB infers column types automatically. If you need to override a type, pass schema hints:

SELECT * FROM read_csv_auto('sales_q1.csv', types={'order_date': 'DATE'}) LIMIT 10;

For files you query repeatedly, create a view so you can reference it by name without repeating the file path:

CREATE VIEW sales AS SELECT * FROM read_csv_auto('sales_q1.csv');
SELECT region, SUM(revenue) AS total FROM sales GROUP BY region ORDER BY total DESC;

This view lives only in the current session. If you want it to persist across sessions, open DuckDB against a named database file:

duckdb my_analysis.db

Any views or tables you create in my_analysis.db survive between sessions.

Running Aggregations and Filters

DuckDB supports the full SQL standard including window functions, CTEs, and regex matching. A few patterns analysts reach for most often:

Group by and count:

SELECT status, COUNT(*) AS n, AVG(order_value) AS avg_value
FROM read_csv_auto('orders.csv')
GROUP BY status
ORDER BY n DESC;

Filter by date range:

SELECT * FROM read_csv_auto('orders.csv', types={'order_date': 'DATE'})
WHERE order_date BETWEEN '2025-01-01' AND '2025-03-31';

String pattern matching:

SELECT * FROM read_csv_auto('contacts.csv')
WHERE email LIKE '%@gmail.com';

DuckDB pushes filters down to the file scan layer, so filtering a 5-million-row CSV by date reads only the relevant chunks rather than loading the full file into memory.

Joining Multiple CSV Files

One of DuckDB's most practical features for analysts is joining across files without loading either into a database first:

SELECT
  o.order_id,
  o.revenue,
  c.company_name,
  c.segment
FROM read_csv_auto('orders.csv') o
JOIN read_csv_auto('customers.csv') c ON o.customer_id = c.id
WHERE c.segment = 'Enterprise'
ORDER BY o.revenue DESC;

This works for Parquet files too. You can join a CSV against a Parquet file in the same query:

SELECT a.*, b.category
FROM read_csv_auto('transactions.csv') a
JOIN 'product_catalog.parquet' b ON a.product_id = b.id;

DuckDB handles the format differences internally. No conversion step needed.

Reading Excel Files

DuckDB reads Excel files using the excel extension. Install it once from inside the DuckDB CLI:

INSTALL excel;
LOAD excel;
SELECT * FROM read_xlsx('report.xlsx') LIMIT 5;

If the workbook has multiple sheets, specify the sheet by name:

SELECT * FROM read_xlsx('report.xlsx', sheet='Q2 Data');

This is particularly useful for ops teams who receive Excel exports from ERP or CRM systems and need to join them against transactional data sitting in CSV format.

Exporting Query Results

DuckDB exports to CSV, Parquet, and JSON with a single COPY statement:

COPY (
  SELECT region, SUM(revenue) AS total
  FROM read_csv_auto('sales.csv')
  GROUP BY region
) TO 'regional_totals.csv' (HEADER, DELIMITER ',');

To export as Parquet, which compresses 3 to 5 times smaller than CSV and loads faster in downstream tools:

COPY (SELECT * FROM read_csv_auto('sales.csv')) TO 'sales.parquet' (FORMAT PARQUET);

Querying Folders of Files

Analysts frequently receive data split across monthly or weekly files. DuckDB reads an entire folder in one query using a glob pattern:

SELECT month, SUM(revenue) FROM read_csv_auto('exports/*.csv') GROUP BY month;

DuckDB scans all matching files and unions them automatically, inferring a consistent schema across files. If schemas differ slightly between files, pass union_by_name=true:

SELECT * FROM read_csv_auto('exports/*.csv', union_by_name=true);

This pattern replaces manual concatenation workflows that previously required Python or Power Query.

When SQL Is Not an Option

DuckDB assumes you are comfortable writing SQL. For teams where most members work in plain English rather than query syntax, platforms like VSLZ handle CSV analysis through natural language prompts, returning charts, aggregations, and statistical summaries without writing a single query.

Practical Summary

DuckDB installs in under two minutes and queries CSV and Excel files directly from the command line with no server required. Use read_csv_auto for single files, glob patterns for folders of files, and COPY ... TO for exporting results. Version 1.5.2 added DuckLake for lightweight catalog management, making it practical for teams that need to share persistent views across sessions. For one-off analysis on local exports, DuckDB replaces the setup cost of spinning up a database while delivering query speeds that outperform Pandas on most aggregation workloads.

FAQ

Does DuckDB require Python to run?

No. DuckDB ships as a standalone CLI binary. Download it, add it to your PATH, and run SQL queries directly from your terminal. Python bindings are available for those who want to use DuckDB inside notebooks or scripts, but they are not required for CSV analysis.

How large a CSV file can DuckDB handle?

DuckDB can query files larger than available RAM by streaming data in chunks and using out-of-core processing. In practice, files up to several hundred gigabytes work well on a standard laptop. For files in the terabyte range, MotherDuck (the managed cloud version of DuckDB) extends this further with distributed execution.

Can DuckDB join CSV files from different sources?

Yes. DuckDB treats each CSV or Parquet file as a virtual table and joins them with standard SQL syntax. You can join a local CSV against a Parquet file or even an S3-hosted file in the same query without downloading or importing anything first.

Does DuckDB read Excel files?

Yes, with the excel extension, which you install from inside DuckDB using INSTALL excel; LOAD excel;. After that, use read_xlsx() to query any .xlsx file by sheet name or index, the same way you would query a CSV.

How does DuckDB compare to Pandas for CSV analysis?

DuckDB uses columnar storage and vectorized execution, which gives it a significant performance advantage over Pandas on aggregation-heavy queries. A benchmark from MarkTechPost in April 2026 found DuckDB roughly 12 times faster on group-by operations over 10 million rows. Pandas remains more flexible for row-by-row transformations and machine-learning pipelines, but for SQL-style analysis DuckDB is faster and uses less memory.

Related

Python code editor displaying a Polars DataFrame analytics workflow
Guides

How to Get Started with Polars for Data Analysis

Polars is a Python DataFrame library built on a Rust engine with lazy evaluation and multi-core execution. Install it with pip install polars, read CSV or Parquet files with pl.read_csv() or pl.scan_csv(), and chain filter, group-by, and aggregation expressions to analyze data. On a 1 GB CSV file with 10 million rows, Polars loads data in 1.6 seconds and uses roughly 87 percent less memory than pandas on the same task.

Arkzero Research · Jun 4, 2026
How to Use Julius AI for Data Analysis - hero image
Guides

How to Use Julius AI for Data Analysis

Julius AI is a conversational data analysis platform that lets you upload a spreadsheet or CSV, ask questions in plain English, and receive charts, summaries, and statistical outputs in seconds with no SQL or code required. It runs Python in the background, handles messy real-world files automatically, and maintains session context so you can refine results conversationally. Free accounts are capped at 15 messages per month; real analysis work requires Plus at $35 per month or higher.

Arkzero Research · May 28, 2026
OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026