Guides

How to Query CSVs Instantly with DuckDB

Arkzero ResearchMar 26, 20268 min read

Last updated Mar 26, 2026

DuckDB is a free, serverless analytical database that lets analysts run fast SQL queries directly against CSV files without installing a database or writing Python. Released as a single downloadable binary, it runs on a laptop and handles files with tens of millions of rows without loading them into memory. As of early 2026, DuckDB had recorded over 50 percent year-over-year developer growth, driven largely by its zero-configuration setup and speed on local data files.
DuckDB logo on a clean editorial background

DuckDB lets you run SQL queries directly against CSV files on your laptop with no server, no database installation, and no Python required. You download a single command-line binary, point it at a file, and start asking questions in plain SQL. For analysts who spend hours reformatting spreadsheets before they can analyze them, DuckDB removes that bottleneck entirely.

What DuckDB Is and Why Analysts Are Using It

DuckDB is an in-process analytical database. The practical meaning is simple: it runs as a standalone binary on your computer, without requiring a separate database server, cloud account, or configuration file. You download one file, open a terminal, and start querying data.

The project was originally created as a research effort at Centrum Wiskunde and Informatica in the Netherlands in 2019. By early 2026, it had crossed 50 percent year-over-year developer growth, according to the MotherDuck ecosystem newsletter, driven by its usefulness for local analytical work. The growth reflects a real gap it fills: analysts frequently have CSV exports they need to interrogate quickly, without the overhead of loading data into a spreadsheet or setting up a full database environment.

DuckDB is optimized for analytical queries, meaning aggregations, group-bys, and joins across large row counts. It streams data from disk rather than loading it into memory first, which means a file with 10 million rows does not require 10 million rows worth of RAM to query.

Installing DuckDB

Go to duckdb.org and download the CLI binary for your operating system. On macOS, installation via Homebrew takes one command:

brew install duckdb

On Windows, download duckdb.exe and place it anywhere on your machine. Open a terminal (Command Prompt or PowerShell), navigate to the folder containing the executable, and type:

duckdb

You will see a prompt like the following:

v1.1.3 ...
Enter ".help" for usage hints.
D

The D cursor is where you type SQL queries. There are no credentials to configure, no ports to open, and no services to maintain. The entire setup takes under two minutes on a clean machine.

Querying Your First CSV

Place a CSV file in an accessible folder. To read the first ten rows:

SELECT * FROM 'sales_data.csv' LIMIT 10;

DuckDB reads the file directly. It infers column names from the header row and assigns data types automatically. To see what types it assigned:

DESCRIBE SELECT * FROM 'sales_data.csv';

This returns each column name alongside its inferred type, such as VARCHAR, DOUBLE, or DATE. Knowing the inferred type matters because aggregation and filtering behave differently depending on whether a column like order_date has been read as a date or as a plain text string.

To count total rows:

SELECT COUNT(*) FROM 'sales_data.csv';

Running Aggregation Queries

The most common analyst task is summarizing data: totals by category, averages by region, counts by date. DuckDB handles these with standard SQL.

Total revenue by product category:

SELECT category, SUM(amount) AS total_revenue
FROM 'sales_data.csv'
GROUP BY category
ORDER BY total_revenue DESC;

Top ten customers by total spend:

SELECT customer_name, SUM(amount) AS total_spend
FROM 'sales_data.csv'
GROUP BY customer_name
ORDER BY total_spend DESC
LIMIT 10;

Monthly revenue trend:

SELECT DATE_TRUNC('month', CAST(order_date AS DATE)) AS month,
       SUM(amount) AS monthly_revenue
FROM 'sales_data.csv'
GROUP BY month
ORDER BY month;

The DATE_TRUNC function rounds each date down to the first day of its month, grouping all transactions in the same period together. This produces the same output as a monthly pivot table in Excel, but runs in milliseconds on files that would stall or crash a spreadsheet application entirely.

Joining Two CSV Files

One practically useful capability for operations analysts is joining two CSV files without copying data between tabs or spreadsheets. Suppose you have orders.csv with a customer_id field and customers.csv with customer details. A join query looks like this:

SELECT o.order_id, c.company_name, c.region, o.amount
FROM 'orders.csv' AS o
JOIN 'customers.csv' AS c ON o.customer_id = c.id
ORDER BY o.amount DESC
LIMIT 20;

This produces a combined view without modifying either source file. If the join produces unexpected row counts, you can diagnose mismatches by comparing distinct key counts across both files:

SELECT COUNT(DISTINCT customer_id) FROM 'orders.csv';
SELECT COUNT(DISTINCT id) FROM 'customers.csv';

Mismatched counts indicate that some orders reference customers not present in the customer file. This is a data quality issue that a VLOOKUP in Excel would silently handle with blank cells, leaving errors invisible in downstream reports.

Checking Data Quality Before You Analyze

Drawing conclusions from unchecked data is one of the most common sources of reporting errors. DuckDB makes quick quality checks straightforward.

Count missing values per column:

SELECT
  COUNT(*) - COUNT(customer_name) AS missing_customer_name,
  COUNT(*) - COUNT(amount) AS missing_amount,
  COUNT(*) - COUNT(order_date) AS missing_order_date
FROM 'sales_data.csv';

Find duplicate rows by a key field:

SELECT order_id, COUNT(*) AS occurrences
FROM 'sales_data.csv'
GROUP BY order_id
HAVING COUNT(*) > 1;

Summarize a numeric column:

SELECT
  MIN(amount) AS min_val,
  MAX(amount) AS max_val,
  AVG(amount) AS avg_val,
  MEDIAN(amount) AS median_val,
  STDDEV(amount) AS std_dev
FROM 'sales_data.csv';

Running these three checks takes about thirty seconds and catches the most common problems: missing key fields that will cause silent errors in any aggregate, duplicate rows from a double export, and extreme outliers that skew the mean away from a representative value. The median is useful here precisely because it is not affected by outliers the way the average is.

Exporting Results to a New File

Once a query produces the output you need, save it directly to a new CSV for sharing or import into another tool:

COPY (
  SELECT category, SUM(amount) AS total_revenue
  FROM 'sales_data.csv'
  GROUP BY category
  ORDER BY total_revenue DESC
) TO 'revenue_by_category.csv' (HEADER, DELIMITER ',');

To export to Parquet format, which is smaller and faster to re-query:

COPY (SELECT * FROM 'sales_data.csv')
TO 'sales_data.parquet' (FORMAT PARQUET);

Parquet files are column-oriented, meaning DuckDB reads only the columns your query needs rather than scanning every field in every row. For large datasets with many columns, this reduces query time substantially. Once your data is in Parquet, subsequent queries run noticeably faster than re-reading the original CSV each time.

Performance on Large Files

DuckDB was designed for analytical workloads, which means it handles large files without special tuning or configuration. A benchmark published by the MotherDuck team in early 2026 showed DuckDB completing a GROUP BY aggregation across 500 million rows in under four seconds on a standard laptop. For comparison, the same operation in a conventional tool would require either a cloud data warehouse subscription or an extended wait from a local database server.

On typical analyst workflows involving files with hundreds of thousands to a few million rows, DuckDB will return results in under a second for most aggregation queries. Files in that range do not require any configuration changes or memory adjustments.

When DuckDB Fits and When It Does Not

DuckDB is the right tool when you have structured CSV or Parquet files, you are comfortable writing basic SQL, and you want fast answers without setting up any infrastructure. It suits ad-hoc analysis, data cleaning before import, and building summary exports from raw data files.

It is less suited for interactive dashboards, collaborative data editing, or workflows where non-technical stakeholders need answers without writing SQL. For that kind of work, platforms like VSLZ accept a file upload and let you ask questions in plain English, with the query and chart generation handled automatically.

Summary

DuckDB is a single binary that turns any laptop into a fast query engine for local CSV and Parquet files. The core workflow is: download the CLI, open a terminal, write SQL against your files. Practical operations include checking data types and row counts, running aggregations and joins across multiple files, validating data quality before analysis, and exporting clean results. For analysts who regularly work with large CSV exports and need reliable answers without setting up infrastructure, it removes the main friction point of waiting for spreadsheet tools to process the data.

FAQ

Do I need to know Python to use DuckDB?

No. DuckDB has a command-line interface that accepts plain SQL queries. You do not need Python, R, or any programming language to use it for CSV analysis. Python and R integrations are available for more advanced workflows, but the CLI alone is sufficient for querying, joining, and exporting data from CSV files.

Can DuckDB handle large CSV files that Excel cannot open?

Yes. DuckDB streams data from disk rather than loading it into memory, so it can query files with tens of millions of rows on a standard laptop without running out of memory. Excel has a row limit of approximately 1 million rows per sheet and slows significantly well before that limit. DuckDB has no practical row limit for most analytical workloads.

How do I join two CSV files in DuckDB?

Reference both files directly in a SQL JOIN statement using their file paths as table names. For example: SELECT o.order_id, c.company_name FROM 'orders.csv' AS o JOIN 'customers.csv' AS c ON o.customer_id = c.id. DuckDB reads both files and combines them based on the join condition without requiring you to import either file into a database first.

Can I export DuckDB query results to a new CSV file?

Yes. Use the COPY command to write query output to a file: COPY (SELECT category, SUM(amount) FROM 'data.csv' GROUP BY category) TO 'output.csv' (HEADER, DELIMITER ','). DuckDB also supports export to Parquet format, which is faster to re-query and significantly smaller than CSV for the same data.

Is DuckDB free to use?

Yes. DuckDB is open-source software released under the MIT license. The CLI and all core features are free with no usage limits. MotherDuck is a commercial cloud service built on DuckDB that adds collaboration and sharing features, but the local DuckDB binary itself requires no license or subscription.

Related