Guides

How to Analyze CSV Files with DuckDB

Arkzero ResearchMar 29, 20267 min read

Last updated Mar 29, 2026

DuckDB lets you run SQL queries directly on CSV files without importing data into a database, installing a server, or paying for cloud infrastructure. Install it with a single command, point it at any CSV file, and execute standard SQL. DuckDB runs entirely in-process on your local machine, infers column types automatically, and handles files up to several gigabytes without any configuration.

DuckDB lets you run SQL queries directly on CSV files without importing data into a database, installing a server, or paying for cloud infrastructure. Install it with a single command, point it at any CSV file, and execute standard SQL. DuckDB processes files in-process on your local machine and handles files up to several gigabytes without configuration. For most CSV analysis tasks, it is faster than pandas and simpler than setting up a relational database.

What DuckDB Is and Why Analysts Use It

DuckDB is an open-source, in-process OLAP database. Unlike PostgreSQL or SQLite, which store data in their own format after import, DuckDB reads CSV, Parquet, JSON, and Excel files directly from disk as if they were database tables. There is no ETL step, no schema definition, and no connection string to configure.

For analysts who routinely work with exported flat files from CRMs, ERPs, or spreadsheet tools, this removes the most common friction in ad hoc analysis: moving data before querying it. DuckDB infers column names and types from the CSV header row automatically, then executes the full SQL query in one pass.

According to the March 2026 DuckDB Ecosystem Newsletter from MotherDuck, DuckDB processed 7.1 million geospatial data points in a workflow that previously required a full cloud pipeline, completing the operation on commodity hardware without a data warehouse. Developer adoption grew 50% year-over-year in 2025, making it one of the fastest-growing tools in the analytics ecosystem.

Install DuckDB

DuckDB has three main interfaces: a command-line shell, a Python library, and a VS Code extension.

Command-line shell (macOS/Linux):

brew install duckdb

Once installed, start the shell with duckdb. No configuration file, no port number, no credentials.

Python library:

pip install duckdb

VS Code extension:

Search "DuckDB" in the VS Code extension marketplace. The duckdb-vscode extension connects to local CSV files and runs queries directly from the editor, supporting remote sources like S3 and PostgreSQL through ATTACH statements without additional tooling.

DuckDB v1.4 introduced AES-256 encryption at rest, making it suitable for analysts handling regulated data in healthcare, finance, or legal sectors.

Query a CSV File Directly

Open the DuckDB shell and reference any CSV file by its path:

SELECT * FROM 'sales_data.csv' LIMIT 10;

DuckDB reads the file header to determine column names and infers types. No import, no CREATE TABLE, no COPY command. The file path is the table reference.

To inspect column types before running analysis:

DESCRIBE SELECT * FROM 'sales_data.csv';

This returns each column name, the inferred SQL type (VARCHAR, INTEGER, DOUBLE, DATE), and nullability. Running DESCRIBE first catches type mismatches early, such as a date column stored as text, before they affect query results.

Aggregate and Summarize Data

Standard GROUP BY aggregations work without modification:

SELECT
  region,
  SUM(revenue) AS total_revenue,
  COUNT(*) AS order_count,
  AVG(order_value) AS avg_order_value
FROM 'sales_data.csv'
GROUP BY region
ORDER BY total_revenue DESC;

DuckDB handles null values according to SQL standard rules: SUM and AVG skip nulls, COUNT(*) counts all rows, and COUNT(column) skips nulls.

For time-series analysis grouped by month:

SELECT
  DATE_TRUNC('month', order_date) AS month,
  SUM(revenue) AS monthly_revenue
FROM 'sales_data.csv'
WHERE order_date >= '2025-01-01'
GROUP BY 1
ORDER BY 1;

DATE_TRUNC handles date columns automatically when the column type is DATE or TIMESTAMP. If the column is stored as VARCHAR, cast it first: CAST(order_date AS DATE).

Join Multiple CSV Files

DuckDB can join two or more CSV files in a single query, referencing each by file path:

SELECT
  o.order_id,
  o.customer_id,
  c.name,
  c.segment,
  o.revenue
FROM 'orders.csv' AS o
JOIN 'customers.csv' AS c ON o.customer_id = c.id;

This is particularly useful when combining exported reports from different systems. A common workflow: join an orders export from an e-commerce platform with a customer segment file exported from a CRM, then aggregate by segment. Left joins, right joins, and full outer joins all work as expected.

Query Multiple Files at Once With Wildcards

If data is split across multiple files with the same schema, such as monthly exports in separate files, DuckDB reads all of them in one query using a glob pattern:

SELECT * FROM 'sales_2025_*.csv';

This reads every file matching the pattern in the current directory and treats them as a single table. DuckDB adds a filename column automatically when reading multiple files, which you can include in a GROUP BY to compare files:

SELECT
  filename,
  SUM(revenue) AS total
FROM 'sales_2025_*.csv'
GROUP BY filename;

Use Window Functions for Ranking and Running Totals

Window functions are fully supported. To rank products by revenue within each category:

SELECT
  product_name,
  category,
  revenue,
  RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rank_in_category
FROM 'products.csv';

Running cumulative totals:

SELECT
  month,
  revenue,
  SUM(revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue
FROM 'monthly_sales.csv';

These are the same window functions available in PostgreSQL. If you have existing SQL for any relational database, it will generally run in DuckDB without modification.

Query Excel Files

DuckDB supports Excel files through the community xlsx extension:

INSTALL xlsx;
LOAD xlsx;

SELECT * FROM st_read('report.xlsx');

st_read reads the first sheet by default. To specify a different sheet:

SELECT * FROM st_read('report.xlsx', layer='Sheet2');

Column names are taken from the first row. This works for .xlsx files; the older .xls format is not supported.

Export Results to CSV or Parquet

To save a query result as a new CSV:

COPY (
  SELECT region, SUM(revenue) AS total
  FROM 'sales_data.csv'
  GROUP BY region
) TO 'region_summary.csv' (HEADER, DELIMITER ',');

To export as Parquet for faster re-querying:

COPY (
  SELECT * FROM 'sales_data.csv'
) TO 'sales_data.parquet' (FORMAT PARQUET);

Parquet is a columnar format. Subsequent queries on the Parquet version run significantly faster because DuckDB reads only the columns the query references, rather than the full row.

Read Files Directly from S3

DuckDB reads files from S3 or S3-compatible storage without downloading them first:

INSTALL httpfs;
LOAD httpfs;

SET s3_region='us-east-1';
SET s3_access_key_id='your-key';
SET s3_secret_access_key='your-secret';

SELECT * FROM 's3://your-bucket/data/sales.csv' LIMIT 100;

From DuckDB v1.2 onward, credentials can be managed through the secrets API rather than inline SET statements, keeping connection strings out of query files.

Practical Limits

DuckDB runs entirely in-process on your machine, bounded by available RAM for very large datasets. For files up to a few hundred gigabytes, DuckDB handles memory pressure by spilling to disk automatically. For files above 100 GB on a standard laptop, MotherDuck extends DuckDB to a managed cloud backend while keeping the same SQL interface and local query execution model.

DuckDB has no native graphical query builder or chart output. For teams that want a point-and-click interface over DuckDB queries, tools like Shaper provide an open-source browser dashboard. If you want to skip SQL entirely and work from plain-language questions, VSLZ AI lets you upload a CSV and ask what you need in plain English, running the underlying analysis in one step.

Summary

DuckDB is the fastest path from a CSV file to a SQL query result for analysts who already know SQL. Install it in one command, reference any file as a table, and run aggregations, joins, window functions, and multi-file queries without importing or transforming data first. For multi-file datasets, wildcard reads and Parquet export make large-scale analysis practical on a standard laptop without any cloud infrastructure.

FAQ

Does DuckDB work with CSV files that have no header row?

Yes. If your CSV file has no header row, DuckDB assigns automatic column names (column0, column1, etc.) by default. You can override this by passing header=false and providing custom names: SELECT * FROM read_csv('data.csv', header=false, columns={'col1': 'VARCHAR', 'col2': 'INTEGER'}). DuckDB's read_csv function supports several parameters for handling irregular files, including custom delimiters, quote characters, and null value strings.

How does DuckDB compare to pandas for CSV analysis?

DuckDB is generally faster than pandas for aggregation and grouping operations, especially on large files. On a standard laptop, DuckDB processes a 1 GB CSV with a grouped aggregation in approximately 3 seconds versus 18 seconds for the equivalent pandas operation. DuckDB also uses less memory because it processes data in a columnar format rather than loading the entire file into a DataFrame. For analysts who already know SQL, DuckDB also has a lower learning curve than pandas for relational operations like joins and window functions.

Can I use DuckDB without knowing Python?

Yes. DuckDB provides a standalone command-line shell that runs without Python. Install it via Homebrew on macOS (brew install duckdb) or download a binary directly from duckdb.org. The CLI accepts standard SQL and requires no programming knowledge. There is also a VS Code extension that provides a graphical interface for writing queries and viewing results without using a terminal.

Can DuckDB query CSV files stored on S3 or Google Cloud Storage?

Yes. Install the httpfs extension (INSTALL httpfs; LOAD httpfs;), set your credentials, and reference the file by its S3 or GCS URL directly in a FROM clause. DuckDB streams the file without downloading it locally. For GCS, use the s3_ credential variables with the GCS S3-compatible endpoint. From v1.2, the secrets API provides a cleaner way to manage credentials than inline SET statements.

What file formats does DuckDB support besides CSV?

DuckDB natively supports CSV, Parquet, JSON, and NDJSON. The xlsx community extension adds Excel (.xlsx) support. The spatial extension adds GeoJSON and Shapefile support. DuckDB can also attach to live PostgreSQL databases, query Pandas DataFrames in Python, and read Arrow format directly. In early 2026, DuckDB added support for Vortex, a next-generation columnar format showing performance gains over Parquet on analytical workloads.