Guides

How to Analyze Large CSV Files with DuckDB

Arkzero ResearchApr 6, 20267 min read

Last updated Apr 6, 2026

DuckDB is a free, in-process SQL database that runs directly on your laptop with no server setup required. It loads CSV files of any size in seconds, runs standard SQL queries, and exports results back to CSV. Analysts and operations teams use it to query files that crash Excel, without needing Python or cloud infrastructure. Installation takes under five minutes on Mac or Windows.
DuckDB CSV analysis setup guide for analysts

When a CSV file has more than 100,000 rows, Excel struggles. Pivot tables slow down. Filters take seconds to respond. Saving the file can trigger a crash. This is not a machine performance problem. It is an Excel architectural problem: the spreadsheet format was built for interactive editing, not bulk data querying.

DuckDB solves this with a different approach. It is a database that runs entirely inside a program on your laptop, with no server, no login, and no configuration. You point it at a CSV file, write a SQL query, and get an answer in milliseconds. A benchmark demonstrated by the DuckDB community in early 2026 showed querying 1.3 million rows across eight CSV files in a single statement, faster than Excel can even open a file that size.

DuckDB is free, open source, and installs in under five minutes. You do not need to know Python. You do not need to set up a database server. You do not need cloud storage. You just need a terminal and a few SQL commands you can copy from this guide.

Install DuckDB on Mac or Windows

On a Mac:

Open Terminal (press Command + Space, type "Terminal", press Enter). Run this command:

brew install duckdb

If you do not have Homebrew installed, run this first:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then re-run the brew command.

On Windows:

Open Command Prompt (press Windows key, type "cmd", press Enter). Run:

winget install DuckDB.cli

If winget is not available, download the DuckDB CLI binary directly from duckdb.org/docs/installation and place it in a folder on your PATH.

Launch DuckDB with a graphical interface:

DuckDB ships with a built-in web UI as of version 1.2. After installing, run:

duckdb -ui

This opens a browser-based interface at localhost:4213. You can write and run queries in a text editor, view results in a table, and export without touching the terminal again. This is the recommended starting point for analysts who prefer not to work in a command line.

Verify the install:

In the terminal, run duckdb --version. You should see something like v1.2.0. If you get "command not found," close and reopen the terminal and try again.

Load Your CSV File

You do not need to import or convert the file. DuckDB reads CSV files directly.

Place your CSV file in a folder you can find easily. For this guide, the file is called sales.csv and it sits in your Downloads folder.

In the DuckDB shell (launched by typing duckdb in the terminal), run:

SELECT * FROM read_csv('~/Downloads/sales.csv') LIMIT 5;

This previews the first five rows. DuckDB automatically detects column names and data types from the file header, with no schema definition needed.

To inspect the full column list and types:

DESCRIBE SELECT * FROM read_csv('~/Downloads/sales.csv');

To count the total number of rows:

SELECT COUNT(*) FROM read_csv('~/Downloads/sales.csv');

If your file has more than 500,000 rows, save it as a DuckDB table first:

CREATE TABLE sales AS SELECT * FROM read_csv('~/Downloads/sales.csv');

Now every query runs against the in-memory table instead of re-reading the file from disk, which is considerably faster.

Run Business Queries Without Python

The four queries below cover the most common analysis tasks for operations and finance teams. Each uses plain SQL. The syntax reads almost like English: SELECT which columns, FROM which table, WHERE some condition is true, GROUP BY a category, ORDER BY a value.

Query 1: Find the top products or categories by revenue

SELECT category, SUM(revenue) AS total_revenue
FROM sales
GROUP BY category
ORDER BY total_revenue DESC
LIMIT 10;

This returns the top ten categories sorted by total revenue. Replace category and revenue with your actual column names.

Query 2: Filter rows by date or amount

SELECT *
FROM sales
WHERE purchase_date >= '2025-01-01'
  AND purchase_amount > 500
ORDER BY purchase_date DESC;

This returns all transactions from 2025 onward where the amount exceeded 500. Adjust the date and amount to match your data.

Query 3: Summarize by month

SELECT
  strftime(CAST(purchase_date AS DATE), '%Y-%m') AS month,
  COUNT(*) AS total_orders,
  SUM(purchase_amount) AS total_sales
FROM sales
GROUP BY month
ORDER BY month;

The strftime function formats the date into year-month (e.g. 2025-03). This produces a clean monthly summary without needing a pivot table.

Query 4: Combine multiple CSV files in one query

If your data is split across multiple files (one per month, one per region), DuckDB reads them all at once:

SELECT * FROM read_csv('~/Downloads/sales_*.csv', union_by_name=true);

The * is a wildcard that matches any filename starting with sales_. The union_by_name=true option handles minor column differences between files without crashing. In a documented test using WhoTracksMe data, this approach handled 1.3 million rows across eight CSV files in a single command on a standard laptop.

Export Your Results

Once you have the query result you want, export it to a new CSV file:

COPY (
  SELECT category, SUM(revenue) AS total_revenue
  FROM sales
  GROUP BY category
  ORDER BY total_revenue DESC
) TO '~/Downloads/revenue_by_category.csv' (HEADER, DELIMITER ',');

This writes the query result directly to a new CSV file. You can open it in Excel, share it by email, or import it into any reporting tool.

To export to Parquet instead (smaller and faster for large datasets):

COPY (SELECT ...) TO '~/Downloads/output.parquet' (FORMAT 'PARQUET');

When DuckDB Beats Excel (and When to Stay)

DuckDB is the right tool when:

  • Your CSV has more than 100,000 rows and Excel is slow or crashing
  • You need to combine multiple files from different sources or time periods
  • You want to apply the same analysis repeatedly without rebuilding a pivot table every time
  • You are working with JSON exports, Parquet files, or data that Excel cannot open cleanly

Excel is still the right tool when:

  • You are editing data by hand (correcting values, adding rows, formatting cells)
  • You need to share a formatted report with stakeholders who expect a spreadsheet
  • Your dataset fits comfortably in memory and a pivot table answers your question in under a minute

The useful comparison is not DuckDB vs. Excel on the same task. It is which tool makes the task take the least time. For data exploration and aggregation on files over 100,000 rows, DuckDB consistently wins. For final formatting and sharing, most teams still use Excel or Google Sheets as the last step.

If you want to skip the SQL setup entirely, VSLZ AI lets you upload a CSV and ask questions in plain English, returning charts and summaries from a single prompt with no SQL or terminal needed.

Where to Go Next

The fastest way to start is with the DuckDB UI. Install DuckDB, run duckdb -ui, upload your CSV, and run the four queries above. A working analysis takes under 15 minutes.

For files you return to regularly, save them as named DuckDB tables so queries stay fast. For teams that need results in a shared format, use the COPY command to export back to CSV after each session.

DuckDB does not replace a full data warehouse for teams managing millions of events per day. But for the analyst or operations manager who needs to answer a business question from a large file today, it is the fastest path from raw data to answer without cloud setup or engineering support.

FAQ

Does DuckDB work with CSV files without installing Python?

Yes. DuckDB is a standalone binary that reads CSV files natively. You install it via Homebrew on Mac or winget on Windows, then point it directly at any CSV file. No Python, no pandas, no additional libraries required.

How large a CSV file can DuckDB handle?

DuckDB has no hard row or file size limit. In practice, it handles files up to several gigabytes on a standard laptop. For very large files (10GB+), using a persistent database file rather than in-memory mode is recommended. The community has documented single-query analysis of 1.3 million rows across eight files on a consumer laptop.

Can I use DuckDB without knowing SQL?

You need basic SQL to write queries, but DuckDB's SQL is standard and readable. The four queries in this guide cover the most common analysis tasks: filtering rows, aggregating by category, summarizing by date, and combining files. Copy-pasting and adapting these four queries handles most real-world business questions. The DuckDB UI also provides autocomplete that helps non-technical users.

How is DuckDB different from SQLite?

Both run locally without a server, but they are designed for different tasks. SQLite is optimized for transactional workloads (many small reads and writes), making it common in mobile apps and embedded systems. DuckDB is optimized for analytical queries (aggregations, filtering, grouping over large datasets). For CSV analysis and reporting, DuckDB is substantially faster than SQLite.

Can DuckDB connect to cloud storage like Google Drive or S3?

DuckDB supports reading files from AWS S3, Google Cloud Storage, Azure Blob Storage, and Cloudflare R2 via extensions. It cannot connect to Google Drive directly, but you can download files from Drive and point DuckDB at the local path. The httpfs extension also lets DuckDB read CSV or Parquet files from any public HTTP URL without downloading them first.

Related

Claude AI logo on clean white background
Guides

How to Analyze Business Data with Claude AI

Claude can analyze business data from a CSV or Excel file without any coding. Upload a file to Claude.ai, describe what you want to know, and Claude reads the actual rows and calculates results using its built-in analysis tool. This guide covers the exact workflow, prompt templates that produce reliable output, and the single instruction that prevents Claude from returning estimated numbers instead of real ones.

Arkzero Research · Apr 7, 2026
Power BI Copilot interface in Microsoft Fabric
Guides

How to Set Up Power BI Copilot

Power BI Copilot is a generative AI assistant built into Microsoft Power BI and Fabric that lets analysts generate reports, write DAX measures, and ask questions about data in plain English. It requires a paid Microsoft Fabric capacity (F2 or higher) or Power BI Premium (P1 or higher). Once a Fabric admin enables it in the tenant settings, the feature appears inside reports and as a standalone chat experience across all accessible semantic models.

Arkzero Research · Apr 7, 2026
A photorealistic editorial scene representing Python data analysis with Polars
Guides

How to Get Started with Polars for Data Analysis

Polars is a Python DataFrame library built on Rust that processes large datasets far faster than pandas. On 100 million rows, a GroupBy in Polars completes in under 30 seconds; pandas often crashes on the same task. It installs with one command, requires no server, and reads CSV, Parquet, and JSON directly. For analysts who hit pandas memory limits or slow runtimes, Polars is a practical alternative that requires minimal code changes.

Arkzero Research · Apr 6, 2026