Guides

How to Set Up Ibis for Portable Python Analytics

Arkzero ResearchApr 29, 20266 min read

Last updated Apr 29, 2026

Ibis is a portable Python analytics library that lets you write dataframe code once and run it against more than 24 backends -- including DuckDB, BigQuery, Snowflake, and Postgres -- without rewriting your analysis. Version 12.0.0, released in 2026, ships with DuckDB as a zero-config default, making it the fastest path from a local CSV file to production-grade analytical queries without touching a SQL string directly.

Ibis Python portable analytics library setup guide

Ibis is an open source Python library that lets you analyze data using a familiar, Pandas-like API while executing queries through whatever database engine you choose. You write expressions once; Ibis compiles them into the native query language of the backend you connect to. The result is a single script that works on a local CSV file today and runs against a company data warehouse tomorrow, changed by one line.

Ibis 12.0.0, the current release as of April 2026, supports 24 backends: Athena, BigQuery, ClickHouse, Databricks, DataFusion, DeltaLake, Druid, DuckDB, Exasol, Flink, Impala, Materialize, MSSQL, MySQL, Oracle, Polars, Postgres, PySpark, RisingWave, SingleStoreDB, Snowflake, SQLite, and Trino. It defaults to DuckDB for local work because DuckDB requires no server and reads CSV and Parquet files directly from disk.

Why Portability Matters for Data Teams

Most analytics code is not actually portable. An analyst prototyping on a local file writes Pandas; an engineer deploying the same logic to BigQuery rewrites it in SQL. That translation gap costs time and introduces bugs. Ibis closes it. According to benchmarks from the DuckDB team, OLAP-style queries on files larger than 1 GB run roughly 90 percent faster through DuckDB than through native Pandas, and Ibis routes all of that through the same API you would use on a 10-row test file.

The practical payoff is clearest in teams that prototype locally and run nightly jobs in a cloud warehouse. Analysts write the transformation logic once. The production engineer changes the connection string. No rewrite, no SQL dialect mismatch, no handoff friction.

Installation

Ibis installs from PyPI. Backends are extras, so you only pull in dependencies you need.

For local analysis with DuckDB:

pip install 'ibis-framework[duckdb]'

To add BigQuery support later:

pip install 'ibis-framework[bigquery]'

Python 3.10 or higher is required. Verify the install:

import ibis
print(ibis.__version__)  # 12.0.0

Your First Analysis with DuckDB

Ibis defaults to DuckDB as its local backend. Connecting requires no credentials and no server.

import ibis

# In-memory DuckDB -- zero config
con = ibis.duckdb.connect()

# Read a CSV directly into an Ibis table expression
sales = con.read_csv("sales_q1_2026.csv")

# Inspect the schema
print(sales.schema())

Every operation on sales is lazy. Ibis builds an expression tree and translates it to SQL only when you call .execute(). Nothing runs in Python until that point, which means complex chains of filters, joins, and aggregations are sent to the backend as a single optimized query.

# Group by region, sum revenue, sort descending
by_region = (
    sales
    .group_by("region")
    .aggregate(total_revenue=sales.revenue.sum())
    .order_by(ibis.desc("total_revenue"))
)

# Fire the query
result = by_region.execute()
print(result)

The return value is a standard Pandas DataFrame. It drops straight into matplotlib, seaborn, or any visualization library you already use.

To inspect the SQL Ibis will send to the backend before running it:

print(ibis.to_sql(by_region))

This is useful for debugging, for sharing with SQL-native colleagues, and for checking query costs before sending a large job to a cloud warehouse.

Switching Backends in One Line

Suppose your team runs the same analysis on BigQuery in production. The backend swap is a single connection change:

# Development: local DuckDB
con = ibis.duckdb.connect()

# Production: BigQuery
con = ibis.bigquery.connect(project_id="your-gcp-project", dataset_id="analytics")

The by_region expression above runs unchanged on either backend. Ibis compiles it to BigQuery Standard SQL automatically. The logic is yours; the SQL dialect is not your problem.

Connecting to an Existing Database

If your company already has data in Postgres or MySQL, Ibis connects directly without needing to export anything to a file first.

# Postgres
con = ibis.postgres.connect(
    host="db.example.com",
    port=5432,
    database="analytics",
    user="analyst",
    password="..."
)

# List available tables
print(con.list_tables())

# Reference an existing table
orders = con.table("orders")

From here, all the same Ibis expressions apply. Filters, joins, window functions, and aggregations compile to Postgres SQL. If you later migrate to Snowflake, you change the connection; the expressions stay the same.

Common Operations Reference

Filtering rows:

high_value = sales.filter(sales.revenue > 10000)

Selecting and renaming columns:

trimmed = sales.select(
    region=sales.region,
    rev=sales.revenue,
    month=sales.order_date.month()
)

Window functions:

ranked = sales.mutate(
    rank=ibis.row_number().over(
        ibis.window(group_by="region", order_by=ibis.desc("revenue"))
    )
)

Joins:

products = con.read_csv("products.csv")
enriched = sales.join(products, sales.product_id == products.id)

All of these compile to the backend's native SQL. You can chain them freely; Ibis builds a single query rather than materializing intermediate results in Python memory.

Reading Parquet Files and Remote Storage

DuckDB reads Parquet files natively, and Ibis exposes this without any conversion step.

orders = con.read_parquet("orders_2025.parquet")

For remote files on S3, set credentials on the DuckDB connection once and Ibis reads them identically to local files:

con = ibis.duckdb.connect()
con.raw_sql("SET s3_region='us-east-1'")
con.raw_sql("SET s3_access_key_id='AKIA...'")
con.raw_sql("SET s3_secret_access_key='...'")

remote = con.read_parquet("s3://your-bucket/data/*.parquet")

The *.parquet glob queries an entire S3 prefix as a single table, which is the standard pattern for partitioned data lakes. DuckDB streams the data rather than loading the full dataset into memory.

When to Use Ibis Instead of Pandas

Ibis is the right choice when any of these conditions apply. Your data does not fit comfortably in RAM: Ibis pushes computation to the backend, so you process a 20 GB file without loading it into a Python process. You need the same logic to run in multiple environments: if your team uses DuckDB locally and Snowflake or BigQuery in production, Ibis eliminates the translation cost at handoff. You want to inspect the generated SQL before running it: the .to_sql() method is useful for auditing query costs on metered cloud warehouses.

Pandas remains the better choice for highly custom row-level transformations, tight integration with scikit-learn or PyTorch, or datasets small enough to process comfortably in memory. Ibis and Pandas are designed to work together. Ibis handles heavy filtering and aggregation; the Pandas DataFrame that .execute() returns is ready for any downstream processing you already do.

If your goal is ad hoc analysis on a file you just received -- without configuring a backend or writing any code -- VSLZ lets you upload the file and describe what you need in plain English, returning charts, cleaned data, and statistical summaries in a single step.

Summary

Ibis 12.0.0 installs with pip install 'ibis-framework[duckdb]', connects to a local DuckDB instance with ibis.duckdb.connect(), and compiles lazy expressions to optimized SQL at execution time. Write the analysis once; switch the connection to run it on BigQuery, Snowflake, Postgres, or any of the 24 supported backends. Use ibis.to_sql() to inspect generated queries before sending them to a metered cloud warehouse. The library is particularly well suited for teams that prototype locally and deploy to a cloud data warehouse, or for any analyst who wants database-speed performance without learning a new API from scratch.

FAQ

What backends does Ibis support in 2026?

Ibis 12.0.0 supports 24 backends: Athena, BigQuery, ClickHouse, Databricks, DataFusion, DeltaLake, Druid, DuckDB, Exasol, Flink, Impala, Materialize, MSSQL, MySQL, Oracle, Polars, Postgres, PySpark, RisingWave, SingleStoreDB, Snowflake, SQLite, and Trino. DuckDB is the default for local development and requires no server or credentials.

How do I install Ibis with DuckDB?

Run `pip install 'ibis-framework[duckdb]'` in your terminal. Python 3.10 or higher is required. After installation, import ibis and connect with `con = ibis.duckdb.connect()` to start an in-memory DuckDB session immediately. To add cloud backends later, install the relevant extra, for example `pip install 'ibis-framework[bigquery]'`.

Can I switch from DuckDB to BigQuery without rewriting my analysis?

Yes. Ibis expressions are backend-agnostic. Change the connection from `ibis.duckdb.connect()` to `ibis.bigquery.connect(project_id='...', dataset_id='...')` and all downstream expressions compile to BigQuery Standard SQL automatically. The analysis logic stays the same; only the connection string changes.

Is Ibis faster than Pandas for large files?

For OLAP-style queries -- aggregations, group-bys, large joins -- Ibis routed through DuckDB is significantly faster than native Pandas. DuckDB benchmarks show roughly 90 percent latency reduction versus Pandas on datasets larger than 1 GB. Ibis with DuckDB processes these queries without loading the full dataset into Python memory. For small datasets or custom row-level transformations tightly coupled to Python logic, Pandas is often the simpler choice.

Does Ibis work with Parquet files and S3?

Yes. When using the DuckDB backend, Ibis reads Parquet files natively with `con.read_parquet('file.parquet')`. For S3, set the AWS credentials on the DuckDB connection using `con.raw_sql()` and then pass an S3 path including glob patterns. DuckDB streams S3 data without loading it all into memory, which makes it practical for large partitioned data lakes.