Guides

How to Set Up PandasAI for Data Analysis

Arkzero ResearchApr 8, 20266 min read

Last updated Apr 8, 2026

PandasAI is a Python library that lets you query, visualize, and clean data using plain English instead of code. Version 2 supports more than 100 LLM providers through LiteLLM, including OpenAI, Anthropic, and Google Gemini, and works with CSV files, Excel sheets, SQL databases, and cloud data warehouses. This guide walks through installation, LLM configuration, and running your first real queries on a dataset in under 15 minutes.
PandasAI natural language data analysis setup guide

PandasAI is a Python library that converts plain-English questions into Pandas operations and runs them against your data. You ask what you want; it writes the code. This guide covers installation, LLM configuration, and practical patterns for getting reliable results from CSV files, Excel sheets, and SQL databases.

How PandasAI Works

PandasAI sits between your data and a large language model. When you ask a question, the library translates it into Python code using Pandas operations, executes that code against your DataFrame, and returns the result as a number, a table, or a chart. You do not write Pandas yourself.

Version 2, released in 2025, rewrote the LLM integration layer using LiteLLM, which means you can connect to OpenAI, Anthropic, Google Gemini, Azure OpenAI, or any compatible model without changing your application code. It also added a Docker sandbox option for running generated code in isolation, which matters when working with sensitive data.

The GitHub repository has over 14,000 stars, and PyPI records show consistent weekly downloads in the hundreds of thousands, placing it among the most-used AI data libraries in the Python ecosystem.

Installing PandasAI

You need Python 3.8 or later. To confirm your version:

python --version

Create a virtual environment first to keep dependencies isolated:

python -m venv pandasai-env
source pandasai-env/bin/activate  # on Windows: pandasai-env\Scripts\activate

Install PandasAI with your preferred LLM provider:

pip install "pandasai[openai]"    # for OpenAI (GPT-4o, GPT-4o-mini)
pip install "pandasai[anthropic]" # for Anthropic Claude
pip install "pandasai[all]"       # all providers at once

Configuring Your LLM

PandasAI requires a working LLM to generate code. Set your API key as an environment variable rather than hardcoding it in scripts.

Using OpenAI:

export OPENAI_API_KEY="your-key-here"
from pandasai import Agent
from pandasai.llm import OpenAI

llm = OpenAI(model="gpt-4o-mini")  # cheaper and fast enough for most queries

Using Anthropic Claude:

export ANTHROPIC_API_KEY="your-key-here"
from pandasai.llm import Anthropic

llm = Anthropic(model="claude-3-haiku-20240307")

Cost note: gpt-4o-mini costs roughly $0.00015 per 1K input tokens. A typical PandasAI query runs 500 to 1,000 tokens, putting the cost at under $0.001 per query. Run-heavy workflows benefit from a cheaper model; complex multi-step analysis may need gpt-4o or claude-sonnet.

Loading Your Data

PandasAI works with Pandas DataFrames, meaning it handles any format Pandas can read: CSV, Excel, Parquet, JSON, and SQL query results.

From a CSV file:

import pandas as pd
from pandasai import Agent

df = pd.read_csv("sales_data.csv")
agent = Agent(df, config={"llm": llm})

From Excel:

df = pd.read_excel("quarterly_report.xlsx", sheet_name="Q1")
agent = Agent(df, config={"llm": llm})

From multiple DataFrames:

If you have related tables such as orders and customers, pass them as a list. PandasAI will join them when needed:

orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")

agent = Agent([orders, customers], config={"llm": llm})

Running Your First Queries

The .chat() method is the main interface. Ask a question in plain English:

response = agent.chat("What was the total revenue last quarter?")
print(response)

PandasAI translates this into a groupby or similar Pandas operation, executes it, and returns the result. You can follow up in the same session since the agent keeps prior context:

agent.chat("Which product category drove the most of that revenue?")
agent.chat("Break it down by month")

Asking for charts:

agent.chat("Plot monthly revenue as a bar chart")

This generates and saves a chart to the current directory by default, or returns a matplotlib figure depending on your configuration.

Cleaning data:

agent.chat("How many rows have missing values in the 'email' column?")
agent.chat("Fill missing prices with the column median")

Handling Common Errors

The model returns wrong results. PandasAI v2 includes a retry mechanism. If code execution fails, it resends the error to the LLM and asks for a corrected version. Set retry attempts in configuration:

agent = Agent(df, config={"llm": llm, "max_retries": 3})

The model hallucinates column names. This happens when column names are ambiguous or abbreviated. Add a plain-English description of your data:

agent = Agent(
    df,
    config={"llm": llm},
    description="Sales records with columns: order_id, customer_name, product_sku, revenue_usd, order_date"
)

Results are unreliable on complex statistical questions. LLMs get common aggregations right but struggle with percentiles, rolling averages, and regression. For those, write the Pandas code manually and use PandasAI for the plain-English parts you can verify easily.

Connecting to SQL Databases

PandasAI v2 supports direct SQL connections so you are not limited to flat files.

PostgreSQL:

from pandasai.connectors import PostgreSQLConnector

connector = PostgreSQLConnector({
    "host": "localhost",
    "port": 5432,
    "database": "analytics",
    "username": "user",
    "password": "pass",
    "table": "sales"
})

agent = Agent(connector, config={"llm": llm})
agent.chat("What is the average order value this month?")

BigQuery, MySQL, Snowflake, and Databricks use the same pattern with their respective connector classes.

Production Best Practices

A few patterns matter when moving beyond experimentation.

Cache query results. PandasAI ships with an optional cache layer that stores query-response pairs. For dashboards or reports that run the same questions repeatedly, caching cuts LLM costs significantly:

agent = Agent(df, config={"llm": llm, "enable_cache": True})

Use the Docker sandbox for untrusted data. If user-uploaded files are in your pipeline, generated code running directly in your Python process is a security risk. The Docker sandbox runs each query in an isolated container:

agent = Agent(df, config={"llm": llm, "use_docker_sandbox": True})

This requires Docker to be running on the host.

Log generated code for debugging. Logging the code PandasAI generates helps you audit accuracy and catch errors early:

agent = Agent(df, config={"llm": llm, "save_logs": True})

When PandasAI Fits Best

PandasAI works best for ad hoc analysis on structured tabular data where the questions are varied and hard to anticipate in advance. A sales team that needs different cuts of a report each week, a founder running one-off checks on operational data, or an analyst prototyping questions before writing permanent SQL queries, these are the practical use cases.

It is less suited for real-time streaming data, highly nested JSON, or analysis that requires complex statistical methods that LLMs tend to get wrong without correction.

If you want to run similar queries without Python setup, tools like VSLZ let you upload a file and ask questions directly in a browser with no configuration needed.

Summary

Install PandasAI with pip install "pandasai[openai]", set your API key as an environment variable, load a DataFrame, and call .chat() with a plain-English question. Use multi-DataFrame support for joined queries, add a description parameter when column names are ambiguous, and enable caching for any production workload. For sensitive data pipelines, the Docker sandbox runs generated code in isolation.

FAQ

Does PandasAI work without an OpenAI API key?

Yes. PandasAI v2 uses LiteLLM, which supports more than 100 LLM providers including Anthropic Claude, Google Gemini, Azure OpenAI, and locally hosted models via Ollama. You can set up any of these instead of OpenAI. The installation command changes based on your provider: use pip install 'pandasai[anthropic]' for Claude, or pip install 'pandasai[google]' for Gemini.

What Python version does PandasAI require?

PandasAI requires Python 3.8 or later. Python 3.10 or 3.11 is recommended for best compatibility with the LiteLLM integration and optional dependencies. You can check your version with python --version before installing.

Can PandasAI connect directly to SQL databases?

Yes. PandasAI v2 includes native connectors for PostgreSQL, MySQL, BigQuery, Snowflake, and Databricks. You pass a connector object instead of a DataFrame when initializing the Agent. The connector handles query execution against the database; PandasAI translates your natural language question into SQL and runs it. Flat file formats like CSV, Excel, Parquet, and JSON are also supported directly through Pandas.

How accurate are PandasAI query results?

Accuracy depends on the LLM you choose, the clarity of column names in your data, and the complexity of the question. For common aggregations like sums, averages, counts, and group-bys, gpt-4o-mini and Claude Haiku are reliable. Complex statistical operations such as percentiles, rolling windows, or regression are less reliable and should be verified. Adding a plain-English description of your dataset significantly improves accuracy for ambiguous column names. Setting max_retries to 2 or 3 helps recover from code generation errors automatically.

Is PandasAI free to use?

The PandasAI library itself is open source under the MIT license and free to install. You pay for LLM API usage, which is separate from PandasAI. With gpt-4o-mini, a typical query costs under $0.001. Running 100 queries per day costs roughly $0.10. Claude Haiku pricing is in the same range. Local models via Ollama eliminate API costs but require hardware capable of running inference.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026