How to Analyze Data with PandasAI
Last updated Apr 24, 2026

PandasAI turns your dataframes into something you can talk to. Instead of writing groupby statements or pivot table logic, you type a question in plain English and PandasAI generates Python code, executes it against your data, and returns the answer directly. It works with CSV files, SQL databases, Excel sheets, and standard pandas DataFrames. The library has accumulated over 16,500 GitHub stars and is used by data teams at companies ranging from early-stage startups to large enterprises. This guide walks through installation, a basic CSV workflow, chart generation, multi-file queries, and the most common points of failure.
What You Need Before Starting
Before installing PandasAI, confirm you have:
- Python 3.8 or later (check with
python --version) - pip package manager
- An OpenAI API key with billing enabled
PandasAI also works with Anthropic Claude, Google Gemini, and local models via Ollama. OpenAI GPT-4o is the most tested option and the one used in examples below. API costs for typical analysis sessions are low. A session querying a mid-size CSV of 10,000 rows runs well under $0.05 using GPT-4o's current pricing of $2.50 per million input tokens and $10 per million output tokens.
Installing PandasAI
Install using pip:
pip install pandasai
If you plan to generate charts, also install matplotlib:
pip install matplotlib
PandasAI version 3 changed the core API significantly. Older tutorials using PandasAI(llm) and pandas_ai.run(df, "...") are v1 patterns that no longer work with current builds. This guide uses the v3 interface throughout.
Configuring Your API Key
PandasAI reads LLM credentials from environment variables. Set your OpenAI key before running any analysis:
export OPENAI_API_KEY="sk-..."
Or set it at the top of your Python script:
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
To use Anthropic Claude instead, set ANTHROPIC_API_KEY and change the model string in the config step below.
Analyzing a CSV File
This example uses a sales CSV with columns for date, product, region, units, and revenue. The steps apply to any tabular CSV regardless of domain.
import pandasai as pai
# point to your model
pai.config.set({
"llm": "openai/gpt-4o"
})
# load the data
df = pai.read_csv("sales_data.csv")
# ask a question
result = df.chat("What is total revenue by region, ranked highest to lowest?")
print(result)
PandasAI generates Python code internally, executes it against the loaded dataframe, and returns the result. Running a second df.chat() call in the same session keeps context alive, so you can follow up without reloading the file:
result2 = df.chat("Which region had the largest month-over-month growth in March?")
The library sends column names, data types, and a small sample of rows to the model on each call. It does not send your full dataset to the API, which matters for privacy with large or sensitive files.
Reading Excel Files
Excel files work identically to CSVs:
df = pai.read_excel("q1_report.xlsx")
result = df.chat("Summarize the key trends from this data")
For Excel workbooks with multiple sheets, load the specific sheet by name:
import pandas as pd
raw = pd.read_excel("workbook.xlsx", sheet_name="Revenue")
df = pai.DataFrame(raw)
result = df.chat("What is the average deal size by sales rep?")
The pai.DataFrame() wrapper gives any existing pandas DataFrame the .chat() interface. Use this pattern whenever you have already loaded or transformed data before handing it to PandasAI.
Connecting to a SQL Database
For database sources, load with pandas and wrap the result:
import pandas as pd
import sqlalchemy
import pandasai as pai
pai.config.set({"llm": "openai/gpt-4o"})
engine = sqlalchemy.create_engine("postgresql://user:pass@host/dbname")
raw = pd.read_sql("SELECT * FROM orders WHERE created_at > '2026-01-01'", engine)
df = pai.DataFrame(raw)
result = df.chat("What percentage of orders were refunded, broken out by product category?")
print(result)
This pattern works with any SQLAlchemy-compatible database including PostgreSQL, MySQL, SQLite, and Snowflake (via snowflake-sqlalchemy).
Generating Charts
Ask PandasAI to plot directly in the same prompt:
df.chat("Generate a bar chart showing monthly revenue for each region")
The chart is saved as a PNG to a charts/ folder in your working directory. To specify a different path:
pai.config.set({"save_charts_path": "/output/charts"})
Standard bar, line, pie, and scatter plots work reliably. If you need a dual-axis chart or a specific color scheme, include that detail in the prompt: "Generate a line chart with two y-axes, one for revenue and one for unit volume, with revenue in blue."
In headless environments (servers, CI pipelines), set the matplotlib backend before any imports to prevent display errors:
import matplotlib
matplotlib.use('Agg')
Querying Multiple Dataframes Together
PandasAI handles cross-file questions without explicit join syntax. If you have a customers file and an orders file, pass both to pai.chat():
customers = pai.read_csv("customers.csv")
orders = pai.read_csv("orders.csv")
result = pai.chat(
"How many orders did each customer segment place in Q1?",
customers,
orders
)
PandasAI infers join conditions from column names. It handles straightforward foreign keys like customer_id automatically. For ambiguous schemas, add a short description to the prompt: "customers.id matches orders.customer_id." For more than two dataframes, the same pattern extends by adding more arguments.
Common Failure Modes and Fixes
Incorrect aggregations on complex questions. PandasAI generates code probabilistically. For multi-step calculations, split the question into two simpler prompts and chain the results. For example, instead of asking for a period-over-period growth rate in a single prompt, first ask for totals per period, then ask for the growth calculation.
API rate limit errors. If you loop through many questions programmatically, add time.sleep(1) between calls to stay within OpenAI's requests-per-minute limits.
Column not found errors. PandasAI reads column names from the dataframe schema. If your CSV has inconsistent casing, normalize before loading:
import pandas as pd
raw = pd.read_csv("data.csv")
raw.columns = raw.columns.str.lower().str.replace(" ", "_")
df = pai.DataFrame(raw)
Chart not saved. Confirm matplotlib is installed (pip install matplotlib) and the output directory is writable.
Reducing API Costs
Each df.chat() call sends the column schema and a data sample. For large files, load only the columns your questions will reference:
raw = pd.read_csv("large_file.csv", usecols=["date", "revenue", "region", "product"])
For fully offline use, PandasAI supports local models via Ollama. Replace the config line with:
pai.config.set({"llm": "ollama/llama3"})
This eliminates API costs entirely. Response quality and code generation accuracy are lower than GPT-4o for complex analytical questions, but local models handle straightforward aggregations and summaries well.
When Setup Is a Barrier
The steps above assume you are comfortable opening a terminal, installing Python packages, and handling environment variables. If that is not you, or you want to go straight to analysis without any configuration, VSLZ lets you upload a CSV and start asking questions immediately with no local setup required.
Summary
PandasAI v3 uses pai.read_csv(), pai.DataFrame(), and df.chat() as its core interface. It supports CSV, Excel, SQL databases, and multi-file joins. Charts generate to PNG automatically with matplotlib installed. Typical sessions cost under $0.05 using GPT-4o. For complex aggregations, decompose questions into smaller steps for more reliable results. For offline or zero-cost use, swap to a local Ollama model.
FAQ
Does PandasAI send my full dataset to OpenAI?
No. PandasAI sends column names, data types, and a small sample of rows to the language model, not the full dataset. The generated Python code runs locally on your machine against the complete data. For highly sensitive data, you can also run PandasAI with a local model through Ollama, keeping all data entirely offline.
Which LLMs does PandasAI support?
PandasAI supports OpenAI models (GPT-4o, GPT-4, GPT-3.5), Anthropic Claude, Google Gemini, and local models through Ollama and LiteLLM. You switch models by changing the string in pai.config.set(). GPT-4o is the most widely tested and produces the most reliable code for analytical queries.
What is the difference between PandasAI v1 and v3?
In PandasAI v1, you instantiated a PandasAI object with an LLM and called pandas_ai.run(df, question). In v3, you load data with pai.read_csv() or pai.DataFrame() and call df.chat(question) directly. The v3 API also adds native multi-dataframe support via pai.chat(question, df1, df2) and a centralized config system. Most tutorials published before 2025 use the v1 pattern.
How much does it cost to use PandasAI with OpenAI?
Costs depend on question complexity and dataset size. A typical analysis session with 10 to 20 questions on a CSV of 10,000 rows uses well under $0.10 of GPT-4o tokens at current OpenAI pricing ($2.50 per million input tokens, $10 per million output tokens as of 2026). PandasAI samples your data rather than sending the full file, which keeps token usage low.
Can PandasAI generate charts automatically?
Yes. Ask for a chart in the prompt (for example, 'Generate a bar chart showing revenue by region') and PandasAI generates Python code that produces and saves a PNG to a charts/ directory in your working directory. Matplotlib must be installed separately with pip install matplotlib. You can configure a different output path via pai.config.set({'save_charts_path': '/your/path'}).


