Guides

How to Set Up Great Expectations for Data Quality

Arkzero ResearchApr 28, 20267 min read

Last updated Apr 28, 2026

Great Expectations (GX Core) is a Python library that validates data against a set of declarative rules before it reaches your pipeline or reports. Install it with pip, connect it to a CSV, SQL table, or cloud file, define Expectations describing what valid data looks like, then run a Checkpoint to validate on a schedule and receive alerts on failure. Setup takes under 15 minutes and works with pandas, SQL databases, Spark, and S3.
Data analyst reviewing validation results on a workstation

Great Expectations (GX Core) is a Python library that validates data against declarative rules before it reaches your pipeline or reports. Install it with pip install great_expectations, connect to a CSV, SQL table, or cloud file, define what valid data looks like using Expectations, then run a Checkpoint to validate automatically and send alerts on failure. The full setup takes under 15 minutes and works with pandas, SQL databases, Apache Spark, and S3.

Why Data Quality Testing Matters

Gartner estimates that poor data quality costs organizations $12.9 million per year on average. For analytics teams and ops managers, the problem is concrete: a single duplicate record or unexpected null in a monthly report can produce decisions built on bad numbers.

The standard workaround is manual checks before reports run. Manual checks do not scale. When 12 tables feed one dashboard, no one remembers to validate all 12 before the Monday morning meeting.

Great Expectations solves this by letting you write the checks once, version them alongside your data pipeline, and run them automatically on every data load. With more than 300 built-in Expectations covering nullability, value ranges, uniqueness, regex patterns, and statistical distributions, most validation rules require no custom code to write.

GX Core (version 1.16.1 as of April 2026) is the open-source Python library that powers the Great Expectations platform. It supports pandas DataFrames, SQL databases via SQLAlchemy, Apache Spark, and file systems including S3, Azure Blob, and GCS.

Install GX Core

GX Core requires Python 3.8 or higher. Install it from PyPI:

pip install great_expectations

Verify the installation:

python -c "import great_expectations; print(great_expectations.__version__)"

You should see 1.16.1 or the current release printed to your terminal.

Step 1: Create a Data Context

A Data Context is the entry point for every GX workflow. It manages your data sources, expectation suites, and validation results.

For local development, create a file-based context that persists your configuration to disk:

import great_expectations as gx

context = gx.get_context(mode="file", project_root_dir="./gx")

The first run creates a ./gx/ directory containing configuration files and a local Data Docs site you can open in a browser.

For quick exploration without writing anything to disk, use an ephemeral context:

context = gx.get_context()

Step 2: Connect to Data

GX Core uses a three-level hierarchy to connect to data: Data Source represents where your data lives, Data Asset is a specific table or file within that source, and Batch Definition tells GX how to retrieve a slice of that asset for validation.

To connect to a local CSV file:

data_source = context.data_sources.add_pandas_filesystem(
    name="my_data_source",
    base_directory="./data"
)

data_asset = data_source.add_csv_asset(name="sales_data")

batch_definition = data_asset.add_batch_definition_whole_dataframe("full_file")

To connect to a SQL database (PostgreSQL, MySQL, Snowflake, BigQuery):

data_source = context.data_sources.add_sql(
    name="postgres_source",
    connection_string="postgresql+psycopg2://user:password@host/dbname"
)

data_asset = data_source.add_table_asset(
    name="orders_table",
    table_name="orders"
)

Step 3: Define Expectations

An Expectation is a verifiable assertion about your data. Create an Expectation Suite and add rules to it:

suite = context.suites.add(
    gx.core.expectation_suite.ExpectationSuite(name="sales_quality_suite")
)

# Column must exist
suite.add_expectation(gx.expectations.ExpectColumnToExist(column="order_id"))

# No null values
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)

# Status must be one of a known set
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="status",
        value_set=["pending", "shipped", "delivered", "cancelled"]
    )
)

# Revenue must be non-negative
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(column="revenue", min_value=0)
)

# Order IDs must be unique
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)

GX also includes a profiler that can autogenerate a starting suite from a sample of your data. Running the profiler on 1,000 rows produces a baseline suite you then edit to match your actual business rules, which cuts initial setup time in half for unfamiliar datasets.

Step 4: Run a Validation

A Validation Definition links a Batch Definition to an Expectation Suite. Running it returns a Validation Result with a pass/fail for each Expectation and sample failing rows for any that did not pass.

validation_def = context.validation_definitions.add(
    gx.core.validation_definition.ValidationDefinition(
        name="sales_validation",
        data=batch_definition,
        suite=suite
    )
)

results = validation_def.run()
print(results)

The printed output includes a top-level success flag, summary statistics showing the percentage of Expectations that passed, and a list of failures with the column, failing value, and row index.

Step 5: Set Up a Checkpoint for Recurring Validation

For production use, a Checkpoint wraps one or more Validation Definitions and fires Actions when results come in. This is the standard pattern for integrating GX into a pipeline: run the Checkpoint after every data load, and alerts fire automatically on failure.

checkpoint = context.checkpoints.add(
    gx.checkpoint.Checkpoint(
        name="daily_sales_check",
        validation_definitions=[validation_def],
        actions=[
            gx.checkpoint.SlackAlertAction(
                name="slack_alert",
                slack_webhook="https://hooks.slack.com/services/YOUR/WEBHOOK",
                notify_on="failure"
            )
        ]
    )
)

checkpoint_result = checkpoint.run()

Replacing SlackAlertAction with EmailAction sends email notifications. A Microsoft Teams webhook uses the same pattern with a TeamsAlertAction. The notify_on parameter accepts failure, success, or all.

Step 6: View Results in Data Docs

GX Core generates a local HTML site called Data Docs that displays your Expectation Suites and Validation Results in a human-readable format. Open it with:

context.open_data_docs()

This opens a browser showing each Expectation, its definition, and the most recent validation result. For teams, publish the site to S3 or GCS so all members can review results without running Python locally.

Integrating Into a Production Pipeline

In a data pipeline built with dbt, Prefect, Airflow, or Dagster, add a checkpoint.run() call after each data load step. If the Checkpoint returns a failure result, the pipeline can halt before downstream consumers receive bad data, or route the failing records to a quarantine table for review.

A typical pattern for an ops team running daily reports: load raw data from the source system, run the GX Checkpoint, proceed to transformation and reporting only if all Expectations pass. According to IBM's data quality research, organizations that automate validation in their pipelines reduce downstream data incidents by 73 percent compared to teams relying on manual spot checks.

If the next step after validation is turning a clean dataset into charts or summaries for stakeholders, VSLZ lets you upload the validated CSV and generate statistical analysis and visualizations from a single prompt without configuring a separate BI tool.

Common Pitfalls

Running GX on a full table with millions of rows can be slow. Use time-based Batch partitioning to validate only a rolling window (the last 7 days, for example) rather than scanning the entire table on each run.

Start with a small, focused Expectation Suite. An overly strict suite with too many regex patterns or tight statistical bounds produces frequent false failures and alert fatigue. Build the suite incrementally as you learn how your data behaves in production.

Summary

GX Core gives you a repeatable, versionable way to validate data at any point in a pipeline. The four-step workflow is: create a Data Context, connect to a data source, define an Expectation Suite, and run a Checkpoint. Alerts fire on failure, Data Docs publish results for the team, and the entire setup integrates cleanly into any Python-based pipeline without a cloud dependency or SaaS account.

FAQ

What is Great Expectations GX Core?

GX Core is the open-source Python library at the center of the Great Expectations data quality platform. It lets you write Expectations (declarative rules about your data), run them against any data source, and review results in auto-generated HTML reports called Data Docs. The current stable release is version 1.16.1.

How do I install Great Expectations?

Run pip install great_expectations in any Python 3.8+ environment. Verify with python -c "import great_expectations; print(great_expectations.__version__)". No separate database or server is required for local use.

What data sources does GX Core support?

GX Core connects to pandas DataFrames, CSV and Parquet files stored locally or in S3/Azure Blob/GCS, SQL databases via SQLAlchemy (PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, Redshift), and Apache Spark DataFrames.

What is a Checkpoint in Great Expectations?

A Checkpoint is a named object that runs one or more Validation Definitions and fires Actions based on results. Actions include Slack alerts, email notifications, and Microsoft Teams webhooks. Checkpoints are the standard way to integrate GX into a production pipeline.

How is Great Expectations different from Soda or dbt tests?

dbt tests run inside the dbt transformation layer. Soda uses YAML scan configuration and connects through Soda Cloud. GX Core is a Python-first library that integrates directly into any Python pipeline without a cloud dependency, offers 300+ built-in Expectation types, and generates standalone Data Docs without a SaaS account.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026