Guides

How to Set Up Data Quality Checks with Soda

Arkzero ResearchApr 25, 20267 min read

Last updated Apr 25, 2026

Soda is an open-source data quality platform that scans databases, warehouses, and CSV files using a YAML-based language called SodaCL. Install the library with a connector for your data source, write a checks file defining row count, freshness, and validity rules, then run soda scan from your terminal. As of February 2026, the library ships an AI agent called soda ai that generates checks automatically from your table schema or dbt models.
Soda data quality platform logo on a clean background

Soda runs data quality checks directly against your data source using a YAML-based format called SodaCL. Install the library, connect it to your database or warehouse, define your checks, and run soda scan from the terminal. Soda compiles each check into optimized SQL and reports pass or fail with exact violation counts. In February 2026, Soda released an AI agent built into the CLI that generates complete checks files from table definitions, removing the syntax barrier for teams new to the tool.

What Soda Does and Why Teams Use It

Soda catches data quality issues at the pipeline level, between ingestion and transformation or between transformation and serving. Instead of discovering that a dashboard is wrong after someone notices it, you block bad data from moving forward.

The platform works by running scans against live data sources and comparing results against rules you define in SodaCL. Unlike schema-level checks in dbt or Great Expectations, SodaCL focuses on runtime data properties: freshness, row counts, null rates, value validity, and duplicate detection. A Gartner estimate from 2024 puts the average cost of poor data quality at $12.9 million per year for large organizations. For smaller teams, the cost is more direct: a weekly ops report built on a table with a schema change surfaces wrong numbers before anyone checks the underlying data.

Soda 4.0, released in late 2025, introduced data contracts as a first-class concept. A contract is a formal schema-and-quality agreement attached to a dataset. Downstream consumers sign the contract; upstream producers are responsible for meeting it. The AI CLI, released February 23, 2026, generates these contracts automatically.

Installing Soda

The base package is soda-core. Each data source has a separate connector.

For PostgreSQL:

pip install soda-postgres

For BigQuery:

pip install soda-bigquery

For Snowflake:

pip install soda-snowflake

For local CSV files or Pandas DataFrames:

pip install soda-pandas

After installation, verify:

soda --version

Version 4.x confirms you are on the current major release. If you see an older version, upgrade with pip install --upgrade soda-core plus the connector package.

Connecting to Your Data Source

Soda reads connection details from a YAML configuration file. Here is an example for PostgreSQL:

data_sources:
  my_postgres:
    type: postgres
    host: localhost
    port: 5432
    username: ${POSTGRES_USER}
    password: ${POSTGRES_PASSWORD}
    database: analytics
    schema: public

Reference environment variables with ${VAR_NAME} to keep credentials out of version control. Soda interpolates them at scan time.

For BigQuery:

data_sources:
  my_bigquery:
    type: bigquery
    account_info_json_path: /path/to/service-account.json
    dataset: my_dataset

For Snowflake, the configuration takes account, user, password, database, warehouse, and schema fields. The full reference is in the Soda documentation at docs.soda.io.

Writing Your First Checks File

SodaCL checks live in a YAML file, typically named after the table being tested. Here is a checks file for an orders table:

checks for orders:
  - row_count > 0
  - missing_count(customer_id) = 0
  - invalid_count(status) = 0:
      valid values: [pending, processing, shipped, delivered, cancelled]
  - duplicate_count(order_id) = 0
  - freshness(created_at) < 1d

Breaking down what each check does:

row_count > 0 fails if the table is empty. This is the simplest upstream pipeline failure signal. An empty table that should have data means ingestion broke silently.

missing_count(customer_id) = 0 fails if any row has a NULL customer ID. This catches referential integrity issues before joins break downstream reports.

invalid_count(status) = 0 with a valid values list checks that every status value in the column belongs to a defined set. A new status added without coordination fails this check, surfacing the discrepancy before it reaches BI tools.

duplicate_count(order_id) = 0 catches deduplication failures in ETL pipelines. A non-zero count tells you exactly how many duplicate order IDs exist.

freshness(created_at) < 1d fails if the most recent record is more than one day old, catching ingestion delays before stale data is served to dashboards.

Running a Scan

With a configuration file and a checks file ready, run:

soda scan -d my_postgres -c configuration.yml checks_orders.yml

The -d flag specifies the data source name from your configuration. Soda compiles each check into SQL, executes it, and prints results to the terminal:

Soda 4.x.x
Scan summary:
5/5 checks PASSED

On a failure:

Scan summary:
4/5 checks PASSED
1/5 checks FAILED: duplicate_count(order_id) = 0 [duplicate_count=23]

The violation count appears inline. You see the scale of the problem without opening a separate tool.

Using Soda AI CLI to Generate Checks

Running soda ai opens an interactive agent session in your terminal. No extra package is needed and no API key is required. The agent is part of soda-core from version 4.x onwards.

soda ai

Paste a CREATE TABLE statement, a dbt model, or a CSV schema into the session and ask the agent to write a checks file. Example prompt:

> Here is my orders table schema: [paste schema]
> Write a SodaCL checks file covering freshness, nulls, and key integrity.

The agent drafts a complete SodaCL file with checks calibrated to the column names and types it detects. You review the output, edit as needed, and save it as a checks YAML file.

The AI CLI is particularly useful when migrating from ad-hoc SQL scripts to structured data contracts. You can paste an existing validation query and ask the agent to convert it to SodaCL. The agent knows the full SodaCL syntax and common data quality patterns, so it handles cases like multi-column duplicate checks or conditional nullability rules that are tedious to write from scratch.

Integrating Soda into a Pipeline

Soda works best embedded in orchestration rather than run manually. For Airflow, install the apache-airflow-providers-soda package and use SodaSQLOperator as a task within a DAG. A scan failure raises an exception, blocking downstream tasks from running on bad data.

For dbt, Soda ingests test results through the manifest.json and run_results.json files dbt generates after a run. This lets you view dbt test outcomes and Soda scan results in one place, useful for teams running both frameworks.

For GitHub Actions, add a scan step that runs against a staging environment before a pull request merges:

- name: Run Soda Scan
  run: soda scan -d staging -c configuration.yml checks_orders.yml

A failed scan fails the CI check and blocks the PR. This is the most direct way to enforce data contracts across a team without manual review.

Common Mistakes to Avoid

Setting thresholds without historical data is the most common error. A row_count > 50000 check that fails on a slow Monday creates noise and trains teams to ignore alerts. Start with structural checks (nulls, schema, freshness), run a few weeks of scans, then add volume thresholds calibrated to observed minimums.

Storing credentials in configuration files defeats the purpose of centralized secrets management. Always use environment variable interpolation.

Not versioning checks files in git means you lose the ability to audit when a check was added, who changed a threshold, or why a specific validation exists. Checks files are code and belong in version control alongside the models they protect.

Running a first scan without reviewing the output creates a false baseline. The first scan shows you the current state of your data, which may already contain issues. Treat the first scan as an audit, not a gate. Review and adjust checks before treating failures as blocking.

Next Steps

Connect Soda Cloud to centralize scan history, configure alerting over time, and give non-technical stakeholders a web interface to view dataset health. Soda Cloud is the managed layer on top of the open-source library.

For cross-table checks, SodaCL supports custom SQL checks that run arbitrary queries and evaluate their results against a threshold. This covers referential integrity between tables and business-logic validations that built-in metrics cannot express.

The full SodaCL reference at docs.soda.io covers schema evolution checks, freshness for partitioned tables, and integration guides for Spark and Kubernetes-based pipelines.

FAQ

What is SodaCL and how does it differ from dbt tests?

SodaCL (Soda Checks Language) is a YAML-based domain-specific language for defining data quality checks. It runs against live data at scan time and evaluates properties like freshness, row counts, null rates, and value validity. dbt tests run as part of a dbt build and are tightly coupled to the dbt transformation layer. SodaCL runs independently of any transformation tool, which makes it usable across raw, transformed, and serving layers. Teams that use both typically run dbt tests on transformation logic and Soda checks on the data as it is served to downstream consumers.

Is Soda Core free to use?

Soda Core is open-source and free. The soda-core package and all connector packages (soda-postgres, soda-bigquery, soda-snowflake, etc.) are available on PyPI under an open-source license. Soda Cloud, the managed web interface for centralized scan results, alerting, and team collaboration, is a paid product with a free tier available. The soda ai CLI agent is part of soda-core and requires no additional license or API key as of the February 2026 release.

How do I run Soda scans automatically on a schedule?

The most common approach is to embed soda scan commands in your existing orchestration tool. In Airflow, use the apache-airflow-providers-soda package, which provides a SodaSQLOperator that runs as a task within a DAG. You can schedule scans after ingestion, after dbt runs, or before data is served to BI tools. For simpler setups, a cron job running the soda scan command on a schedule works without any orchestration dependency. GitHub Actions can also run scans on a schedule using the on: schedule trigger.

What does the Soda AI CLI actually do?

The soda ai command opens an interactive terminal session with an AI agent that knows the SodaCL syntax and data quality best practices. You provide it a table schema, a dbt model, or a pipeline script, and it generates a complete SodaCL checks file. The agent can also convert existing SQL validation queries into SodaCL checks and help migrate from an older checks format to data contracts introduced in Soda 4.0. No API key is required. It ships as part of soda-core and was released in open beta on February 23, 2026.

How does Soda handle freshness checks on partitioned tables?

For partitioned tables, Soda's freshness check evaluates the maximum value of a timestamp column across the entire table by default. For tables partitioned by date where the most recent partition may be partial or delayed, you can scope the freshness check to a specific partition using a filter. The SodaCL syntax supports a where clause on any check, letting you write freshness(created_at) < 1d where partition_date = current_date - 1 to check yesterday's partition explicitly. The full syntax reference is at docs.soda.io.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026