How to Get Started with Dagster
Last updated Apr 24, 2026

What Is Dagster and Why Teams Are Switching
Dagster is an open-source data orchestration platform built around software-defined assets (SDAs). Instead of defining a workflow as a sequence of tasks — the Airflow model — Dagster asks you to define the data objects your pipeline produces: tables, files, ML model artifacts, API responses. The scheduler works backward from those assets to determine what needs to run and when.
The distinction matters more than it sounds. In a task-based system, you observe whether jobs succeeded or failed. In an asset-based system, you observe whether your data is fresh, stale, or missing. That shift reduces debugging time because the question "why is this dashboard stale?" becomes directly answerable from the UI rather than requiring you to trace task logs.
Dagster's GitHub repository has passed 13,000 stars as of April 2026. With Apache Airflow 2 reaching end of life, many teams that built their pipelines in 2021-2022 are evaluating alternatives. According to Dagster's own benchmarking data, engineers building in Dagster are 2x more productive than teams on Airflow, primarily because lineage and staleness are first-class concepts rather than bolted-on plugins.
Prerequisites
Before you start:
- Python 3.10 or higher
- pip or uv for package management
- A terminal and a text editor
No database setup. No Docker. No separate service. Dagster runs entirely in-process.
Step 1: Install Dagster
pip install dagster dagster-webserver
If you use uv, which installs packages roughly 10-20x faster than pip:
uv pip install dagster dagster-webserver
Verify the install:
dagster --version
You should see output like dagster, version 1.10.x.
Step 2: Scaffold a New Project
Dagster includes a scaffold command that creates the correct directory structure:
dagster project scaffold --name my_pipeline
cd my_pipeline
pip install -e ".[dev]"
The scaffold creates:
my_pipeline/
my_pipeline/
__init__.py
assets.py
definitions.py
my_pipeline_tests/
setup.py
pyproject.toml
The definitions.py file is the entry point Dagster loads. It wires assets, jobs, schedules, and sensors into a single Definitions object.
Step 3: Define Your First Assets
Open assets.py and replace its contents:
import pandas as pd
from dagster import asset, AssetExecutionContext
@asset
def raw_sales():
"""Load raw sales records from a remote CSV."""
return pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
@asset
def sales_summary(context: AssetExecutionContext, raw_sales):
"""Summarize conversion rate by customer segment."""
summary = raw_sales.groupby("Pclass")["Survived"].mean().reset_index()
summary.columns = ["segment", "rate"]
context.log.info(f"Summarized {len(raw_sales)} rows into {len(summary)} segments")
return summary
Two things to notice. First, sales_summary takes raw_sales as a parameter — Dagster reads that argument name and resolves the dependency automatically. No YAML graph definitions, no explicit set_upstream calls. Second, context.log.info writes to Dagster's structured event log, which surfaces in the UI next to run metadata.
Step 4: Wire Everything into Definitions
Open definitions.py:
from dagster import Definitions, load_assets_from_modules
from . import assets
defs = Definitions(
assets=load_assets_from_modules([assets]),
)
load_assets_from_modules scans the module and collects every @asset-decorated function. You can also list assets explicitly if you want finer control over what gets loaded.
Step 5: Launch the Local UI
From the project root:
dagster dev
Open http://localhost:3000. The Dagster UI shows an Asset Graph with two nodes — raw_sales and sales_summary — connected by a directed edge.
Click Materialize all to run the full pipeline. Dagster executes raw_sales first, passes its result to sales_summary, and records both outputs as asset materializations. The event log on the right of the run view shows each step's start time, duration, and logged metadata.
The UI also tracks the freshness status of each asset. If an upstream asset was materialized three hours ago and a downstream asset has not been re-run since, Dagster marks the downstream asset as stale. This visibility is the practical core difference between asset-based and task-based orchestrators.
Step 6: Add a Recurring Schedule
Edit definitions.py:
from dagster import (
Definitions,
load_assets_from_modules,
define_asset_job,
ScheduleDefinition,
)
from . import assets
all_assets_job = define_asset_job("all_assets_job")
morning_schedule = ScheduleDefinition(
job=all_assets_job,
cron_schedule="0 6 * * *", # 6 AM UTC daily
)
defs = Definitions(
assets=load_assets_from_modules([assets]),
jobs=[all_assets_job],
schedules=[morning_schedule],
)
Restart dagster dev. Navigate to Automation in the sidebar and toggle the schedule on. Dagster's internal scheduler manages execution without any external cron job or additional infrastructure.
Step 7: Add Persistent Storage with an I/O Manager
By default, asset materializations are stored in memory and lost when the process restarts. For persistence, attach a filesystem I/O manager:
from dagster import FilesystemIOManager
defs = Definitions(
assets=load_assets_from_modules([assets]),
jobs=[all_assets_job],
schedules=[morning_schedule],
resources={
"io_manager": FilesystemIOManager(base_dir="/tmp/dagster_storage")
},
)
For production, Dagster supports S3, GCS, BigQuery, Snowflake, and DeltaLake through official integrations. Switching storage backends does not require changing your asset code — you swap the I/O manager in Definitions and the asset functions stay identical.
Step 8: Connect dbt Models as Assets
If your stack includes dbt, the dagster-dbt integration maps each dbt model to a Dagster asset automatically, giving you end-to-end lineage from raw ingestion through SQL transformations in a single graph.
pip install dagster-dbt
Then in definitions.py:
from dagster_dbt import DbtCliResource, dbt_assets
from pathlib import Path
DBT_PROJECT_DIR = Path("/path/to/your/dbt/project")
@dbt_assets(manifest=DBT_PROJECT_DIR / "target/manifest.json")
def my_dbt_assets(context, dbt: DbtCliResource):
yield from dbt.cli(["build"], context=context).stream()
Each dbt model becomes a node in the same Dagster asset graph alongside your Python ingestion assets. The combined lineage view is the clearest picture most data teams will have ever had of where their data comes from.
Deploying to Production
Dagster runs in two modes. Self-hosted requires running the daemon, webserver, and a PostgreSQL database yourself — the right choice for teams with data residency requirements or existing Kubernetes infrastructure.
Dagster Cloud Serverless is a managed option that activates in under five minutes with no infrastructure configuration. The free tier supports small teams and is a practical way to run a Dagster pipeline in production before committing to server management. Dagster Cloud also supports branch deployments — isolated pipeline environments per Git branch — which reduces the risk of shipping pipeline changes to production without a staging run.
What Dagster Does Better Than Airflow
When a Dagster pipeline fails midway, the UI shows exactly which asset materialization failed and which downstream assets are now stale. Airflow's task model shows you that a task failed — it cannot show you which data objects are affected without additional tooling.
Dagster also ships with built-in support for partitioned assets (daily, monthly, or custom key-based). Incremental loading patterns that require external operators in Airflow are first-class primitives in Dagster.
For teams running Airflow 2 pipelines that need a migration path in 2026, Dagster provides official tooling to incrementally migrate DAGs. You can observe existing Airflow DAGs from within Dagster without changing a line of Airflow code, then migrate assets one at a time.
If your team works primarily with uploaded files or ad hoc CSV exports rather than scheduled pipelines, VSLZ handles on-demand analysis from a file upload in plain English without any pipeline setup.
Practical Summary
A working Dagster pipeline with a daily schedule takes under 20 minutes to set up from scratch. The asset graph becomes genuinely useful once you have more than three interdependent assets — that is when freshness tracking and lineage visibility start saving real debugging time. After getting the local setup running, the recommended path is Dagster Cloud Serverless for your first production deployment, followed by connecting one dbt project using dagster-dbt to see unified lineage across ingestion and transformation in a single view.
FAQ
What is Dagster used for?
Dagster is an open-source data orchestration platform used to build, schedule, and monitor data pipelines. Unlike Airflow, which models pipelines as task graphs, Dagster models them as software-defined assets — tables, files, ML models — and tracks whether each asset is fresh or stale. It is commonly used for ETL pipelines, dbt orchestration, ML workflow management, and data platform engineering.
How is Dagster different from Apache Airflow?
The core difference is the execution model. Airflow schedules tasks and reports whether they succeeded or failed. Dagster schedules asset materializations and reports whether the resulting data objects are current, stale, or missing. This means Dagster gives you data lineage and freshness visibility out of the box, without additional plugins. Dagster also has a more Pythonic API — no YAML DAG definitions, no operators — and built-in support for partitioned assets and branch deployments.
How do I install Dagster?
Run pip install dagster dagster-webserver in a Python 3.10+ environment. Then use dagster project scaffold --name my_project to create a project with the correct directory structure, install it with pip install -e '.[dev]', and launch the local UI with dagster dev. The full setup from zero to a running local pipeline takes under 15 minutes.
Is Dagster free?
Yes. Dagster is open-source (Apache 2.0 license) and free to self-host. Dagster Cloud, the managed SaaS version, offers a free Serverless tier for small teams. Paid Dagster Cloud plans add features like SSO, role-based access control, and higher execution limits. Self-hosting requires running the Dagster daemon, webserver, and a PostgreSQL database, but there are no licensing costs.
Can Dagster replace Airflow for existing pipelines?
In many cases, yes. Dagster provides official tooling to incrementally migrate Airflow DAGs — you can observe existing Airflow pipelines from within Dagster without rewriting them, then migrate assets one at a time. With Apache Airflow 2 reaching end of life in 2026, many teams are using the migration tooling to transition to Dagster over several sprints rather than doing a big-bang rewrite.


