Guides

How to Set Up Datadog Experiments for A/B Testing

Arkzero ResearchApr 10, 20267 min read

Last updated Apr 10, 2026

Datadog Experiments is a new product that lets teams run A/B tests inside the Datadog platform, connecting product changes directly to business metrics stored in their data warehouse. It launched in general availability on April 2, 2026, built on technology from Datadog's acquisition of Eppo. This guide walks through the full setup process, from configuring data sources to launching your first experiment with feature flags and guardrail metrics.

What Datadog Experiments Does

Datadog Experiments lets product and engineering teams run randomized A/B tests directly inside the Datadog platform. Instead of stitching together a feature flag tool, an analytics platform, and a statistics engine, you get all three in one place. The product connects to your existing data warehouse (Snowflake, BigQuery, Redshift, or Databricks) so experiment results are measured against your actual business metrics, not proxies.

The product launched on April 2, 2026, and is built on technology from Datadog's 2024 acquisition of Eppo, a company that specialized in statistical experimentation engines. It ships with variance reduction techniques that let teams detect meaningful differences with smaller sample sizes, which means faster experiment cycles.

Prerequisites

Before starting, you need an active Datadog account with at least one of these data sources configured:

Real User Monitoring (RUM) for client-side performance signals. This captures page load times, user interactions, and frontend errors that your experiments might affect.

Product Analytics for tracking user behavior and journey metrics. This gives you the event data that feeds into experiment metric definitions.

A connected data warehouse if you want to measure experiments against business metrics like revenue, conversion rates, or retention. Datadog supports Snowflake, BigQuery, Redshift, and Databricks as warehouse sources.

You also need the Datadog Feature Flags product enabled, since experiments rely on feature flags to split traffic between control and variant groups.

Step 1: Connect Your Data Sources

Navigate to Organization Settings in Datadog and open the Integrations tab. If you are using a data warehouse, select your provider and follow the connection wizard. For Snowflake, you will need your account identifier, a service user with read access to the relevant tables, and the warehouse name. BigQuery requires a service account JSON key with BigQuery Data Viewer permissions.

For RUM, ensure the RUM SDK is installed in your application. The SDK automatically captures page views, user actions, and resource timings. Product Analytics requires event instrumentation, meaning your application needs to send custom events for the user actions you want to measure (signups, purchases, feature usage).

Datadog recommends connecting at least two data sources. RUM plus a data warehouse gives you the broadest experiment coverage: you can measure both user experience metrics (page speed, error rates) and business outcomes (revenue per user, trial conversion) in the same experiment.

Step 2: Define Your Experiment Metrics

Go to the Experiments section under Product Analytics in the Datadog dashboard. Before creating an experiment, define the metrics you want to measure. Click "Metrics" in the left sidebar, then "New Metric."

Datadog supports several metric types: counts (number of signups), sums (total revenue), averages (mean session duration), ratios (conversion rate), and percentiles (P95 page load time). Each metric definition includes the event source, any filters, and the aggregation method.

For a typical conversion experiment, you might define a primary metric like "checkout completion rate" as a ratio of checkout_completed events to session_started events, filtered to users who viewed the product page.

Create guardrail metrics too. These are metrics that should not degrade during the experiment. Common guardrails include error rate, page load time, and overall session count. If a guardrail metric shows statistically significant degradation, Datadog flags the experiment for review.

Step 3: Create a Feature Flag

Navigate to Feature Flags in the Datadog sidebar and click "New Flag." Name it something descriptive that matches your experiment, like "checkout-redesign-v2." Set the flag type to boolean or multivariate depending on how many variants you plan to test.

Configure the targeting rules. For a simple A/B test, create two variants: control (the existing experience) and treatment (the new experience). Set the traffic allocation; a 50/50 split is standard for most tests, but you can use 90/10 if you want to limit exposure during early validation.

Implement the flag in your application code using the Datadog SDK. In a JavaScript application, the check looks like this:

import { datadogRum } from '@datadog/browser-rum';

const variant = datadogRum.getFeatureFlag('checkout-redesign-v2');
if (variant === 'treatment') {
  renderNewCheckout();
} else {
  renderCurrentCheckout();
}

Deploy the code with the flag defaulting to "control" so no users see the new experience until you launch the experiment.

Step 4: Create and Configure the Experiment

Go back to Experiments and click "Create Experiment." Fill in the experiment name, a hypothesis (for example, "The simplified checkout flow will increase completion rate by 5% without increasing error rates"), and select the feature flag you created.

Assign your primary metric. This is the metric the experiment is designed to move. Add your guardrail metrics. Datadog runs statistical tests on all assigned metrics, but only the primary metric determines whether the experiment "wins."

Set the minimum sample size or minimum runtime. Datadog's statistical engine uses sequential testing, which means it can detect significant results before the experiment reaches a fixed sample size. However, setting a minimum runtime (usually 7 to 14 days) prevents decisions based on day-of-week effects or other cyclical patterns.

Review the experiment configuration summary, then click "Launch." Datadog will start splitting traffic according to your feature flag rules and collecting metric data.

Step 5: Monitor and Analyze Results

Once the experiment is running, the Experiments dashboard shows real-time results. You will see the current metric values for control and treatment groups, the statistical significance level, and the confidence interval for the treatment effect.

Datadog uses a combination of frequentist and Bayesian methods depending on the metric type. For binary metrics like conversion rate, it applies a sequential probability ratio test. For continuous metrics like revenue per user, it uses CUPED (Controlled-experiment Using Pre-Experiment Data) for variance reduction, which typically reduces the required sample size by 20 to 40 percent.

The guardrail panel shows a green, yellow, or red status for each guardrail metric. Green means no significant change. Yellow means the confidence interval overlaps zero but trends negative. Red means a statistically significant degradation, and the system recommends pausing the experiment.

When results reach significance, Datadog surfaces a recommendation: roll out the winner, roll back, or continue collecting data. Click the recommendation to automatically update the feature flag to serve 100% of traffic the winning variant.

Common Pitfalls

Running experiments for less than one full business cycle (typically 7 days) often produces misleading results. Weekend traffic patterns differ from weekday patterns in most products, and stopping an experiment on day 3 of a positive trend can lead to false positives.

Defining too many primary metrics dilutes statistical power. Pick one metric that directly represents the outcome you care about. Use guardrails for everything else.

Not accounting for network effects is another common mistake. If your product involves user-to-user interactions (messaging, marketplaces), standard A/B tests can leak treatment effects into the control group. Datadog does not currently offer cluster-randomized experiments, so be cautious with social features.

According to a 2025 analysis by Eppo (now part of Datadog), teams that use variance reduction techniques like CUPED reach statistically significant results 30% faster on average. Enabling this in Datadog Experiments is a single toggle in the experiment settings. If you have pre-experiment data on the metric you are measuring, turn it on.

Practical Summary

The full setup takes about two hours for a team that already has RUM or Product Analytics instrumented. The longest step is usually connecting the data warehouse and verifying that metric definitions match your source-of-truth tables. Once the first experiment is running, subsequent experiments reuse the same metrics and warehouse connections, bringing setup time down to minutes. If you prefer to skip infrastructure setup entirely and run analysis from uploaded data files with plain-English prompts, tools like VSLZ handle the statistical layer without requiring any platform configuration.

FAQ

What data sources does Datadog Experiments support?

Datadog Experiments supports three types of data sources: Real User Monitoring (RUM) for client-side performance metrics, Product Analytics for user behavior tracking, and data warehouse connections including Snowflake, BigQuery, Redshift, and Databricks. You need at least one configured, but Datadog recommends two for broader experiment coverage.

How long should I run a Datadog Experiment before making a decision?

Datadog recommends a minimum runtime of 7 to 14 days to account for day-of-week effects and cyclical traffic patterns. The platform uses sequential testing that can detect significant results before reaching a fixed sample size, but stopping too early (under 7 days) risks false positives from short-term traffic fluctuations.

Does Datadog Experiments require a separate feature flag tool?

No. Datadog Experiments includes built-in feature flag functionality through Datadog Feature Flags. You create flags directly in the Datadog dashboard and implement them using the Datadog SDK. The experiment and flag management are integrated, so you do not need a third-party tool like LaunchDarkly or Split.

What is CUPED and should I enable it in Datadog Experiments?

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that uses historical data about your metric to reduce noise in experiment results. It typically reduces the required sample size by 20 to 40 percent, meaning experiments reach statistical significance faster. You should enable it whenever you have pre-experiment data for the metric you are measuring. It is a single toggle in the experiment settings.

Can I use Datadog Experiments for server-side A/B tests?

Yes. While client-side experiments use the RUM SDK, server-side experiments can use the Datadog SDK for your backend language. The feature flag evaluation happens server-side, and metrics can be tracked through Product Analytics events or data warehouse queries. The setup process is the same: create a flag, define metrics, and launch the experiment.