About Firebase A/B tests

To help you maximize the relevance and usefulness of your test results, thispage provides detailed information about howFirebase A/B Testing works.

Sample size

Firebase A/B Testing inference does not require the identification of aminimum sample size prior to starting an experiment. In general, you should pickthe largest experiment exposure level that you feel comfortable with. Largersample sizes increase the chances of finding a statistically significant result,especially when performance differences between variants are small. You may alsofind it useful to consult an online sample size calculator to find therecommended sample size based on the characteristics of your experiment.

Edit experiments

You can edit selected parameters of running experiments, including:

  • Experiment name
  • Description
  • Targeting conditions
  • Variant values

To edit an experiment:

  1. Open the results page for the experiment you want to modify.
  2. From theMore menu, selectEdit running experiment.
  3. Make your changes, then clickPublish.

Note that changing the app's behavior during a running experiment may impactresults.

Remote Config variant assignment logic

Users who match all experiment targeting conditions (including the percentageexposure condition) are assigned to experiment variants according tovariant weights and a hash of the experiment ID and theuser'sFirebase installation ID.

Google Analytics Audiencesare subject to latency and are not immediately available when a user initiallymeets the audience criteria:

  • When you create a new audience, it may take 24-48 hours to accumulatenew users.
  • New users are typically enrolled into qualifying audiences 24-48 hours afterthey become eligible.

For time-sensitive targeting, consider the use ofGoogle Analytics userproperties or built-in targeting options such as country or region, language,and app version.

Once a user has entered an experiment, they are persistently assigned to theirexperiment variant and receive parameter values from the experiment as long asthe experiment remains active, even if their user properties change and they nolonger meet the experiment targeting criteria.

Activation events

Experiment activation events limit experiment measurement to app users whotrigger the activation event. The experiment activation event does not have anyimpact on the experiment parameters that are fetched by the app; all users whomeet the experiment targeting criteria will receive experiment parameters.Consequently, it is important to choose an activation event that occurs afterthe experiment parameters have been fetched and activated, but before theexperiment parameters have been used to modify the app's behavior.

Variant weights

During experiment creation, it is possible to change the default variant weightsto place a larger percentage of experiment users into a variant.

Interpret test results

Firebase A/B Testing usesfrequentist inference to help you understand thelikelihood that your experiment results could have occurred solely due to randomchance. This likelihood is represented by aprobability value, orp-value. The p-value is the probability that a difference in performance thislarge, or larger, between two variants could have occurred due to random chanceif there is actually no effect, measured by avalue between 0 and 1.A/B Testing uses a significance level of 0.05 so that:

  • A p-value less than 0.05 indicates that if the true difference were zero,there is a less than 5% chance that an observed difference this extreme couldoccur randomly. Because 0.05 is the threshold, any p-value less than 0.05indicates a statistically significant difference between variants.
  • A p-value greater than 0.05 indicates that the difference between variants isnot statistically significant.

Experiment data is refreshed once a day, and the last update timeappears at the top of the experiment results page.

The experiment results graph displays the cumulative average values of theselected metric. For example, if you're tracking Ad revenue per user as ametric, it displays observed revenue per user and if you're tracking Crash-freeusers, it tracks the percentage of users who have not encountered a crash. Thisdata is cumulative from the beginning of the experiment.

Results are split intoObserved data andInference data.Observed data is calculated directly from Google Analytics data, and inferencedata provides p-values and confidence intervals to help you evaluate thestatistical significance of the observed data.

For each metric, the following statistics are displayed:

Observed data

  • Total value for the tracked metric (number of retained users, number ofusers who crashed, total revenue)
  • Metric-specific rate (retention rate, conversion rate, revenue per user)
  • Percent difference (lift) between the variant and baseline

Inference data

  • 95% CI (Difference in means) displays an intervalthat contains the "true" value of the tracked metric with 95% confidence. Forexample, if your experiment results in a 95% CI for estimated total revenuebetween $5 and $10, there is a 95% chance that the true difference in means isbetween $5 and $10. If the CI range includes 0, a statistically significantdifference between the variant and baseline was not detected.

    Confidence interval values appear in the format that matches the trackedmetric. For example, Time (inHH:MM:SS) for user retention, USD for adrevenue per user, and percentage for conversion rate.

  • P-value, which represents the probability ofobserving data as extreme as the results obtained in the experiment, giventhat there is no true difference between the variant and baseline. The lowerthe p-value, the higher the confidence that the observed performance remainstrue if we repeat the experiment. A value of 0.05 or lower indicates asignificant difference and a lowlikelihood that results were due to chance. P-values are based on aone-tailed test, where the Variant value is greater than the Baseline value.Firebase uses anunequal variance t-test for continuous variables(numeric values, like revenue) and az-test of proportions for conversiondata (binary values, like user retention, crash-free users,users who trigger aGoogle Analytics event).

The experiment results provide important insights for each experiment variant,including:

  • How much higher or lower each experiment metric is compared to the baseline,as directly measured (that is, the actual observed data)
  • The likelihood that the observed difference between the variant and thebaseline could have occurred due to random chance (p-value)
  • A range that is likely to contain the "true" performance difference betweenthe variant and the baseline for each experiment metric---a way to understandthe "best case" and "worst case" performance scenarios

Interpret results for experiments powered by Google Optimize

Firebase A/B Testing results for experiments started before October 23, 2023were powered by Google Optimize. Google Optimize used Bayesian inference togenerate insightful statistics from your experiment data.

Results are split into "observed data" and "modeled data." Observed data wascalculated directly from analytics data, and modeled data was derived from theapplication of our Bayesian model to the observed data.

For each metric, the following statistics are displayed:

Observed Data

  • Total value (sum of metric for all users in the variant)
  • Average value (average value of metric for users in the variant)
  • % difference from baseline

Modeled Data

  • Probability to beat baseline: how likely that the metric is higher for this variant compared to the baseline
  • Percent difference from baseline: based on the median model estimates of the metric for the variant and the baseline
  • Metric ranges: the ranges where the value of the metric is most likely to be found, with 50% and 95% certainty

Overall, the experiment results give us three important insights for eachvariant in the experiment:

  1. How much higher or lower each experiment metric is compared to the baseline, as directly measured (i.e., the actual observed data)
  2. Howlikely it is that each experiment metric is higher than the baseline / best overall, based on Bayesian inference (probability to be better / best respectively)
  3. The plausible ranges for each experiment metric based on Bayesian inference--"best case" and "worst case" scenarios (credible intervals)

Leader determination

For experiments usingFrequentist inference,Firebase declares that a variant is leading if there is a statisticallysignificant performance difference between the variant and the baseline on thegoal metric. If multiple variants meet this criteria, the variant with thelowestp-value is chosen.

For experiments that usedGoogle Optimize,Firebase declared that a variant is a "clear leader" if it had greater than 95%chance of being better than the baseline variant on the primary metric. Ifmultiple variants met the "clear leader" criteria, only the best performingvariant overall was labeled as the "clear leader."

Since leader determination is based on the primary goal only, you shouldconsider all relevant factors and review the results of secondary metrics beforedeciding whether or not toroll outa leading variant. You may wantto consider the expected upside of making the change, the downside risk (such asthe lower end of the confidence interval for improvement), and the impact tometrics other than the primary goal.

For example, if your primary metric is Crash-free users, and Variant A is aclear leader over the baseline, but Variant A user retention metrics trailbaseline user retention, you may want to investigate further before rolling outVariant A more widely.

You canroll outany variant, not just a leading variant, based onyour overall evaluation of performance across both primary and secondarymetrics.

Experiment duration

Firebase recommends that an experiment continue to run until the followingconditions are met:

  1. The experiment has accrued enough data to provide a useful result.Experiments and result data are updated once daily. You may want to consultan online sample size calculator to evaluate the recommended sample size ofyour experiment.
  2. The experiment has run long enough to ensure a representative sample of yourusers and measure longer-term performance. Two weeks is the recommendedminimum runtime for a typical Remote Config experiment.

Experiment data is processed for a maximum of 90 days afterexperiment start. After 90 days, the experiment is automatically stopped.Experiment results are no longer updated in theFirebase console and theexperiment stops sending experiment-specific parameter values. At thispoint, clients begin fetching parameter values based on the conditions setin theRemote Config template. Historical experiment data is retained untilyou delete the experiment.

BigQuery schema

In addition to viewingA/B Testing experiment data in theFirebase console, you can inspect and analyze experiment data inBigQuery. WhileA/B Testing does not have a separateBigQuery table, experiment and variant memberships are stored on everyGoogle Analytics event within theAnalytics event tables.

The user properties that contain experiment information are of the formuserProperty.key like "firebase_exp_%" oruserProperty.key ="firebase_exp_01" where01 is the experiment ID, anduserProperty.value.string_value contains the (zero-based) index of theexperiment variant.

You can use these experiment user properties to extract experiment data.This gives you the power to slice your experiment results in many differentways and independently verify the results ofA/B Testing.

To get started, complete the following as described in this guide:

  1. EnableBigQuery export forGoogle Analytics in the Firebaseconsole
  2. AccessA/B Testing data usingBigQuery
  3. Explore example queries

EnableBigQuery export forGoogle Analytics in the Firebase console

If you're on the Spark plan, you can use theBigQuery sandbox toaccessBigQuery at no cost, subject toSandbox limits.SeePricing and theBigQuery sandboxfor more information.

First, make sure that you're exporting yourAnalytics data toBigQuery:

  1. Open theIntegrations tab,which you can access using>Project settings in theFirebase console.
  2. If you're already usingBigQuery with other Firebase services,clickManage. Otherwise, clickLink.
  3. ReviewAbout Linking Firebase toBigQuery, then clickNext.
  4. In theConfigure integration section, enable theGoogle Analytics toggle.
  5. Select a region and choose export settings.

    Note: For more information aboutGoogle Analytics forFirebase settings, seeData collection.
  6. ClickLink toBigQuery.

Depending on how you chose to export data, it may take up to a day for thetables to become available. For more information about exporting project data toBigQuery, seeExport project data toBigQuery.

AccessA/B Testing data inBigQuery

Before querying for data for a specific experiment, you'll want to obtain someor all of the following to use in your query:

  • Experiment ID: You can obtain this from the URL of theExperiment overview page. For example, if your URL looks likehttps://console.firebase.google.com/project/my_firebase_project/config/experiment/results/25,the experiment ID is25.
  • Google Analytics property ID: This is your 9-digitGoogle Analytics property ID. You can find this withinGoogle Analytics; it also appears inBigQuery when you expandyour project name to show the name of yourGoogle Analytics eventtable (project_name.analytics_000000000.events).
  • Experiment date: To compose a faster and more efficient query, it'sgood practice to limit your queries to theGoogle Analytics dailyevent table partitions that contain your experiment data—tablesidentified with aYYYYMMDD suffix. So, if your experiment ran fromFebruary 2, 2024 through May 2, 2024, you'd specify a_TABLE_SUFFIX between'20240202' AND '20240502'. For an example, seeSelect a specific experiment's values.
  • Event names: Typically, these correspond with yourgoal metricsthat you configured in the experiment. For example,in_app_purchaseevents,ad_impression, oruser_retention events.
Tip: If you're on the Blaze plan, Firebase can generate a sample query toextract the experiment name, variant, event name, and number of events for theexperiment you select. Learn more atQuery experiment data using theFirebase console's auto-generated query

After you gather the information you need to generate your query:

  1. OpenBigQueryin theGoogle Cloud console.
  2. Select your project, then selectCreate SQL query.
  3. Add your query. For example queries to run, seeExplore example queries.
  4. ClickRun.
Tip: While these steps describe using theGoogle Cloud console, you canalso use the CLI or client libraries to queryBigQuery. Find out morein theBigQuery documentation.

Query experiment data using the Firebase console's auto-generated query

If you're using the Blaze plan, theExperiment overview page provides asample query that returns the experiment name, variants, event names, and thenumber of events for the experiment you're viewing.

To obtain and run the auto-generated query:

  1. From theFirebase console, openA/B Testingand select theA/B Testing experiment you want to query to open theExperiment overview.
  2. From the Options menu, beneathBigQuery integration, selectQuery experiment data. This opens your project inBigQuerywithin theGoogle Cloud console console and provides a basic query you canuse to query your experiment data.

The following example shows a generated query for an experiment withthree variants (including the baseline) named "Winter welcome experiment."It returns the active experiment name, variant name, unique event, andevent count for each event. Note that the query builder doesn't specifyyour project name in the table name, as it opens directly within your project.

/*    This query is auto-generated byFirebase A/B Testing for your    experiment "Winter welcome experiment".    It demonstrates how you can get event counts for all Analytics    events logged by each variant of this experiment's population.  */SELECT'Winter welcome experiment'ASexperimentName,CASEuserProperty.value.string_valueWHEN'0'THEN'Baseline'WHEN'1'THEN'Welcome message (1)'WHEN'2'THEN'Welcome message (2)'ENDASexperimentVariant,event_nameASeventName,COUNT(*)AScountFROM`analytics_000000000.events_*`,UNNEST(user_properties)ASuserPropertyWHERE(_TABLE_SUFFIXBETWEEN'20240202'AND'20240502')ANDuserProperty.key='firebase_exp_25'GROUPBYexperimentVariant,eventName

For additional query examples, proceed toExplore example queries.

Explore example queries

The following sections provide examples of queries you can use to extractA/B Testing experiment data fromGoogle Analytics event tables.

Extract purchase and experiment standard deviation values from all experiments

You can use experiment results data to independently verifyFirebase A/B Testing results. The followingBigQuery SQL statementextracts experimentvariants, the number of unique users in each variant, and sums total revenuefromin_app_purchase andecommerce_purchase events, and standard deviationsfor all experiments within the time range specified as the_TABLE_SUFFIX beginand end dates. You can use the data you obtain from this query with astatistical significance generator for one-tailed t-tests to verify that theresults Firebase provides match your own analysis.

For more information about howA/B Testing calculates inference, seeInterpret test results.

/*    This query returns all experiment variants, number of unique users,    the average USD spent per user, and the standard deviation for all    experiments within the date range specified for _TABLE_SUFFIX.  */SELECTexperimentNumber,experimentVariant,COUNT(*)ASunique_users,AVG(usd_value)ASusd_value_per_user,STDDEV(usd_value)ASstd_devFROM(SELECTuserProperty.keyASexperimentNumber,userProperty.value.string_valueASexperimentVariant,user_pseudo_id,SUM(CASEWHENevent_nameIN('in_app_purchase','ecommerce_purchase')THENevent_value_in_usdELSE0END)ASusd_valueFROM`PROJECT_NAME.analytics_ANALYTICS_ID.events_*`CROSSJOINUNNEST(user_properties)ASuserPropertyWHEREuserProperty.keyLIKE'firebase_exp_%'ANDevent_nameIN('in_app_purchase','ecommerce_purchase')AND(_TABLE_SUFFIXBETWEEN'YYYYMMDD'AND'YYYMMDD')GROUPBY1,2,3)GROUPBY1,2ORDERBY1,2;

Select a specific experiment's values

The following example query illustrates how to obtain data for a specificexperiment inBigQuery. This sample query returns the experiment name,variant names (including Baseline), event names, and event counts.

SELECT'EXPERIMENT_NAME'ASexperimentName,CASEuserProperty.value.string_valueWHEN'0'THEN'Baseline'WHEN'1'THEN'VARIANT_1_NAME'WHEN'2'THEN'VARIANT_2_NAME'ENDASexperimentVariant,event_nameASeventName,COUNT(*)AScountFROM`analytics_ANALYTICS_PROPERTY.events_*`,UNNEST(user_properties)ASuserPropertyWHERE(_TABLE_SUFFIXBETWEEN'YYYMMDD'AND'YYYMMDD')ANDuserProperty.key='firebase_exp_EXPERIMENT_NUMBER'GROUPBYexperimentVariant,eventName

Limits

A/B Testing is limited to 300 total experiments, 24 running experiments,and 24 draft experiments. These limits are shared withRemote Config rollouts.For example, if you have two running rollouts, and three running experiments,you can have up to 19 additional rollouts or experiments.

  • If you reach the 300 total experiment limit or the 24 draft experiment limit,you must delete an existing experiment before creating a new one.

  • If you reach the 24 running experiment and rollout limit, you must stop arunning experiment or rollout before starting a new one.

An experiment can have a maximum of 8 variants (including the baseline) and upto 25 parameters for each variant. An experiment can have a size up to around200 KiB. This includes variant names, variant parameters, and otherconfiguration metadata.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-04 UTC.