Datacenter switchovers (web search) are a standard response to certain types of situations, and involve changing which data center serves as the primary data center. Technology organizations regularly practice datacenter switchovers to ensure that tooling and hardware will respond appropriately in case of an emergency. Moreover, switching between datacenters makes room for potentially disruptive maintenance work on inactive servers, such as database upgrades/changes, hardware replacement etc. In other words, while we're serving traffic from an active datacentre, we are doing our regular upkeep work on the inactive one to maintain its efficiency and reliability.
At Wikimedia a datacentre switchover means switching over different components between our two main datacentres:eqiad andcodfw.
We perform two datacenter switchovers annually, during the week of the solar equinox:
See#Upcoming Switches for the next switchover dates, andSwitch Datacenter/Switchover Dates for a pre-calculated list of switchover dates through 2050.
Our switchover process is broken down into stages, where some can progress independently, while others need to progress in lockstep. This page documents all the steps needed for this work broken down by component.SRE/Service_Operations is driving the process and maintains the software necessary to run the switchover, with a little help from their friends.
Impact of a switchover is expected to be2-3 minutes of read-only for MediaWiki including extensions. Any other services/features/infrastructures not participating directly in the Switchover will continue to work as usual. However, anything relying on MediaWiki indirectly (e.g. via some data pipeline) may experience some minor impact, for example some delay in receiving events. This is expected.
Read-only is a two-step process: we first set MediaWiki itself read-only and then theMediaWiki databases. We allow an amount of time between the two so as to allow the last in-flight edits to land safely. All read-only functionality will continue to function as usual.
During read-only, any kind of writes reaching ourMediaWiki databases (UPDATE, DELETE, INSERT in SQL terms) will be denied. Additionally, any features ignoring the global MediaWiki read-only configuration will not function during this time window. This scheduled read-only period adds a 0.001% MediaWiki edit unavailability per year.
Notes:Non-MediaWiki databases are not part of the switchover.
Datacenter Switchovers take place twice a year. InMarch we move northward toEqiad, and inSeptember we move southward toCodfw. This coincides with the Solar Equinox, where we assume that the Northward Solar Equinox happens on March 21st, and the Southward Solar Equinox on September 21st. (It is on purpose that this does not match the the astronomical event exactly.)
Disruptive operations such as the MediaWiki Switchover (see below)will target 14:00 UTC as their start time. However, SRE Service Operations reserves the right to adjust this byup to +/- 2h with sufficient prior notification.
A controlled switchover occurs over a span of 8 days:
Non read-only parts of the Switchover always take place on Tuesday. This process is non-disruptive and lower-risk, and it may be scheduled @ 14:00 UTC, however that is not necessary.
dc_from datacentredc_todc_todc_from to destinationdc_toThe MediaWiki Switchover (read-only) will always take place on the Wednesday of the above mentioned week. During read-only (2-3 minutes) no Wikis will be editable and editors will be seeing a warning message asking to try again later.Read-only starts @ 14:00 UTC (subject to change with sufficient prior notification, as noted above). Readers should experience no changes for the entirety of the event.
Note: For thenext 7 calendar days after the MW read-only phasetraffic will be flowing solely to one datacentre (destination), rendering the other datacenter effectively inactive.
At your convenience, after coordinating with deployers, you may switch the Special cases
A week later,we activate caching and services in the inactive/secondary datacenter again. With traffic flowing in both DCs, we are back in the normal Multi-DC mode. This period may be extended, depending how maintenance work progresses at the inactive DC.
Note: As of September 2023, we are running each datacenter as primary for half of the year.The 2 datacentres are considered coequal, alternating roles every 6 months.
SeeSwitch Datacenter/Coordination coordinate dates and communication plan with involved groups.
Run a "live test" of the MediaWiki cookbook and a dry-run for everything.
Depending on what changes have occurred to our infrastructure/production from the previous switchover, code changes in cookbooks are expected. The purpose of the live-test and the dry-run is to test most of the existing and updated codepaths, and identify potential issues there.
Note: Always use the--dry-run flag when running cookbooks for testing purposes
A dry-run is available for both cookbooks we use during a switchover:sre.switchdc.mediawiki andsre.discovery.datacenter. During a dry-run the direction is the one we have announced.
For example, if we are currently on codfw and switching over to eqiad, a dry-run's direction would be codfw→eqiad, as follows:
cumin1002:~# cookbook --dry-run sre.switchdc.mediawikicodfw eqiad<entering cookbook menu>> 00-disable-puppet> 00-reduce-ttl
cumin1002:~# cookbook --dry-run sre.discovery.datacenter depool codfw \ --all --reason "Datacenter services switchover dry-run" \ --task-id T357547
Note:
A live test (--live-test) flag will skip actions that could harm the primary DC or perform them on the secondary DC and is available only for thesre.switchdc.mediawiki cookbook. What we should be careful about it is thatwe "switch" from the currently secondary DC to the currently primary DC. While the live-test process will log your actions to SAL please remember to announce to #wikimedia-sre and to #wikimedia-operations too that you will be running this test. Unless something goes really badly this is a non-disruptive test.
For example, if currently our primary DC iscodfw and for the upcoming switchover we will be switching toeqiadthe direction for a live test is eqiad→codfw:
cumin1002:~# cookbook sre.switchdc.mediawiki --live-test --task-id TXXXXXX --ro-reason "Datacenter MediaWiki switchover live-test"eqiad codfw<entering cookbook menu>> 00-disable-puppet> 00-reduce-ttl
Note:
mysql commands). Consider reaching out to Data Persistence to confirm that this will not conflict with any ongoing maintenance operations during the planned test window.Check that all core primaries in DC_TO are in sync with the core primaries in DC_FROM step of the03-set-db-readonly cookbook will fail, but the error is suppressed in--live-test mode. Consider checking with Data Persistence about whether this is expected to fail.Data Persistance checklist:
sre.switchdc.databases.preparecookbook.Service Operations checklist:
mw-web andmw-api-ext, may need additional upsizes before Day 1 to accommodate all traffic in the destination datacentre. Seetask T371273 for example analysis to determine appropriate upsizes.GeoDNS (User-facing) Routing:
Use thesre.dns.admin cookbook to depool all GeoDNS services from the source DC. SeeChange GeoDNS / Depool a Site for details. Example run: if you are depooling eqiad,
cookbook sre.dns.admin depool eqiad
(authdns-update is not required, you only need to run the cookbook)
The TTL fordyna.wikimedia.org andupload.wikimedia.org is 3 minutes or 180 seconds as of May 2025.
Same procedure as above, just using thepool command rather thandepool.
For a global switchover we are using the sre.discovery.datacenter to depool all services from a DC:
However, there are a few services we completely exclude from this process. These are hardcoded in the sre.discovery.datacenter cookbook (seeEXCLUDED_SERVICES).
What the cookbook does is, for each service:
Before depooling any service, do not forget to review (and copy/paste) the current status of all services, but running:
cookbook sre.discovery.datacenter status all
The following command will depool all active/active services from a DC, and will prompt to move or skip the active/passive ones.
# Switch all services to eqiad$sudocookbooksre.discovery.datacenterdepoolcodfw--all--reason"Datacenter Switchover"--task-idT12345
The following command will repool all active/active services to a DC, and will prompt to move or skip the active/passive ones.
# Repool codfw$sudocookbooksre.discovery.datacenterpoolcodfw--reason"Datacenter switch to Multi-DC"--task-idT12345
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in theoperations/cookbooks repository, in thecookbooks/sre/switchdc/mediawiki/ path.
sudo cookbook sre.switchdc.mediawiki --ro-reason 'DC switchover (TXXXXXX)' codfw eqiad
and proceed through the steps
scap lock --all "Datacenter Switchover - T12345"00-disable-puppet: Disables puppet on maintenance hosts in both eqiad and codfw00-reduce-ttl: Reduces TTL for various DNS discovery entries.Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1. The cookbook should force you to wait anyway.00-optional-warmup-caches: Warms up shared (e.g., Memcache) and local (e.g., APCu) caches inDC_TO using the mediawiki-cache-warmup tool. The warmup queries will repeat automatically until the response times stabilize, and include:mw-web.mw-web,mw-api-ext, andmw-api-int.00-downtime-db-readonly-checks: Sets downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page.01-stop-maintenance: Stops maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters. Keep in mind there is a chance of a manual job running. Check again with your peers; usually the way forward is to kill the job by force.sudo systemctl reset-failed <failed unit> on the maintenance host, and then re-run the cookbook.02-set-readonly: Sets read-only mode by changing theReadOnly conftool value03-set-db-readonly: Puts origin DCDC_FROM core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and waits for destination DC'sDC_TO databases to catch up with replication04-switch-mediawiki: Switches the discovery records and MediaWiki active datacenterMEDIAWIKI_SERVICES topooled=true in destination DCWMFMasterDatacenter fromDC_FROM toDC_TOMEDIAWIKI_SERVICES topooled=false in source DCAfter this, DNS will be changed for the source DC and internal applications (except mediawiki) will start hitting the new DC
06-set-db-readwrite: Sets destination DC's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode07-set-readwrite: Goes back to read-write mode by changing theReadOnly conftool valueTake a breath, smile!
08-restart-envoy-on-jobrunners: Restarts pods on the (now) inactive jobrunners, trigger changeprop to re-resolve the DNS name and connect to destination DC08-start-maintenance: Starts maintenance on destination DC09-restore-ttl: Sets the TTL for the DNS records to 300 seconds againauthdns-update!log Phase 9: Update DNS records for new database masters09-run-puppet-on-db-masters: Runs Puppet on the database masters in both DCs, to update expected read-only state.authdns-updateauthdns-updatecurl -s -H 'Accept: application/json'https://stream.wikimedia.org/v2/stream/recentchange | jq . mx-in1001:~$ sudo watch qshape mx-out1001:~$ sudo watch qshapeCirrusSearch talks by default to the local datacenter ($wmgDatacenter). No special actions are required when disabling a datacenter.
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editingwgCirrusSearchDefaultClusterext-CirrusSearch.php.
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by followingRecovering from an Elasticsearch outage / interruption in updates.
If it is needed to exclude services, using the oldsre.switchdc.services is still necessary until exclusion is implemented.
# Switch all services to codfw, excluding parsoid and cxserver$sudocookbooksre.switchdc.services--excludeparsoidcxserver--eqiadcodfwIf you are switching only one service, using the oldsre.switchdc.services is still necessary
# Switch the service "parsoid" to codfw-only$sudocookbooksre.switchdc.services--servicesparsoid--eqiadcodfwapt.wikimedia.org needs aPuppet change.
These services require manual changes to be switched over and have not yet been included in service::catalog
These services should be switched by Collaboration Services:
Main document:MariaDB/Switch Datacenter
Once we're confident that the switchover will not be rolled back run thesre.switchdc.databases.finalizecookbook.
These references are to aid discovery and as reminder to avoid ambiguity.
A few months after the Switchback of 2023, and following a feedback gathering process, a proposal to move to a predictable set of dates for the dates while also increasing the Switchover duration to 6 months was adopted and turned into a process. The document can be found in the link below:
Recurring, Equinox-based, Data Center Switchovers
SeeSwitch Datacenter/Switchover Dates for a pre-calculated list up to 2050
September
March
March
September
September
February
April (switching back):
Schedule:
Reports:
Switching back:
Reports:
Schedule:
Reports:
Switching back:
Schedule:
Reports:
Switching back:
Reports:
Schedule:
Reports:
Switching back:
Reports:
Schedule:
Reports:
Switching back:
Aggregated list of interesting dashboards