Movatterモバイル変換

Switch Datacenter

From Wikitech

See also the blog posts from the "2016 Failover" and "2017 Failover" and "2021 Switchover" on the Wikimedia Blogs.

Introduction

Datacenter switchovers (web search) are a standard response to certain types of situations, and involve changing which data center serves as the primary data center. Technology organizations regularly practice datacenter switchovers to ensure that tooling and hardware will respond appropriately in case of an emergency. Moreover, switching between datacenters makes room for potentially disruptive maintenance work on inactive servers, such as database upgrades/changes, hardware replacement etc. In other words, while we're serving traffic from an active datacentre, we are doing our regular upkeep work on the inactive one to maintain its efficiency and reliability.

What?

At Wikimedia a datacentre switchover means switching over different components between our two main datacentres:eqiad andcodfw.

When?

We perform two datacenter switchovers annually, during the week of the solar equinox:

Northward (codfw → eqiad): ~21st March
Southward (eqiad → codfw): ~21st September

See#Upcoming Switches for the next switchover dates, andSwitch Datacenter/Switchover Dates for a pre-calculated list of switchover dates through 2050.

Our switchover process is broken down into stages, where some can progress independently, while others need to progress in lockstep. This page documents all the steps needed for this work broken down by component.SRE/Service_Operations is driving the process and maintains the software necessary to run the switchover, with a little help from their friends.

Impact

Impact of a switchover is expected to be2-3 minutes of read-only for MediaWiki including extensions. Any other services/features/infrastructures not participating directly in the Switchover will continue to work as usual. However, anything relying on MediaWiki indirectly (e.g. via some data pipeline) may experience some minor impact, for example some delay in receiving events. This is expected.

What does read-only mean?

Read-only is a two-step process: we first set MediaWiki itself read-only and then theMediaWiki databases. We allow an amount of time between the two so as to allow the last in-flight edits to land safely. All read-only functionality will continue to function as usual.

During read-only, any kind of writes reaching ourMediaWiki databases (UPDATE, DELETE, INSERT in SQL terms) will be denied. Additionally, any features ignoring the global MediaWiki read-only configuration will not function during this time window. This scheduled read-only period adds a 0.001% MediaWiki edit unavailability per year.

Notes:Non-MediaWiki databases are not part of the switchover.

High Level switchover flow

Scheduling details

Datacenter Switchovers take place twice a year. InMarch we move northward toEqiad, and inSeptember we move southward toCodfw. This coincides with the Solar Equinox, where we assume that the Northward Solar Equinox happens on March 21st, and the Southward Solar Equinox on September 21st. (It is on purpose that this does not match the the astronomical event exactly.)

Disruptive operations such as the MediaWiki Switchover (see below)will target 14:00 UTC as their start time. However, SRE Service Operations reserves the right to adjust this byup to +/- 2h with sufficient prior notification.

A controlled switchover occurs over a span of 8 days:

Day 1 - Tuesday: Traffic+Services

Non read-only parts of the Switchover always take place on Tuesday. This process is non-disruptive and lower-risk, and it may be scheduled @ 14:00 UTC, however that is not necessary.

Traffic: Disable caching in the origin datacenter -Switch Datacenter#Traffic
- ~20 minutes for disabling caching completely from origindc_from datacentre
Services: Depool services in the origin datacenter to destination -Switch Datacenter#Services
- ~15-40 minutes to switchover services to destinationdc_to
  - Leave active/active services pooled only to destinationdc_to
  - Switchover active/passive services from origindc_from to destinationdc_to

Day 2 - Wednesday: MediaWiki

The MediaWiki Switchover (read-only) will always take place on the Wednesday of the above mentioned week. During read-only (2-3 minutes) no Wikis will be editable and editors will be seeing a warning message asking to try again later.Read-only starts @ 14:00 UTC (subject to change with sufficient prior notification, as noted above). Readers should experience no changes for the entirety of the event.

Switch Mediawiki itself to destination datacentreSwitch Datacenter#MediaWiki
- ~35 minutes for a complete run of the cookbook, from disabling puppet to re-enabling it, if timed right for the read-only part of the cookbook to fall at the start of the announced window. Doing it in an emergency can be done faster since there is no need to wait for a set time.

Note: For thenext 7 calendar days after the MW read-only phasetraffic will be flowing solely to one datacentre (destination), rendering the other datacenter effectively inactive.

Day 3 - Thursday: Deployment Server + Special cases

At your convenience, after coordinating with deployers, you may switch the Special cases

Deployment Server
- ~1 hour for the deployment server switch. This process has quite a few manual steps, so you need to plan plenty of time.
Special Cases -Switch Datacenter#Manual switch

Day 8 - Wednesday: Pool back inactive DC

A week later,we activate caching and services in the inactive/secondary datacenter again. With traffic flowing in both DCs, we are back in the normal Multi-DC mode. This period may be extended, depending how maintenance work progresses at the inactive DC.

Note: As of September 2023, we are running each datacenter as primary for half of the year.The 2 datacentres are considered coequal, alternating roles every 6 months.

Weeks in advance: communication, testing, and preparation

Communication - 10 weeks before

SeeSwitch Datacenter/Coordination coordinate dates and communication plan with involved groups.

Testing - 3 weeks before

Run a "live test" of the MediaWiki cookbook and a dry-run for everything.

Depending on what changes have occurred to our infrastructure/production from the previous switchover, code changes in cookbooks are expected. The purpose of the live-test and the dry-run is to test most of the existing and updated codepaths, and identify potential issues there.

Note: Always use the--dry-run flag when running cookbooks for testing purposes

Dry Run

A dry-run is available for both cookbooks we use during a switchover:sre.switchdc.mediawiki andsre.discovery.datacenter. During a dry-run the direction is the one we have announced.

For example, if we are currently on codfw and switching over to eqiad, a dry-run's direction would be codfw→eqiad, as follows:

cumin1002:~# cookbook --dry-run sre.switchdc.mediawikicodfw eqiad<entering cookbook menu>> 00-disable-puppet> 00-reduce-ttl

cumin1002:~# cookbook --dry-run sre.discovery.datacenter depool codfw \                      --all --reason "Datacenter services switchover dry-run" \                      --task-id T357547

Note:

It is expected that the 01-stop-maintenance cookbook will fail, as the cookbook requires all jobs to be stopped to exit successfully, and we do not touch running jobs while in dry run.

Live Test

A live test (--live-test) flag will skip actions that could harm the primary DC or perform them on the secondary DC and is available only for thesre.switchdc.mediawiki cookbook. What we should be careful about it is thatwe "switch" from the currently secondary DC to the currently primary DC. While the live-test process will log your actions to SAL please remember to announce to #wikimedia-sre and to #wikimedia-operations too that you will be running this test. Unless something goes really badly this is a non-disruptive test.

For example, if currently our primary DC iscodfw and for the upcoming switchover we will be switching toeqiadthe direction for a live test is eqiad→codfw:

cumin1002:~# cookbook sre.switchdc.mediawiki --live-test --task-id TXXXXXX --ro-reason "Datacenter MediaWiki switchover live-test"eqiad codfw<entering cookbook menu>> 00-disable-puppet> 00-reduce-ttl

Note:

Although this test should be non-disruptive, itdoes perform "safe" operations on core primaries (e.g., add and remove downtimes, run puppet, execute noopmysql commands). Consider reaching out to Data Persistence to confirm that this will not conflict with any ongoing maintenance operations during the planned test window.
If circular replication is not yet enabled everywhere, theCheck that all core primaries in DC_TO are in sync with the core primaries in DC_FROM step of the03-set-db-readonly cookbook will fail, but the error is suppressed in--live-test mode. Consider checking with Data Persistence about whether this is expected to fail.

Preparation - a few days before

Data Persistance checklist:

There is no ongoing long-running maintenance that affects database availability, capacity or lag (schema changes, upgrades, hardware issues, etc.)
Replication is flowing from eqiad -> codfw and from codfw -> eqiad running thesre.switchdc.databases.preparecookbook.

Service Operations checklist:

Check capacity in the destination datacentre. More specifically, ensure that MediaWiki deployments share at least the same number of pods in both datacentres.
- High-traffic services with a large RO component, such asmw-web andmw-api-ext, may need additional upsizes before Day 1 to accommodate all traffic in the destination datacentre. Seetask T371273 for example analysis to determine appropriate upsizes.
Prepare all patches
- Day 1: Any patches necessary to augment capacity as described above.
- Day 2:
  - #1 Update DNS records for master DBs (usehttps://noc.wikimedia.org/dbconfig/codfw.json orhttps://noc.wikimedia.org/dbconfig/eqiad.json to find the correct values.Note that the s3 master is labelled as DEFAULT in the dbconfig - if in doubt ask Data Persistence)
  - #2 Update DNS records for maintenance host
  - #3 geo-maps: set default datacentre
  - #4 debug.json: List primary DC servers first
- Day 3:
  - #5 update deployment DNS record
  - #6 update deployment_server on puppet

Per-service switchover instructions

Traffic

General procedure

See:Global traffic routing.

Day 1: Depool source datacentre

Make sure you have gone through Testing, andPreperation, including patches.

GeoDNS (User-facing) Routing:

Use thesre.dns.admin cookbook to depool all GeoDNS services from the source DC. SeeChange GeoDNS / Depool a Site for details. Example run: if you are depooling eqiad,

cookbook sre.dns.admin depool eqiad

(authdns-update is not required, you only need to run the cookbook)

The TTL fordyna.wikimedia.org andupload.wikimedia.org is 3 minutes or 180 seconds as of May 2025.

Day 8: Switch to Multi-DC again

Same procedure as above, just using thepool command rather thandepool.

Dashboards

Services

General procedure

For a global switchover we are using the sre.discovery.datacenter to depool all services from a DC:

active-active services in DNS discovery will be depooled from said DC
active/passive ones will be switched over to the alternative DC, per user input

However, there are a few services we completely exclude from this process. These are hardcoded in the sre.discovery.datacenter cookbook (seeEXCLUDED_SERVICES).

What the cookbook does is, for each service:

Depool the datacenter we're moving away from in confctl / discovery
Poll all authoritative nameservers until their responses matche the new intent
Flush the associated discovery DNS record from all recursive resolver caches

Day 1: Depooling source DC

Make sure you have gone through Testing, andPreparation, including patches.

Before depooling any service, do not forget to review (and copy/paste) the current status of all services, but running:

cookbook sre.discovery.datacenter status all

The following command will depool all active/active services from a DC, and will prompt to move or skip the active/passive ones.

# Switch all services to eqiad$sudocookbooksre.discovery.datacenterdepoolcodfw--all--reason"Datacenter Switchover"--task-idT12345

Day 8: Switch to Multi-DC again

The following command will repool all active/active services to a DC, and will prompt to move or skip the active/passive ones.

# Repool codfw$sudocookbooksre.discovery.datacenterpoolcodfw--reason"Datacenter switch to Multi-DC"--task-idT12345

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in theoperations/cookbooks repository, in thecookbooks/sre/switchdc/mediawiki/ path.

Day 2: MediaWiki Switchover

Make sure you have gone through Testing, andPreperation, including patches.

Audible indicator: PutListen to Wikipedia in the background during the switchover. Silence indicates read-only, when it starts to make sounds again, edits are back up.

Execution tip: The best way to run this multi step cookbook is to start it in interactive mode from the cookbook root:

sudo cookbook sre.switchdc.mediawiki --ro-reason 'DC switchover (TXXXXXX)' codfw eqiad
and proceed through the steps

Start the following stepsabout 30-60mins before the scheduled switchover time, in a tmux or a screen.

Phase 0 - preparation

ManualStatusPage: Add a scheduled maintenance (Maintenances -> Schedule Maintenance)
Manualscap lock: Add a scap lock on a separate tmux/screen on the deployment server. This will block any scap deployments, and it will stay there waiting for your input to unlock it.scap lock --all "Datacenter Switchover - T12345"
00-disable-puppet: Disables puppet on maintenance hosts in both eqiad and codfw
00-reduce-ttl: Reduces TTL for various DNS discovery entries.Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1. The cookbook should force you to wait anyway.
(Optional-Skip)00-optional-warmup-caches: Warms up shared (e.g., Memcache) and local (e.g., APCu) caches inDC_TO using the mediawiki-cache-warmup tool. The warmup queries will repeat automatically until the response times stabilize, and include:
- The global "urls-cluster" warmup againstmw-web.
- The "urls-server" warmup against all pods in each ofmw-web,mw-api-ext, andmw-api-int.
00-downtime-db-readonly-checks: Sets downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page.
- You can confirm downtimes have been set athttps://icinga.wikimedia.org (navigate to Downtime > Scheduled Service Downtime).

Stop for GO/NOGO: Ask your peers for Go or NoGo

Phase 1 - stop maintenance

01-stop-maintenance: Stops maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters. Keep in mind there is a chance of a manual job running. Check again with your peers; usually the way forward is to kill the job by force.
- The logic to validate that all timers were disabled can fail if any unit is in a failed state. If this happens, you can clear failed states withsudo systemctl reset-failed <failed unit> on the maintenance host, and then re-run the cookbook.

Final GO/NOGO before read-only: Ask what time is it.This is the point of no return. The following steps until Phase 7 need to be executed in quick succession to minimise read-only time

Phase 2 - read-only mode

02-set-readonly: Sets read-only mode by changing theReadOnly conftool value

Phase 3 - lock down database masters

03-set-db-readonly: Puts origin DCDC_FROM core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and waits for destination DC'sDC_TO databases to catch up with replication

Phase 4 - switch active datacenter configuration

04-switch-mediawiki: Switches the discovery records and MediaWiki active datacenter
- FlipsMEDIAWIKI_SERVICES topooled=true in destination DC
- FlipsWMFMasterDatacenter fromDC_FROM toDC_TO
- FlipsMEDIAWIKI_SERVICES topooled=false in source DC

After this, DNS will be changed for the source DC and internal applications (except mediawiki) will start hitting the new DC

Phase 5 - DEPRECATED - Invert Redis replication for MediaWiki sessions

Phase 6 - Set new site's databases to read-write

06-set-db-readwrite: Sets destination DC's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode

Phase 7 - Set MediaWiki to read-write

07-set-readwrite: Goes back to read-write mode by changing theReadOnly conftool value

You are now out of read-only mode.

Take a breath, smile!

Phase 8 - Restore rest of MediaWiki

08-restart-envoy-on-jobrunners: Restarts pods on the (now) inactive jobrunners, trigger changeprop to re-resolve the DNS name and connect to destination DC
- A steady rate of 500s is expected until this step is completed, as changeprop may still be sending edits to source DC, though the database master will reject them.
08-start-maintenance: Starts maintenance on destination DC
- Runs puppet on themaintenance hosts, which will reactivate systemd timers in destination DC
- Most Wikidata-editing bots will restart once this is done and the "dispatch lag" has recovered. This should bring us back to 100% of editing traffic.
ManualStatusPage: End the planned maintenance
Let MoveComms know the RO window is over.

Phase 9 - Post read-only

09-restore-ttl: Sets the TTL for the DNS records to 300 seconds again
Manual#1 Update DNS records for master DBs: merge and runauthdns-update
- Please use the following to SAL log:!log Phase 9: Update DNS records for new database masters
09-run-puppet-on-db-masters: Runs Puppet on the database masters in both DCs, to update expected read-only state.
- This also removes the downtimes set in Phase 0.
- As before, you can confirm the downtimes have been removed athttps://icinga.wikimedia.org (navigate to Downtime > Scheduled Service Downtime).
ManualCentralNotice banner: Ensure the banner informing users of readonly is removed. There is some minor HTTP caching involved (~5mins) here too.
ManualScap lock: Go back to the terminal where you added the lock and press enter
Manual#2 Update DNS records for maintenance host: merge and runauthdns-update
Manual#3 geo-maps: set default datacentre: merge and runauthdns-update
- This default only affects a small portion of traffic, so this is mostly about logical consistency (when we have no idea where to route a request, we prefer the primary DC).
Manual#4 debug.json: List primary DC servers first: Re-order noc.wm.o's debug.json to have primary servers listed first, seeT289745. Runscap backport to deploy.

Phase 10 - verification and troubleshooting

ManualReading and Editing: Ensure they work! :)
ManualRecent Changes: Ensure recent changes are flowing
- See Special:RecentChanges,EventStreams, and the IRC feeds
- curl -s -H 'Accept: application/json'https://stream.wikimedia.org/v2/stream/recentchange | jq .
ManualEmail: Ensure email works viatest an email. For the following commands the Total messages in queue should mostly be 0, but the value will fluctuate as new mail is received then sent

mx-in1001:~$ sudo watch qshape

mx-out1001:~$ sudo watch qshape

Dashboards

ElasticSearch

General context on how to switchover

CirrusSearch talks by default to the local datacenter ($wmgDatacenter). No special actions are required when disabling a datacenter.

Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editingwgCirrusSearchDefaultClusterext-CirrusSearch.php.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by followingRecovering from an Elasticsearch outage / interruption in updates.

Dashboards

ElasticSearch Percentiles

Special cases

Exclusions

Exclusions have been implemented in the Switchover cookbook. The next section is still around for historical and information purposes. While it will probably not be needed, it's still useful information to have around.

If it is needed to exclude services, using the oldsre.switchdc.services is still necessary until exclusion is implemented.

# Switch all services to codfw, excluding parsoid and cxserver$sudocookbooksre.switchdc.services--excludeparsoidcxserver--eqiadcodfw

Single service

If you are switching only one service, using the oldsre.switchdc.services is still necessary

# Switch the service "parsoid" to codfw-only$sudocookbooksre.switchdc.services--servicesparsoid--eqiadcodfw

apt

In March 2023 Switchover, we identified issues with apt.wikimedia.org being switched over. As of the September 2023 Switchover, those haven't been solved yet and apt.wikimedia.org won't participate in the Switchover.

apt.wikimedia.org needs aPuppet change.

Manual switch

These services require manual changes to be switched over and have not yet been included in service::catalog

Deployment server

These services should be switched by Collaboration Services:

planet.wikimedia.org
- The DNS discovery name planet.discovery.wmnet needs to be switched from one backend to another as in example changegerrit:891369. No other change is needed.
people.wikimedia.org
- In puppet hieradata the rsync_src and rsync_dst hosts need to be flipped as in example changegerrit:891382.
- FIXME: manual rsync command has to be run
- The DNS discovery name peopleweb.discovery.wmnet needs to be switched from one backend to another as in example changegerrit::891381.

Dashboards

Databases

Main document:MariaDB/Switch Datacenter

Once we're confident that the switchover will not be rolled back run thesre.switchdc.databases.finalizecookbook.

Other miscellaneous

Past procedures

These references are to aid discovery and as reminder to avoid ambiguity.

RCStream: This used to be a misc server. Decom'ed in 2017.
restbase-async: This used be a special case. As of September 2023, this is no longer needed. We keep restbase-async pooled in both DCs.View last revision
noc.wikimedia.org: This used to require a manual switch. As of September 2023, noc.wikimedia.org is active/active in mw-on-k8s.

Predictable, Recurring Switchovers

A few months after the Switchback of 2023, and following a feedback gathering process, a proposal to move to a predictable set of dates for the dates while also increasing the Switchover duration to 6 months was adopted and turned into a process. The document can be found in the link below:

Recurring, Equinox-based, Data Center Switchovers

Upcoming Switches

SeeSwitch Datacenter/Switchover Dates for a pre-calculated list up to 2050

Services+Traffic:Tuesday, 24 March 2026 15:00 UTC
MediaWiki:Wednesday, 25 March 2026 15:00 UTC

Past Switches

2025 switches

Tracked inPhabricator
Task T399891

September

Services + Traffic:Tuesday, 23 September 2025 15:00 UTC
MediaWiki:Wednesday, 24 September 2025 15:00 UTC
Read only: 2 minutes 41 seconds

Tracked inPhabricator
Task T385155

March

Services + Traffic:Tuesday, 18 March 2025 14:00 UTC
MediaWiki:Wednesday, 19 March 2025 14:00 UTC
Read only: 2 minutes 24 seconds

2024 switches

Tracked inPhabricator
Task T357547

March

Services + Traffic: Tuesday, March 19th, 2024 14:00 UTC
MediaWiki: Wednesday, March 20th, 2024 14:00UTC
Read only: 3 minutes 8 seconds

Tracked inPhabricator
Task T370962

September

Services + Traffic: Tuesday, 24 September 2024 @ 15:00 UTC
MediaWiki: Wednesday, 25 September 2024 @ 15:00 UTC
Read only: 2 minutes 46 seconds

2023 switches

Tracked inPhabricator
Task T345263

September

Services + Traffic:Tuesday, September 19th, 2023 14:00 UTC
MediaWiki:Wednesday, September 20th, 2023 14:00 UTC

February

Services:Tuesday, February 28th, 2023 14:00 UTC
Traffic:Tuesday, February 28th, 2023 15:00 UTC
MediaWiki:Wednesday, March 1st, 2023 14:00 UTC
Read only: 1 minute 59 seconds
Reports:Recap

April (switching back):

Services:Tuesday, April 25th, 2023 14:00 UTC
Traffic: March 14 2023
MediaWiki:Wednesday, April 26th 2023 14:00 UTC
Read only: 3 minutes 1 second

2021 switches

Schedule:

Services:Monday, June 28th, 2021 14:00 UTC
Traffic:Monday, June 28th, 2021 15:00 UTC
MediaWiki:Tuesday, June 29th, 2021 14:00 UTC

Reports:

June 2021 Data Center Switchover on the Wikimedia Tech blog
Services and Traffic on wikitech-l
MediaWiki on wikitech-l
Incident documentation/2021-06-29 trwikivoyage primary db
Read only duration: 1 minute 57 seconds

Switching back:

Reports:

Datacenter switchover recap on wikitech-l
Read only duration: 2 minutes 42 seconds

2020 switches

Schedule:

Services: Monday, August 31st, 2020 14:00 UTC
Traffic: Monday, August 31st, 2020 15:00 UTC
MediaWiki: Tuesday, September 1st, 2020 14:00 UTC

Reports:

Incident documentation/2020-09-01 data-center-switchover
Read only duration: 2 minutes 49 seconds

Switching back:

Traffic: Thursday, September 17th, 2020 17:00 UTC
MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
Services: Wednesday, October 28th, 2020 14:00 UTC

2018 switches

Tracked inPhabricator
Task T199073

Schedule:

Services: Tuesday, September 11th 2018 14:30 UTC
Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
Traffic: Tuesday, September 11th 2018 19:00 UTC
MediaWiki: Wednesday, September 12th 2018: 14:00 UTC

Reports:

Datacenter Switchover recap
Read only duration: 7 minutes 34 seconds

Switching back:

Traffic: Wednesday, October 10th 2018 09:00 UTC
MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
Services: Thursday, October 11th 2018 14:30 UTC
Media storage/Swift: Thursday, October 11th 2018 15:00 UTC

Reports:

Datacenter Switchback recap
Read only duration: 4 minutes 41 seconds

2017 switches

Tracked inPhabricator
Task T138810

Schedule:

Elasticsearch: elasticsearch is automatically following mediawiki switch
Services: Tuesday, April 18th 2017 14:30 UTC
Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
Traffic: Tuesday, April 18th 2017 19:00 UTC
MediaWiki: Wednesday, April 19th 201714:00 UTC (user visible, requires read-only mode)
Deployment server: Wednesday, April 19th 2017 16:00 UTC

Reports:

Editing pause for failover test on Wikimedia Blog
Recap
Read only duration: 17 minutes

Switching back:

Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
MediaWiki: Wednesday, May 3rd 201714:00 UTC (user visible, requires read-only mode)
Elasticsearch: elasticsearch is automatically following mediawiki switch
Services: Thursday, May 4th 2017 14:30 UTC
Swift: Thursday, May 4th 2017 15:30 UTC
Deployment server: Thursday, May 4th 2017 16:00 UTC

Reports:

Incident documentation/2017-05-03 missing index
Incident documentation/2017-05-03 x1 outage
Recap
Read only duration: 13 minutes

2016 switches

Schedule:

Deployment server: Wednesday, January 20th 2016
Traffic: Thursday, March 10th 2016
MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
Services: Monday, April 18th 2016, 10:00 UTC
MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Reports:

Wikimedia failover test on Wikimedia Blog

Switching back:

MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done

Monitoring Dashboards

Aggregated list of interesting dashboards

Retrieved from "https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&oldid=2356541"

Category:

SRE Service Operations

[8]ページ先頭