Movatterモバイル変換

[0]ホーム

Jump to content

Downtime

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromNetwork outage)

Period when a system is unavailable or unable to provide or perform its primary function

For other uses, seeDowntime (disambiguation).

In computing and telecommunications,downtime (also(system) outage or(system) drought colloquially) is a period when a system is unavailable. Theunavailability is the proportion of a time-span that asystem is unavailable oroffline.This is usually a result of the systemfailing to function because of an unplanned event, or because of routinemaintenance (a planned event).

The terms are commonly applied tonetworks andservers. The common reasons for unplanned outages are system failures (such as acrash) or communications failures (commonly known asnetwork outage ornetwork drought colloquially). For outages due to issues with generalcomputer systems, the termcomputer outage (alsoIT outage orIT drought) can be used.

The term is also commonly applied in industrial environments in relation to failures in industrial production equipment. Some facilities measure the downtime incurred during a work shift, or during a 12- or 24-hour period. Another common practice is to identify each downtime event as having an operational, electrical or mechanical origin.

The opposite of downtime isuptime.

Types

[edit]

Industry standards for the term "Outage Duration" or "Maintenance Duration" can have different point of initiation and completion thus the following clarification should be used to avoid conflicts in contract execution:

"Turnkey" this is the most engrossing of all outage types. Outage or Maintenance starts with operator of the plant or equipment pressing the shutdown or stop button to initiate a halt in operation. Unless otherwise noted, Outage or Maintenance is considered completed when the plant or equipment is back in normal operation ready to begin manufacturing or ready be synchronized with system or grid or ready to perform duties as pump or compressor.
"Breaker to Breaker" This Outage or Maintenance starts with operator of the plant or equipment removing the power circuit (Main power breaker at "off" or "disengaged" or "On-Cooldown"), not the control circuit from operation. This still would allow for the equipment to be cooled down or brought to ambient such that outage/maintenance work can be prepared or initiated. Depending on equipment types, "Breaker to Breaker" outage can be advantageous if contracting out controls related maintenance as this type of maintenance work can be performed while main equipment is still on cool-down or on stand-by. Unless otherwise noted, this type of outage is considered complete when power circuit is re-energized via engaging of the power breaker.
"Completion ofLock-out/Tag-out" This Outage or Maintenance (sometimes mistaken for "Off-Cooldown" but not the same) starts with operator of the plant or equipment removing the power circuit, disengaging the control circuit and performing other neutralization of potential power and hazard sources (typically called Lock-Out, Tag-Out "LOTO") This point of maintenance period is typically the last phase of the outage initiation stage before actual work starts on the facility, plant or equipment. Safety briefing should always follow the LOTO activity, before any work is conducted. Unless otherwise noted, this type of outage is considered complete when the equipment has reached mechanical completion and ready to be placed on slow-roll for many heavy rotating equipment, Bump-test or rotation check for motors, etc., but must follow return or work permit per LOTO procedures.

Any on-line testing, performance testing and tuning required should not count towards the outage duration as these activities are typically conducted after the completion of outage or maintenance event and are out of control of most maintenance contractors.

Characteristics

[edit]

Unplanned downtime may be the result of an equipment malfunction, etc.

Telecommunication outage classifications

[edit]

Downtime can be caused by failure inhardware (physical equipment),(logic controlling equipment),interconnecting equipment (such as cables, facilities, routers,...),transmission (wireless, microwave, satellite), and/orcapacity (system limits).

The failures can occur because ofdamage,failure,design,procedural (improper use by humans),engineering (how to use and deployment),overload (traffic or system resources stressed beyond designed limits),environment (support systems like power and HVAC),(outages designed into the system for a purpose such as software upgrades and equipment growth),other (none of the above but known), orunknown.

The failures can be the responsibility ofcustomer/service provider,vendor/supplier,utility,government,contractor,end customer,public individual,act of nature,other (none of the above but known), orunknown.

Impact

[edit]

Outages caused by system failures can have a serious impact on the users of computer/network systems, in particular those industries that rely on a nearly 24-hour service:

Medical informatics
Nuclear power and otherinfrastructure
Banks and otherfinancial institutions
Aeronautics,airlines
News reporting
E-commerce andonline transaction processing
Persistent online games

Also affected can be the users of anISP and other customers of a telecommunication network.

Corporations can lose business due to network outage or they may default on a contract, resulting in financial losses. According toVeeam 2019 cloud data management report organizations encounter unplanned downtime, on average, 5-10 times per year with the average cost of one hour of downtime being $102,450.^[1]

Those people or organizations that are affected by downtime can be more sensitive to particular aspects:

some are more affected by the length of an outage - it matters to them how much time it takes to recover from a problem
others are sensitive to the timing of an outage - outages during peak hours affect them the most

The most demanding users are those that requirehigh availability.

Famous outages

[edit]

This articleappears to beslanted towards recent events. Please try to keep recent events in historical perspective andadd more content related to non-recent events.(May 2013)

OnMother's Day, Sunday, May 8, 1988, a fire broke out in the main switching room of the Hinsdale Central Office of theIllinois Bell telephone company. One of the largestswitching systems in the state, the facility processed more than 3.5 million calls each day while serving 38,000 customers, including numerous businesses, hospitals, and Chicago's O'Hare and Midway Airports.^[2]

Virtually the entireAT&T network of4ESS toll tandems switches went in and out of service over and over again on January 15, 1990, disrupting long-distance service for the entire United States. The problem dissipated by itself when traffic slowed down. A software bug was found.^[3]

AT&T lost itsFrame Relay network for 26 hours on April 13, 1998.^[4] This affected many thousands of customers, and bank transactions were one casualty. AT&T failed to meet theservice level agreement on their contracts with customers and had to refund^[5] 6,600customer accounts, costing millions of dollars.

Xbox Live had intermittent downtime during the 2007–2008 holiday season which lasted thirteen days.^[6] Increased demand from Xbox 360 purchasers (the largest number of new user sign-ups in the history of Xbox Live) was given as the reason for the downtime; in order to make amends for the service issues, Microsoft offered their users the opportunity to receive a free game.^[7]

Sony'sPlayStation Network April 2011 outage, began on April 20, 2011, and was gradually restored on May 14, 2011, starting in theUnited States. This outage is the longest amount of time the PSN has been offline since its inception in 2006. Sony has stated the problem was caused by an external intrusion which resulted in the confiscation of personal information. Sony reported on April 26, 2011, that a large amount of user data had been obtained by the same hack that resulted in the downtime.^[8]

Telstra's Ryde switch failed in late 2011 after water egressed into the electrical switch board from continuing wet weather. The Ryde switch is one of the largest by area switches in Australia, and affected more than 720,000 services.^{[citation needed]}

TheMiami datacenter of ServerAxis went offline unannounced on February 29, 2016, and was never restored. This impacted multiple providers and hundreds of websites. The outage impacted coverage of the2016 NCAA Division I women's basketball tournament as WBBState, one of the affected sites, was by far the most comprehensive provider of women's basketball statistics available.^[9]

The game platformRoblox had an outage around October 2021, during theirChipotle Event. Many users thought it was because of the event, because it received massive reception, as users could get a free Chipotle burrito during it. The outage was Roblox's longest downtime, lasting 3 days.^[10]^[11]^[12]

On July 8, 2022, Rogers suffered amajor nationwide outage inCanada. This simultaneously affected cell phone and internet access, causing 911 calls, interbank transactions to fail and also disrupting government services.

On July 19, 2024,CrowdStrike issued afaulty device driver update for their Falcon software, resulting in Windows PCs, servers, and virtual machines to crash and boot loop. The incident unintentionally affected approximately 8.5 millionWindows machines worldwide, including critical infrastructure such as 911 services in various states. It is considered to be the largest outage in the history ofinformation technology.^[13]^[14]

Service levels

[edit]

Inservice level agreements, it is common to mention a percentage value (per month or per year) that is calculated by dividing the sum of all downtimes timespans by the total time of a reference time span (e.g. a month). 0% downtime means that the server was available all the time.

For Internet servers downtimes above 1% per year or worse can be regarded as unacceptable as this means a downtime of more than 3 days per year. For e-commerce and other industrial use any value above 0.1% is usually considered unacceptable.^[15]

Response and reduction of impact

[edit]

It is the duty of the network designer to make sure that a network outage does not happen. When it does happen, a well-designed system will further reduce the effects of an outage by having localized outages which can be detected and fixed as soon as possible.

A process needs to be in place to detect a malfunction -network monitoring - and to restore the network to a working condition - this generally involves ahelp desk team that cantroubleshoot a problem, one composed of trained engineers; a separate help desk team is usually necessary in order to field user input, which can be particularly demanding during a downtime.

Anetwork management system can be used to detect faulty or degrading components prior to customer complaints, with proactive fault rectification.

Risk management techniques can be used to determine the impact of network outages on an organisation and what actions may be required to minimise risk. Risk may be minimised by using reliable components, by performing maintenance, such as upgrades, by usingredundant systems or by having acontingency plan orbusiness continuity plan.Technical means can reduce errors witherror correcting codes,retransmission,checksums, ordiversity scheme.

One of the biggest causes of downtime is misconfiguration, where a planned change goes wrong. Typically organisations rely on manual effort to manage the process of configuration backups, but this requires highly skilled engineers with the time to manage the process across a multi-vendor network. Automation tools are available to manage backups, but there are very few solutions that handle configuration recovery which is needed to minimize the overall impact of the outage.^[16]

Planning

[edit]

A planned outage is the result of a planned activity by the system owner and/or by aservice provider. These outages, often scheduled during themaintenance window, can be used to perform tasks including the following:

Deferred maintenance, e.g., a deferred hardware repair or a deferred restart to clean up a garbled memory
Diagnostics to isolate a detected fault
Hardware fault repair
Fixing an error or omission in a configuration database or omission in a recent configuration database change
Fixing an error in application database or an error in a recent application database change
Software patching/software updates to fix a software fault.

Outages can also be planned as a result of a predictable natural event, such asSun outage.

Maintenance downtimes have to be carefully scheduled in industries that rely on computer systems. In many cases, system-wide downtimes can be averted using what is called a "rolling upgrade" - the process of incrementally taking down parts of the system for upgrade, without affecting the overall functionality.

Avoidance

[edit]

For most websites,website monitoring is available. Website monitoring (synthetic or passive) is a service that "monitors" downtime and users on the site.

Other usage

[edit]

Downtime can also refer to time when human capital or other assets go down. For instance, if employees are in meetings or unable to perform their work due to another constraint, they are down. This can be equally expensive, and can be the result of another asset (i.e. computer/systems) being down. This is also commonly known as "dead time".

Downtime is also generalized in a personal sense, being used to refer to a period ofsleep orrecreation.^[17]^[18]^[19]

This term is used also in factories or industrial use. Seetotal productive maintenance (TPM).

Measuring downtime

[edit]

There are many external services which can be used to monitor the uptime and downtime as well as availability of a service or a host.

A notable example is that ofDowndetector, an online website owned byOokla which tracks regular downtime and major outages with user outage reports made in the site, which also includes the page for each website on Downdetector itself and Twitter.^[20] It is currently available in 45 countries (with a different site in each country), and tracks 12,000 services internationally.^[21]^[22]