CROSS-REFERENCE TO RELATED APPLICATION This is a continuation of U.S. patent application Ser. No. 09/706,960, entitled “Recovering a System that has Experienced a Fault,” filed Nov. 6, 2000, now U.S. Pat. No. 7,089,449, which is hereby incorporated by reference.
TECHNICAL FIELD The invention relates to recovery of systems that have experienced faults.
BACKGROUND Improvements in technology have provided users with a wide variety of devices to perform various tasks. Examples of such devices include desktop computer systems, portable computer systems, personal digital assistants (PDAs), mobile telephones, and so forth. The devices are relatively sophisticated devices that include processing elements (e.g., microprocessors or microcontrollers) and storage devices (e.g., hard disk drives, dynamic random access memorys or DRAMs, and so forth).
A typical device includes an operating system (e.g., a WINDOWS® operating system, a UNIX operating system, a LINUX operating system, etc.) that is loaded when the device is started. Application software is also loaded into the device to provide useful functions for users. Example applications include word processing applications, electronic mail applications, web browsing applications, calendar and address book applications, and so forth.
Despite improvements in technology, failures in various components of a device remains a persistent problem. When a component of a device, such as a hard disk drive, fails, the user may be left with an inoperational device. One option for the user is to take the device to a repair shop where an attempt may be made to recover the failed component, such as the failed hard disk drive. In some cases, data on the hard disk drives may be recovered so that loss of data is minimized. However, in many other cases, the data stored on the hard disk drive is lost, unless the user has diligently backed up the data.
Conventionally, recovery of the failed component such as the hard disk drive is an arduous process that often is frustrating for the user. A need thus exists for an improved method and apparatus for recovering a device to an operational state after a failure has occurred.
SUMMARY In general, according to one embodiment, a system comprises an interface to a network and a first operational element to perform one or more tasks in the system. A storage element contains a flag to indicate if a fault has occurred with the first operational element. A backup device enables access to the network through the interface in response to the flag indicating failure of the first operational element.
In general, according to another embodiment, a system comprises a main storage device, a backup storage device, and a routine executable to boot from the backup storage device in case of a system fault. The backup storage device enables access over a network to retrieve data from a network node to recover the system.
Other features and embodiments will become apparent from the following description, from the claims, and from the drawings.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 is an embodiment of a network system including a network, various nodes coupled to the network, and a backup storage system.
FIG. 2 is a block diagram of components of a node ofFIG. 1, in accordance with an embodiment.
FIG. 3 is a flow diagram of tasks performed for a failure recovery in the node ofFIG. 2, in accordance with an embodiment.
DETAILED DESCRIPTION In the following description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details and that numerous variations or modifications from the described embodiments may be possible.
Referring toFIG. 1, anetwork system10 includes anetwork12 that is coupled tonetwork nodes14,16, and18. Examples of thenodes14,16, and18 include desktop computer systems, portable computer systems, and other types of systems having access to the network12 (over either wired or wireless connections). Examples of thenetwork12 include local area networks (LANs), wide area networks (WANs), the Internet, and so forth.
Abackup storage system20 accessible over thenetwork12 stores data to be used to recovernodes14,16, and18 in case of a fault (such as a component experiencing an error or failure) occurring in the nodes. The data stored in thebackup storage system20 includes user data, such as user-created documents or files, electronic mail messages, calendar and address book files, and so forth. The data stored in the backup storage system also includes software, such as operating system and application software that are stored and executed in each of the nodes. In one embodiment, the user data and software are stored asimage data30,32, and34 that correspond tonodes14,16, and18, respectively. Thus, in case of a fault innode14, theimage data30 is retrieved from thebackup storage system20 and communicated to thenode14, with the image data used to recover thenode14. Similarly,image data32 and34 are used to recovernodes16 and18, respectively.
As illustrated, thenode18 includes a mainhard disk drive24, abackup storage device22, and abackup routine26 executable in thenode18. Thebackup routine26 is initially stored on thebackup storage device22 and is executable to enable thenode18 to access thebackup storage system20 over thenetwork12 in case one of several predetermined faults occurs in thenode18. Examples of such predetermined faults include failure of the hard disk drive, an unrecoverable error occurring on the hard disk drive, corrupted software and files associated with the software (e.g., library files, etc.), and so forth. Thebackup routine26 and thebackup storage device22 may be collectively be referred to as the “backup device25.” In the illustrated embodiment, thebackup routine26 is a software routine loaded from thebackup storage device22 for execution on a processing element in thenode18. Alternatively, the backup device is a hardware component that performs backup tasks in response to detection of certain types of faults.
More generally, thenode18 includes a main operational portion, which in one embodiment contains the main hard disk drive24 (or some other type of storage element). The main operational portion controls operation when thenode18 functions normally. The mainhard disk drive24 stores the operating system and application software, which are loaded into thenode18 to perform useful tasks. In case of some predetermined faults, thebackup device25 is used to enable access over thenetwork12 to thebackup storage system20 to retrieve data to recover the main operational portion of thenode18.
Thebackup storage device22 can be implemented in a number of different ways. For example, thebackup storage device22 can be a bootable mini-drive that is mounted inside the chassis of or on a motherboard in the node. The mini-drive can be a hard disk drive having a relatively small storage capacity for reduced cost. Alternatively, the mini-drive can be other types of non-volatile memory, such as flash memory, electrically erasable and programmable read-only memory (EEPROM) devices, and so forth. Instead of a separate component in the chassis of each node, the mini-drive can also be integrated onto the motherboard of the node if its size permits. Alternatively, thebackup storage device22 can be a full form factor drive.
Thebackup storage device22 can also include a compact disk (CD) or digital video disk or digital versatile disk (DVD) drive in which a CD or DVD is loaded. The CD or DVD contains the necessary software to enable thenode18 to access thenetwork12. Alternatively, thebackup storage device22 includes a partition on the mainhard disk drive24. It is likely that only one part of thehard disk drive24 is corrupted while another portion is not corrupted. Thebackup storage device22 can also include other bootable cartridges or drives.
An example of thebackup routine26 is a browser that is capable of executing on a processor in each node to gain access to thenetwork12. To avoid having to load a large operating system such as the WINDOWS® operating system, the browser can be a reduced version browser that does not need standard full-scale computer operating systems to run. Examples of such “mini-browsers” include browsers that run in PDAs and other handheld devices. Alternatively, mini-browsers can be designed to operate in a DOS operating system, a WINDOWS® CE operating system, or other “lite” operating systems.
Referring toFIG. 2, an example of the node18 (which has a similar arrangement asnodes14 and16) is illustrated. Thenode18 includes a central processing unit (CPU)100 that forms the processing core of thenode18. Ahost bridge102 is connected over a host bus to theCPU100. Thehost bridge102 is also connected to a system bus104, such as a Peripheral Component Interconnect (PCI) bus. Additionally, thehost bridge102 contains control elements to interface amain memory103 and avideo controller116 that controls presentation of images on adisplay114. The system bus104 is connected to anetwork interface112 that manages communications to thenetwork12 through aport110.
Other components of thenode18 include asouth bridge123 coupled to the system bus104. Thesouth bridge123 is in turn coupled to adisk controller124 that is connected to themain disk drive24. Thedisk controller124 can also manage communications with a CD and/orDVD drive126. An input/output (I/O)controller118, which is connected to afloppy disk drive120 and to a mini-drive122, is also coupled to thesouth bridge123.
When thenode18 first starts up, a basic input/output system (BIOS) routine108 is loaded to perform boot and initialization tasks. TheBIOS routine108 is stored in anon-volatile memory106, which can be a flash memory, EEPROM, and other like memory devices. Access to thenon-volatile memory106 is provided through thesouth bridge123.
Thebackup storage device22 ofFIG. 1 can be one or more of the following elements in the node18: the mini-drive122, the CD orDVD drive126, thefloppy drive120, thebackup partition130 in the mainhard disk drive24, or an additional drive like themain drive24.
Although not shown, the node also includes various layers and stacks to enable communications over thenetwork12. For example, a network stack can include a TCP/IP (Transmission Control Protocol/Internet Protocol) or a UDP/IP (User Datagram Protocol/Internet Protocol) stack. TCP is described in RFC 793, entitled “Transmission Control Protocol,” dated September 1981; and UDP is described in RFC 768, entitled “User Datagram Protocol,” dated August 1980. One version of IP is described in Request for Comments (RFC)791, entitled “Internet Protocol,” dated September 1981; and another version of IP is described in RFC 2460, entitled “Internet Protocol, Version 6 (IPv6) Specification,” dated December 1998. TCP and UDP are transport layers for managing connections over an IP network.
Also, various services enable the communication of requests over thenetwork12, such as requests between a node and thebackup storage system20. One such service is the Hypertext Transport Protocol (HTTP) service, which enables requests sent from one network element to another and responses from the destination network element to the requesting network element.
Referring toFIG. 3, the failure recovery process performed in one of thenodes14,16, and18 is illustrated. Theoperating system134 determines (at202) if the node has experienced a fault. If so, theoperating system134 sets (at204) a fail flag132 (in the main hard disk drive24) to an active state. Alternatively, the fail flag can be stored in thenon-volatile memory106, the mini-drive122, or another memory storage element in the node.
Next, either in response to a user request to restart or automatically upon detection of the fault, the node is rebooted (at206). When the node starts up, theBIOS routine108 is loaded to perform boot tasks. One of the tasks performed by theBIOS routine108 is to determine if thefail flag132 has been set (at208). If not, a normal boot process is performed (at210) by theBIOS routine108. If thefail flag132 is set, then theBIOS routine108 accesses (at212) thebackup storage device22. Alternatively, instead of automatically checking for thefail flag132, the boot from thebackup storage device22 can be performed manually by a user through the BIOS (such as by selecting the boot drive). Software on thestorage device22, including thebackup routine26, is loaded (at214) into the node for execution on theCPU100. As noted above, thebackup routine26 can be a mini-browser that enables communications over thenetwork12.
The backup routine26 presents an indication of the fault (at216), such as displaying a warning on thedisplay114. The backup routine26 then waits (at218) for a user request to recover. If a request to recover the node is received, then the backup routine26 accesses (at220) theremote backup system20 over thenetwork12. Image data (30,32, or34) is retrieved from thebackup storage system20 and downloaded (at222) into the node, where the image data is used to recover the node. A scan disk operation may be performed to determine portions of the hard disk drive that are defective. The image data can then be copied to the remaining portions of thehard disk drive24 to enable normal operation of the node.
The various software routines or modules described herein may be executable on various processing elements. Such processing elements include microprocessors, microcontrollers, processor cards (including one or more microprocessors or microcontrollers), or other control or computing devices. As used here, a “controller” can refer to either hardware or software or a combination of the two.
The storage units include one or more machine-readable storage media for storing data and instructions. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs), and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; or optical media such as CDs or DVDs. Instructions that make up the various software routines or modules when executed by a respective processing element cause the corresponding node to perform programmed acts.
The instructions of the software routines or programs are loaded or transported into the node in one of many different ways. For example, code segments including instructions stored on floppy disks, CD or DVD media, a hard disk, or transported through a network interface card, modem, or other interface device are loaded into the system and executed as corresponding software routines or modules. In the loading or transport process, data signals that are embodied in carrier waves (transmitted over telephone lines, network lines, wireless links, cables, and the like) communicate the code segments, including instructions, to the node. Such carrier waves may be in the form of electrical, optical, acoustical, electromagnetic, or other types of signals.
While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of the invention.