CROSS-REFERENCE TO RELATED APPLICATIONSThis application is based upon and claims the benefit of priority of the prior Japanese Application No. 2012-189684 filed on Aug. 30, 2012 in Japan, the entire contents of which are hereby incorporated by reference.
FIELDThe embodiments discussed herein are directed to an information processing apparatus and a fault processing method for an information processing apparatus
BACKGROUNDAn OS (Operating System) operating in a server issues an I/O (Input/Output) instruction to a peripheral apparatus such as an I/O device through a serial or parallel internal bus. If no response to the I/O instruction is received upon polling through the internal bus in accordance with the I/O instruction and then timeout is detected, then it is recognized that a fault has occurred in an I/O device, a bus bridge connected to the I/O device or the like. In this instance, since a suspect location cannot be identified, replacement of an entire location including the I/O device, bus bridge and so forth in which a fault has not occurred is performed as maintenance work.
In order to identify a suspect location that is a location to be replaced in maintenance work, it is necessary to acquire detailed fault information (error information) in the I/O device, bus bridge or the like. Therefore, it seems advisable to extract a server detailed fault information and so forth from the I/O device, bus bridge or the like through the internal bus. However, for example, if a fault occurs in a path of the internal bus, then there is the possibility that fault information and so forth may not be read out. Therefore, such a countermeasure as to issue a notification of fault information and so forth of an apparatus connected to the bus bridge to a maintenance diagnosis apparatus through a path (diagnosis bus or the like) different from the internal bus is taken.
[Patent Document 1] Japanese Laid-Open Patent Publication No. 2009-223584
[Patent Document 2] Japanese Laid-Open Patent Publication No. 2009-217435
[Patent Document 3] Japanese Laid-Open Patent Publication No. Hei 11-259383
[Patent Document 4] Japanese Laid-Open Patent Publication No. Hei 10-254736
However, also when a notification of fault information and so forth is issued to the maintenance diagnosis apparatus through a path different from the internal bus, if the different path is configured from a low-speed bus such as, for example, an I2C (Inter-Integrated Circuit) bus, then there is the possibility that, when a plurality of faults occur or in alike case, transmission of fault information may result in failure and the fault information may be lost. If the fault information is lost in this manner, then when maintenance work is performed, a suspect location cannot be identified and it becomes necessary to replace the entire location including the I/O device, bus bridge and so forth in which a fault does not occur.
SUMMARYIn one scheme, an information processing apparatus includes a processing apparatus, a bus bridge connected to the processing apparatus through a first bus and connecting to a peripheral apparatus, a nonvolatile storage apparatus that stores information relating to a fault occurring in the peripheral apparatus or the bus bridge, a monitoring apparatus connected to the nonvolatile storage apparatus through a second bus different from the first bus and monitoring a system including the processing apparatus, and a fault notification unit that stores, when the fault occurs in the peripheral apparatus or the bus bridge, the information relating to the occurring fault into the nonvolatile storage apparatus and issues a notification of an error to the monitoring apparatus through the second bus.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 is a block diagram depicting a general configuration of an information processing apparatus according to a present embodiment;
FIG. 2 is a block diagram depicting a detailed configuration of a PCI box in the information processing apparatus depicted inFIG. 1;
FIG. 3 is a flow chart illustrating operation of a server in the information processing apparatus depicted inFIG. 1;
FIG. 4 is a flow chart illustrating operation of an I2C controller (fault notification unit) in the PCI box depicted inFIG. 2;
FIG. 5 is a flow chart illustrating operation of a system controlling apparatus (monitoring apparatus) in the information processing apparatus depicted inFIG. 1; and
FIGS. 6 to 12 are flow charts illustrating a particular maintenance work procedure using the information processing apparatus according to the present embodiment.
DESCRIPTION OF THE PREFERRED EMBODIMENTSIn the following, embodiments are described with reference to the drawings.
Configuration of the Information Processing Apparatus of the Present EmbodimentFirst, a configuration of the information processing apparatus1 of the present embodiment is described with reference toFIGS. 1 and 2. Here,FIG. 1 is a block diagram depicting a general configuration of the information processing apparatus1 of the present embodiment, andFIG. 2 is a block diagram depicting a detailed configuration of a PCI (Peripheral Components Interconnect)box20 in the information processing apparatus1 depicted inFIG. 1. As depicted inFIG. 1, the information processing apparatus1 includes aserver10, aPCI box20, adevice30 and asystem controlling apparatus40.
[1-1] Configuration of the Server (Processing Apparatus)
The server (processing apparatus)10 is a universal computer configured such that a CPU (Central Processing Unit)11, amemory12, a PCI-ex (PCI-express)controller13, anI2C controller14 and a LAN (Local Area Network)interface unit15 are communicably connected to each other through a bus16.
TheCPU11 reads out and executes programs stored in thememory12 to perform various functions hereinafter described.
Thememory12 is, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), an HDD (Hard Disk Drive), an SSD (Solid State Drive) or the like provided in an apparatus main body of theserver10.
The PCI-ex controller13 functions as an interface to a PCI-ex bus (internal bus; first bus)50 and is connected for communication to thePCI box20 hereinafter described having a housing different from a housing of theserver10 through the PCI-ex bus50.
TheI2C controller14 functions as an interface to an I2C bus (system controlling bus; second bus)70 and is connected for communication to thesystem controlling apparatus40 hereinafter described through the I2C bus70.
TheLAN interface unit15 functions as an interface to aLAN80 and is connected for communication to thesystem controlling apparatus40 hereinafter described through theLAN80.
An OS that operates in the CPU11 (server10) has a function of issuing an I/O instruction for a peripheral apparatus (device30 hereinafter described) such as an I/O device through the PCI-ex controller13 and the PCI-ex bus50.
If an error response (second response) or an interrupt (second interrupt) indicating that a fault occurs in thePCI box20 side hereinafter described is received through the PCI-ex bus50 when an I/O access to the peripheral apparatus (device30 hereinafter described) is performed, then the CPU11 (OS) performs such functions as described below. In particular, the CPU11 (OS) performs a function of performing a fault analysis (second fault analysis; identification of a suspect location in which a fault has occurred) based on information (fault information, error information) included in the error response or the interrupt. Then, theCPU11 performs a function of notifying thesystem controlling apparatus40 hereinafter described of a result of the second fault analysis through theLAN interface unit15 and theLAN80 and logging the result of the second fault. The logging is performed not only into thememory12 in theserver10 but also into a memory42 (hereinafter described) in thesystem controlling apparatus40 hereinafter described.
Further, when no response is received from the PCI-ex bus50 and timeout occurs upon the I/O access to the peripheral apparatus (device30 hereinafter described), the CPU11 (OS) performs such functions as described below. In particular, the CPU11 (OS) performs a function of recognizing an error of the PCI box20 (all elements included in the PCI box20) hereinafter described. Then, theCPU11 performs a function of notifying thesystem controlling apparatus40 hereinafter described of a result of the recognition through theLAN interface unit15 and theLAN80 and performing logging of the result of the recognition. The logging is performed not only into thememory12 in theserver10 but also into a memory42 (hereinafter described) in thesystem controlling apparatus40 hereinafter described.
[1-2] Configuration of the PCI Box
ThePCI box20 has a housing different from that of theserver10 and is connected to theserver10 through the PCI-ex bus50. ThePCI box20 includes a PCI-ex bridge21, a PCI-ex card slot22 and anI2C controller23.
The PCI-ex bridge (bus bridge)21 is connected to theserver10 through the PCI-ex bus50 and is coupled with the PCI-ex card31 by the PCI-ex card slot22. ThePCI box20 has a plurality of PCI-ex card slots22 configured such that a PCI-ex card31 can be inserted into the individual PCI-ex card slots22. By inserting the PCI-ex card31 into each of the PCI-ex card slots22, the PCI-ex card31 is stored into thePCI box20. The PCI-ex card31 is connected to the device (peripheral apparatus)30 such as an HDD, a LAN switch or a hub through acable32. Consequently, theserver10 can issue an I/O access to thedevice30 through the PCI-ex bus50, PCI-ex bridge21, PCI-ex card slot22, PCI-ex card31 andcable32.
The PCI-ex bridge21 and the PCI-ex card31 (device30) individually have a function of issuing, when a fault occurs, a notification of an error response (first response) or an interrupt (first interrupt) indicating that a fault has occurred with theI2C controller23 through I2C buses24 and25.
The I2C controller (fault notification unit)23 performs transmission and reception (error notification, collection of error information (fault information), control relating to power supply and so forth) of information relating to system control between thesystem controlling apparatus40 hereinafter described and thePCI box20. Therefore, theI2C controller23 is connected to thesystem controlling apparatus40 hereinafter described through an I2C bus (second bus)60 different from the PCI-ex bus (first bus)50. Further, theI2C controller23 is connected to the PCI-ex bridge21 through the I2C bus24 and is connected to the PCI-ex card31 (device30) inserted in the PCI-ex card slot22 through the I2C bus25 and the PCI-ex card slot22. Here, the I2C is communication means that can be utilized with a low cost although the speed is low in comparison with the PCI.
Further, as depicted inFIG. 2, theI2C controller23 includes aprocessor231, amemory232 and anonvolatile memory233.
Theprocessor231 reads out and executes a program stored in thememory232 and functions as a fault notification unit hereinafter described. Thememory232 is, for example, a RAM, a ROM, an HDD, an SSD or the like.
The nonvolatile memory (nonvolatile storage apparatus; flash memory)233 is controlled by theprocessor231 and stores information (hereinafter referred to as “fault information” or “error information”) relating to a fault occurring in any of the components of thePCI box20. Here, the components of thePCI box20 include the PCI-ex bridge21, PCI-ex card31 anddevice30 described above. Further, the fault information (error information) is retained as registration information in registers of the PCI-ex bridge21, PCI-ex card31 anddevice30 and includes information such as a part identifier, an error state and so forth. The fault information (error information) is used for an error analysis by thesystem controlling apparatus40.
It is to be noted that thenonvolatile memory233 is removably attached to the PCI box20 (I2C controller23). Accordingly, thenonvolatile memory233 can be removed from thePCI box20 and attached to a different processing apparatus as occasion demands so that fault information accumulated in thenonvolatile memory233 can be used for a fault analysis by the different processing apparatus.
The processor (fault notification unit)231 performs a function of reading out, when an error response (first response) or an interrupt (first interrupt) is received from a component in which a fault has occurred through the I2C buses24 and25, register information (fault information) from the component in which the fault has occurred through the I2C buses24 and25 and accumulating the read out information into thenonvolatile memory233. Further, theprocessor231 performs a function of accumulating the fault information into thenonvolatile memory233 and issuing a notification of an error to thesystem controlling apparatus40 through the I2C bus (second bus)60.
Further, the processor (fault notification unit)231 performs a function of transmitting, where a readout request of the fault information of thenonvolatile memory233 is received from thesystem controlling apparatus40 through the I2C bus60, the fault information stored in thenonvolatile memory233 to thesystem controlling apparatus40 through the I2C bus60.
Further, the processor (fault notification unit)231 performs a function of transmitting, where access (hereinafter described) for an alive check is received from thesystem controlling apparatus40, register information (error information where a fault occurs) indicating a state of theI2C controller23 and so forth to thesystem controlling apparatus40 through the I2C bus60.
[1-3] Configuration of System Controlling Apparatus (Monitoring Apparatus)
Thesystem controlling apparatus40 is an SVP (SerVice Processor) for performing monitoring of the system including theserver10 and thePCI box20 and is connected to theserver10 and thePCI box20 through the I2C buses70 and60 as system controlling buses, respectively.
Further, as depicted inFIG. 1, thesystem controlling apparatus40 is configured by connecting aCPU41, thememory42, anI2C controller43 and aLAN interface unit44 to each other for communication through a bus45.
TheCPU41 reads out and executes a program stored in thememory42 to perform various functions hereinafter described. Thememory42 is, for example, a RAM, a ROM, an HDD, an SSD or the like.
TheI2C controller43 functions as an interface to the I2C buses70 and60 and is connected for communication to the server10 (I2C controller14) and the PCI box20 (I2C controller23) through the I2C buses70 and60, respectively.
TheLAN interface unit44 functions as an interface to theLAN80 and is connected for communication to the server10 (LAN interface unit15) through aLAN80.
The CPU41 (system controlling apparatus40) performs such functions as described below.
If a notification of an error is received from theI2C controller23 of thePCI box20, then theCPU41 reads out fault information stored in thenonvolatile memory233 through the I2C bus60 and performs a fault analysis (first fault analysis; identification of a suspect location in which a fault has occurred) based on the read-out fault information. Then, theCPU41 performs a function of issuing a notification of a result of the first fault analysis to the operator and performing logging of the result of the first fault analysis into thememory42.
It is to be noted that the notification of a result of the first fault analysis is performed to the operator using a monitor or the like in thesystem controlling apparatus40, and the operator who refers to the notification would perform maintenance work such as part replacement for a suspect location as hereinafter described.
At this time, when both of a result of the first fault analysis obtained based on the fault information of thenonvolatile memory233 of thePCI box20 and a result of the second fault analysis received as a notification from theserver10 through theLAN80 are obtained, theCPU41 issues a notification of a result of the first fault analysis in priority to the operator.
If no response is received from the PCI-ex bus50 when theserver10 performs an I/O access to thedevice30, then theCPU41 reads out fault information stored in thenonvolatile memory233 through the I2C bus60 and performs a fault analysis (first fault analysis; identification of a suspect location in which a fault has occurred) based on the read-out fault information. Then, theCPU41 performs a function of issuing a notification of a result of the first fault analysis to the operator and logging the result of the first fault analysis into thememory42.
TheCPU41 has a function of periodically or non-periodically performing an access for an alive check to theI2C controller23 of thePCI box20 in order to monitor thePCI box20. The alive check is a check process performed for checking whether or not theI2C controller23 is operating normally. It is to be noted that, while theCPU41 performs an access for an alive check also to theI2C controller14 of theserver10 in order to monitor theserver10, detailed description of the access is omitted here.
If error information indicating that a fault has occurred is received from theI2C controller23 when an access to theI2C controller23 of thePCI box20 is performed, then theCPU41 performs a fault analysis (third fault analysis) based on the received error information. Then, theCPU41 performs a function of issuing a notification of a result of the third fault analysis to the operator and logging the result of the third fault analysis into thememory42.
If no response is received from theI2C controller23 when an access to theI2C controller23 of the PCI box is performed and timeout occurs, then theCPU41 recognizes that a fault has occurred in theI2C controller23. In particular, theCPU41 performs a function of recognizing all elements included in theI2C controller23 as suspect locations and then issuing a notification of the fact to the operator and logging the fact into thememory42.
If the fault is resolved by replacing theI2C controller23 with a new one after the notification of the fact that a fault has occurred in theI2C controller23, then theCPU41 performs a function of determining theI2C controller23 as a suspect location and then issuing a notification of the fact to the operator and logging the fact into thememory42.
On the other hand, if no fault is resolved even if theI2C controller23 is replaced after the notification of the fact that a fault has occurred in theI2C controller23, theCPU41 recognizes the components connected to theI2C controller23 as suspect locations. In particular, theCPU41 performs a function of recognizing all of the components on thePCI box20 side except for theI2C controller23 as suspect locations and then issuing a notification of the fact to the operator and logging the fact into thememory42.
[2] Operation of the Information Processing Apparatus of the Present Embodiment
Now, operation of theserver10, operation of the I2C controller23 (fault notification unit231) of thePCI box20 and operation of the system controlling apparatus40 (CPU41) in the information processing apparatus of the present embodiment configured in such a manner as described above are described with reference toFIGS. 3 to 5.
[2-1] Operation of the Server
Operation of the server10 (CPU11) in the information processing apparatus1 depicted inFIG. 1 is described with reference to the flow chart (steps S11 to S18) depicted inFIG. 3.
If an I/O access to thedevice30 is issued (YES route at step S11), then theCPU11 decides whether or not a normal response to the issued I/O access is received (step S12). If a normal response to the I/O access is received (YES route at step S12), then theCPU11 returns the processing to step S11 to wait issuance of an I/O access.
On the other hand, if no normal response to the I/O access is received (NO route at step S12), then theCPU11 decides whether or not an error response or an interrupt indicating that a fault has occurred on thePCI box20 side is received through PCI-ex bus50 (step S13). If an error response or an interrupt is received (YES route at step S13), then theCPU11 performs a fault analysis (second fault analysis) based on fault information included in the error response or the interrupt to identify a suspect location in which a fault has occurred (step S14). Then, theCPU11 issues a notification of a result of the fault analysis to thesystem controlling apparatus40 through theLAN interface unit15 and theLAN80 and performs logging of the fault analysis result (step S15), and then returns the processing to step S11.
Further, theCPU11 decides whether or not timeout (lapse of predetermined time) occurs without receiving a normal response or an error response/interrupt to the I/O access (NO route at step S13) (step S16). If timeout does not occur (NO route at step S16), then theCPU11 returns the processing to step S12. On the other hand, if timeout occurs (YES route at step S16), then theCPU11 recognizes all elements included in thePCI box20 as suspect locations (step S17). Then, theCPU11 issues a notification of a result of the recognition to thesystem controlling apparatus40 through theLAN interface unit15 and theLAN80 and performs logging of the recognition result (step S18), and then returns the processing to step S11.
[2-2] Operation of the Fault Notification Unit
Operation of the I2C controller23 (fault notification unit231) in thePCI box20 depicted inFIG. 2 is described with reference to the flow chart (steps S21 to S29) depicted inFIG. 4.
Thefault notification unit231 decides whether or not an error response or an interrupt indicating that a fault has occurred is received from the PCI-ex bridge21 or the PCI-ex card31 (device30), which is a component of thePCI box20, through the I2C buses24 and25 (step S21). If an error response or an interrupt is received (YES route at step S21), then thefault notification unit231 reads out register information (fault information) from the component, in which a fault has occurred, through the I2C buses24 and25 and accumulates the read out information into the nonvolatile memory233 (steps S22 and S23). Then, thefault notification unit231 issues a notification of the error to thesystem controlling apparatus40 through the I2C bus60 (step S24), and returns the processing to step S21.
On the other hand, if an error response or an interruption is not received (NO route at step S21), then thefault notification unit231 decides whether or not a readout request for fault information is received from thesystem controlling apparatus40 through the I2C bus (step S25). Here, the readout request for fault information is issued from the system controlling apparatus40 (CPU41) in response to an error of a notification issued from thefault notification unit231. If the readout request for fault information in thenonvolatile memory233 is received from thesystem controlling apparatus40 through the I2C bus60 (YES route at step S25), then thefault notification unit231 reads out and transmits the fault information stored in thenonvolatile memory233 to thesystem controlling apparatus40 through the I2C bus60 (steps S26 and S27), and returns the processing to step S21.
If a readout request for fault information in thenonvolatile memory233 is not received (NO route at step S25), then thefault notification unit231 decides whether or not an access for an alive check from thesystem controlling apparatus40 is received (step S28). If an access for an alive check from thesystem controlling apparatus40 is received (YES route at step S28), then thefault notification unit231 transmits register information (error information) indicating a state of theI2C controller23 and so forth to thesystem controlling apparatus40 through the I2C bus60 (step S29), and returns the processing to step S21. It is to be noted that, if an access for an alive check from thesystem controlling apparatus40 is not received (NO route at step S28), then thefault notification unit231 returns the processing to step S21.
[2-3] Operation of the System Controlling Apparatus (Monitoring Apparatus)
Operation of the system controlling apparatus (CPU41) in the information processing apparatus1 depicted inFIG. 1 is described with reference to the flow chart (steps S31 to S52) depicted inFIG. 5.
TheCPU41 decides whether or not a notification of an error is received from theI2C controller23 of thePCI box20 through the I2C bus60 (step S31). If a notification of an error is received from theI2C controller23 of the PCI box20 (YES route at step S31), then theCPU41 issues a readout request for fault information stored in thenonvolatile memory233 through the I2C bus60 (step S32). If fault information from thenonvolatile memory233 is received after a readout request is issued (step S33), then theCPU41 performs a fault analysis (first fault analysis) based on the read out fault information to identify a suspect location in which a fault has occurred (step S34). Then, theCPU41 issues a notification of a result of the first fault analysis to the operator and logs the result of the first fault analysis into the memory (step S35), and then returns the processing to step S31.
If a notification of an error is not received from theI2C controller23 of the PCI box20 (NO route at step S31), then theCPU41 decides whether or not a result of a second fault analysis is received from theserver10 through the LAN80 (step S36). If a result of a second fault analysis is received from the server10 (YES route at step S36), then theCPU41 decides whether or not a result of a first fault analysis corresponding to the second fault analysis is acquired by the CPU41 (step S37). If a result of a first fault analysis corresponding to the second fault analysis is acquired (YES route at step S37), then theCPU41 issues a notification of the result of the first fault analysis in priority to the operator and logs the result of the first fault analysis into the memory42 (step S38), and then returns the processing to step S31. On the other hand, if a result of the first fault analysis corresponding to the second fault analysis is not acquired (NO route at step S37), then theCPU41 issues a notification of the result of the second fault analysis in priority to the operator and logs the result of the second fault analysis into the memory42 (step S39), and then returns the processing to step S31. It is to be noted that a result of the first fault analysis is obtained by theCPU41 performing a fault analysis based on the fault information in thenonvolatile memory233 of thePCI box20. Further, the result of the second fault analysis is a result of the fault analysis performed by theserver10 and issued as a notification from theserver10 through theLAN80 as described above.
If a result of the second fault analysis is not received from the server10 (NO route at step S36), then theCPU41 decides whether or not an access for an alive check is issued to theI2C controller23 of the PCI box20 (step S40) . If an access for an alive check is not issued (NO route at step S40), then theCPU41 returns the processing to step S31.
If an access for an alive check is issued to the PCI box20 (YES route at step S40), then theCPU41 decides whether or not register information is received from theI2C controller23 through the I2C bus60 in response to the access (step S41). If the register information is received (YES route at step S41), then theCPU41 decides whether or not the received register information is error information (step S42). Then, if the received register information is not error information (NO route at step S42), then the processing returns to step S31. On the other hand, if the received register information is error information (YES route at step S42), then theCPU41 performs a fault analysis (third fault analysis) based on the error information to identify a suspect location in which a fault has occurred (step S43). Then, theCPU41 issues a notification of a result of the third fault analysis to the operator and logs the result of the third fault analysis into the memory42 (step S44), and returns the processing to step S31.
If the register information is not received (NO route at step S41), then theCPU41 decides whether or not timeout (lapse of a predetermined time period) occurs without receiving a response from the I2C controller23 (step S45). If timeout does not occur (NO route at step S45), then theCPU41 returns the processing to step S41. On the other hand, if timeout occurs (YES route at step S45), then theCPU41 recognizes all elements included in theI2C controller23 of thePCI box20 as suspect locations (step S46). Then, theCPU41 issues a notification of the result of the recognition to the operator and logs the recognition result into the memory42 (step S47).
Thereafter, theCPU41 decides whether or not the fault is resolved by replacing theI2C controller23 with a different one after a notification that a fault has occurred in theI2C controller23 is issued (step S48). If the fault is resolved (YES route at step S48), then theCPU41 determines theI2C controller23 as a suspect location (step S49). Then, theCPU41 issues a notification of the fact to the operator and logs the fact into the memory42 (step S50), and then returns the processing to step S31. On the other hand, if the fault is not resolved (NO route at step S48), then theCPU41 recognizes all components on thePCI box20 side except for theI2C controller23 as suspect locations (step S51). Then, theCPU41 issues a result of the recognition to the operator and logs the recognition result into the memory (step S52), and then returns the processing to step S31.
[3] Particular Maintenance Work Procedure using the Information Processing Apparatus of Present Embodiment
Now, a particular maintenance work procedure using the information processing apparatus1 of the present embodiment is described with reference toFIGS. 6 to 12. It is to be noted thatFIGS. 6 to 12 are flow charts illustrating a particular maintenance work procedure using the information processing apparatus1 of the present embodiment.
[3-1] First, a particular maintenance work procedure when an error response or an interrupt is returned from thePCI box20 when theserver10 performs an I/O access and a fault occurring location (suspect location) is the PCI-ex card31 (or thedevice30 connected to the PCI-ex card31) is described with reference toFIGS. 6 and 7.
FIG. 6 is a flow chart illustrating operation/procedure (steps A11 to A16) relating to theserver10, and illustrates operation/procedure when a result of a fault analysis performed based on fault information in thenonvolatile memory233 is not acquired but another result of a fault analysis by theserver10 is acquired by thesystem controlling apparatus40 side.
Step A11: If an OS operating in the server10 (CPU11) issues an I/O access, then an I/O access command is issued through the PCI-ex bus50 in accordance with the issuance of the I/O access.
Step A12: Since a fault occurs in the PCI-ex card31, an error response arrives from the PCI-ex card31 at the PCI -ex bridge21 of which the I/O access command arrives.
Step A13: An error response or an interrupt is returned from the PCI-ex bridge21 to theserver10 through the PCI-ex bus50.
Step A14: A fault analysis (error analysis) is performed by the OS of theserver10 and a notification of a result of the fault analysis is issued to thesystem controlling apparatus40 through the LAN80 [corresponding to steps S14 and S15 ofFIG. 3].
Step A15: By thesystem controlling apparatus40, a notification of the fault analysis result issued from theserver10 and indicating that a fault has occurred in the PCI-ex card31 is issued to the operator and logging of the fault analysis result into thememory42 is performed [corresponding to step S15 ofFIG. 3].
Step A16: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace the PCI-ex card (or the device30) in which a fault has occurred.
In this manner, when a fault occurs in the PCI-ex card31, there is the possibility that the fault may be detected also by thesystem controlling apparatus40 side. In the present embodiment, when a fault is detected by thesystem controlling apparatus40 side, a result of the fault analysis obtained on thesystem controlling apparatus40 side is used in priority to another result of the fault analysis obtained by theserver10 side and error reporting to the operator is performed.FIG. 7 is a flowchart illustrating operation/procedure (steps A21 to A26) relating to theI2C controller23 and thesystem controlling apparatus40 in such a case as just described.
Step A21: An interrupt from the PCI-ex card31 to theI2C controller23 occurs together with occurrence of a fault in the PCI-ex card31. Thefault notification unit231 extracts register information (error information) of the PCI-ex card31 through the I2C bus25 in response to the interrupt and accumulates the extracted information into the nonvolatile memory233 [corresponding to steps S22 and S23 ofFIG. 4].
Step A22: Thefault notification unit231 issues a notification of an error to thesystem controlling apparatus40 through the I2C bus (system controlling bus)60 [corresponding to step S24 ofFIG. 4].
Step A23: The system controlling apparatus40 (CPU41) extracts error information stored in thenonvolatile memory233 through the I2C bus60 in response to the error notification [corresponding to step S33 ofFIG. 5].
Step A24: Thesystem controlling apparatus40 performs a fault analysis (error analysis) based on the extracted error information [corresponding to step S34 ofFIG. 5].
Step A25: Thesystem controlling apparatus40 issues a notification of a result of the fault analysis to the operator and performs logging of the fault analysis result into the memory42 [corresponding to step S35 ofFIG. 5].
Step A26: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace the PCI-ex card (or the device30) in which a fault has occurred.
[3-2] Now, a particular maintenance work procedure where an error response or an interrupt is returned from thePCI box20 side when theserver10 performs an I/O access and a fault occurring location (suspect location) is the PCI-ex bridge21 is described with reference toFIGS. 8 and 9.
FIG. 8 is a flow chart illustrating operation/procedure (steps A31 to A35) relating to theserver10, and illustrates operation/procedure when a result of a fault analysis performed based on fault information in thenonvolatile memory233 is not acquired but a result of another fault analysis in theserver10 is acquired on thesystem controlling apparatus40 side.
Step A31: If the OS operating in theserver10 issues an I/O access, then an I/O access command is issued through the PCI-ex bus50 in accordance with the issuance of the I/O access.
Step A32: Since a fault occurs in the PCI-exbridge21, an error is recognized in the PCI-ex bridge21 at which the I/O access command arrives. Then, in accordance with this, an error response or an interrupt is returned from the PCI-ex bridge21 to theserver10 through the PCI-ex bus50.
Step A33: Fault analysis (error analysis) is performed by the OS of theserver10 and a notification of a result of the fault analysis is issued to thesystem controlling apparatus40 through the LAN80 [corresponding to steps S14 and S15 ofFIG. 3].
Step A34: By thesystem controlling apparatus40, a notification of the fault analysis result indicating that the fault occurs in the PCI-ex bridge21 and issued from theserver10 is issued to the operator and logging of the fault analysis result into thememory42 is performed [corresponding to step S15 ofFIG. 3].
Step A35: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace the PCI-ex bridge21 in which a fault occurs.
In this manner, where a fault occurs in the PCI-ex bridge21, there is the possibility that a fault may be detected also on thesystem controlling apparatus40 side. In the present embodiment, where a fault is detected on thesystem controlling apparatus40 side, a result of the fault analysis obtained on thesystem controlling apparatus40 side is used in priority to a result of another fault analysis obtained on theserver10 side, and error reporting to the operator is performed.FIG. 9 is a flow chart illustrating operation/procedure (steps A41 to A46) relating to theI2C controller23 and thesystem controlling apparatus40 in such a case as just described.
Step A41: An interrupt from the PCI-ex bridge21 to theI2C controller23 occurs together with occurrence of a fault in the PCI-ex bridge21. Thefault notification unit231 extracts register information (error information) of the PCI-ex card31 through the I2C bus24 in response to the interrupt and accumulates the extracted information into the nonvolatile memory233 [corresponding to steps S22 and S23 ofFIG. 4].
Step A42: Thefault notification unit231 issues a notification of an error to thesystem controlling apparatus40 through the I2C bus (system controlling bus)60 [corresponding to step S24 ofFIG. 4].
Step A43: The system controlling apparatus40 (CPU41) extracts the error information stored in thenonvolatile memory233 through the I2C bus60 in response to the error notification [corresponding to step S33 ofFIG. 5].
Step A44: Thesystem controlling apparatus40 performs a fault analysis based on the extracted error information [corresponding to step S34 ofFIG. 5].
Step A45: Thesystem controlling apparatus40 issues a notification of a result of the fault analysis to the operator and logs the fault analysis result into the memory42 [corresponding to step S35 ofFIG. 5].
Step A46: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace the PCI-ex bridge21 in which a fault has occurred.
[3-3] Now, a particular maintenance work procedure where no response is received from thePCI box20 side and timeout occurs when theserver10 performs an I/O access and the fault occurring location (suspect location) is the PCI-ex card31 is described hereinabove with reference toFIGS. 10 and 7.FIG. 10 is a flow chart illustrating operation/procedure (steps A51 to A54) relating to theserver10 in such a case as just described.
Step A51: If an OS operating in theserver10 issues an I/O access, then an I/O access command is issued through the PCI-ex bus50 in accordance with the issuance of the I/O access.
Step A52: No response is received from thePCI box20 side and timeout occurs.
Step A53: All components included in thePCI box20 are recognized as suspect locations by the OS of theserver10 and a notification of a result of the recognition is issued to thesystem controlling apparatus40 through the LAN80 [corresponding to step S17 ofFIG. 3].
Step A54: By thesystem controlling apparatus40, a notification of the recognition result issued from theserver10 is issued to the operator and logging of the recognition result into thememory42 is performed [corresponding to step S18 ofFIG. 3].
The person in charge of maintenance (operator) who refers to such a recognition result as described above would replace theentire PCI box20 with a new one although a fault has actually occurred in the PCI-ex card31 in thePCI box20 and it is necessary to replace only the fault PCI-ex card31.
Detailed fault information (error information) is required in order to identify a suspect location. Therefore, in the present embodiment, when a fault is detected by thesystem controlling apparatus40 side, error reporting to the operator is performed giving priority to the result of the fault analysis obtained by thesystem controlling apparatus40 rather than the result of the fault analysis obtained by theserver10. At this time, operation/procedure (steps A21 to A26) similar to those depicted inFIG. 7 are executed.
Step A21: An interrupt from the PCI-ex card31 to theI2C controller23 occurs together with occurrence of a fault in the PCI-ex card31. Thefault notification unit231 extracts register information (error information) of the PCI-ex card31 through the I2C bus25 in response to the interrupt and accumulates the extracted information into the nonvolatile memory233 [corresponding to steps S22 and S23 ofFIG. 4].
Step A22: Thefault notification unit231 issues a notification of an error to thesystem controlling apparatus40 through the I2C bus (system controlling bus) [corresponding to step S24 ofFIG. 4].
Step A23: The system controlling apparatus40 (CPU41) extracts error information stored in thenonvolatile memory233 through the I2C bus60 in response to the error notification [corresponding to step S33 ofFIG. 5].
Step A24: Thesystem controlling apparatus40 performs a fault analysis based on the extracted error information [corresponding to step S34 ofFIG. 5].
Step A25: Thesystem controlling apparatus40 issues a notification of a result of the fault analysis to the operator and performs logging of the fault analysis result into the memory42 [corresponding to step S35 ofFIG. 5].
Step A26: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace the PCI-ex card31 in which a fault has occurred.
[3-4] Now, a particular maintenance work procedure when no response is received from thePCI box20 side and timeout occurs when theserver10 performs an I/O access and the fault occurring location (fault location) is the PCI-ex bridge21 is described with reference toFIGS. 10 and 9. Also in this instance, operation/procedure (steps A51 to A54) similar to those depicted inFIG. 10 are executed in theserver10.
Step A51: If an OS operating in theserver10 issues an I/O access, then an I/O access command is issued through the PCI-ex bus50 in accordance with the issuance of the I/O access.
Step A52: No response is received from thePCI box20 side and timeout occurs.
Step A53: All components included in thePCI box20 are recognized as suspect locations by the OS of theserver10 and a notification of a result of the recognition is issued to thesystem controlling apparatus40 through the LAN80 [corresponding to step S17 ofFIG. 3].
Step A54: By thesystem controlling apparatus40, a notification of the recognition result issued from theserver10 is issued to the operator and logging of the recognition result into thememory42 is performed [corresponding to step S18 ofFIG. 3].
The person in charge of maintenance (operator) who refers to such a recognition result as just described would replace theentire PCI box20 although a fault has actually occurred in the PCI-ex bridge21 in thePCI box20 and it is necessary to replace only the fault PCI-ex bridge21.
Detailed fault information (error information) is required in order to identify a suspect location. Therefore, in the present embodiment, when a fault is detected by thesystem controlling apparatus40 side, error reporting to the operator is performed giving priority to the result of the fault analysis obtained by thesystem controlling apparatus40 rather than the result of the fault analysis obtained by theserver10. At this time, operation/procedure (steps A41 to A46) similar to those depicted inFIG. 9 are executed.
Step A41: An interrupt from the PCI-ex bridge21 to theI2C controller23 occurs together with occurrence of a fault in the PCI-ex bridge21. Thefault notification unit231 extracts register information (error information) of the PCI-ex card31 through the I2C bus24 in response to the interrupt and accumulates the extracted information into the nonvolatile memory233 [corresponding to steps S22 and S23 ofFIG. 4].
Step A42: Thefault notification unit231 issues a notification of an error to thesystem controlling apparatus40 through the I2C bus (system controlling bus)60 [corresponding to step S24 ofFIG. 4].
Step A43: The system controlling apparatus40 (CPU41) extracts error information stored in thenonvolatile memory233 through the I2C bus60 in response to the error notification [corresponding to step S33 ofFIG. 5].
Step A44: Thesystem controlling apparatus40 performs a fault analysis (error analysis) based on the extracted error information [corresponding to step S34 ofFIG. 5].
Step A45: Thesystem controlling apparatus40 issues a notification of a result of the fault analysis to the operator and performs logging of the fault analysis result into the memory42 [corresponding to step S35 ofFIG. 5].
Step A46: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace the PCI-ex bridge21 in which a fault has occurred.
[3-5] A particular maintenance work procedure when an error response or an interrupt is returned from theI2C controller23 when thesystem controlling apparatus40 performs an access for an alive check to theI2C controller23 of thePCI box20 is described with reference toFIG. 11.FIG. 11 is a flow chart illustrating operation/procedure (steps A61 to A65) relating to thesystem controlling apparatus40 and theI2C controller23 in such a case as just described.
Step A61: The system controlling apparatus40 (CPU41) issues an access for an alive check to theI2C controller23 of thePCI box20 through the I2C bus60.
Step A62: TheI2C controller23 transmits, in response to the access for an alive check, an error response or an interrupt including register information (error information) to thesystem controlling apparatus40 through the I2C bus60 [corresponding to step S29 ofFIG. 4].
Step A63: If the error information is received, then thesystem controlling apparatus40 performs a fault analysis based on the received error information [corresponding to step S43 ofFIG. 5].
Step A64: Thesystem controlling apparatus40 issues a notification of a result of the fault analysis to the operator and performs logging of the fault analysis result into the memory42 [corresponding to step S44 ofFIG. 5].
Step A65: The person in charge of maintenance (operator) would refer to the fault analysis result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace theI2C controller23 in which a fault has occurred.
[3-6] A particular maintenance work procedure when no response is received from theI2C controller23 side and timeout occurs when thesystem controlling apparatus40 performs an access for an alive check to theI2C controller23 of thePCI box20 is described with reference toFIG. 12.FIG. 12 is a flowchart illustrating operation/procedure (steps A71 to A82) relating to thesystem controlling apparatus40 in such a case as just described.
Step A71: The system controlling apparatus40 (CPU41) issues an access for an alive check to theI2C controller23 of thePCI box20 through the I2C bus60.
Step A72: No response is received from theI2C controller23 side of thePCI box20 and timeout occurs.
Step A73: Thesystem controlling apparatus40 recognizes all components included in theI2C controller23 of thePCI box20 as suspect locations [corresponding to step S46 ofFIG. 5].
Step A74: Thesystem controlling apparatus40 issues a notification of a result of the recognition to the operator and performs logging of the recognition result into the memory42 [corresponding to step S47 ofFIG. 5]
Step A75: The person in charge of maintenance (operator) would refer to the recognition result issued from thesystem controlling apparatus40 or the log stored in thememory42 to decide and replace theI2C controller23 in which a fault has occurred.
Step A76: Thesystem controlling apparatus40 or the person in charge of maintenance decides whether or not the fault is resolved by the replacement at step A75 [corresponding to step S48 ofFIG. 5].
Step A77: If the fault is resolved (YES route at step S76), then thesystem controlling apparatus40 determines theI2C controller23 as a suspect location, and issues a notification of the fact to the person in charge of maintenance and performs logging of the effect into thememory42. Thereafter, the processing is ended.
Also the maintenance work by the person in charge of maintenance is completed [corresponding to steps S49 and S50 ofFIG.5].
Step A78: If the fault is not resolved (NO route at step S76), then thesystem controlling apparatus40 recognizes all components on thePCI box20 side except for theI2C controller23 as suspect locations, and issues a notification of a result of the recognition to the person in charge of maintenance and performs logging of the recognition result into the memory42 [corresponding to steps S51 and S52 ofFIG. 5].
Step A79: The person in charge of maintenance who refers to the substance of the notification or the log would confirm whether or not isolation work of the components configuring thePCI box20 is permitted while thePCI box20 remains connected to the system (server10).
Step A80: If the isolation work is permitted (YES route at step A79), then the person in charge of maintenance would replace the components configuring thePCI box20 one by one and confirm whether or not the fault is resolved by the replacement thereby to identify a suspect location. If a suspect location is identified by such work as just described and the fault is resolved by replacement of the element of the suspect location, then the maintenance work by the person in charge of maintenance is completed.
Step A81: The isolation work may not be permitted by circumferences of the customer. At this time (NO route at step A79), the person in charge of maintenance would replace all components of thePCI box20 except for theI2C controller23 with anew PCI box20.
Step A82: After the replacement of thePCI box20, the person in charge of maintenance would transmit thePCI box20 from which identification of a suspect location has failed to a factory and a fault reproduction experiment of thePCI box20 from which identification of a suspect location has failed is performed. At this time, the fault information accumulated in thenonvolatile memory233 included in theI2C controller23 is read out and a suspect location in thePCI box20 is identified based on the read out fault information. Then, the part (element) of the identified suspect location is replaced with a new part. If the fault is resolved by the replacement work, then the maintenance work by the person in charge of maintenance is completed.
[4] Effect of the Information Processing Apparatus of the Embodiment
In the existing technique, there is the possibility that, when a notification of fault information or the like is issued to thesystem controlling apparatus40, which corresponds to a maintenance diagnosis apparatus, through a path different from the PCI-ex bus50, if the different path is configured from a low-speed bus such as, for example, an I2C bus, then when a plurality of faults occur, the fault information may be partly lost without being transmitted fully.
On the other hand, with the information processing apparatus1 of the present embodiment, since details of fault information are accumulated into thenonvolatile memory233 where a fault occurs, the fault information is stored with certainty into thenonvolatile memory233 without losing the fault information irrespective of an on/off state of the power supply. Then, if an error notification is issued to thesystem controlling apparatus40 through the I2C bus (second bus)60, then thesystem controlling apparatus40 successively reads out the fault information from thenonvolatile memory233.
Accordingly, it is possible to acquire fault information of the PCI-ex bridge21 or a PCI-ex card31 (device30) in thePCI box20 with certainty, identify a suspect location with high accuracy and perform replacement with a new part to resolve the fault. Consequently, in the maintenance work, replacement of theentire PCI box20 can be avoided as far as possible, and accurate maintenance by identification of a suspect location (suspect part) can be achieved. Thus, effective maintenance work and reduction of a maintenance and part cost can be implemented.
Further, since the I2C bus60 is a low-speed path, there is the possibility that, if thesystem controlling apparatus40 tries to collect error information from the PCI-ex card31 through the I2C bus60, then the maintenance work may not be completed within an actual execution time period. On the other hand, in the present embodiment, since error information is accumulated and stored into thenonvolatile memory233 also in a case in which the maintenance work cannot be performed within an actual execution time period, a fault analysis can be performed with certainty to identify a suspect location and then a notification of the identified suspect location can be issued.
Further, by accumulating fault information into thenonvolatile memory233, a collection process of fault information and a notification process of the fault information to thesystem controlling apparatus40 can be performed separately from each other, and also increase of the speed of the process can be implemented.
On the other hand, the I2C bus (second bus)60 which is an access path different from the PCI-ex bus50 is provided and is used as a path for collection of fault information from thePCI box20 to thesystem controlling apparatus40. In such a case as just described, if the I2C bus60 or theI2C controller23 fails, then there is the possibility that fault information may not be transmitted from theI2C controller23 to thesystem controlling apparatus40 and a suspect location may not be able to be identified. In contrast, in the present embodiment, by the maintenance work procedure described above with reference toFIGS. 11 and 12, a fault occurrence location in theI2C controller23 can be identified to perform maintenance.
Further, in the present embodiment, when a fault is detected by thesystem controlling apparatus40 side, priority is given to a fault analysis result obtained by thesystem controlling apparatus40 side rather than to a fault analysis result obtained by theserver10 side to perform error reporting to the operator. Consequently, the operator can refer to the fault analysis result, in which a suspect location is identified based on the detailed fault information, obtained by thesystem controlling apparatus40 side to perform maintenance work. In short, replacement only of a part corresponding to the suspect location can be performed without replacing theentire PCI box20, and efficient maintenance work and reduction of the maintenance and part cost can be implemented.
OthersAlthough the preferred embodiment of the present invention is described in detail above, the present invention is not limited to the particular embodiment but can be carried out in various modified or altered forms without departing from the subject matter of the present invention.
In the embodiment described above, the PCI-ex bus is used as the first bus, and the I2C bus is used as the second bus (system controlling bus). However, the present invention is not limited to this, but some other buses may be used. For example, as the second bus, an SM (System Management) buts may be used.
According to the embodiment, fault information of a peripheral apparatus and a bus bridge is acquired with certainty.
All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are to be construed as being without limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.