CN120256224A

Movatterモバイル変換

Info

Publication number: CN120256224A
Application number: CN202510365271.5A
Authority: CN
Inventors: 万梦佳; 张锐华; 韦森; 占义成; 陈志列
Original assignee: Shenzhen Yiyike Data Equipment Technology Co ltd
Current assignee: Shenzhen Yiyike Data Equipment Technology Co ltd
Priority date: 2025-03-26
Filing date: 2025-03-26
Publication date: 2025-07-04

Abstract

The application relates to a server fault positioning system and method, comprising a debugging card and a main board, wherein the debugging card is electrically connected with the main board, the debugging card is used for acquiring a startup self-checking code of the main board and matching the startup self-checking code with a pre-stored fault code library, determining a fault code corresponding to the startup self-checking code and determining a fault module corresponding to the fault code in a target server, the main board is used for calling and starting BIOS debugging firmware stored in the debugging card, entering a debugging mode and acquiring enumeration information of the fault module, the debugging card is further used for analyzing the enumeration information, determining state codes corresponding to all units included in the fault module, respectively matching each state code with the pre-stored state code library, determining fault state codes which cannot be matched with correct state codes in the state code library and determining fault units corresponding to the fault state codes. The application solves the problem of lower fault locating efficiency of the server in the related technology.

Description

Server fault positioning system and method

Technical Field

The application belongs to the technical field of server management, and particularly relates to a server fault positioning system and method.

Background

The server is used as a core facility for data storage, processing and transmission, and the stability and reliability of the server are important. Along with the continuous development of the server industry and related technologies, the functions of the server are continuously enriched, the design is more and more complex, the integration level is higher and higher, and the possibility of the server failure is increased while the design difficulty is increased. These faults may be caused by a variety of factors, such as hardware, software, or networks.

When a server sends out alarm information due to the change of the state of a register in a central processing unit (Central Processing Unit, CPU), the alarm information is generally more general, and can not directly judge which peripheral or component causes faults, so that the fault positioning work of operation and maintenance personnel is greatly challenged. The conventional fault locating method generally comprises the steps of manually disassembling the machine, performing minimum system verification, powering up the external devices one by one and the like. Such fault localization methods tend to be time consuming and inefficient. Particularly when key components such as a power supply, a CPU, a memory and the like are involved, inaccurate fault positioning can cause increased maintenance efficiency and influence service continuity.

At present, no effective solution is proposed for the problem of low fault location efficiency of a server in the related art.

Disclosure of Invention

The embodiment of the application provides a server fault positioning system and a server fault positioning method, which are used for at least solving the problem of lower fault positioning efficiency of a server in the related technology.

The embodiment of the application provides a server fault positioning system, which comprises a debugging card and a main board, wherein the main board is arranged in a target server, the debugging card is electrically connected with the main board, the debugging card is used for acquiring a startup self-checking code of the main board and matching the startup self-checking code with a preset fault code library stored in the debugging card, determining a fault code corresponding to the startup self-checking code and determining a fault module corresponding to the fault code in the target server, the main board is used for calling and starting BIOS debugging firmware stored in the debugging card, entering a debugging mode and obtaining enumeration information of the fault module, the debugging card is further used for analyzing the enumeration information and determining state codes corresponding to all units included in the fault module, each state code is respectively matched with a preset state code library stored in the debugging card, the fault state which cannot be matched with the correct state code in the state code library is determined, and the fault state which cannot be matched with the correct state code is determined, and the fault state code corresponding to the fault unit is determined.

In some embodiments, the debug card comprises a processor, a memory, an interface module and a communication module, the main board comprises a controller, a PCH chip and a CPLD chip, the debug card establishes communication with the controller, the PCH chip and the CPLD chip through the communication module, the debug card is electrically connected with the main board through the interface module, the memory stores the fault code library and the state code library, wherein the fault code library comprises a plurality of fault codes and fault module information corresponding to each fault code, and the state code library comprises a plurality of correct state codes and unit information corresponding to each correct state code.

In some embodiments, the processor includes a first set of GPIO pins and a second set of GPIO pins, the CPLD chip includes a third set of GPIO pins, the PCH chip includes a fourth set of GPIO pins, wherein the first set of GPIO pins are connected to the third set of GPIO pins through the interface module, the second set of GPIO pins are connected to the fourth set of GPIO pins through the interface module, and the processor is configured to obtain the boot self-test code of the motherboard from the PCH chip through the second set of GPIO pins, and send the fault module information of the fault module to the CPLD chip through the first set of GPIO pins.

In some embodiments, the processor further comprises a first set of PG pins, a first set of EN pins and a debugging pin, the CPLD chip further comprises a second set of PG pins and a second set of EN pins, the first set of PG pins are connected with the second set of PG pins through the interface module, the first set of EN pins are connected with the second set of EN pins through the interface module, the processor is further used for setting the debugging pin to be 1 after the first set of PG pins are electrically connected with the mainboard and powered on for the first time, and obtaining the PG signal level state of each standby power supply of the mainboard from the CPLD chip through the first set of PG pins, the processor determines that the PG signal level state is abnormal, the standby power supply is abnormal, and sends EN signals to the abnormal standby power supply through the first set of EN pins so as to force the abnormal standby power supply, and the processor determines that the PG signal level state is abnormal and the abnormal standby power supply is normal when the PG signal level is converted from the abnormal standby power supply state to the normal power supply.

In some embodiments, the CPLD chip further comprises a fifth group of GPIO pins, wherein the fifth group of GPIO pins are used for controlling power shielding of a peripheral module connected with the target server, the processor is further used for controlling the CPLD chip through the first group of PG pins under the condition that the PG signal level state of each standby power supply of the main board is normal, so that the CPLD chip shields EN signals of the peripheral module through the fifth group of GPIO pins and enables the target server to execute a minimum system starting-up action, and the processor determines that the peripheral module comprises the fault module under the condition that the minimum system starting-up action is successfully executed by the target server.

In some embodiments, the processor is further configured to control the CPLD chip through the first set of PG pins to enable the CPLD chip to recover EN signals of the peripheral module through the fifth set of GPIO pins and enable the target server to restart normally when the target server executes a minimum system boot operation successfully, and further configured to obtain the boot self-test code of the motherboard from the PCH chip through the second set of GPIO pins after the target server restarts normally, match the boot self-test code with the fault code library stored in the memory, determine the fault code corresponding to the boot self-test code, determine the fault module corresponding to the fault code in the target server, and send the fault module information to the CPLD chip through the first set of GPIO pins.

In some embodiments, the processor is further configured to set the debug pin to 0 after sending the fault module information to the CPLD chip, the CPLD chip is configured to control the PCH chip to call and start the BIOS debug firmware to enable the PCH chip to enter the debug mode, the PCH chip is configured to send the enumeration information of the fault module to the processor through the fourth set of GPIO pins, the processor is further configured to parse the enumeration information, determine the state code corresponding to each unit included in the fault module, match each state code with the state code library stored in the memory, determine a fault state code that cannot be matched with a correct state code in the state code library, and determine the fault unit corresponding to the fault state code, and send the unit information of the fault unit to the CPLD chip through the first set of GPIO pins.

In some embodiments, the fifth set of GPIO pins is further used for controlling the fault module and power shielding of an associated module in the target server, which is in communication with the fault module, the processor is further used for controlling the CPLD chip through the first set of PG pins after sending the fault module information to the CPLD chip, so that the CPLD chip shields EN signals of the fault module through the fifth set of GPIO pins and enables the target server to restart, the processor records the fault module information under the condition that the target server is normally restarted, the processor sends each associated module to the CPLD chip through the first set of GPIO pins under the condition that the target server is not normally restarted, the CPLD chip sequentially shields EN signals of the associated modules through the fifth set of GPIO pins and enables the target server to restart until the target server is normally restarted, and the processor records the fault module information under the condition that the target server is not normally restarted.

In some embodiments, the CPLD chip is further configured to send the unit information of the failed unit to the controller, the controller configured to generate an alert log based on the unit information of the failed unit.

The embodiment of the application provides a server fault positioning method, which is applied to the server fault positioning system in any one of the first aspect, and comprises the steps of obtaining a startup self-checking code of a main board by a debugging card, matching the startup self-checking code with a preset fault code library stored in the debugging card, determining a fault code corresponding to the startup self-checking code, determining a fault module corresponding to the fault code in a target server, calling and starting a BIOS debugging firmware stored in the debugging card, entering a debugging mode, obtaining enumeration information of the fault module, analyzing the enumeration information by the debugging card, determining state codes corresponding to all units included in the fault module, respectively matching each state code with a preset state code library stored in the debugging card, determining fault state codes which cannot be matched with the correct state code in the state code library, and determining fault units corresponding to the fault state.

Compared with the related art, the server fault positioning system and method provided by the embodiment of the application have the advantages that when a target server fails, the debug card is connected with the main board of the target server to obtain the startup self-checking code of the main board, the startup self-checking code is matched with the fault code library prestored in the debug card, so that the fault code corresponding to the startup self-checking code and the fault module corresponding to the fault code are determined, the main board can call and start the BIOS debug firmware prestored in the debug card to enter a debug mode to obtain enumeration information of the fault module, the debug card can analyze the enumeration information, determine the state codes of a plurality of units contained in the fault module, and match the state codes with the state code library prestored in the debug card to determine the fault state code and the fault unit corresponding to the fault state code, thereby realizing accurate positioning of the server fault and improving the operation and maintenance efficiency of the server. The application solves the problem of lower fault locating efficiency of the server in the related technology, and achieves the technical effect of improving the fault locating efficiency of the server.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a server fault location system according to one embodiment of the application;

FIG. 2 is a schematic diagram of the structure of a debug card according to one embodiment of the present application;

fig. 3 is a flow chart of a server fault location method according to one embodiment of the application.

Wherein, each reference sign in the figure is:

10-server fault location system, 100-debugging card, 101-processor, 102-memory, 103-debugging Flash memory, 110-main board, 111-controller, 112-PCH chip, 113-CPLD chip and 114-Flash memory.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

When the server sends out alarm information due to the change of the state of a register in the CPU, the alarm information is generally more general, and can not directly judge which peripheral or component causes faults, so that the fault positioning work of operation and maintenance personnel is very challenging. The conventional fault locating method generally comprises the steps of manually disassembling the machine, performing minimum system verification, powering up the external devices one by one and the like. Such fault localization methods tend to be time consuming and inefficient. Particularly when key components such as a power supply, a CPU, a memory and the like are involved, inaccurate fault positioning can cause increased maintenance efficiency and influence service continuity.

There are server troubleshooting tools in the related art that typically provide some basic troubleshooting functionality, such as log analysis, performance monitoring, and the like. However, these tools only reduce the fault range, and still require manual disassembly by service personnel, and use replacement methods to locate specific fault positions and remove faults.

In view of the above, when a target server fails, the server fault positioning system provided by the embodiment of the application acquires a startup self-checking code of the main board by connecting the debug card with the main board of the target server, and matches the startup self-checking code with a fault code library pre-stored in the debug card, so as to determine a fault code corresponding to the startup self-checking code and a fault module corresponding to the fault code, the main board can call and start a BIOS debug firmware pre-stored in the debug card to enter a debug mode to acquire enumeration information of the fault module, the debug card can analyze the enumeration information, determine state codes of a plurality of units contained in the fault module, and match the state codes with a state code library pre-stored in the debug card, so as to determine the fault state code and a fault unit corresponding to the fault state code, thereby realizing accurate positioning of a server fault, eliminating the need of manually dismantling a machine by an operation and maintenance person to judge the fault position, improving the operation and maintenance efficiency of the server, and reducing the maintenance cost of the server. The application solves the problem of lower fault locating efficiency of the server in the related technology, and achieves the technical effect of improving the fault locating efficiency of the server.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) The BMC, collectively referred to as baseboard management controller (Baseboard Management Controller), is responsible for various monitoring, control and configuration functions of the server as a management control core of the server.

2) A CPLD, collectively referred to as a complex programmable logic device (Complex Programmable Logic Device), is a digital logic circuit that allows a user to define its logic functions by programming.

3) PCH, commonly referred to as platform path controller (Platform Controller Hub), is a core chip on the motherboard and is mainly responsible for managing input/output functions, peripheral connections, and communications with the CPU.

4) The MCU is called Microcontroller Unit, namely a micro control unit, and corresponds to a microcomputer, and various functional modules such as a CPU, a memory, an input/output interface and the like are integrated in the MCU. It can implement various complex control functions by writing programs.

5) The Debug Flash (also called a Debug Flash memory) is used for storing BIOS Debug firmware, and when specific faults need to be located, the PCH is switched to the BIOS Debug firmware to provide the bottom hardware environment for the Debug card so as to conduct system fault detection and hardware diagnosis.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a server fault location system according to an embodiment of the present application, where the server fault location system includes a debug card and a motherboard, and as shown in fig. 1, the motherboard is disposed in a target server, the debug card is electrically connected with the motherboard, the debug card is configured to obtain a boot self-checking code of the motherboard, and match the boot self-checking code with a preset fault code library stored in the debug card, determine a fault code corresponding to the boot self-checking code, and determine a fault module corresponding to the fault code in the target server, the motherboard is configured to call and start a BIOS debug firmware stored in the debug card, enter a debug mode, and obtain enumeration information of the fault module, the debug card is further configured to parse the enumeration information, determine a status code corresponding to each unit included in the fault module, and match each status code with a preset status code library stored in the debug card, and determine a correct status code of the status code that cannot be matched with the status code library, and determine a fault unit corresponding to the fault code.

In this embodiment, in the case that the target server fails (for example, the motherboard cannot be powered on, and the host is blocked in the startup stage and cannot be started normally, etc.), the operation and maintenance personnel may be externally connected with the debug card provided by the embodiment of the present application, so that the debug card is electrically connected with the motherboard through the connector.

After the debugging card is connected with the main board, a program stored in the debugging card can be automatically operated, a fault module is determined by acquiring a startup self-checking code (postcode) of the main board, and fault module information of the fault module is sent to a CPLD chip in the main board, so that the CPLD chip executes power supply shielding operation on the fault module; after the power supply of the fault module is shielded, if the target server is restarted normally, it can be determined that the fault module causes the target server to fail.

After determining that the fault module causes the fault of the target server, the main board can call and start the BIOS debugging firmware stored in the debugging card, so that the PCH chip in the main board is switched to the BIOS debugging firmware for starting. After entering the debug mode, the PCH chip may obtain debug information of the failed module, extract enumeration information about a plurality of units contained in the failed module from the debug information, and send the enumeration information to the debug card.

The debug card can analyze the enumeration information, determine the state code corresponding to each unit contained in the fault module, match each state code with a preset state code library, determine the fault unit, send the unit information of the fault unit to the CPLD chip, so that the CPLD chip can execute power supply shielding operation on the fault unit, and after the power supply of the fault unit is shielded, if the target server is restarted normally, the fault unit can be determined to cause the fault of the target server.

In this way, through the external debugging card, the accurate and rapid positioning and debugging of the fault unit of the target server can be realized under the condition that operation and maintenance personnel do not need to manually disassemble, troubleshoot and the like on the target server, so that the operation and maintenance efficiency of the target server is improved, and through storing BIOS debugging firmware on the debugging card, the bottom hardware environment can be provided for the debugging card, so that the debugging card can perform specific analysis on the debugging information (namely enumeration information) of the fault module, thereby performing the troubleshooting and accurately and rapidly positioning the fault unit. In addition, the independent debugging card design can avoid occupying the originally crowded space of the main board, and interface resources of corresponding hardware equipment can not be occupied when the target server has no fault.

As shown in fig. 1, in one embodiment, the debug card includes a processor, a memory, an interface module (not shown in fig. 1), and a communication module (e.g., an I2C module in fig. 1), the motherboard includes a controller (BMC), a PCH chip, and a CPLD chip, the debug card establishes communication with the controller, the PCH chip, and the CPLD chip through the communication module, the debug card is electrically connected with the motherboard through the interface module, and the memory stores a fault code library and a status code library, wherein the fault code library includes a plurality of fault codes and fault module information (e.g., module category, module ID, etc.) corresponding to each of the fault codes, and the status code library includes a plurality of correct status codes and unit information corresponding to each of the correct status codes.

In this embodiment, the debug card may be designed as a compact circuit board, the processor included in the debug card may be a microprocessor such as an MCU, the memory in the debug card may be a nonvolatile memory (e.g. an EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY), an interface module in the debug card may be an interface circuit necessary for establishing an electrical connection with the motherboard, and the communication module in the debug card may be an I2C (Inter-INTEGRATED CIRCUIT, two-wire serial bus) module, an SPI (SERIAL PERIPHERAL INTERFACE ) module, or a UART (Universal Asynchronous Receiver/transceiver) model, etc. may be used as a communication module for establishing a communication connection with the motherboard. As shown in fig. 1, the processor may communicate with the memory through an I2C module (I2C data link) to obtain a fault code bank or a status code bank stored in the memory.

The debug card may communicate with the controller, PCH chip and CPLD chip in the motherboard via circuitry (e.g., interface circuitry and communication modules as described above) to complete fault localization of the target server. The debug card may establish electrical connection with the motherboard through a reserved interface on the target server, including but not limited to at least one of PCIe (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, high speed serial computer expansion bus standard) socket, a dedicated debug interface, or an expansion slot, etc.

As shown in fig. 1, in one embodiment, the processor includes a first set of PG pins, a first set of EN pins, and a debug pin, and the CPLD chip includes a second set of PG pins and a second set of EN pins. The first group of PG pins are connected with the second group of PG pins through the interface module, and the first group of EN pins are connected with the second group of EN pins through the interface module. The processor is further configured to set the debug pin to 1 after the debug pin is electrically connected to the motherboard and powered on for the first time, and acquire a PG signal level state of each standby power supply of the motherboard from the CPLD chip through a first set of PG pins, determine that the standby power supply with the PG signal level state being abnormal is an abnormal standby power supply, send EN signals to the abnormal standby power supply through a first set of EN pins to force enabling of the abnormal standby power supply, and determine that the abnormal standby power supply is a fault unit when the PG signal level state of the abnormal standby power supply is converted from abnormal to normal.

In this embodiment, the debug card further includes a plurality of LEDs (light-emitting diodes), each LED is electrically connected to one PG pin in the first group of PG pins, and the LEDs can be turned on or turned off according to the PG signal level state of the standby power supply of the motherboard obtained by the PG pin. For example, if the PG signal level state of the standby power supply of the motherboard corresponding to a certain PG pin is abnormal, the PG pin is set to 0, and the LED corresponding to the PG pin is turned off, and if the PG signal level state of the standby power supply of the motherboard corresponding to a certain PG pin is normal, the PG pin is set to 1, and the LED corresponding to the PG pin is turned on. In this way, the operation and maintenance personnel can determine the abnormal/normal condition of the standby power supply by observing the on/off state of each LED on the debug card, so that the fault troubleshooting operation is accurately and rapidly performed on the abnormal standby power supply, and the fault locating efficiency of the target server is improved.

As shown in fig. 1, in one embodiment, the processor includes a first set of GPIO pins and a second set of GPIO pins, the CPLD chip includes a third set of GPIO pins, the PCH chip includes a fourth set of GPIO pins, wherein the first set of GPIO pins are connected with the third set of GPIO pins through an interface module, the second set of GPIO pins are connected with the fourth set of GPIO pins through the interface module, and the processor is configured to obtain a boot self-test code of the motherboard from the PCH chip through the second set of GPIO pins, and send fault module information of the fault module to the CPLD chip through the first set of GPIO pins.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a debug card according to an embodiment of the present application. As shown in fig. 2, the processor may use 4 GPIO pins as the first set of GPIO pins, and connect with a third set of GPIO pins of the CPLD chip for transmitting data information between the processor and the CPLD chip. As shown in fig. 2, 4 GPIO pins may be selected from gpio_0, gpio_1, and gpio_10 of the processor as the first set of GPIO pins. As an example, gpio_0, gpio_1, gpio_2, and gpio_3 of the processor may be taken as a first set of GPIO pins, constituting GPIO [3:0] combination pins.

In addition, the processor may use 6 GPIO pins as the second set of GPIO pins, and connect with the fourth set of GPIO pins of the PCH chip for transmitting data information between the processor and the PCH chip. As shown in fig. 2,6 GPIO pins may be selected as the second set of GPIO pins from gpio_0, gpio_1, and/or gpio_10 of the processor. As an example, gpio_5, gpio_6, gpio_7, gpio_8, gpio_9, and gpio_10 of the processor may be taken as a second set of GPIO pins, constituting a GPIO [10:5] combination pin.

As shown in FIG. 1, in one embodiment, the CPLD chip further comprises a fifth group of GPIO pins for controlling power shielding of a peripheral module connected with the target server, and the processor is further configured to control the CPLD chip through the first group of PG pins when the PG signal level state of each standby power supply of the main board is normal, so that the CPLD chip shields the EN signal of the peripheral module through the fifth group of GPIO pins and the target server executes a minimum system start-up action, and determine that the peripheral module comprises a fault module when the target server executes the minimum system start-up action successfully.

As shown in fig. 1, the fifth set of GPIO pins may be connected to pcie_en pins, dimm_en pins, pch_io_en pins for controlling power shielding of various Modules (e.g., modules connected to a target server through PCIE slots, dual Inline Memory Modules (DIMMs), modules connected to a target server through IO interfaces of PCH chips, etc.).

In this embodiment, if the LEDs in the debug card are all in an on state, it may be determined that the PG signal level state of each standby power supply of the motherboard is normal, and if the CPLD chip of the motherboard identifies that the debug card is connected, EN signals of all peripheral modules of the target server may be shielded by the fifth set of GPIO pins, so that the target server executes a minimum system startup action. If the target server performs the minimum system boot action successfully, the processor may determine that the peripheral module includes a failure module that causes the target server to fail. Since the peripheral modules may include a variety of modules, subsequent positioning operations are also required for specific failed ones of the peripheral modules.

In one embodiment, the processor is further configured to control the CPLD chip through the first set of PG pins under the condition that the target server successfully executes the minimum system boot operation, so that the CPLD chip recovers EN signals of the peripheral module through the fifth set of GPIO pins and enables the target server to restart normally, and the processor is further configured to obtain a boot self-test code of the motherboard from the PCH chip through the second set of GPIO pins after the target server restarts normally, match the boot self-test code with a fault code library stored in the memory, determine a fault code corresponding to the boot self-test code, determine a fault module in the target server corresponding to the fault code, and send fault module information to the CPLD chip through the first set of GPIO pins.

In this embodiment, the processor may obtain postcode of the motherboard from the PCH chip through GPIO [10:5] combined pins (i.e., the second set of GPIO pins), and after the processor obtains postcode, may start a parsing program to match postcode with a fault code library pre-stored in a memory (e.g., EEPROM), where the fault code library includes a plurality of fault codes (postcode ID), fault types corresponding to each fault code, fault module information corresponding to each fault code, and so on. For example, if the last bit of postcode obtained by the processor is 0x53, the parsing program of the processor may query postcode ID corresponding to 0x53 in the fault code base, the fault type corresponding to postcode ID is MEM INIT ERROR, and the fault module corresponding to postcode ID is MEM (memory), so that the processor may quickly point to the fault module having a problem as the memory.

The processor may then send the parsed data (fault module information) to the CPLD chip via the GPIO [3:0] combination pins (i.e., the first set of GPIO pins).

In one embodiment, after the CPLD chip recovers the EN signal of the peripheral module through the fifth set of GPIO pins and enables the target server to restart normally, if the target server cannot start normally, the processor confirms whether postcode can be obtained, if postcode can be obtained, the fault module is located according to the above steps, and if postcode cannot be obtained, an operation and maintenance person is required to disassemble the target server at this time to check whether the CPU and the memory energy critical module are installed correctly.

In this embodiment, the processor is further configured to set the debug pin to 0 after sending the fault module information to the CPLD chip, where the LED on the debug card may start to flash to indicate that the debug mode is turned on.

In one embodiment, the fifth group of GPIO pins are further used for controlling a fault module and power shielding of an associated module in the target server, which is in communication association with the fault module, the processor is further used for controlling the CPLD chip through the first group of PG pins after sending the fault module information to the CPLD chip so that the CPLD chip shields the EN signals of the fault module through the fifth group of GPIO pins and enables the target server to restart, the processor records the fault module information under the condition that the target server is normally restarted, the processor sends each associated module to the CPLD chip through the first group of GPIO pins under the condition that the target server cannot be normally restarted, the CPLD chip sequentially shields the EN signals of the associated modules through the fifth group of GPIO pins and enables the target server to restart until the target server is normally restarted, and the processor records the fault module information.

In this embodiment, the method of analyzing postcode and controlling the power supply shielding of the fault module may implement the troubleshooting of the fault module of the target server, and confirm the fault range caused by the fault module. In addition, through the fault positioning mode of controlling the EN signal of the fault module or the associated module in communication association with the fault module to carry out power shielding on the fault module or the associated module, the risk of equipment damage caused by manual disassembly and assembly of an operation and maintenance personnel on a target server in the conventional fault positioning method can be reduced, and the maintenance cost is reduced.

In one embodiment, the CPLD chip is used for controlling the PCH chip to call and start the BIOS debugging firmware so as to enable the PCH chip to enter a debugging mode, the PCH chip is used for sending enumeration information of the fault module to the processor through a fourth group of GPIO pins, the processor is further used for analyzing the enumeration information to determine state codes corresponding to all units included in the fault module, each state code is respectively matched with a state code library stored in the memory, fault state codes which cannot be matched with the correct state codes in the state code library are determined, fault units corresponding to the fault state codes are determined, and unit information of the fault units is sent to the CPLD chip through a first group of GPIO pins.

As shown in fig. 1, in this embodiment, the PCH chip may form an SPI communication circuit with a Flash memory in the motherboard and a debug Flash memory in the debug card, for guiding the BIOS firmware program to start the target server. The CS signal of the MUX selector can perform alternative operation on the BIOS firmware stored in the Flash memory and the debugging BIOS firmware stored in the debugging Flash memory. For example, the CPLD chip may switch the CS signal of the MUX selector through the GPIO pin, and switch the firmware required for starting the target server from the BIOS firmware stored in the Flash memory to the debug BIOS firmware stored in the debug Flash memory, or vice versa.

In this embodiment, the enumeration information includes unit information of a plurality of units included in the failure module and a status code corresponding to each unit, where the status code stores information of all units of each functional module of the target server, including unit information corresponding to each unit (for example, the unit information may be a unit type, a unit ID, or the like) and a correct status code. The processor can determine the fault state code which cannot be matched with the correct state code in the state code library by comparing the state code corresponding to each unit included in the fault module with the correct state code in the state code library, and can locate the fault unit by acquiring the unit information corresponding to the fault state code.

Furthermore, the faulty unit may be due to installation errors, line faults, etc., and its related information is not covered by the enumeration information. Thus, the processor may determine a failed cell that is not enumerated by comparing the cell information (e.g., cell ID) of all cells in the enumeration information with the cell information (e.g., cell ID) in the state code library. Then, the processor may send the unit information of the uninenumerated faulty unit or the unit information corresponding to the faulty status code to the CPLD chip through the first set of GPIO pins.

In one embodiment, the fifth set of GPIO pins is further used for controlling power shielding of a faulty unit, the processor is further used for controlling the CPLD chip through the first set of PG pins after transmitting unit information of the faulty unit to the CPLD chip, so that the CPLD chip shields EN signals of the faulty unit through the fifth set of GPIO pins and enables the target server to restart, and the processor records the faulty unit under the condition that the target server is restarted normally.

In this embodiment, after the processor records the unit information of the faulty unit, the debug pin may be set to 1, and the debug mode is exited, and the LED of the debug card may be restored to normal light. The CPLD chip may be configured to send the unit information of the failed unit to a controller, which is configured to generate an alert log based on the unit information of the failed unit. Specifically, the CPLD chip may communicate with a controller (BMC) through the I2C module, so as to send the located unit information of the fault unit to the controller, and the operation and maintenance personnel may perform subsequent analysis and processing through a remote management function of the controller. The operation and maintenance personnel can confirm the fault unit through the alarm log of the controller, replace the corresponding fault unit, then confirm whether the target server normally operates after the fault unit is replaced, if the target server normally operates, the debugging work is completed, and the debugging card is taken down. Otherwise, the fault location operation may be re-performed using the debug card.

In this way, the debug card records the unit information of the fault unit, and the CPLD chip sends the unit information of the fault unit to the controller through the I2C module, so that subsequent analysis and processing by operation and maintenance personnel can be facilitated, and the stability and reliability of the server are improved.

The server fault locating method according to an embodiment of the present application will be described below with reference to the accompanying drawings, and the server fault locating method may be applied to the server fault locating system described in the above embodiment. Referring to fig. 3, fig. 3 is a flowchart of a server fault location method according to an embodiment of the present application, as shown in fig. 3, the method includes steps S301 to S303:

step S301, the debug card obtains the boot self-checking code of the main board, matches the boot self-checking code with a preset fault code library stored in the debug card, determines the fault code corresponding to the boot self-checking code, and determines a fault module corresponding to the fault code in the target server.

Step S302, the main board calls and starts BIOS debugging firmware stored in the debugging card, enters a debugging mode and acquires enumeration information of a fault module.

Step S303, the debug card analyzes the enumeration information to determine the state code corresponding to each unit included in the fault module, matches each state code with a preset fault state code library stored in the debug card, determines the fault state code corresponding to the state code, and determines the fault unit corresponding to the fault state code.

It should be noted that, because the details of each step are based on the same concept as the system embodiment of the present application, specific functions and technical effects thereof may be referred to in the system embodiment section, and will not be described herein.

The above-mentioned server fault locating method will be described below with reference to a specific example, and the server fault locating method may be divided into six steps:

Step 1, when a target server fails (including but not limited to the problems that a main board cannot be powered on, a host is blocked in a starting stage and cannot be started normally, and the like), a debugging card is connected to a connector of the main board, a program built in the debugging card is loaded and executed, the debugging card is in a non-debugging mode by default after being powered on for the first time, and a debugging pin is set to be 1 by default. The debugging card firstly judges whether a target server fails due to the failure of a main board circuit or not, a processor of the debugging card acquires the PG signal level state of each Standby power supply of the main board from the CPLD chip through a first group of PG pins, the PG signal level states correspond to LEDs on the debugging card one by one, the LEDs can be turned on or turned off according to the PG signal level states of the Standby power supplies of the main board acquired by the PG pins, and if all the LEDs on the debugging card are all on at the moment, the Standby power supplies are normally electrified, and step 2 is entered. Otherwise, the Standby power supply is powered on abnormally, the LED corresponding to the abnormal Standby power supply cannot be lightened, at the moment, the processor can send an EN signal to the abnormal Standby power supply through a first group of EN pins to forcedly enable the abnormal Standby power supply, when the PG signal level state of the abnormal Standby power supply is converted from abnormal to normal, the processor determines the abnormal Standby power supply as a fault unit, and if the power supply is still not normal, operation and maintenance personnel are required to check the corresponding power supply line and impedance condition.

And 2, when all LEDs of the debug card are in an on state and the CPLD chip of the main board recognizes that the debug card is connected, the processor controls the CPLD chip through the first group of PG pins so that the CPLD chip shields EN signals of the peripheral module through the fifth group of GPIO pins and the target server executes the minimum system starting action. If the target server is started normally, the peripheral fault including the fault module can be determined. The processor is also used for acquiring a startup self-checking code of the main board from the PCH chip through the second group of GPIO pins after the target server is restarted normally, matching the startup self-checking code with a fault code library stored in a memory, determining a fault code corresponding to the startup self-checking code, determining a fault module corresponding to the fault code in the target server, and transmitting fault module information to the CPLD chip through the first group of GPIO pins. At this time, the debug pin of the debug card is set to 0, and the LED on the debug card begins to flash to indicate that the debug mode is turned on, and step 3 is entered. In addition, after the CPLD chip recovers the EN signal of the peripheral module through the fifth group of GPIO pins and enables the target server to restart normally, if the target server cannot start normally, the processor confirms whether postcode can be acquired, if postcode can be acquired, the fault module is positioned according to the steps, if postcode cannot be acquired, an operation and maintenance person is required to disassemble the target server at the moment to check whether the CPU and the memory energy key module are installed or not.

And 3, after the processor transmits the fault module information to the CPLD chip, controlling the CPLD chip through the first group of PG pins to enable the CPLD chip to shield the EN signals of the fault module through the fifth group of GPIO pins and enable the target server to restart, recording the fault module information by the processor under the condition that the target server is normally restarted, transmitting each associated module to the CPLD chip through the first group of GPIO pins by the processor under the condition that the target server cannot be normally restarted, and sequentially shielding the EN signals of the associated modules through the fifth group of GPIO pins by the CPLD chip and enabling the target server to restart until the target server is normally restarted, and recording the fault module information by the processor. Step 4 is entered.

And 4, the PCH chip and a Flash memory in the main board and a debug Flash memory in the debug card can form an SPI communication circuit for guiding the BIOS firmware program to start the target server. The CS signal of the MUX selector can perform alternative operation on the BIOS firmware stored in the Flash memory and the debugging BIOS firmware stored in the debugging Flash memory. The CPLD chip can control the PCH chip to call and start the BIOS debugging firmware so as to enable the PCH chip to enter a debugging mode, the PCH chip is used for sending enumeration information of a fault module to the processor through a fourth group of GPIO pins, the processor is further used for analyzing the enumeration information to determine state codes corresponding to all units included in the fault module, each state code is respectively matched with a state code library stored in the memory, fault state codes which cannot be matched with correct state codes in the state code library are determined, fault units corresponding to the fault state codes are determined, and the processor can determine non-enumerated fault units by comparing unit information of all units in the enumeration information with unit information in the state code library. Then, the processor may send the unit information of the uninenumerated faulty unit or the unit information corresponding to the faulty status code to the CPLD chip through the first set of GPIO pins.

And 5, after the processor transmits the unit information of the fault unit to the CPLD chip, controlling the CPLD chip through the first group of PG pins so that the CPLD chip shields the EN signal of the fault unit through the fifth group of GPIO pins and enables the target server to restart, and recording the unit information of the fault unit by the processor under the condition that the target server is restarted normally. Step 5 is entered.

And 6, after the processor records the unit information of the fault unit, the debugging pin can be set to 1, the debugging mode is exited, and the LED of the debugging card can be normally on. The CPLD chip may be used to send the unit information of the failed unit to a controller, which is used to generate an alert log based on the failed unit. Specifically, the CPLD chip may communicate with a controller (BMC) through the I2C module, so as to send the located unit information of the fault unit to the controller, and the operation and maintenance personnel may perform subsequent analysis and processing through a remote management function of the controller. The operation and maintenance personnel can confirm the fault unit through the alarm log of the controller, replace the corresponding fault unit, then confirm whether the target server normally operates after the fault unit is replaced, if the target server normally operates, the debugging work is completed, and the debugging card is taken down. Otherwise, the fault location operation may be re-performed using the debug card.

Through the steps 1 to 6, when a fault occurs in the target server, the debug card is connected with the main board of the target server to obtain a startup self-checking code of the main board, the startup self-checking code is matched with a fault code library pre-stored in the debug card, so that a fault code corresponding to the startup self-checking code and a fault module corresponding to the fault code are determined, the main board can call and start BIOS debug firmware pre-stored in the debug card to enter a debug mode to obtain enumeration information of the fault module, the debug card can analyze the enumeration information, determine state codes of a plurality of units contained in the fault module, and match the state codes with a state code library pre-stored in the debug card to determine the fault state code and a fault unit corresponding to the fault state code, thereby realizing accurate positioning of the fault of the server and improving the operation and maintenance efficiency of the server. The application solves the problem of lower fault locating efficiency of the server in the related technology, and achieves the technical effect of improving the fault locating efficiency of the server.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The foregoing embodiments are merely for illustrating the technical solution of the present application, but not for limiting the same, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present application and should be included in the protection scope of the present application.

Claims

1. The server fault positioning system is characterized by comprising a debugging card and a main board, wherein the main board is arranged in a target server, and the debugging card is electrically connected with the main board;

The debug card is used for acquiring a startup self-checking code of the main board, and matching the startup self-checking code with a preset fault code library stored in the debug card;

the main board is used for calling and starting BIOS debugging firmware stored in the debugging card, entering a debugging mode and acquiring enumeration information of the fault module;

The debug card is also used for analyzing the enumeration information to determine the state codes corresponding to all units included in the fault module, matching each state code with a preset state code library stored in the debug card, determining fault state codes which cannot be matched with the correct state codes in the state code library, and determining the fault units corresponding to the fault state codes.

2. The server fault location system of claim 1, wherein the debug card comprises a processor, a memory, an interface module, and a communication module, the motherboard comprising a controller, a PCH chip, and a CPLD chip;

The debug card establishes communication with the controller, the PCH chip and the CPLD chip through the communication module, the debug card is electrically connected with the main board through the interface module, and the memory stores the fault code library and the state code library, wherein the fault code library comprises a plurality of fault codes and fault module information corresponding to each fault code, and the state code library comprises a plurality of correct state codes and unit information corresponding to each correct state code.

3. The server fault location system of claim 2, wherein the processor comprises a first set of GPIO pins and a second set of GPIO pins, the CPLD chip comprises a third set of GPIO pins, the PCH chip comprises a fourth set of GPIO pins, wherein the first set of GPIO pins is connected to the third set of GPIO pins through the interface module, and the second set of GPIO pins is connected to the fourth set of GPIO pins through the interface module;

The processor is configured to obtain the boot self-test code of the motherboard from the PCH chip through the second set of GPIO pins, and send the fault module information of the fault module to the CPLD chip through the first set of GPIO pins.

4. The server fault location system of claim 3, wherein the processor further comprises a first set of PG pins, a first set of EN pins, and a debug pin, the CPLD chip further comprises a second set of PG pins and a second set of EN pins, the first set of PG pins is connected to the second set of PG pins through the interface module, the first set of EN pins is connected to the second set of EN pins through the interface module;

The processor is further configured to set the debug pin to 1 after the processor is electrically connected to the motherboard through the interface module and is powered on for the first time, and obtain a PG signal level state of each standby power supply of the motherboard from the CPLD chip through the first set of PG pins;

the processor determines that the standby power supply with the PG signal level state being abnormal is an abnormal standby power supply, and sends an EN signal to the abnormal standby power supply through the first group of EN pins so as to forcedly enable the abnormal standby power supply;

in the event that the PG signal level state of the abnormal standby power supply transitions from abnormal to normal, the processor determines the abnormal standby power supply as a faulty unit.

5. The server fault location system of claim 4, wherein the CPLD chip further comprises a fifth set of GPIO pins for controlling power shielding of peripheral modules connected to the target server;

The processor is further configured to control, when the PG signal level state of each standby power supply of the motherboard is normal, the CPLD chip through the first set of PG pins, so that the CPLD chip shields EN signals of the peripheral module through the fifth set of GPIO pins, and causes the target server to execute a minimum system startup action;

And under the condition that the target server successfully executes the minimum system starting action, the processor determines that the peripheral module comprises the fault module.

6. The server fault location system of claim 5, wherein the processor is further configured to control the CPLD chip through the first set of PG pins to cause the CPLD chip to restore EN signals of the peripheral module through the fifth set of GPIO pins and to cause the target server to restart normally if the target server performs a minimum system boot action successfully;

the processor is further configured to obtain the boot self-test code of the motherboard from the PCH chip through the second set of GPIO pins after the target server is restarted normally, match the boot self-test code with the fault code library stored in the memory, determine the fault code corresponding to the boot self-test code, determine the fault module in the target server corresponding to the fault code, and send the fault module information to the CPLD chip through the first set of GPIO pins.

7. The server fault location system of claim 6, wherein the processor is further configured to set the debug pin to 0 after sending the fault module information to the CPLD chip;

the CPLD chip is used for controlling the PCH chip to call and start the BIOS debugging firmware so as to enable the PCH chip to enter the debugging mode;

The PCH chip is used for sending the enumeration information of the fault module to the processor through the fourth group of GPIO pins;

The processor is further configured to parse the enumeration information, determine the status code corresponding to each unit included in the failure module, match each status code with the status code library stored in the memory, determine failure status codes that cannot be matched with the correct status codes in the status code library, determine the failure unit corresponding to the failure status code, and send the unit information of the failure unit to the CPLD chip through the first set of GPIO pins.

8. The server fault location system of claim 5 or 6, wherein the fifth set of GPIO pins is further configured to control power shielding of the faulty module and an associated module of the target server in communicative association with the faulty module;

The processor is further configured to control the CPLD chip through the first set of PG pins after the fault module information is sent to the CPLD chip, so that the CPLD chip shields EN signals of the fault module through the fifth set of GPIO pins, and restart the target server;

Under the condition that the target server is restarted normally, the processor records the fault module information;

And under the condition that the target server cannot be restarted normally, the processor transmits each association module to the CPLD chip through the first group of GPIO pins, the CPLD chip sequentially shields EN signals of the association modules through the fifth group of GPIO pins and restarts the target server until the target server is restarted normally, and the processor records the fault module information.

9. The server fault location system according to any one of claims 2 to 7, wherein the CPLD chip is further configured to send the unit information of the faulty unit to the controller, the controller being configured to generate an alert log based on the unit information of the faulty unit.

10. A server fault location method, characterized in that it is applied to the server fault location system according to any one of claims 1 to 9, the method comprising:

The debug card acquires a startup self-checking code of the main board, and matches the startup self-checking code with a preset fault code library stored in the debug card;

The mainboard calls and starts BIOS debugging firmware stored in the debugging card, enters a debugging mode and acquires enumeration information of the fault module;

The debug card analyzes the enumeration information to determine state codes corresponding to all units included in the fault module, respectively matches each state code with a preset state code library stored in the debug card, determines fault state codes which cannot be matched with the correct state codes in the state code library, and determines fault units corresponding to the fault state codes.