Disclosure of Invention
In order to overcome the problems in the related art, the present specification provides a method and an apparatus for positioning a memory failure.
According to a first aspect of embodiments of the present disclosure, there is provided a memory fault location method, applied to a BMC, including:
When the occurrence of faults is determined according to the registers in the logic chip on the main board, the fault information of the internal registers is obtained from the processor of the main board;
determining a system address of a fault memory according to the fault information;
according to the mapping relation between the stored system address and the physical address, carrying out address conversion on the system address to obtain the physical address of the fault memory;
the physical address is recorded in a system event log.
Optionally, before determining that the fault occurs according to the register in the logic chip on the motherboard, the method further includes:
and receiving and storing the mapping relation between the system address and the physical address of the memory.
Further, determining the system address of the fault memory according to the fault information, further includes:
Determining a fault type according to the first fault information acquired from the first register;
If the fault type is the fault of the processor, determining a fault module according to the second fault information acquired from the second register;
and acquiring a system address corresponding to the fault module from a third register corresponding to the fault module.
Further, after determining the fault type according to the first fault information acquired in the first register, the method further includes:
If the memory fault is determined not to be the fault of the processor according to the fault type, stopping positioning the memory fault.
Further, the failure module includes IMC, IIO, MLC and a DCU;
the system address corresponding to the fault module is obtained from a third register corresponding to the fault module, and the method comprises the following steps:
if the fault module is IIO and the system address cannot be resolved, the system address corresponding to the IMC is obtained from a third register corresponding to the IMC.
According to a second aspect of embodiments of the present disclosure, there is provided a memory fault location device, applied to a BMC, including:
the acquisition unit is used for acquiring fault information of an internal register from a processor of the main board when the fault is determined to occur according to the register in the logic chip on the main board;
The address unit is used for determining the system address of the fault memory according to the fault information;
the conversion unit is used for carrying out address conversion on the system address according to the mapping relation between the stored system address and the physical address, and obtaining the physical address of the fault memory;
And the recording unit is used for recording the physical address in the system event log.
Optionally, the device further includes:
and the storage unit is used for receiving and storing the mapping relation between the system address and the physical address of the memory.
Further, the address unit includes:
The type determining subunit is used for determining the fault type according to the first fault information acquired in the first register;
The module determining subunit is configured to determine, if the fault type is a fault of the processor, a fault module according to the second fault information obtained in the second register;
the address determination subunit is configured to obtain a system address corresponding to the fault module from a third register corresponding to the fault module.
Optionally, the device further includes:
And the termination unit is used for stopping the positioning of the memory fault if the fault is determined not to be the fault of the processor according to the fault type.
Further, the failure module includes IMC, IIO, MLC and a DCU;
and the address unit is also used for acquiring the system address corresponding to the IMC from the third register corresponding to the IMC if the fault module is IIO and the system address cannot be resolved.
The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:
in the embodiment of the specification, after the fault is determined on the main board, the fault information of the internal register in the processor is obtained, the system address of the memory with the fault is determined according to the fault information, the address is converted to obtain the physical address, and the physical address is recorded in the system event log, so that the problem that the system is suspended and the memory fault cannot be located due to internal errors is avoided, and the maintainability of the server is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification.
The application provides a memory fault positioning method, which is applied to BMC (baseboard management controller), as shown in figure 1, and comprises the following steps:
S100, when the memory is determined to have faults according to the registers in the logic chip on the main board, the fault information of the internal registers is obtained from the processor of the main board.
As shown in fig. 2, a main board for management is provided in the server, and a processor, a logic chip, and a BMC (baseboard management controller ) are provided on the main board. The BMC may monitor devices, modules within the server and record in a SEL (System event Log ). The processor may be a CPU, or may be another device that performs an operation, and a plurality of CPUs may be disposed on the motherboard, as shown in fig. 2, which is two CPUs. The logic chip may be a CPLD (complex programmable logic device ). The present application will be described with reference to the structure of fig. 2. The BMC may be connected to the CPU via PECI (Platform Environment Control Interface ), to the CPLD via CATERR/MSMI pin, and to the CPLD via CATERR/MSMI pin.
When the CPU knows that the fault occurs according to the hardware interrupt, the fault notification is sent to the CPLD through the CATERR/MSMI pin, and the CPLD records in an internal register according to the fault notification. In the scenario shown in FIG. 2, it can be seen that CPU1 is connected to CPLD via CATERR1/MSMI pin and CPU2 is connected to CPLD via CATERR2/MSMI pin.
During normal operation of the server, the BMC can periodically (for example, with 1 second as a period) read the value of the internal register from the CPLD through the CATERR3/MSMI pin to determine whether the CPU has reported a fault, if the value of the internal register indicates that no fault exists, the timing is continued until the next period is read again, if the value of the internal register indicates that a fault exists, the BMC is triggered, accesses the internal register of the CPU through the PECI, and acquires fault information recorded by the internal register of the CPU. When two CPUs are arranged on the main board, the BMC can acquire fault information in internal registers of the two CPUs respectively and record the fault information respectively.
S101, determining the system address of the fault memory according to the fault information.
The CPU contains a plurality of internal registers for storing fault information related to faults, and the BMC needs to determine the system address of the memory with faults according to the values of the plurality of internal registers. The following describes the applied internal memory.
A first register (e.g., mca_err_src_log) has stored in it the fault type, including the fault of the present CPU as well as external faults (i.e., the fault occurred on other CPUs), and whether the IERR or MCERR was at the time of the fault. In the event of a memory failure, although it may be considered an IERR, it is also recorded in the MCERR-related register, and more information about the memory-related failure is recorded in the MCERR-related register.
Based on the first fault information stored in the first register (e.g., bits [28:27] and bits [19:18 ]) in the first register, it can be determined whether the fault type where the fault occurs is the fault of the present CPU. If the fault is not the fault of the CPU, the other internal registers of the CPU do not need to be analyzed continuously, the memory fault positioning of the CPU is stopped, and if the fault is the fault of the CPU, the other internal registers of the CPU are analyzed continuously.
A second register (e.g., MCerr _logging_reg) in which is stored a module identification (i.e., second failure information) of the failed module, which is typically the module associated with the memory. As shown in fig. 3, IMCs are disposed in the processor, each IMC is provided with a plurality of channels, and each channel may correspond to a plurality of memories. The module may then include the IMC and may distinguish, via the second fault information, whether the fault is specific to the IMC itself or the plurality of channels associated with the IMC. A memory may be inserted into each channel. In fig. 3, it can be seen that two IMCs are provided in one processor, 3 channels are associated under one IMC, and two memories can be plugged into one channel.
For example, bits [7:0] in MCerr _logging_reg may be used to determine whether the fault originated from the IMC itself or the channel with which the IMC is associated. When it is confirmed from bits [7:0] that the fault originated from a channel, then further look-up bits [13:10] are needed to determine which channel to locate.
Specifically, bits [7:0] =0100000 represent IMC0, bits [7:0] = 01100100 represent IMC1, after IMC is determined, a lane is determined according to bits [13:10], bits [13:10] =0101 represent lane 0, bits [13:10] =0111 represent lane 1, and bits [13:10] =1001 represent lane 2. Specific bit values in the register are not listed one by one for the other channels.
From bit 2 it is determined that the fault originated from the IMC itself, and also from bits [7:0] it is possible to determine which IMC the fault is in. Bits [7:0] =0100010 represent IMC0, and bits [7:0] =0100110 represent IMC1.
Through the above process, a specific fault module can be determined from the second fault information acquired from the second register.
A corresponding register is provided in the CPU for a module to store the state of the module, which memory may be referred to as a module specific register. At the module. For example, for IMC0 itself and IMC1 itself, MC7 BANK and MC8 BANK may be set to store the state information of the module as well as the system address. For the 6 lanes associated with IMC0 and IMC1, the 6 registers m13_bank to m18_bank may be set to store the state information and system address of the module.
In the mc7_bank, the mc8_bank, and the mc13_bank to the mc18_bank, a plurality of sub registers including a STATUS register (mc_status), an address register (mc_addr), a redundancy register (mc_misc), and a control register (mc_ctl) are provided. Wherein a efficacy flag is set in the status register and the address register. In the process of acquiring the system address, the BMC firstly checks whether the efficacy mark in the status register is valid, checks the address register under the condition of being valid, and also checks whether the efficacy mark in the address register is valid when checking the address register, and acquires the address stored in the address register as the system address under the condition of being valid.
Also, memory failures may be logged to other modules, such as IIO, MLC, and DCU, due to memory diffusion. Wherein a corresponding register (mc3_bank) is provided for MLC, a corresponding register (mc1_bank) is provided for DCU, and a corresponding register (mc6_bank) is provided for IIO.
When the fault module is IIO, whether a PCIE (high-speed serial computer expansion bus, PERIPHERAL COMPONENT INTERCONNECT EXPRESS) fault exists or not may be determined according to a sub-register (mc6_miss) in the mc6_bank, that is, whether a system address of the PCIE associated with the IIO is resolved. If the system address of the PCIE cannot be resolved, it may be considered to be caused by a memory failure. At this time, the system address of the faulty memory may be obtained for the register corresponding to the IMC.
S102, according to the mapping relation between the stored system address and the physical address, address conversion is carried out on the system address, and the physical address of the fault memory is obtained.
The BMC can store the mapping relation between the system address and the physical address, and after the BMC acquires the system address based on the fault memory, the system address can be converted into the actual physical address based on the mapping relation. The physical address indicates where the memory is pinned and may be used to locate the memory within the server.
The mapping relationship may be stored in the BMC through a fixed relationship, or may be sent to the BMC in a file form through a data channel between the BIOS and the BMC before step S100, i.e. during the power-on process of the BIOS, so that the BMC stores the file in the nonvolatile memory, so that the mapping relationship is used when a memory failure is detected.
The specific manner in which system addresses and physical addresses are translated is similar to that which is currently described and will not be described again.
S103, recording a physical address in a system event log.
After the BMC determines the physical address of the failed memory, it may record in SEL (system event Log ) of the BMC and may display the physical address through a display device such as a display.
After the physical address of the faulty memory is displayed, a worker can locate the memory in the server according to the physical address, thereby replacing the faulty memory.
In the embodiment of the specification, after the fault is determined on the main board, the fault information of the internal register in the processor is obtained, the system address of the memory with the fault is determined according to the fault information, the address is converted to obtain the physical address, and the physical address is recorded in the system event log, so that the problem that the system is suspended and the memory fault cannot be located due to internal errors is avoided, and the maintainability of the server is improved.
In a large server, a large amount of memory may be provided therein, and in this way, a failed memory may be located from the large amount of memory.
Correspondingly, the application also provides a memory fault positioning device, as shown in fig. 4, applied to the BMC, comprising:
the acquisition unit is used for acquiring fault information of an internal register from a processor of the main board when the fault is determined to occur according to the register in the logic chip on the main board;
The address unit is used for determining the system address of the fault memory according to the fault information;
the conversion unit is used for carrying out address conversion on the system address according to the mapping relation between the stored system address and the physical address, and obtaining the physical address of the fault memory;
And the recording unit is used for recording the physical address in the system event log.
Optionally, the device further includes:
and the storage unit is used for receiving and storing the mapping relation between the system address and the physical address of the memory.
Further, the address unit includes:
The type determining subunit is used for determining the fault type according to the first fault information acquired in the first register;
The module determining subunit is configured to determine, if the fault type is a fault of the processor, a fault module according to the second fault information obtained in the second register;
the address determination subunit is configured to obtain a system address corresponding to the fault module from a third register corresponding to the fault module.
Optionally, the device further includes:
And the termination unit is used for stopping the positioning of the memory fault if the fault is determined not to be the fault of the processor according to the fault type.
Further, the failure module includes IMC, IIO, MLC and a DCU;
and the address unit is also used for acquiring the system address corresponding to the IMC from the third register corresponding to the IMC if the fault module is IIO and the system address cannot be resolved.
The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:
in the embodiment of the specification, after the fault is determined on the main board, the fault information of the internal register in the processor is obtained, the system address of the memory with the fault is determined according to the fault information, the address is converted to obtain the physical address, and the physical address is recorded in the system event log, so that the problem that the system is suspended and the memory fault cannot be located due to internal errors is avoided, and the maintainability of the server is improved.
It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof.
The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.