Movatterモバイル変換


[0]ホーム

URL:


CN112650612B - A memory fault location method and device - Google Patents

A memory fault location method and device
Download PDF

Info

Publication number
CN112650612B
CN112650612BCN202011550974.9ACN202011550974ACN112650612BCN 112650612 BCN112650612 BCN 112650612BCN 202011550974 ACN202011550974 ACN 202011550974ACN 112650612 BCN112650612 BCN 112650612B
Authority
CN
China
Prior art keywords
fault
register
address
memory
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011550974.9A
Other languages
Chinese (zh)
Other versions
CN112650612A (en
Inventor
赵俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Cloud Technologies Co Ltd
Original Assignee
New H3C Cloud Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Cloud Technologies Co LtdfiledCriticalNew H3C Cloud Technologies Co Ltd
Priority to CN202011550974.9ApriorityCriticalpatent/CN112650612B/en
Publication of CN112650612ApublicationCriticalpatent/CN112650612A/en
Application grantedgrantedCritical
Publication of CN112650612BpublicationCriticalpatent/CN112650612B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本说明书提供一种内存故障定位方法及装置,涉及通信技术领域。一种内存故障定位方法,应用于BMC,包括:当根据主板上的逻辑芯片中的寄存器确定出现故障时,从主板的处理器中获取内部寄存器的故障信息;根据故障信息确定故障内存的系统地址;根据所存储的系统地址和物理地址的映射关系,对系统地址进行地址转换,获取故障内存的物理地址;在系统事件日志中记录物理地址。通过上述方法能够提高服务器的可维护性。

The present specification provides a memory fault location method and device, which relates to the field of communication technology. A memory fault location method, applied to BMC, includes: when a fault is determined according to the register in the logic chip on the mainboard, obtaining the fault information of the internal register from the processor of the mainboard; determining the system address of the faulty memory according to the fault information; performing address conversion on the system address according to the mapping relationship between the stored system address and the physical address, and obtaining the physical address of the faulty memory; recording the physical address in the system event log. The above method can improve the maintainability of the server.

Description

Memory fault positioning method and device
Technical Field
The present disclosure relates to the field of communications technologies, and in particular, to a method and an apparatus for positioning a memory failure.
Background
With the high-speed development of information technology at home and abroad, servers are widely applied to various industries, and people are paying attention to reliable operation of the servers. In the faults of the server, the hard disk and the memory occupy the first two bits, wherein the memory faults can directly influence the operation of the system, and the business of the client is interrupted, so that the quick positioning of the memory faults becomes an important research direction. The memory faults are mainly divided into correctable errors and uncorrectable errors, the correctable errors do not affect the operation of the system, and the uncorrectable errors can cause the system to hang up, so that the system stops working completely. Therefore, uncorrectable errors are more of a concern.
For system hang-up, two cases are included, one is MCERR (mechanical check Error, MACHINE CHECK Error) caused by system software and the other is IERR (Internal Error) caused by system hardware. In the case of MCERR, the BIOS (basic input output system ) of the server can still work, and perform error detection and reporting. However, in the case of the IERR, the BIOS will also stop running, and may not implement error location and reporting. Therefore, how to locate and report the memory failure under the circumstance of IERR occurrence is a problem to be solved by technicians.
Disclosure of Invention
In order to overcome the problems in the related art, the present specification provides a method and an apparatus for positioning a memory failure.
According to a first aspect of embodiments of the present disclosure, there is provided a memory fault location method, applied to a BMC, including:
When the occurrence of faults is determined according to the registers in the logic chip on the main board, the fault information of the internal registers is obtained from the processor of the main board;
determining a system address of a fault memory according to the fault information;
according to the mapping relation between the stored system address and the physical address, carrying out address conversion on the system address to obtain the physical address of the fault memory;
the physical address is recorded in a system event log.
Optionally, before determining that the fault occurs according to the register in the logic chip on the motherboard, the method further includes:
and receiving and storing the mapping relation between the system address and the physical address of the memory.
Further, determining the system address of the fault memory according to the fault information, further includes:
Determining a fault type according to the first fault information acquired from the first register;
If the fault type is the fault of the processor, determining a fault module according to the second fault information acquired from the second register;
and acquiring a system address corresponding to the fault module from a third register corresponding to the fault module.
Further, after determining the fault type according to the first fault information acquired in the first register, the method further includes:
If the memory fault is determined not to be the fault of the processor according to the fault type, stopping positioning the memory fault.
Further, the failure module includes IMC, IIO, MLC and a DCU;
the system address corresponding to the fault module is obtained from a third register corresponding to the fault module, and the method comprises the following steps:
if the fault module is IIO and the system address cannot be resolved, the system address corresponding to the IMC is obtained from a third register corresponding to the IMC.
According to a second aspect of embodiments of the present disclosure, there is provided a memory fault location device, applied to a BMC, including:
the acquisition unit is used for acquiring fault information of an internal register from a processor of the main board when the fault is determined to occur according to the register in the logic chip on the main board;
The address unit is used for determining the system address of the fault memory according to the fault information;
the conversion unit is used for carrying out address conversion on the system address according to the mapping relation between the stored system address and the physical address, and obtaining the physical address of the fault memory;
And the recording unit is used for recording the physical address in the system event log.
Optionally, the device further includes:
and the storage unit is used for receiving and storing the mapping relation between the system address and the physical address of the memory.
Further, the address unit includes:
The type determining subunit is used for determining the fault type according to the first fault information acquired in the first register;
The module determining subunit is configured to determine, if the fault type is a fault of the processor, a fault module according to the second fault information obtained in the second register;
the address determination subunit is configured to obtain a system address corresponding to the fault module from a third register corresponding to the fault module.
Optionally, the device further includes:
And the termination unit is used for stopping the positioning of the memory fault if the fault is determined not to be the fault of the processor according to the fault type.
Further, the failure module includes IMC, IIO, MLC and a DCU;
and the address unit is also used for acquiring the system address corresponding to the IMC from the third register corresponding to the IMC if the fault module is IIO and the system address cannot be resolved.
The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:
in the embodiment of the specification, after the fault is determined on the main board, the fault information of the internal register in the processor is obtained, the system address of the memory with the fault is determined according to the fault information, the address is converted to obtain the physical address, and the physical address is recorded in the system event log, so that the problem that the system is suspended and the memory fault cannot be located due to internal errors is avoided, and the maintainability of the server is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification.
FIG. 1 is a flow chart of a memory fault location method according to the present application;
fig. 2 is a schematic structural diagram of a motherboard to which the memory fault location method according to the present application is applied;
FIG. 3 is a schematic diagram of the relationship between a processor and memory according to the present application;
fig. 4 is a schematic structural diagram of a memory fault location device according to the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification.
The application provides a memory fault positioning method, which is applied to BMC (baseboard management controller), as shown in figure 1, and comprises the following steps:
S100, when the memory is determined to have faults according to the registers in the logic chip on the main board, the fault information of the internal registers is obtained from the processor of the main board.
As shown in fig. 2, a main board for management is provided in the server, and a processor, a logic chip, and a BMC (baseboard management controller ) are provided on the main board. The BMC may monitor devices, modules within the server and record in a SEL (System event Log ). The processor may be a CPU, or may be another device that performs an operation, and a plurality of CPUs may be disposed on the motherboard, as shown in fig. 2, which is two CPUs. The logic chip may be a CPLD (complex programmable logic device ). The present application will be described with reference to the structure of fig. 2. The BMC may be connected to the CPU via PECI (Platform Environment Control Interface ), to the CPLD via CATERR/MSMI pin, and to the CPLD via CATERR/MSMI pin.
When the CPU knows that the fault occurs according to the hardware interrupt, the fault notification is sent to the CPLD through the CATERR/MSMI pin, and the CPLD records in an internal register according to the fault notification. In the scenario shown in FIG. 2, it can be seen that CPU1 is connected to CPLD via CATERR1/MSMI pin and CPU2 is connected to CPLD via CATERR2/MSMI pin.
During normal operation of the server, the BMC can periodically (for example, with 1 second as a period) read the value of the internal register from the CPLD through the CATERR3/MSMI pin to determine whether the CPU has reported a fault, if the value of the internal register indicates that no fault exists, the timing is continued until the next period is read again, if the value of the internal register indicates that a fault exists, the BMC is triggered, accesses the internal register of the CPU through the PECI, and acquires fault information recorded by the internal register of the CPU. When two CPUs are arranged on the main board, the BMC can acquire fault information in internal registers of the two CPUs respectively and record the fault information respectively.
S101, determining the system address of the fault memory according to the fault information.
The CPU contains a plurality of internal registers for storing fault information related to faults, and the BMC needs to determine the system address of the memory with faults according to the values of the plurality of internal registers. The following describes the applied internal memory.
A first register (e.g., mca_err_src_log) has stored in it the fault type, including the fault of the present CPU as well as external faults (i.e., the fault occurred on other CPUs), and whether the IERR or MCERR was at the time of the fault. In the event of a memory failure, although it may be considered an IERR, it is also recorded in the MCERR-related register, and more information about the memory-related failure is recorded in the MCERR-related register.
Based on the first fault information stored in the first register (e.g., bits [28:27] and bits [19:18 ]) in the first register, it can be determined whether the fault type where the fault occurs is the fault of the present CPU. If the fault is not the fault of the CPU, the other internal registers of the CPU do not need to be analyzed continuously, the memory fault positioning of the CPU is stopped, and if the fault is the fault of the CPU, the other internal registers of the CPU are analyzed continuously.
A second register (e.g., MCerr _logging_reg) in which is stored a module identification (i.e., second failure information) of the failed module, which is typically the module associated with the memory. As shown in fig. 3, IMCs are disposed in the processor, each IMC is provided with a plurality of channels, and each channel may correspond to a plurality of memories. The module may then include the IMC and may distinguish, via the second fault information, whether the fault is specific to the IMC itself or the plurality of channels associated with the IMC. A memory may be inserted into each channel. In fig. 3, it can be seen that two IMCs are provided in one processor, 3 channels are associated under one IMC, and two memories can be plugged into one channel.
For example, bits [7:0] in MCerr _logging_reg may be used to determine whether the fault originated from the IMC itself or the channel with which the IMC is associated. When it is confirmed from bits [7:0] that the fault originated from a channel, then further look-up bits [13:10] are needed to determine which channel to locate.
Specifically, bits [7:0] =0100000 represent IMC0, bits [7:0] = 01100100 represent IMC1, after IMC is determined, a lane is determined according to bits [13:10], bits [13:10] =0101 represent lane 0, bits [13:10] =0111 represent lane 1, and bits [13:10] =1001 represent lane 2. Specific bit values in the register are not listed one by one for the other channels.
From bit 2 it is determined that the fault originated from the IMC itself, and also from bits [7:0] it is possible to determine which IMC the fault is in. Bits [7:0] =0100010 represent IMC0, and bits [7:0] =0100110 represent IMC1.
Through the above process, a specific fault module can be determined from the second fault information acquired from the second register.
A corresponding register is provided in the CPU for a module to store the state of the module, which memory may be referred to as a module specific register. At the module. For example, for IMC0 itself and IMC1 itself, MC7 BANK and MC8 BANK may be set to store the state information of the module as well as the system address. For the 6 lanes associated with IMC0 and IMC1, the 6 registers m13_bank to m18_bank may be set to store the state information and system address of the module.
In the mc7_bank, the mc8_bank, and the mc13_bank to the mc18_bank, a plurality of sub registers including a STATUS register (mc_status), an address register (mc_addr), a redundancy register (mc_misc), and a control register (mc_ctl) are provided. Wherein a efficacy flag is set in the status register and the address register. In the process of acquiring the system address, the BMC firstly checks whether the efficacy mark in the status register is valid, checks the address register under the condition of being valid, and also checks whether the efficacy mark in the address register is valid when checking the address register, and acquires the address stored in the address register as the system address under the condition of being valid.
Also, memory failures may be logged to other modules, such as IIO, MLC, and DCU, due to memory diffusion. Wherein a corresponding register (mc3_bank) is provided for MLC, a corresponding register (mc1_bank) is provided for DCU, and a corresponding register (mc6_bank) is provided for IIO.
When the fault module is IIO, whether a PCIE (high-speed serial computer expansion bus, PERIPHERAL COMPONENT INTERCONNECT EXPRESS) fault exists or not may be determined according to a sub-register (mc6_miss) in the mc6_bank, that is, whether a system address of the PCIE associated with the IIO is resolved. If the system address of the PCIE cannot be resolved, it may be considered to be caused by a memory failure. At this time, the system address of the faulty memory may be obtained for the register corresponding to the IMC.
S102, according to the mapping relation between the stored system address and the physical address, address conversion is carried out on the system address, and the physical address of the fault memory is obtained.
The BMC can store the mapping relation between the system address and the physical address, and after the BMC acquires the system address based on the fault memory, the system address can be converted into the actual physical address based on the mapping relation. The physical address indicates where the memory is pinned and may be used to locate the memory within the server.
The mapping relationship may be stored in the BMC through a fixed relationship, or may be sent to the BMC in a file form through a data channel between the BIOS and the BMC before step S100, i.e. during the power-on process of the BIOS, so that the BMC stores the file in the nonvolatile memory, so that the mapping relationship is used when a memory failure is detected.
The specific manner in which system addresses and physical addresses are translated is similar to that which is currently described and will not be described again.
S103, recording a physical address in a system event log.
After the BMC determines the physical address of the failed memory, it may record in SEL (system event Log ) of the BMC and may display the physical address through a display device such as a display.
After the physical address of the faulty memory is displayed, a worker can locate the memory in the server according to the physical address, thereby replacing the faulty memory.
In the embodiment of the specification, after the fault is determined on the main board, the fault information of the internal register in the processor is obtained, the system address of the memory with the fault is determined according to the fault information, the address is converted to obtain the physical address, and the physical address is recorded in the system event log, so that the problem that the system is suspended and the memory fault cannot be located due to internal errors is avoided, and the maintainability of the server is improved.
In a large server, a large amount of memory may be provided therein, and in this way, a failed memory may be located from the large amount of memory.
Correspondingly, the application also provides a memory fault positioning device, as shown in fig. 4, applied to the BMC, comprising:
the acquisition unit is used for acquiring fault information of an internal register from a processor of the main board when the fault is determined to occur according to the register in the logic chip on the main board;
The address unit is used for determining the system address of the fault memory according to the fault information;
the conversion unit is used for carrying out address conversion on the system address according to the mapping relation between the stored system address and the physical address, and obtaining the physical address of the fault memory;
And the recording unit is used for recording the physical address in the system event log.
Optionally, the device further includes:
and the storage unit is used for receiving and storing the mapping relation between the system address and the physical address of the memory.
Further, the address unit includes:
The type determining subunit is used for determining the fault type according to the first fault information acquired in the first register;
The module determining subunit is configured to determine, if the fault type is a fault of the processor, a fault module according to the second fault information obtained in the second register;
the address determination subunit is configured to obtain a system address corresponding to the fault module from a third register corresponding to the fault module.
Optionally, the device further includes:
And the termination unit is used for stopping the positioning of the memory fault if the fault is determined not to be the fault of the processor according to the fault type.
Further, the failure module includes IMC, IIO, MLC and a DCU;
and the address unit is also used for acquiring the system address corresponding to the IMC from the third register corresponding to the IMC if the fault module is IIO and the system address cannot be resolved.
The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:
in the embodiment of the specification, after the fault is determined on the main board, the fault information of the internal register in the processor is obtained, the system address of the memory with the fault is determined according to the fault information, the address is converted to obtain the physical address, and the physical address is recorded in the system event log, so that the problem that the system is suspended and the memory fault cannot be located due to internal errors is avoided, and the maintainability of the server is improved.
It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof.
The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims (8)

CN202011550974.9A2020-12-242020-12-24 A memory fault location method and deviceActiveCN112650612B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202011550974.9ACN112650612B (en)2020-12-242020-12-24 A memory fault location method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202011550974.9ACN112650612B (en)2020-12-242020-12-24 A memory fault location method and device

Publications (2)

Publication NumberPublication Date
CN112650612A CN112650612A (en)2021-04-13
CN112650612Btrue CN112650612B (en)2025-03-04

Family

ID=75359960

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202011550974.9AActiveCN112650612B (en)2020-12-242020-12-24 A memory fault location method and device

Country Status (1)

CountryLink
CN (1)CN112650612B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113742123A (en)*2021-08-202021-12-03新华三技术有限公司合肥分公司Memory fault information recording method and equipment
CN114461476B (en)*2022-02-142023-09-26深圳源创存储科技有限公司Memory bank fault detection method, device and system
CN114968652A (en)*2022-07-092022-08-30超聚变数字技术有限公司Fault processing method and computing device
CN115421948A (en)*2022-07-302022-12-02超聚变数字技术有限公司Method for detecting memory data fault and related equipment thereof
CN116069538B (en)*2023-02-212025-05-02宁畅信息产业(北京)有限公司 Fault repair method, device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN106126368A (en)*2016-08-222016-11-16浪潮电子信息产业股份有限公司Method for analyzing memory fault address under LINUX

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107092549A (en)*2017-04-262017-08-25郑州云海信息技术有限公司A kind of automatic monitoring and the instrument and method for parsing memory failure
CN109508247B (en)*2018-11-092022-02-11英业达科技有限公司Method, system and electronic equipment for positioning memory error occurrence position
CN111104238B (en)*2019-10-302022-06-03苏州浪潮智能科技有限公司CE-based memory diagnosis method, device and medium
CN111984487A (en)*2020-09-252020-11-24苏州浪潮智能科技有限公司 A method and device for off-machine recording of faulty hardware location

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN106126368A (en)*2016-08-222016-11-16浪潮电子信息产业股份有限公司Method for analyzing memory fault address under LINUX

Also Published As

Publication numberPublication date
CN112650612A (en)2021-04-13

Similar Documents

PublicationPublication DateTitle
CN112650612B (en) A memory fault location method and device
US10824499B2 (en)Memory system architectures using a separate system control path or channel for processing error information
TWI229796B (en)Method and system to implement a system event log for system manageability
TWI528172B (en)Machine check summary register
US6615374B1 (en)First and next error identification for integrated circuit devices
WO2021169260A1 (en)System board card power supply test method, apparatus and device, and storage medium
US20090150721A1 (en)Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System
CN111414268A (en) Troubleshooting method, device and server
US7409580B2 (en)System and method for recovering from errors in a data processing system
KR102378466B1 (en)Memory devices and modules
CN103198000A (en)Method for positioning faulted memory in linux system
US3735105A (en)Error correcting system and method for monolithic memories
US6550019B1 (en)Method and apparatus for problem identification during initial program load in a multiprocessor system
US11360839B1 (en)Systems and methods for storing error data from a crash dump in a computer system
US20120166873A1 (en)System and method for handling system failure
WO2023193396A1 (en)Memory fault processing method and device, and computer readable storage medium
CN116302625A (en) Fault reporting method, device and storage medium
US11593209B2 (en)Targeted repair of hardware components in a computing device
JP4299634B2 (en) Information processing apparatus and clock abnormality detection program for information processing apparatus
CN118093265A (en)PCIE equipment fault processing method and server
JP5440673B1 (en) Programmable logic device, information processing apparatus, suspected part indication method and program
TWI654518B (en)Method for storing error status information and server using the same
CN114090327B (en)Single-particle error processing method, system and device
TWI777259B (en)Boot method
CN119003225B (en) A fault location method and device, storage medium and computer program product

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp