WO2024239866A1

Movatterモバイル変換

Info

Publication number: WO2024239866A1
Application number: PCT/CN2024/088676
Authority: WO
Inventors: 林沐晖
Original assignee: Hangzhou AliCloud Feitian Information Technology Co Ltd
Current assignee: Hangzhou AliCloud Feitian Information Technology Co Ltd
Priority date: 2023-05-24
Filing date: 2024-04-18
Publication date: 2024-11-28
Anticipated expiration: 2025-11-24
Also published as: CN116610493A

Abstract

Provided in the embodiments of the present application are a checkepoint-based application dump method, a checkpoint-based application restart method, a device and a storage medium. The present application innovatively proposes integrating state description information under a target application for a target specific device assembled at a computing node, and adding the state description information into a checkpoint file created for the target application; on this basis, during the process of restarting the target application on the basis of the checkpoint, the state description information can be read from the checkpoint file and, on the basis of the state description information, the device state of the target specific device under the target application is restarted to a specified checkpoint, which can provide a correct device state for the target application, thus ensuring the normal restart of the target application. Therefore, by reforming dump processes and restart processes, the embodiments of the present application can ensure that target applications can still be normally restarted when using stateful specific devices.

Description

Translated fromChinese

基于检查点的应用转储和恢复方法、设备及存储介质Checkpoint-based application dump and recovery method, device, and storage medium

本申请要求于2023年5月24日提交中国专利局、申请号为202310608335.0、发明名称为“一种基于检查点的应用转储和恢复方法、设备及存储介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the Chinese Patent Office on May 24, 2023, with application number 202310608335.0 and invention name “A checkpoint-based application dump and recovery method, device and storage medium”, the entire contents of which are incorporated by reference in this application.

技术领域Technical Field

本申请涉及云计算技术领域，尤其涉及一种基于检查点的应用转储和恢复方法、设备及存储介质。The present application relates to the field of cloud computing technology, and in particular to a checkpoint-based application dump and recovery method, device, and storage medium.

背景技术Background Art

公共云面向多租户环境，对客户的各种突发性需求提供弹性供给的计算能力。通常许多HPC(High performance computing，高性能计算)应用都属于重载型应用，对计算资源的负载压力很大，并且很多应用需要多节点并行计算，因此，在云环境下，支持HPC应用弹性使用计算资源，是降低客户的总体拥有成本的一种重要手段。Public cloud is oriented to multi-tenant environment, providing flexible computing power for various sudden demands of customers. Usually many HPC (High performance computing) applications are heavy-load applications, which put great pressure on computing resources, and many applications require multi-node parallel computing. Therefore, in the cloud environment, supporting HPC applications to use computing resources flexibly is an important means to reduce customers' total cost of ownership.

目前，云环境中计算节点的结构不断多样化，在计算节点上支持应用运行的底层逻辑也不断多样化，这导致按照传统的检查点和恢复(checkpoint and restart，CR)方案将HPC应用的内存数据简单地进行拷贝后，经常出现HPC应用无法恢复的问题，从而无法支持HPC应用弹性使用计算资源。At present, the structure of computing nodes in cloud environments is becoming increasingly diversified, and the underlying logic supporting application operations on computing nodes is also becoming increasingly diversified. This leads to the problem that HPC applications cannot be recovered after simply copying the memory data of HPC applications according to the traditional checkpoint and restart (CR) solution, making it impossible to support the elastic use of computing resources by HPC applications.

发明内容Summary of the invention

本申请的多个方面提供一种基于检查点的应用转储和恢复方法、设备及存储介质，用以更好地支持应用的转储和恢复。Multiple aspects of the present application provide a checkpoint-based application dump and recovery method, device, and storage medium to better support application dump and recovery.

本申请实施例提供一种基于检查点的应用转储方法，适用于计算节点，所述计算节点上装配有目标特定设备，所述方法包括：响应于针对目标应用的检查点创建指令，获取所述目标特定设备在所述目标应用下的状态描述信息，所述状态描述信息用于支持将所述目标特定设备在所述目标应用下的设备状态恢复至当前检查点；将所述状态描述信息添加至为所述目标应用构建的检查点文件中；对所述检查点文件进行转储，以在将所述目标应用恢复至所述当前检查点时基于所述状态描述信息对所述目标特定设备的设备状态进行恢复。An embodiment of the present application provides a checkpoint-based application dump method, which is applicable to a computing node, on which a target specific device is installed. The method comprises: in response to a checkpoint creation instruction for a target application, obtaining state description information of the target specific device under the target application, wherein the state description information is used to support restoring the device state of the target specific device under the target application to a current checkpoint; adding the state description information to a checkpoint file constructed for the target application; and dumping the checkpoint file to restore the device state of the target specific device based on the state description information when restoring the target application to the current checkpoint.

本申请实施例还提供一种基于检查点的应用恢复方法，适用于计算节点，所述计算节点上装配有目标特定设备，所述方法包括：响应于将目标应用恢复至指定检查点的恢复指令，获取所述目标应用在所述指定检查点对应的检查点文件；从所述检查点文件中读取所述目标特定设备在所述目标应用下的状态描述信息；根据所述状态描述信息，将所述目标特定设备在所述目标应用下的设备状态恢复至所述指定检查点，以将所述目标应用恢复至所述指定检查点。An embodiment of the present application also provides a checkpoint-based application recovery method, which is applicable to a computing node on which a target specific device is installed. The method includes: in response to a recovery instruction to restore a target application to a specified checkpoint, obtaining a checkpoint file corresponding to the target application at the specified checkpoint; reading status description information of the target specific device under the target application from the checkpoint file; and based on the status description information, restoring the device status of the target specific device under the target application to the specified checkpoint, so as to restore the target application to the specified checkpoint.

本申请实施例还提供一种计算节点，包括存储器、处理器和通信组件；所述存储器用于存储一条或多条计算机指令；所述处理器与所述存储器和所述通信组件耦合，用于执行所述一条或多条计算机指令，以用于执行前述的基于检查点的应用转储方法或前述的基于检查点的应用恢复方法。The present application also provides a computing node, including a memory, a processor and a communication component; the memory is used to store one or more computer instructions; the processor is coupled to the memory and the communication component, and is used to execute The one or more computer instructions are used to execute the aforementioned checkpoint-based application dump method or the aforementioned checkpoint-based application recovery method.

本申请实施例还提供一种存储计算机指令的计算机可读存储介质，当所述计算机指令被一个或多个处理器执行时，致使所述一个或多个处理器执行前述的基于检查点的应用转储方法或前述的基于检查点的应用恢复方法。An embodiment of the present application also provides a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, causes the one or more processors to execute the aforementioned checkpoint-based application dump method or the aforementioned checkpoint-based application recovery method.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1为本申请一示例性实施例提供的一种基于检查点的应用转储方法的流程示意图；FIG1 is a flow chart of a checkpoint-based application dump method provided by an exemplary embodiment of the present application;

图2为本申请一示例性实施例提供的一种基于检查点的应用转储方法的逻辑示意图；FIG2 is a logical diagram of a checkpoint-based application dump method provided by an exemplary embodiment of the present application;

图3为本申请一示例性实施例提供的一种基于检查点的应用恢复方法的流程示意图；FIG3 is a flow chart of a checkpoint-based application recovery method provided by an exemplary embodiment of the present application;

图4为本申请一示例性实施例提供的一种基于检查点的应用恢复方法的逻辑示意图；FIG4 is a logical diagram of a checkpoint-based application recovery method provided by an exemplary embodiment of the present application;

图5为本申请另一示例性实施例提供的一种计算节点的结构示意图。FIG5 is a schematic diagram of the structure of a computing node provided by another exemplary embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of the present application clearer, the technical solution of the present application will be clearly and completely described below in combination with the specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of the present application.

目前，经常出现HPC应用无法恢复的问题，从而无法支持HPC应用弹性使用计算资源。为此，本申请的一些实施例中：在基于检查点对目标应用进行转储的过程中，开拓性地提出了为计算节点上装配的目标特定设备整合出在目标应用下的状态描述信息，并将该状态描述信息添加到为目标应用构建的检查点文件中，这样，检查点文件中除了包含传统的应用恢复所需内容外，还增加了用于支持将目标特定设备在目标应用下的设备状态恢复至指定检查点的状态描述信息。在此基础上，在基于检查点对目标应用进行恢复的过程中，可从检查点文件中读取到该状态描述信息，并基于该状态描述信息将目标特定设备在目标应用下的设备状态恢复至指定检查点，这可为目标应用提供正确的设备状态，从而保证目标应用的正常恢复。因此，本申请实施例中，通过对转储过程和恢复过程的改造，可保证目标应用在使用到有状态的特定设备的情况下，依然可以正常恢复。At present, the problem that HPC applications cannot be restored often occurs, so that the elastic use of computing resources by HPC applications cannot be supported. To this end, in some embodiments of the present application: in the process of dumping the target application based on the checkpoint, it is pioneering to integrate the state description information under the target application for the target specific device installed on the computing node, and add the state description information to the checkpoint file constructed for the target application, so that in addition to the content required for traditional application recovery, the checkpoint file also adds the state description information for supporting the device state of the target specific device under the target application to the specified checkpoint. On this basis, in the process of restoring the target application based on the checkpoint, the state description information can be read from the checkpoint file, and the device state of the target specific device under the target application can be restored to the specified checkpoint based on the state description information, which can provide the target application with the correct device state, thereby ensuring the normal recovery of the target application. Therefore, in the embodiment of the present application, by transforming the dump process and the recovery process, it can be ensured that the target application can still be restored normally when using a stateful specific device.

以下结合附图，详细说明本申请各实施例提供的技术方案。The technical solutions provided by various embodiments of the present application are described in detail below in conjunction with the accompanying drawings.

图1为本申请一示例性实施例提供的一种基于检查点的应用转储方法的流程示意图，图2为本申请一示例性实施例提供的一种基于检查点的应用转储方法的逻辑示意图。该方法可由数据处理装置执行，该数据处理装置可实现为软件、硬件或软件与硬件的结合，该数据处理装置可集成在计算节点中。参考图1，该方法可包括：FIG1 is a flow chart of a checkpoint-based application dump method provided by an exemplary embodiment of the present application, and FIG2 is a logic chart of a checkpoint-based application dump method provided by an exemplary embodiment of the present application. The method may be executed by a data processing device, which may be implemented as software, hardware, or a combination of software and hardware, and the data processing device may be integrated in a computing node. Referring to FIG1 , the method may include:

步骤100、响应于针对目标应用的检查点创建指令，获取目标特定设备在目标应用下的状态描述信息，状态描述信息用于支持将目标特定设备在目标应用下的设备状态恢复至当前检查点；步骤101、将状态描述信息添加至为目标应用构建的检查点文件中；步骤102、对检查点文件进行转储，以在将目标应用恢复至当前检查点时基于状态描述信息对目标特定设备的设备状态进行恢复。Step 100: In response to a checkpoint creation instruction for a target application, obtain state description information of a target specific device under a target application, the state description information being used to support restoring the device state of the target specific device under the target application to a current checkpoint; Step 101: Add the state description information to a checkpoint file constructed for the target application; Step 102: The checkpoint file is dumped to restore the device state of the target specific device based on the state description information when the target application is restored to the current checkpoint.

本实施例提供的应用转储方法，可适用于需要对应用进行转储-恢复的场景中。例如，在云环境中支持应用弹性使用资源的场景；又例如，基于检查点的自动容错场景，等。本实施例对应用场景不做限定。The application dump method provided in this embodiment can be applied to scenarios where application dump-recovery is required, for example, scenarios that support elastic resource usage of applications in a cloud environment, or automatic fault tolerance scenarios based on checkpoints, etc. This embodiment does not limit the application scenarios.

检查点和恢复技术(Checkpoint and Restart，CR)是一种后向恢复技术。通过在应用正常运行过程中设置检查点，保存对应的检查点文件，可支持将应用回滚到检查点，并从检查点开始重新执行应用，而不再需要从源头执行应用。其中，检查点可理解为应用在某一时刻的快照。目前，检查点和恢复技术的原理是将应用的内存数据构建为检查点文件，并对检查点文件进行持久化存储，以在需要的时候基于检查点文件在原计算节点或者新的计算节点上为应用恢复内存数据，从而将应用恢复至检查点。其中，内存数据中存储的基本是进程本身的状态及操作系统环境等原本即维护在内存中的信息，在此不做过多举例。Checkpoint and Restart (CR) is a backward recovery technology. By setting checkpoints during the normal operation of the application and saving the corresponding checkpoint files, it is possible to support rolling back the application to the checkpoint and re-executing the application from the checkpoint, without having to execute the application from the source. Among them, the checkpoint can be understood as a snapshot of the application at a certain moment. At present, the principle of checkpoint and recovery technology is to construct the memory data of the application as a checkpoint file and store the checkpoint file persistently, so that when needed, the memory data of the application can be restored on the original computing node or the new computing node based on the checkpoint file, thereby restoring the application to the checkpoint. Among them, the memory data basically stores the information originally maintained in the memory, such as the status of the process itself and the operating system environment, so too many examples will not be given here.

但是，发明人在研究过程中发现，按照传统的检查点和恢复(checkpoint and restart，CR)方案将应用的内存数据简单地进行拷贝后，经常出现应用无法恢复的问题，为此，本实施例提供出一种改进方案，以改善应用无法恢复的问题。However, during the research process, the inventors found that after simply copying the application's memory data according to the traditional checkpoint and restart (CR) solution, the application often cannot be recovered. Therefore, this embodiment provides an improved solution to improve the problem of application non-recovery.

参考图2，本实施例中，计算节点上可装配特定设备，计算节点上的特定设备的数量可以是一个或多个，特定设备的类型可以是一种或多种，本实施例对此不做限定。计算节点上承载的应用可使用这些特定设备。其中，本实施例中的计算节点可以是云服务器或者服务器集群等，本实施例对此不做限定。Referring to FIG. 2 , in this embodiment, a computing node may be equipped with specific devices, the number of specific devices on the computing node may be one or more, and the type of specific devices may be one or more, which is not limited in this embodiment. Applications carried on the computing node may use these specific devices. Among them, the computing node in this embodiment may be a cloud server or a server cluster, etc., which is not limited in this embodiment.

本实施例中，特定设备通常为有状态设备，可包括但不限于有状态的IO设备或有状态的异构加速设备等。其中，有状态(stateful)设备是指以设备状态作为行为指导的一类设备，也即是，设备状态会影响该类设备的行为。若应用需要调用目标特定设备，则计算节点需要为目标特定设备计算出在目标进程下设备状态，并返回给目标应用，以供目标应用基于计算出的该设备状态执行对目标特定设备的操作，因此，使目标应用获取到目标特定设备的正确设备状态是保证目标应用正常运行的必要条件。但是，上述针对设备状态的计算过程，对于目标应用的内存数据来说是透明的且无感的，也即是，特定设备中的状态值并不包含在内存数据中，这导致，按照传统的检查点和恢复方案对内存数据进行拷贝并无法实现对设备状态的转储。可以理解的是，正是由于设备状态无法被正确地转储，导致传统的检查点和恢复方案下经常出现应用无法恢复的问题。In this embodiment, the specific device is usually a stateful device, which may include but is not limited to a stateful IO device or a stateful heterogeneous acceleration device. Among them, a stateful device refers to a type of device that uses the device state as a behavior guide, that is, the device state will affect the behavior of this type of device. If the application needs to call a target specific device, the computing node needs to calculate the device state under the target process for the target specific device and return it to the target application, so that the target application can perform operations on the target specific device based on the calculated device state. Therefore, enabling the target application to obtain the correct device state of the target specific device is a necessary condition to ensure the normal operation of the target application. However, the above-mentioned calculation process for the device state is transparent and imperceptible to the memory data of the target application, that is, the state value in the specific device is not included in the memory data, which results in that the memory data is copied according to the traditional checkpoint and recovery scheme and the dumping of the device state cannot be realized. It can be understood that it is precisely because the device state cannot be dumped correctly that the problem of application failure to recover often occurs under the traditional checkpoint and recovery scheme.

为此，本实施例中，开拓性地提出了为计算节点上装配的特定设备整合出其在目标应用下的状态描述信息，以支持对设备状态的转储和恢复。To this end, in this embodiment, it is pioneered to integrate the state description information of a specific device installed on a computing node under a target application to support the dumping and recovery of the device state.

为便于描述，本实施例中，将以计算节点上装配的目标特定设备作为示例进行技术方案的说明，应当理解的是，目标特定设备可以是计算节点上装配的任意一个被目标应用所使用到的特定设备。而从目标应用的视角来说，其在计算节点上所使用到的特定设备则可以是多个，对于目标应用在计算节点上所使用到的除目标特定设备之外的其它特定设备，也可采用本实施例提供的技术方案进行处置。For ease of description, in this embodiment, the target specific device installed on the computing node is used as an example to illustrate the technical solution. It should be understood that the target specific device can be any specific device installed on the computing node and used by the target application. From the perspective of the target application, there can be multiple specific devices used on the computing node. For other specific devices used by the target application on the computing node except the target specific device, the technical solution provided in this embodiment can also be used to deal with them.

参考图1，在步骤100中，可响应于针对目标应用的检查点创建指令，获取目标特定设备在目标应用下的状态描述信息。1 , in step 100 , in response to a checkpoint creation instruction for a target application, state description information of a target specific device under a target application may be obtained.

本实施例中，状态描述信息用于支持将目标特定设备在目标应用下的设备状态恢复至当前检查点。也即是，在后续的应用恢复过程中，可基于状态描述信息将目标特定设备在目标应用下的设备状态，恢复至当前检查点时面向目标应用所应呈现的设备状态。为此，本实施例中的状态描述信息可包含为特定设备计算设备状态过程中所需用到的各项信息。In this embodiment, the state description information is used to support restoring the device state of the target specific device under the target application to the current checkpoint. That is, in the subsequent application recovery process, the device state of the target specific device under the target application can be restored to the device state that should be presented to the target application at the current checkpoint based on the state description information. To this end, the state description information in this embodiment can include various information required for calculating the device state for the specific device.

正如前文提及的，计算节点需要为目标特定设备计算出在目标进程下设备状态，并返回给目标应用，以供目标应用基于计算出的该设备状态执行对目标特定设备的操作，因此，特定设备中记录的状态值和最终返回给目标应用的设备状态可能是不一致的。其中，特定设备中记录的状态值和最终返回给目标应用的设备状态可能是不一致是一种比较常规的现象，导致这种现象的原因有很多。例如，某个特定设备的当前设备状态需要参考其上一次的设备状态(可称为上下文信息)才能确认，则在确定本次当前设备状态的过程中，除了要从该特定设备中读取状态值之外，还需要结合其上一次的设备状态，才能计算出当前设备状态并返回给应用。显然，返回给应用的当前设备状态与从该特定设备中读取的状态值并不一致。As mentioned above, the computing node needs to calculate the device state under the target process for the target specific device and return it to the target application so that the target application can perform operations on the target specific device based on the calculated device state. Therefore, the state value recorded in the specific device and the device state finally returned to the target application may be inconsistent. Among them, it is a relatively common phenomenon that the state value recorded in the specific device and the device state finally returned to the target application may be inconsistent. There are many reasons for this phenomenon. For example, the current device state of a specific device needs to refer to its previous device state (which can be called context information) to be confirmed. In the process of determining the current device state, in addition to reading the state value from the specific device, it is also necessary to combine its previous device state to calculate the current device state and return it to the application. Obviously, the current device state returned to the application is inconsistent with the state value read from the specific device.

为此，本实施例提出，在状态描述信息中除了包含原本记录在特定设备中的状态值之外，还包含用于进行设备状态计算过程中所需使用到的其它信息，以支持在恢复过程中正确恢复设备状态。To this end, this embodiment proposes that, in addition to the status value originally recorded in the specific device, the status description information also includes other information required for use in the device status calculation process to support correct restoration of the device status during the recovery process.

应当理解的是，对于不同类型的特定设备，状态描述信息中所包含的信息项可能不完全相同，这与特定设备的内部工作原理等相关。本实施例中，对状态描述信息中包含的信息项不做限定。几种示例性的信息项可以是特定设备中设备寄存器的标识、设备寄存器的属性、设备寄存器的状态值、设备内存的状态值、设备驱动软件的状态值、用于支持状态转换的上下文信息和/或用于支持状态映射的映射关系中等。其中，设备寄存器的属性可包括但不限只读、只写、需模拟或需映射等。设备寄存器的状态值是指设备寄存器中记录的状态数据，这些状态数据可作为计算特定设备的设备状态的依据。值得说明的是，本实施例中，可按需设定状态描述信息中包含的信息项，而并不限于此。可知，本实施例中的状态描述信息中不仅包含了目标特定设备中记录的各种状态值，还包含了上下文信息及映射关系等辅助信息，以支持对设备状态的正确恢复。It should be understood that for different types of specific devices, the information items included in the state description information may not be exactly the same, which is related to the internal working principle of the specific device, etc. In this embodiment, the information items included in the state description information are not limited. Several exemplary information items can be the identifier of the device register in the specific device, the attribute of the device register, the state value of the device register, the state value of the device memory, the state value of the device driver software, the context information used to support state conversion and/or the mapping relationship used to support state mapping, etc. Among them, the attributes of the device register may include but are not limited to read-only, write-only, simulation or mapping, etc. The state value of the device register refers to the state data recorded in the device register, and these state data can be used as the basis for calculating the device state of the specific device. It is worth noting that in this embodiment, the information items included in the state description information can be set as needed, but are not limited to this. It can be seen that the state description information in this embodiment not only includes various state values recorded in the target specific device, but also includes auxiliary information such as context information and mapping relationships to support the correct recovery of the device state.

本实施例中，可采用多种实现方式来获取目标特定设备在目标应用下的状态描述信息，在一种实现方式中：可确定目标特定设备对应的目标设备类型；调用目标设备类型所适配的数据采集接口，采集目标特定设备在目标应用下的状态描述信息。其中，不同设备类型适配不同的数据采集接口，数据采集接口中定义有其支持的设备类型下所需采集的信息项及采集逻辑。In this embodiment, multiple implementation methods can be used to obtain the state description information of the target specific device under the target application. In one implementation method: the target device type corresponding to the target specific device can be determined; the data collection interface adapted by the target device type is called to collect the state description information of the target specific device under the target application. Different device types are adapted to different data collection interfaces, and the data collection interface defines the information items and collection logic required to be collected under the device types it supports.

在该实现方式中，考虑到为不同类型的特定设备计算设备状态时所需的信息项可能不完全相同，而不同类型的特定设备对相关信息项的存储位置及存储方式等也可能存在差别。为此，在该实现方式中，可预先开发多种数据采集接口，不同设备类型可适配不同的数据采集接口。并且，在数据采集接口中可预先定义好其所支持的设备类型下所需采集的信息项和采集逻辑。其中，数据采集接口中信息项和采集逻辑可参考特定设备的产品手册、操作系统中与特定设备相关的其他处理逻辑等进行定义，在此不做具体限定。基于此，在该实现方式中，可通过调用合适的数据采集接口，全面且准确地获取到目标特定设备在目标应用下的状态描述信息。In this implementation, it is considered that the information items required to calculate the device status for different types of specific devices may not be exactly the same, and different types of specific devices may also have differences in the storage location and storage method of the relevant information items. For this reason, in this implementation, a variety of data acquisition interfaces can be developed in advance, and different device types can be adapted to different data acquisition interfaces. In addition, the data acquisition interface can pre-define the information items and acquisition logic required to be collected under the device types it supports. Among them, the information items and acquisition logic in the data acquisition interface can be defined with reference to the product manual of the specific device, other processing logic related to the specific device in the operating system, etc., and are not specifically limited here. Based on this, in this implementation, the target specific device can be called by calling the appropriate data acquisition interface to comprehensively and accurately obtain the target specific device at the target time. Status description information of the target application.

当然，本实施例中还可采用其它实现方式来获取目标特定设备在目标应用下的状态描述信息。这里的主要构思是开拓性地提出了收集和整合用于支持将目标特定设备在目标应用下的设备状态恢复至当前检查点的各种信息项，而对于收集和整合所采用的实现手段则可以是灵活多样的，本实施例对此不做限定。Of course, in this embodiment, other implementation methods can also be used to obtain the state description information of the target specific device under the target application. The main idea here is to pioneer the collection and integration of various information items used to support the restoration of the device state of the target specific device under the target application to the current checkpoint, and the implementation means used for collection and integration can be flexible and diverse, and this embodiment does not limit this.

参考图1，在步骤101中，可将步骤100中获取到的状态描述信息添加至为目标应用构建的检查点文件中。Referring to FIG. 1 , in step 101 , the state description information acquired in step 100 may be added to a checkpoint file constructed for a target application.

其中，检查点文件可以是按照传统的检查点和恢复技术构建而成的，本实施例对传统的检查点文件中包含的内容不做限定，也不做过多说明，关于这部分可使用相关技术中的实现方式。这样，在步骤101中可理解为在传统检查点文件中添加了本实施例开拓性提出的状态描述信息。The checkpoint file may be constructed according to the traditional checkpoint and recovery technology. The present embodiment does not limit the content contained in the traditional checkpoint file, nor does it provide too much explanation. The implementation method in the relevant technology may be used for this part. Thus, in step 101, it can be understood that the state description information pioneered by the present embodiment is added to the traditional checkpoint file.

应当理解的是，在步骤100和101中，仅是以目标特定设备为例呈现了获取状态描述信息并添加至检查点文件的步骤逻辑，实际应用中，计算节点上其它特定设备在目标应用下的状态描述信息也会被获取并添加至检查点文件中。这样，从计算节点的视角来看，其上为目标应用所生成的检查点文件中将包含传统检查点文件内容以及目标应用在当前检查点对应的时刻之前所使用到的各个特定设备各自在目标应用下的状态描述信息。It should be understood that in steps 100 and 101, only the step logic of obtaining the state description information and adding it to the checkpoint file is presented by taking the target specific device as an example. In actual applications, the state description information of other specific devices on the computing node under the target application will also be obtained and added to the checkpoint file. In this way, from the perspective of the computing node, the checkpoint file generated for the target application will contain the traditional checkpoint file content and the state description information of each specific device used by the target application before the time corresponding to the current checkpoint under the target application.

在此基础上，参考图1，在步骤102中，可对检查点文件进行转储，以在将目标应用恢复至当前检查点时基于状态描述信息对目标特定设备的设备状态进行恢复。On this basis, referring to FIG. 1 , in step 102 , the checkpoint file may be dumped so as to restore the device state of the target specific device based on the state description information when restoring the target application to the current checkpoint.

实际应用中，可将步骤101中为目标应用在当前检查点下所生成的检查点文件转储至持久化存储资源中。这里，并不限定持久化存储资源的类型，可以是云盘或云数据库等。In actual applications, the checkpoint file generated for the target application at the current checkpoint in step 101 can be dumped into a persistent storage resource. Here, the type of persistent storage resource is not limited, and it can be a cloud disk or a cloud database.

至此，完成了对本次检查点创建指令的响应。本实施例中，后续还可继续在计算节点上发起针对目标应用的检查点创建指令，以为目标应用创建更多检查点，而相应的检查点文件也均可实现持久化存储。这样，在后续对目标应用进程恢复时，可按需指定期望恢复的检查点，并以指定检查点对应的检查点文件作为恢复的基础。At this point, the response to the checkpoint creation instruction is completed. In this embodiment, the checkpoint creation instruction for the target application can be further initiated on the computing node to create more checkpoints for the target application, and the corresponding checkpoint files can also be persistently stored. In this way, when the target application process is subsequently restored, the checkpoint to be restored can be specified as needed, and the checkpoint file corresponding to the specified checkpoint can be used as the basis for restoration.

可以理解的是，由于本实施例中的检查点文件中包含了前述的状态描述信息，因此，基于检查点文件进行应用恢复时，除了能获得传统的恢复成果之外，还可实现对特定设备在目标应用下的设备状态进行恢复，以保证目标应用在恢复后的正常运行。It can be understood that since the checkpoint file in this embodiment contains the aforementioned status description information, when performing application recovery based on the checkpoint file, in addition to obtaining traditional recovery results, it is also possible to restore the device status of a specific device under the target application to ensure the normal operation of the target application after recovery.

本实施例中，对于HPC应用等需要多个进程并行且多个进程可能分布在不同计算节点上的重载型应用，可在对目标应用进程转储的过程中，通过设定超时时间，来避免承载目标应用的多个计算节点中出现计算节点过早退出的问题。另外，若在计算目标特定设备在目标应用下的设备状态过程中需要使用它其它计算节点上的相关数据，这些数据也将作为状态描述信息中的信息项而被携带在检查点文件中。这也进一步保证了在对这类应用进行恢复的过程中，可实现多个计算节点上设备状态的一致性，避免出现错误。基于此，在实际应用中，可支持将带状态的HPC负载(也即是涉及到设备状态的HPC应用)进行转存储，从而可实现无障碍的HPC应用迁移，进而更好地支持集群扩缩容或者让后提交优先级更高的多节点并行作业能提前执行，达到提高资源利用率的目的。In this embodiment, for heavy-load applications such as HPC applications that require multiple processes in parallel and multiple processes may be distributed on different computing nodes, the problem of premature exit of computing nodes in multiple computing nodes carrying the target application can be avoided by setting a timeout during the process of dumping the target application process. In addition, if the relevant data on other computing nodes of the target specific device under the target application is needed in the process of calculating the device status of the target specific device, these data will also be carried in the checkpoint file as information items in the status description information. This also further ensures that in the process of recovering such applications, the consistency of device status on multiple computing nodes can be achieved to avoid errors. Based on this, in actual applications, it can support the transfer and storage of stateful HPC loads (that is, HPC applications involving device status), so as to achieve barrier-free HPC application migration, and then better support cluster expansion and contraction or allow multi-node parallel jobs with higher priority submitted later to be executed in advance, so as to achieve the purpose of improving resource utilization.

综上，本实施例中，在基于检查点对目标应用进行转储的过程中，开拓性地提出了为计算节点上装配的目标特定设备整合出在目标应用下的状态描述信息，并将该状态描述信息添加到为目标应用构建的检查点文件中，这样，检查点文件中除了包含传统的应用恢复所需内容外，还增加了用于支持将目标特定设备在目标应用下的设备状态恢复至指定检查点的状态描述信息。在此基础上，在基于检查点对目标应用进行恢复的过程中，可从检查点文件中读取到该状态描述信息，并基于该状态描述信息将目标特定设备在目标应用下的设备状态恢复至指定检查点，这可为目标应用提供正确的设备状态，从而保证目标应用的正常恢复。因此，本申请实施例中，通过对转储过程和恢复过程的改造，可保证目标应用在使用到有状态的特定设备的情况下，依然可以正常恢复。In summary, in the process of dumping the target application based on the checkpoint, this embodiment innovatively proposes to integrate the state description information under the target application for the target specific device installed on the computing node, and to store the state description information in the target application. The information is added to the checkpoint file constructed for the target application. In this way, in addition to the content required for traditional application recovery, the checkpoint file also adds state description information for supporting the recovery of the device state of the target specific device under the target application to the specified checkpoint. On this basis, in the process of recovering the target application based on the checkpoint, the state description information can be read from the checkpoint file, and the device state of the target specific device under the target application can be restored to the specified checkpoint based on the state description information, which can provide the target application with the correct device state, thereby ensuring the normal recovery of the target application. Therefore, in the embodiment of the present application, by modifying the dump process and the recovery process, it can be ensured that the target application can still be recovered normally when using a specific device with a state.

在上述或下述实施例中，可采用多种实现方式来支持在步骤100中获取目标特定设备在目标应用下的状态描述信息的操作。In the above or below embodiments, multiple implementations may be used to support the operation of obtaining the state description information of the target specific device under the target application in step 100 .

在一种实现方案中：在当前检查点对应的时刻之前，若监听到目标应用中的目标进程发起针对目标特定设备的状态访问操作，则获取状态访问操作对应的状态描述数据，状态访问操作对应的状态描述数据用于计算需返回目标进程的设备状态；将目标特定设备在目标进程下的状态描述数据，更新为状态访问操作对应的状态描述数据。其中，目标进程可以是目标应用中运行在计算节点上且在当前检查点对应的时刻之前使用过目标特定设备的任意进程。In one implementation: before the time corresponding to the current checkpoint, if the target process in the target application is monitored to initiate a state access operation for the target specific device, the state description data corresponding to the state access operation is obtained, and the state description data corresponding to the state access operation is used to calculate the device state to be returned to the target process; the state description data of the target specific device under the target process is updated to the state description data corresponding to the state access operation. The target process can be any process in the target application that runs on the computing node and has used the target specific device before the time corresponding to the current checkpoint.

在该实现方案中，考虑到目标特定设备的设备状态是因目标应用的不断操作而不断变化的，而目标应用在对目标特定设备操作之前通常需要发起针对目标特定设备的状态访问操作，以获取到目标特定设备的实时状态，因此，这里以状态访问操作为单位，获取每个状态访问操作对应的状态描述信息。应当理解的是，与状态描述信息的定义相适配地，状态描述数据可用于支持将目标特定设备在目标进程下的设备状态恢复至当前检查点。In this implementation, considering that the device state of the target specific device is constantly changing due to the continuous operation of the target application, and the target application usually needs to initiate a state access operation for the target specific device before operating the target specific device to obtain the real-time state of the target specific device, therefore, the state description information corresponding to each state access operation is obtained here in units of state access operations. It should be understood that, in accordance with the definition of the state description information, the state description data can be used to support restoring the device state of the target specific device under the target process to the current checkpoint.

这样，基于该实现方案，状态描述信息中可包含目标特定设备在目标应用中至少一个进程下的状态描述数据。In this way, based on this implementation, the state description information may include state description data of the target specific device under at least one process in the target application.

实际应用中，对于目标应用中的目标进程来说，其在当前检查点对应的时刻之前可能会多次发起针对目标特定设备的状态访问操作，本实施例中，并无需保留全部状态访问操作对应的状态描述数据，而是可仅保留目标进程在当前检查点对应的时刻之前最后一次针对目标特定设备发起的状态访问操作对应的状态描述数据即可。保留的该状态描述数据足够支持将目标特定设备在目标进程下的设备状态恢复至当前检查点了。In actual applications, for a target process in a target application, it may initiate state access operations for a target specific device multiple times before the time corresponding to the current checkpoint. In this embodiment, it is not necessary to retain the state description data corresponding to all state access operations, but only the state description data corresponding to the last state access operation initiated by the target process for the target specific device before the time corresponding to the current checkpoint can be retained. The retained state description data is sufficient to support restoring the device state of the target specific device under the target process to the current checkpoint.

为此，在该实现方案中提出了将目标特定设备在目标进程下的状态描述数据，更新为状态访问操作对应的状态描述数据。也即是，在当前检查点对应的时刻之前，可在每次发生目标进程针对目标特定设备的状态访问操作时，都执行上述的更新操作，从而实现目标特定设备在目标应用下的状态描述信息中仅保留目标特定设备在目标应用中至少一个进程下的最新的状态描述数据即可。这可有效减少应用转储的数据代价。To this end, the implementation scheme proposes to update the state description data of the target specific device under the target process to the state description data corresponding to the state access operation. That is, before the time corresponding to the current checkpoint, the above update operation can be performed each time the state access operation of the target process to the target specific device occurs, so that only the latest state description data of the target specific device under at least one process in the target application is retained in the state description information of the target specific device under the target application. This can effectively reduce the data cost of application dumping.

承接前述实施例中提及的数据采集接口，在该实现方案中，可在发生状态访问操作时，通过调用合适的数据采集接口而为状态访问操作获取到对应的状态描述数据。Following the data acquisition interface mentioned in the above-mentioned embodiment, in this implementation, when a state access operation occurs, the corresponding state description data can be acquired for the state access operation by calling a suitable data acquisition interface.

这样，在当前检查点对应的时刻之前，可不断地整合目标特定设备在目标应用下的状态描述信息，以支持在步骤100中响应于针对目标应用的检查点创建指令而获取到目标特定设备在目标应用下的状态描述信息。In this way, before the moment corresponding to the current checkpoint, the state description information of the target specific device under the target application can be continuously integrated to support obtaining the state description information of the target specific device under the target application in response to the checkpoint creation instruction for the target application in step 100.

在一实施方式中，可为目标应用分配指定内存区域，并将目标特定设备在目标应用下的状态描述信息存储在该指定内存区域中，当然，同样地，其它特定设备在目标应用下的状态描述信息也可存储在该指定内存区域中。基于此，在上述实现方案中：可在该指定内存区域中，将目标特定设备在目标进程下的状态描述数据，更新为状态访问操作对应的状态描述数据。也即是，目标应用中每发生一次针对目标特定设备的状态访问操作，即可触发一次针对该指定内存区域的更新操作，该更新操作用于更新该指定内存区域中存储的相应状态描述数据。这样，可在目标应用运行过程中，不断更新该指定内存区域，以在该指定内存区域中维护各个特定设备在目标应用下的状态描述信息。In one embodiment, a designated memory area may be allocated for the target application, and the target specific device may be placed under the target application. The status description information of the target specific device under the target process can be stored in the designated memory area. Of course, similarly, the status description information of other specific devices under the target application can also be stored in the designated memory area. Based on this, in the above implementation scheme: the status description data of the target specific device under the target process can be updated to the status description data corresponding to the status access operation in the designated memory area. That is, every time a status access operation for the target specific device occurs in the target application, an update operation for the designated memory area can be triggered, and the update operation is used to update the corresponding status description data stored in the designated memory area. In this way, the designated memory area can be continuously updated during the operation of the target application to maintain the status description information of each specific device under the target application in the designated memory area.

基于该指定内存区域，在步骤101中，可将该指定内存区域中的数据添加至为目标应用构建的检查点文件中。也即是，除了按照传统方案对目标应用的传统内存数据进行转储之外，还提出了将这里为目标应用额外分配的指定内存区域中的内存数据也进行转储。Based on the designated memory area, the data in the designated memory area can be added to the checkpoint file constructed for the target application in step 101. That is, in addition to dumping the traditional memory data of the target application according to the traditional solution, it is also proposed to dump the memory data in the designated memory area additionally allocated for the target application.

当然，除了上述示例性的实现方案外，本实施例中，还可采用其它实现方案整合目标特定设备在目标应用下的状态描述信息。例如，在接收到检查点创建指令后，回溯目标应用中各个进程各自对目标特定设备发起的最后一次状态访问操作，并为回溯的各个状态访问操作获取对应的状态描述数据，等。本实施例对状态描述信息的维护方式不做限定。Of course, in addition to the above exemplary implementation schemes, in this embodiment, other implementation schemes may be used to integrate the state description information of the target specific device under the target application. For example, after receiving the checkpoint creation instruction, the last state access operation initiated by each process in the target application to the target specific device is traced back, and the corresponding state description data is obtained for each traced state access operation, etc. This embodiment does not limit the maintenance method of the state description information.

综上，本实施例中，可采用各种实现方案在计算节点中维护目标特定设备在目标应用下的状态描述信息，以支持在步骤100中可在发生检查点创建指令时，全面且准确地获取到目标特定设备在目标应用下的状态描述信息。In summary, in this embodiment, various implementation schemes can be used to maintain the status description information of the target specific device under the target application in the computing node, so as to support the comprehensive and accurate acquisition of the status description information of the target specific device under the target application when a checkpoint creation instruction occurs in step 100.

在上述或下述实施例中，还可在目标应用运行之前，在计算节点上启动预置的守护进程。并可由该守护进程来执行前述的步骤100-101。示例性地，在步骤100中，该守护进程可从为目标应用分配的指定内存区域中读取到目标特定设备在目标应用下的状态描述信息，在步骤101中，则可由该守护进程将读取到的状态描述信息添加到为目标应用构建的检查点文件中；以及在步骤102中，可由该守护进程对检查点文件进行转储。In the above or below embodiments, a preset daemon process may be started on the computing node before the target application is run. The daemon process may perform the aforementioned steps 100-101. Exemplarily, in step 100, the daemon process may read the state description information of the target specific device under the target application from the designated memory area allocated for the target application, and in step 101, the daemon process may add the read state description information to the checkpoint file constructed for the target application; and in step 102, the daemon process may dump the checkpoint file.

另外，在目标应用在计算节点上的运行期间，也可由该守护进程为目标应用维护相关的状态描述信息。也即是，可由该守护进程调用合适的数据采集接口而获取到相关状态访问操作对应的状态描述数据，并更新至指定内存区域中。In addition, during the operation of the target application on the computing node, the daemon process can also maintain relevant state description information for the target application. That is, the daemon process can call a suitable data acquisition interface to obtain the state description data corresponding to the relevant state access operation and update it to the specified memory area.

这里可以存在两种可能的设计构思：There are two possible design ideas here:

在一种设计构思中，在计算节点中，可按照传统的设备状态响应方式向目标应用中的各个进行返回目标特定设备的设备状态。传统的响应方式中可能由计算节点的操作系统或者由特定设备的设备控制器等来完成设备状态的计算工作。而本实施例中的守护进程仅承担状态描述数据的获取工作，并不参与设备状态响应工作。In one design concept, in a computing node, the device status of a target specific device can be returned to each target application in a traditional device status response manner. In a traditional response manner, the computing node's operating system or a device controller of a specific device may complete the device status calculation work. However, the daemon process in this embodiment only undertakes the work of obtaining the status description data and does not participate in the device status response work.

在另一种设计构思中，在计算节点中，可拦截目标应用中各个进程针对目标特定设备发起的各个状态访问操作，而由本实施例中的守护进程代为操作。这种设计构思下，可利用预置的守护进程基于获取到的状态描述数据，计算需返回目标进程的设备状态；利用守护进程将计算出的设备状态返回至目标进程，作为对状态访问操作的响应。其中，可在守护进程中预先定义针对设备状态的计算逻辑，实际应用中，可参考传统设备状态响应方式中的计算逻辑进行定义。另外，现代操作系统都可提供用户空间对内核空间的跟踪。例如，对于Linux操作系统，可以使用PRELOAD和ptrace机制截获各种路径上用户进程对设备访问。因此，这里可采用各种可行方案实现对状态访问操作的拦截。In another design concept, in the computing node, each state access operation initiated by each process in the target application for a specific target device can be intercepted, and the daemon process in this embodiment can be used to perform the operation. Under this design concept, the preset daemon process can be used to calculate the device state that needs to be returned to the target process based on the acquired state description data; the daemon process is used to return the calculated device state to the target process as a response to the state access operation. Among them, the calculation logic for the device state can be pre-defined in the daemon process. In actual applications, it can be defined with reference to the calculation logic in the traditional device state response method. In addition, modern operating systems can provide user space tracking of kernel space. For example, for the Linux operating system, the PRELOAD and ptrace mechanisms can be used to intercept user process access to devices on various paths. Therefore, various feasible solutions can be used here to intercept state access operations.

对于上述两种设计构思进行如下说明：在后一种设计构思中，设备状态的计算工作实质已经脱离了计算节点的操作系统环境等相关依赖环境，而是由守护进程基于获取到的状态描述信息来独立完成设备状态的计算。这种脱离依赖环境的设备状态计算方式，可在应用恢复过程中发挥更明显的优势，这是由于，在应用恢复过程中，可能已经更换了计算节点，原本的依赖环境可能并无法完全恢复，而守护进程则可脱离相关的依赖环境，而基于检查点文件中携带的状态描述信息正确计算出相关的设备状态，从而实现设备状态的正确恢复。The above two design concepts are explained as follows: In the latter design concept, the calculation of the device status has actually been separated from the operating system environment of the computing node and other related dependent environments, and the daemon process independently completes the calculation of the device status based on the acquired status description information. This method of calculating the device status without relying on the dependent environment can play a more obvious advantage in the application recovery process. This is because, during the application recovery process, the computing node may have been replaced, and the original dependent environment may not be fully restored. The daemon process can be separated from the related dependent environment and correctly calculate the relevant device status based on the status description information carried in the checkpoint file, thereby realizing the correct recovery of the device status.

通过上述说明可知，在进行应用转储之前，计算节点中采用上述的两种设计构思均可实现本实施例中所需的应用转储效果。对于后一种设计构思，一方面在应用转储过程中可更好地避免对状态访问操作的遗漏，从而更好地保证为目标应用维护的状态描述信息的完整性；另一方面，守护进程内部定义的处理逻辑可直接复用至应用恢复时所处的计算节点上，实现应用转储和恢复两个过程的呼应。It can be seen from the above description that before performing application dump, the above two design concepts can be used in the computing node to achieve the application dump effect required in this embodiment. For the latter design concept, on the one hand, it can better avoid the omission of state access operations during the application dump process, thereby better ensuring the integrity of the state description information maintained for the target application; on the other hand, the processing logic defined inside the daemon process can be directly reused on the computing node where the application is located when it is restored, so as to achieve the response between the two processes of application dump and recovery.

综上，本实施例中，可通过在计算节点上启动预置的守护进程，而实现本实施例中的应用转储逻辑，而且，该守护进程可使得设备状态的计算工作脱离计算节点中的依赖环境，从而避免因依赖环境无法完全恢复而导致的应用恢复错误问题。In summary, in this embodiment, the application dump logic in this embodiment can be implemented by starting a preset daemon process on the computing node. Moreover, the daemon process can separate the calculation work of the device status from the dependent environment in the computing node, thereby avoiding application recovery errors caused by the inability to fully restore the dependent environment.

图3为本申请一示例性实施例提供的一种基于检查点的应用恢复方法的流程示意图。图4为本申请一示例性实施例提供的一种基于检查点的应用恢复方法的逻辑示意图。该方法可由数据处理装置执行，该数据处理装置可实现为软件、硬件或软件与硬件的结合，该数据处理装置可集成在计算节点中。参考图3，该方法可包括：FIG3 is a flow chart of a checkpoint-based application recovery method provided by an exemplary embodiment of the present application. FIG4 is a logic chart of a checkpoint-based application recovery method provided by an exemplary embodiment of the present application. The method may be executed by a data processing device, which may be implemented as software, hardware, or a combination of software and hardware, and the data processing device may be integrated in a computing node. Referring to FIG3, the method may include:

步骤300、响应于将目标应用恢复至指定检查点的恢复指令，获取目标应用在指定检查点对应的检查点文件；Step 300: In response to a restore instruction to restore a target application to a specified checkpoint, obtain a checkpoint file corresponding to the target application at the specified checkpoint;

步骤301、从检查点文件中读取目标特定设备在目标应用下的状态描述信息；Step 301, read the state description information of the target specific device under the target application from the checkpoint file;

步骤302、根据状态描述信息，将目标特定设备在目标应用下的设备状态恢复至指定检查点，以将目标应用恢复至指定检查点。Step 302: According to the state description information, the device state of the target specific device under the target application is restored to the specified checkpoint, so as to restore the target application to the specified checkpoint.

参考图2，本实施例中的计算节点上可装配有多个特定设备，计算节点上的特定设备的数量可以是一个或多个，特定设备的类型可以是一种或多种，本实施例对此不做限定。计算节点上承载的应用可使用这些特定设备。为便于描述，本实施例中，将以计算节点上装配的目标特定设备作为示例进行技术方案的说明，应当理解的是，目标特定设备可以是计算节点上装配的任意一个被目标应用所使用到的特定设备。而从目标应用的视角来说，其在计算节点上所使用到的特定设备则可以是多个，对于目标应用在计算节点上所使用到的除目标特定设备之外的其它特定设备，也可采用本实施例提供的技术方案进行设备状态恢复。Referring to Figure 2, the computing node in this embodiment may be equipped with multiple specific devices, the number of specific devices on the computing node may be one or more, and the type of specific devices may be one or more, which is not limited in this embodiment. The applications carried on the computing node may use these specific devices. For ease of description, in this embodiment, the target specific device installed on the computing node will be used as an example to illustrate the technical solution. It should be understood that the target specific device may be any specific device installed on the computing node and used by the target application. From the perspective of the target application, the specific devices used on the computing node may be multiple. For other specific devices used by the target application on the computing node except the target specific device, the technical solution provided in this embodiment may also be used to restore the device status.

本实施例中，特定设备通常为有状态设备，可包括但不限于有状态的IO设备或有状态的异构加速设备等。关于有状态设备的定义在此不再重复赘述。In this embodiment, the specific device is generally a stateful device, which may include but is not limited to a stateful IO device or a stateful heterogeneous acceleration device, etc. The definition of a stateful device will not be repeated here.

其中，本实施例中的检查点文件中除了包含传统检查点和恢复方案中设计到的应用恢复所需内容之外，还包含了计算节点上至少一个特定设备在目标应用下的状态描述信息，而目标特定设备在目标应用下的状态描述信息可用于支持将目标特定设备在目标应用下的设备状态恢复至指定检查点。Among them, in addition to the content required for application recovery designed in the traditional checkpoint and recovery scheme, the checkpoint file in this embodiment also includes status description information of at least one specific device on the computing node under the target application, and the status description information of the target specific device under the target application can be used to support the recovery of the device status of the target specific device under the target application to a specified checkpoint.

本实施例中并不限定检查点文件的生成方案，本实施例中可采用图1所示的基于检查点的应用转储方法来生成本实施例中所需的检查点文件，相关的技术细节可参考前文实施例，在此不做重复说明。但本实施例中生成检查点文件的方案并不限于此。The generation scheme of the checkpoint file is not limited in this embodiment. In this embodiment, the checkpoint-based application dump method shown in FIG. 1 can be used to generate the checkpoint file required in this embodiment. For relevant technical details, please refer to the previous embodiment, and no repeated description is given here. However, the scheme for generating the checkpoint file in this embodiment is not limited to this.

另外，本实施例中的检查点文件可存储在持久化存储资源中，在步骤300中，可从持久化存储资源中读取到目标应用在指定检查点对应的检查点文件。本实施例中，目标应用可具有多个检查点，在步骤300中的恢复指令中可指明所需恢复至的指定检查点。In addition, the checkpoint file in this embodiment can be stored in a persistent storage resource, and the checkpoint file corresponding to the target application at the specified checkpoint can be read from the persistent storage resource in step 300. In this embodiment, the target application can have multiple checkpoints, and the recovery instruction in step 300 can indicate the specified checkpoint to be recovered.

参考图3，在步骤301中，可从检查点文件中读取目标特定设备在目标应用下的状态描述信息。其中，在一实施方式中，状态描述信息中包含目标特定设备在目标应用中至少一个进程下的状态描述数据。与状态描述信息的定义相适配地，状态描述数据则是用于支持将目标特定设备在目标进程下的设备状态恢复至指定检查点。Referring to FIG. 3 , in step 301, the state description information of the target specific device under the target application can be read from the checkpoint file. In one embodiment, the state description information includes state description data of the target specific device under at least one process in the target application. In accordance with the definition of the state description information, the state description data is used to support restoring the device state of the target specific device under the target process to a specified checkpoint.

基于此，在步骤302中，可根据从检查点文件中读取出的状态描述信息，将目标特定设备在目标应用下的设备状态恢复至指定检查点，以将目标应用恢复至指定检查点。Based on this, in step 302, the device state of the target specific device under the target application can be restored to the specified checkpoint according to the state description information read from the checkpoint file, so as to restore the target application to the specified checkpoint.

本实施例中的状态描述信息可包含为特定设备计算设备状态过程中所需用到的各项信息。应当理解的是，对于不同类型的特定设备，状态描述信息中所包含的信息项可能不完全相同，这与特定设备的内部工作原理等相关。本实施例中，对状态描述信息中包含的信息项不做限定。几种示例性的信息项可以是特定设备中设备寄存器的标识、设备寄存器的属性、设备寄存器的状态值、设备内存的状态值、设备驱动软件的状态值、用于支持状态转换的上下文信息和/或用于支持状态映射的映射关系中等。其中，设备寄存器的属性可包括但不限只读、只写、需模拟或需映射等。设备寄存器的状态值是指设备寄存器中记录的状态数据，这些状态数据可作为计算特定设备的设备状态的依据。值得说明的是，本实施例中，可按需设定状态描述信息中包含的信息项，而并不限于此。The state description information in this embodiment may include various information required for calculating the device state for a specific device. It should be understood that for different types of specific devices, the information items included in the state description information may not be exactly the same, which is related to the internal working principle of the specific device. In this embodiment, the information items included in the state description information are not limited. Several exemplary information items may be the identification of the device register in the specific device, the attribute of the device register, the state value of the device register, the state value of the device memory, the state value of the device driver software, the context information used to support state conversion and/or the mapping relationship used to support state mapping. Among them, the attributes of the device register may include but are not limited to read-only, write-only, simulation or mapping. The state value of the device register refers to the state data recorded in the device register, and these state data can be used as the basis for calculating the device state of the specific device. It is worth noting that in this embodiment, the information items included in the state description information can be set as needed, but are not limited to this.

因此，在步骤302的设备状态恢复过程中，可能存在多种情况，本实施例针对几种示例性的情况分别提供了如下的一种恢复方案：Therefore, in the device status recovery process of step 302, there may be multiple situations. This embodiment provides the following recovery solutions for several exemplary situations:

一种示例性的情况下：若目标特定设备在目标应用中目标进程下的状态描述数据中指示第一设备寄存器的属性为需模拟，则基于状态描述数据中与第一设备寄存器关联的上下文信息，模拟第一设备寄存器的状态转换过程，以将第一设备寄存器在目标进程下的设备状态恢复至指定检查点；其中，第一设备寄存器为目标特定设备所包含的任一设备寄存器，目标进程为目标应用中运行在计算节点上的任意进程。In an exemplary case: if the state description data of the target specific device under the target process in the target application indicates that the attribute of the first device register needs to be simulated, then based on the context information associated with the first device register in the state description data, the state transition process of the first device register is simulated to restore the device state of the first device register under the target process to a specified checkpoint; wherein the first device register is any device register contained in the target specific device, and the target process is any process running on a computing node in the target application.

另一种示例性的情况下：若状态描述数据中指示第一设备寄存器的属性为需映射，则基于状态描述数据中与第一设备寄存器关联的映射关系，对第一设备寄存器提供的状态值进行映射，以将第一设备寄存器在目标进程下的设备状态恢复至指定检查点。In another exemplary case: if the attribute of the first device register indicated in the state description data is to be mapped, the state value provided by the first device register is mapped based on the mapping relationship associated with the first device register in the state description data to restore the device state of the first device register under the target process to the specified checkpoint.

举例来说，若第一设备寄存器中记录的其中一个状态值为DMA访问地址，但是，由于更换了计算节点，而第一设备寄存器中记录的是通常是操作系统所分配的DMA访问地址，该DMA访问地址在不同的计算节点的硬件环境中可能对应的是不同的内存地址。这里，则可基于状态描述数据中携带的地址映射关系，将第一寄存器中记录的DMA访问地址映射至正确的地址，并将正确的地址返回给目标进程，这可保证目标地址后续按照正确的地址向目标特定设备发起DMA操作指令。For example, if one of the status values recorded in the first device register is a DMA access address, but because the computing node is changed, the DMA access address recorded in the first device register is usually the DMA access address assigned by the operating system, and the DMA access address may correspond to different memory addresses in the hardware environment of different computing nodes. Here, based on the address mapping relationship carried in the status description data, the DMA access address recorded in the first register can be mapped to the correct address, and the correct address is returned to the target process, which can ensure that the target address subsequently initiates a DMA operation instruction to the target specific device according to the correct address.

也即是，可根据目标特定设备中不同设备寄存器的属性，而对各个设备寄存器采用合适的恢复逻辑进行设备状态的恢复，以保证各个设备寄存器的设备状态得以正确恢复。That is, the properties of different device registers in the target specific device can be used to select the appropriate device registers. Appropriate recovery logic is used to restore the device status to ensure that the device status of each device register is correctly restored.

本实施例中，可在计算节点上运行预置的守护进程，并利用守护进程实施上述的应用恢复方案。In this embodiment, a preset daemon process may be run on the computing node, and the daemon process may be used to implement the above application recovery solution.

基于此，在步骤302中，可在守护进程中预先定义好各种属性下的恢复逻辑，这样，守护进程可按照合适的恢复逻辑，从检查点文件携带的状态描述信息读取到计算所需的信息项，从而对各个设备寄存器的设备状态进行正确恢复。Based on this, in step 302, the recovery logic under various attributes can be pre-defined in the daemon process, so that the daemon process can read the information items required for calculation from the state description information carried by the checkpoint file according to the appropriate recovery logic, thereby correctly restoring the device state of each device register.

本实施例中的计算节点可以是目标应用转储前所处的原计算节点，当然，也可以是新的计算节点。本实施例中，只需保证计算节点上装配有目标应用所需使用的特定设备即可。The computing node in this embodiment can be the original computing node where the target application is located before dumping, or can be a new computing node. In this embodiment, it is only necessary to ensure that the computing node is equipped with the specific device required by the target application.

在一种实现方案中：可在目标应用的恢复过程中，监听目标应用中各个进程对目标特定设备发起的状态访问操作，并在监听到状态访问操作之后，在执行步骤301和步骤302。In one implementation scheme: during the recovery process of the target application, the state access operations initiated by each process in the target application to the target specific device can be monitored, and after the state access operations are monitored, steps 301 and 302 are executed.

在该实现方案中，可拦截监听到的状态访问操作；In this implementation, the monitored state access operations can be intercepted;

利用预置的守护进程从检查点文件中读取拦截的状态访问操作对应的状态描述数据，计算需返回目标进程的设备状态，目标进程为发起该状态访问操作的进程；Using the preset daemon process to read the state description data corresponding to the intercepted state access operation from the checkpoint file, calculate the device state to be returned to the target process, where the target process is the process that initiated the state access operation;

利用守护进程将计算出的设备状态返回至目标进程，作为对状态访问操作的响应。The daemon process is used to return the calculated device state to the target process in response to the state access operation.

这样，在该实现方案中，在发生状态访问操作时，不再执行传统的设备状态响应逻辑，而是由守护进程代为进行设备状态计算。而守护进程可脱离原计算节点上的依赖环境完成设备状态计算，只需从检查点文件中读取正确地的状态描述数据，并运行内部预先定义好的合适的恢复逻辑，即可计算出正确的设备状态。这样，即使应用恢复时所处的计算节点不再具备原计算节点的完整依赖环境，也可基于守护进程而在应用恢复时所处的计算节点上正确恢复目标特定设备在目标应用下的设备状态，从而保证目标应用的正常恢复。Thus, in this implementation, when a state access operation occurs, the traditional device state response logic is no longer executed, but the daemon process performs the device state calculation instead. The daemon process can complete the device state calculation without relying on the dependency environment on the original computing node. It only needs to read the correct state description data from the checkpoint file and run the appropriate recovery logic predefined internally to calculate the correct device state. In this way, even if the computing node where the application is restored no longer has the complete dependency environment of the original computing node, the device state of the target specific device under the target application can be correctly restored on the computing node where the application is restored based on the daemon process, thereby ensuring the normal recovery of the target application.

在该实现方案中，守护进程可在监听到目标进程针对目标特定设备发起的首次或最早的N次状态访问操作时，按照前述的代为操作逻辑向目标进程返回设备状态。目标进程后续针对目标特定设备发起的状态访问操作则可按照传统的设备状态响应方式进行处置。当然，也可一致保持由守护进程按照前述的代为操作逻辑向目标进程发起的每一次状态访问操作返回设备状态，而彻底禁用传统的设备状态响应方案。在此不做限定。In this implementation, when the daemon process monitors the first or earliest N status access operations initiated by the target process for a specific target device, it can return the device status to the target process according to the aforementioned proxy operation logic. Subsequent status access operations initiated by the target process for the specific target device can be handled in accordance with the traditional device status response method. Of course, it is also possible to consistently maintain that the daemon process returns the device status for each status access operation initiated by the target process according to the aforementioned proxy operation logic, and completely disable the traditional device status response scheme. This is not limited here.

综上，本实施例中，开拓性地提出了将目标特定设备在目标应用下的状态描述信息携带在为目标应用构建的检查点文件中，这样，检查点文件中除了包含传统的应用恢复所需内容外，还增加了用于支持将目标特定设备在目标应用下的设备状态恢复至指定检查点的状态描述信息。在此基础上，在基于检查点对目标应用进行恢复的过程中，可从检查点文件中读取到该状态描述信息，并基于该状态描述信息将目标特定设备在目标应用下的设备状态恢复至指定检查点，这可为目标应用提供正确的设备状态，从而保证目标应用的正常恢复。因此，本申请实施例中，通过对转储过程和恢复过程的改造，可保证目标应用在使用到有状态的特定设备的情况下，依然可以正常恢复。In summary, in this embodiment, it is pioneered to carry the state description information of the target specific device under the target application in the checkpoint file constructed for the target application. In this way, in addition to the content required for traditional application recovery, the checkpoint file also adds state description information for supporting the recovery of the device state of the target specific device under the target application to the specified checkpoint. On this basis, in the process of recovering the target application based on the checkpoint, the state description information can be read from the checkpoint file, and the device state of the target specific device under the target application can be restored to the specified checkpoint based on the state description information, which can provide the target application with the correct device state, thereby ensuring the normal recovery of the target application. Therefore, in the embodiment of the present application, by modifying the dump process and the recovery process, it can be ensured that the target application can still be recovered normally when using a specific device with a state.

另外，本实施例提供的应用转储和恢复方法可应用至HPC集群系统中，实现带状态多进程并行作业的转储和恢复。这可以提高HPC集群系统利用率，增加集群吞吐量，节省成本。通过对HPC作业的转存储，通过调度后恢复可以实现作业迁移集中，从而可以腾出更多的可用节点给后面的作业或者释放掉空闲节点。还可实现让后提交的优先级更高的集群作业能够，实现无感知抢占。以往对于并行作业需要抢占，往往是在作业系统里简单杀掉正在执行作业，从而让高优先级作业有资源得到调度运行，通过本方案对并行作业的检查点和恢复可以实现多机作业抢占。另外，还可提供集群资源维护便利。当有集群节点运维动作需要实施时，系统管理员可以将待操作的节点上的作业迁临时转存储，然后即可实施资源维护操作，等待节点维护完成之后再进行恢复。基于此，再结合HPC集群系统中的调度算法和策略，可使得HPC集群系统在运维、故障处理等方面获益。In addition, the application dump and restore method provided in this embodiment can be applied to the HPC cluster system to realize the dump and restore of stateful multi-process parallel jobs. This can improve the utilization rate of the HPC cluster system, increase cluster throughput, and save costs. By transferring and storing HPC jobs, job migration can be centralized through post-scheduling recovery, thereby freeing up more available nodes for subsequent jobs or releasing idle nodes. It can also enable cluster jobs with higher priority that are submitted later to be preempted without perception. In the past, when parallel jobs needed to be preempted, they were often simply killed in the job system. Jobs are being executed, so that high-priority jobs have resources to be scheduled and run. Through the checkpoint and recovery of parallel jobs in this solution, multi-machine job preemption can be achieved. In addition, cluster resource maintenance is convenient. When cluster node operation and maintenance actions need to be implemented, the system administrator can migrate the jobs on the nodes to be operated to temporary storage, and then implement resource maintenance operations, and wait for the node maintenance to be completed before recovery. Based on this, combined with the scheduling algorithms and strategies in the HPC cluster system, the HPC cluster system can benefit in operation and maintenance, fault handling, etc.

在本申请实施例中，在基于检查点对目标应用进行转储的过程中，开拓性地提出了为计算节点上装配的目标特定设备整合出在目标应用下的状态描述信息，并将该状态描述信息添加到为目标应用构建的检查点文件中，这样，检查点文件中除了包含传统的应用恢复所需内容外，还增加了用于支持将目标特定设备在目标应用下的设备状态恢复至指定检查点的状态描述信息。在此基础上，在基于检查点对目标应用进行恢复的过程中，可从检查点文件中读取到该状态描述信息，并基于该状态描述信息将目标特定设备在目标应用下的设备状态恢复至指定检查点，这可为目标应用提供正确的设备状态，从而保证目标应用的正常恢复。因此，本申请实施例中，通过对转储过程和恢复过程的改造，可保证目标应用在使用到有状态的特定设备的情况下，依然可以正常恢复。In the embodiment of the present application, in the process of dumping the target application based on the checkpoint, it is pioneeringly proposed to integrate the state description information under the target application for the target specific device installed on the computing node, and add the state description information to the checkpoint file constructed for the target application. In this way, in addition to the content required for traditional application recovery, the checkpoint file also adds the state description information used to support the device state of the target specific device under the target application to the specified checkpoint. On this basis, in the process of restoring the target application based on the checkpoint, the state description information can be read from the checkpoint file, and the device state of the target specific device under the target application can be restored to the specified checkpoint based on the state description information, which can provide the target application with the correct device state, thereby ensuring the normal recovery of the target application. Therefore, in the embodiment of the present application, by modifying the dump process and the recovery process, it can be ensured that the target application can still be restored normally when using a stateful specific device.

需要说明的是，在上述实施例及附图中的描述的一些流程中，包含了按照特定顺序出现的多个操作，但是应该清楚了解，这些操作可以不按照其在本文中出现的顺序来执行或并行执行，操作的序号如801、802等，仅仅是用于区分开各个不同的操作，序号本身不代表任何的执行顺序。另外，这些流程可以包括更多或更少的操作，并且这些操作可以按顺序执行或并行执行。It should be noted that in some of the processes described in the above embodiments and the accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be executed in the order in which they appear in this document or may be executed in parallel, and the sequence numbers of the operations, such as 801, 802, etc., are only used to distinguish between different operations, and the sequence numbers themselves do not represent any execution order. In addition, these processes may include more or fewer operations, and these operations may be executed in sequence or in parallel.

图5为本申请另一示例性实施例提供的一种计算节点的结构示意图。如图5所示，该计算设备包括：存储器50、处理器51以及目标特定设备52。Fig. 5 is a schematic diagram of a computing node structure provided by another exemplary embodiment of the present application. As shown in Fig. 5 , the computing device includes: a memory 50 , a processor 51 and a target specific device 52 .

处理器51，与存储器50耦合，用于执行存储器50中的计算机程序，以用于：响应于针对目标应用的检查点创建指令，获取目标特定设备52在目标应用下的状态描述信息，状态描述信息用于支持将目标特定设备52在目标应用下的设备状态恢复至当前检查点；将状态描述信息添加至为目标应用构建的检查点文件中；对检查点文件进行转储，以在将目标应用恢复至当前检查点时基于状态描述信息对目标特定设备52的设备状态进行恢复。The processor 51 is coupled to the memory 50 and is used to execute the computer program in the memory 50, so as to: obtain the state description information of the target specific device 52 under the target application in response to the checkpoint creation instruction for the target application, and the state description information is used to support the recovery of the device state of the target specific device 52 under the target application to the current checkpoint; add the state description information to the checkpoint file constructed for the target application; dump the checkpoint file to restore the device state of the target specific device 52 based on the state description information when restoring the target application to the current checkpoint.

在一实施例中，状态描述信息中包含目标特定设备52在目标应用中至少一个进程下的状态描述数据，在响应于针对目标应用的检查点创建指令之前，处理器51还可用于：In one embodiment, the state description information includes state description data of the target specific device 52 under at least one process in the target application. Before responding to the checkpoint creation instruction for the target application, the processor 51 may also be used to:

在当前检查点对应的时刻之前，若监听到目标应用中的目标进程发起针对目标特定设备52的状态访问操作，则获取状态访问操作对应的状态描述数据，状态访问操作对应的状态描述数据用于计算需返回目标进程的设备状态；将目标特定设备52在目标进程下的状态描述数据，更新为状态访问操作对应的状态描述数据；其中，目标进程为目标应用中运行在计算节点上的任意进程。Before the moment corresponding to the current checkpoint, if it is monitored that the target process in the target application initiates a state access operation for the target specific device 52, the state description data corresponding to the state access operation is obtained, and the state description data corresponding to the state access operation is used to calculate the device state to be returned to the target process; the state description data of the target specific device 52 under the target process is updated to the state description data corresponding to the state access operation; wherein the target process is any process running on the computing node in the target application.

在一实施例中，处理器51在获取状态访问操作对应的状态描述数据时，具体可用于：确定目标特定设备52对应的目标设备类型；调用目标设备类型所适配的数据采集接口，采集状态访问操作对应的状态描述信息；其中，不同设备类型适配不同的数据采集接口，数据采集接口中定义有其支持的设备类型下所需采集的信息项及采集逻辑。In one embodiment, when the processor 51 obtains the status description data corresponding to the status access operation, it can be specifically used to: determine the target device type corresponding to the target specific device 52; call the data acquisition interface adapted to the target device type to acquire the status description information corresponding to the status access operation; wherein different device types are adapted to different data acquisition interfaces, and the data acquisition interface defines the information items and acquisition logic that need to be collected under the device types it supports.

在一实施例中，处理器51还可用于：拦截状态访问操作；利用预置的守护进程基于获取到的状态描述数据，计算需返回目标进程的设备状态；利用守护进程将计算出的设备状态返回至目标进程，作为对状态访问操作的响应。In one embodiment, the processor 51 can also be used to: intercept the state access operation; use the preset daemon process based on The acquired status description data is used to calculate the device status that needs to be returned to the target process; the daemon process is used to return the calculated device status to the target process as a response to the status access operation.

在一实施例中，状态描述信息包括目标特定设备52中设备寄存器的标识、设备寄存器的属性、设备寄存器的状态值、设备内存的状态值、设备驱动软件的状态值、用于支持状态转换的上下文信息和用于支持状态映射的映射关系中的一种或多种。In one embodiment, the state description information includes one or more of an identifier of a device register in the target specific device 52, attributes of the device register, a state value of the device register, a state value of the device memory, a state value of the device driver software, context information for supporting state transitions, and a mapping relationship for supporting state mapping.

在一实施例中，状态描述信息存储在为目标应用分配的指定内存区域中，处理器51在将目标特定设备52在目标进程下的状态描述数据，更新为状态访问操作对应的状态描述数据息时，具体可用于：将指定内存区域中，将目标特定设备52在目标进程下的状态描述数据，更新为状态访问操作对应的状态描述数据。In one embodiment, the status description information is stored in a designated memory area allocated for the target application. When the processor 51 updates the status description data of the target specific device 52 under the target process to the status description data corresponding to the status access operation, it can be specifically used to: update the status description data of the target specific device 52 under the target process in the designated memory area to the status description data corresponding to the status access operation.

在一实施例中，处理器51还可用于：响应于将目标应用恢复至当前检查点的恢复指令，获取目标应用在当前检查点对应的检查点文件；从检查点文件中读取状态描述信息；基于状态描述信息，将目标特定设备52在目标应用下的设备状态恢复至当前检查点，以将目标应用恢复至当前检查点。In one embodiment, the processor 51 can also be used to: in response to a recovery instruction to restore the target application to the current checkpoint, obtain the checkpoint file corresponding to the target application at the current checkpoint; read the status description information from the checkpoint file; based on the status description information, restore the device status of the target specific device 52 under the target application to the current checkpoint, so as to restore the target application to the current checkpoint.

在一实施例中，目标特定设备52为有状态的网络设备或有状态的异构加速设备等有状态设备。In one embodiment, the target specific device 52 is a stateful device such as a stateful network device or a stateful heterogeneous acceleration device.

在另一些可能的设计方案中，还可基于图5所示的计算节点提供基于检查点的应用恢复方案。在这些可能的设计方案中，处理器51可用于执行存储器50中的计算机程序，以用于：响应于将目标应用恢复至指定检查点的恢复指令，获取目标应用在指定检查点对应的检查点文件；从检查点文件中读取目标特定设备52在目标应用下的状态描述信息；根据状态描述信息，将目标特定设备52在目标应用下的设备状态恢复至指定检查点，以将目标应用恢复至指定检查点。In some other possible design solutions, a checkpoint-based application recovery solution can also be provided based on the computing node shown in Figure 5. In these possible design solutions, the processor 51 can be used to execute the computer program in the memory 50, so as to: in response to a recovery instruction to restore the target application to a specified checkpoint, obtain a checkpoint file corresponding to the target application at the specified checkpoint; read the state description information of the target specific device 52 under the target application from the checkpoint file; according to the state description information, restore the device state of the target specific device 52 under the target application to the specified checkpoint, so as to restore the target application to the specified checkpoint.

在一实施例中，状态描述信息中包含目标特定设备52在目标应用中至少一个进程下的状态描述数据，处理器51在根据状态描述信息，将目标特定设备52在目标应用下的设备状态恢复至指定检查点时，具体可用于：若目标特定设备52在目标应用中目标进程下的状态描述数据中指示第一设备寄存器的属性为需模拟，则基于状态描述数据中与第一设备寄存器关联的上下文信息，模拟第一设备寄存器的状态转换过程，以将第一设备寄存器在目标进程下的设备状态恢复至指定检查点；其中，第一设备寄存器为目标特定设备52所包含的任一设备寄存器，目标进程为目标应用中运行在计算节点上的任意进程。In one embodiment, the state description information includes state description data of the target specific device 52 under at least one process in the target application. When the processor 51 restores the device state of the target specific device 52 under the target application to a specified checkpoint based on the state description information, it can be specifically used for: if the state description data of the target specific device 52 under the target process in the target application indicates that the attribute of the first device register needs to be simulated, then based on the context information associated with the first device register in the state description data, the state conversion process of the first device register is simulated to restore the device state of the first device register under the target process to the specified checkpoint; wherein the first device register is any device register included in the target specific device 52, and the target process is any process running on the computing node in the target application.

在一实施例中，处理器51还可用于：若状态描述数据中指示第一设备寄存器的属性为需映射，则基于状态描述数据中与第一设备寄存器关联的映射关系，对第一设备寄存器提供的状态值进行映射，以将第一设备寄存器在目标进程下的设备状态恢复至指定检查点。In one embodiment, the processor 51 may also be used to: if the attribute of the first device register indicated in the state description data is to be mapped, then based on the mapping relationship associated with the first device register in the state description data, map the state value provided by the first device register to restore the device state of the first device register under the target process to a specified checkpoint.

进一步，如图5所示，该计算节点还包括：通信组件53、电源组件54等其它组件。图5中仅示意性给出部分组件，并不意味着计算节点只包括图5所示组件。另外，图5所示的计算节点可以是云服务器或者是多个云服务器组成的逻辑节点，在此不做限定。Further, as shown in FIG5 , the computing node also includes: a communication component 53, a power supply component 54 and other components. FIG5 only schematically shows some components, which does not mean that the computing node only includes the components shown in FIG5 . In addition, the computing node shown in FIG5 can be a cloud server or a logical node composed of multiple cloud servers, which is not limited here.

值得说明的是，上述关于计算节点各实施例中的技术细节，可参考前述的方法实施例中的相关描述，为节省篇幅，在此不再赘述，但这不应造成本申请保护范围的损失。It is worth noting that the technical details in the above-mentioned computing node embodiments can refer to the relevant description in the aforementioned method embodiments. In order to save space, they will not be repeated here, but this should not cause a loss in the scope of protection of this application.

相应地，本申请实施例还提供一种存储有计算机程序的计算机可读存储介质，计算机程序被执行时能够实现上述方法实施例中可由计算设备执行的各步骤。Accordingly, the present application also provides a computer-readable storage medium storing a computer program. When the program is executed, each step in the above method embodiment that can be executed by a computing device can be implemented.

上述图5中的存储器，用于存储计算机程序，并可被配置为存储其它各种数据以支持在计算平台上的操作。这些数据的示例包括用于在计算平台上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The memory in FIG. 5 is used to store computer programs and can be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phone book data, messages, pictures, videos, etc. The memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

上述图5中的通信组件，被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络，如WiFi，2G、3G、4G/LTE、5G等移动通信网络，或它们的组合。在一个示例性实施例中，通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component in Figure 5 above is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

上述图5中的电源组件，为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统，一个或多个电源，及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。The power supply assembly in Figure 5 provides power to various components of the device where the power supply assembly is located. The power supply assembly may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device where the power supply assembly is located.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that include computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "comprises" or any other variations thereof are intended to cover a non-exclusive In the absence of further limitations, an element defined by the phrase "comprises a ..." does not exclude the presence of other identical elements in the process, method, commodity or device that includes the element.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据，并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准，并提供有相应的操作入口，供用户选择授权或者拒绝。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above is only the embodiment of the present application and is not intended to limit the present application. For those skilled in the art, the present application may have various changes and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.