CN112035410B

Movatterモバイル変換

Info

Publication number: CN112035410B
Application number: CN202010833472.0A
Authority: CN
Inventors: 毛东方; 李海翔; 王建民; 黄向东; 潘安群
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2023-08-18
Anticipated expiration: 2040-08-18
Also published as: CN112035410A

Abstract

The application discloses a log storage method, a log storage device, node equipment and a storage medium, and belongs to the technical field of databases. The method comprises the following steps: determining a remaining capacity of a first storage medium in response to a commit event of the target transaction, the first storage medium being a non-volatile storage medium for storing a log; creating a log checkpoint in response to the remaining capacity being less than the data amount of the uncached log of the target transaction, storing business data generated based on the modification operation in the second storage medium to a third storage medium; an uncached log of the target transaction is written to the first storage medium. According to the method and the device, the log is stored in the first storage medium in a lasting manner directly, a complicated double-layer log caching process is not required to be executed, the space occupied by log storage is greatly saved, the system performance of a database is improved, the limitation of the upper throughput limit of the database system is avoided, and the data expansion is facilitated.

Description

Translated fromChinese

日志存储方法、装置、节点设备及存储介质Log storage method, device, node device and storage medium

技术领域Technical Field

本申请涉及数据库技术领域，特别涉及一种日志存储方法、装置、节点设备及存储介质。The present application relates to the field of database technology, and in particular to a log storage method, device, node device and storage medium.

背景技术Background Art

在主流的数据库系统中，通常会采用日志模块来优化系统性能。在写入数据时，会先将数据快速缓存在内存中，之后再异步持久化到磁盘中，当系统崩溃宕机时，由于内存中的、尚未持久化的数据会丢失，此时系统通过日志模块能够将尚未持久化的数据恢复，保证数据可靠性。另外，为了保证日志模块可以正常恢复内存中的数据，因此数据写入后对应的日志也需要持久化到磁盘中，因此，大大限制了数据库系统吞吐量的上限。In mainstream database systems, log modules are usually used to optimize system performance. When writing data, the data is first quickly cached in memory and then asynchronously persisted to disk. When the system crashes, the data in memory that has not yet been persisted will be lost. At this time, the system can restore the data that has not yet been persisted through the log module to ensure data reliability. In addition, in order to ensure that the log module can normally restore the data in memory, the corresponding logs after the data is written also need to be persisted to disk, which greatly limits the upper limit of the database system throughput.

发明内容Summary of the invention

本申请实施例提供了一种日志存储方法、装置、节点设备及存储介质，能够避免限制数据库系统的吞吐量上限，优化数据库系统性能。该技术方案如下：The embodiments of the present application provide a log storage method, apparatus, node device and storage medium, which can avoid limiting the throughput upper limit of the database system and optimize the performance of the database system. The technical solution is as follows:

一方面，提供了一种日志存储方法，该方法包括：In one aspect, a log storage method is provided, the method comprising:

响应于目标事务的提交事件，确定数据库系统中第一存储介质的剩余容量，所述第一存储介质为用于存储日志的非易失性存储介质；In response to a commit event of a target transaction, determining a remaining capacity of a first storage medium in a database system, wherein the first storage medium is a non-volatile storage medium for storing logs;

响应于所述剩余容量小于所述目标事务的未缓存日志的数据量，创建日志检查点，将第二存储介质中基于修改操作产生的业务数据存储至第三存储介质，所述第二存储介质为易失性存储介质，所述第三存储介质为非易失性存储介质；In response to the remaining capacity being less than the data volume of the uncached log of the target transaction, a log checkpoint is created, and the business data generated based on the modification operation in the second storage medium is stored in a third storage medium, where the second storage medium is a volatile storage medium and the third storage medium is a non-volatile storage medium;

将所述目标事务的未缓存日志写入到所述第一存储介质。An uncached log of the target transaction is written to the first storage medium.

一方面，提供了一种日志存储装置，该装置包括：In one aspect, a log storage device is provided, the device comprising:

确定模块，用于响应于目标事务的提交事件，确定数据库系统中第一存储介质的剩余容量，所述第一存储介质为用于存储日志的非易失性存储介质；A determination module, configured to determine, in response to a commit event of a target transaction, a remaining capacity of a first storage medium in a database system, wherein the first storage medium is a non-volatile storage medium for storing logs;

存储模块，用于响应于所述剩余容量小于所述目标事务的未缓存日志的数据量，创建日志检查点，将第二存储介质中基于修改操作产生的业务数据存储至第三存储介质，所述第二存储介质为易失性存储介质，所述第三存储介质为非易失性存储介质；A storage module, configured to create a log checkpoint in response to the remaining capacity being less than the data volume of the uncached log of the target transaction, and store the business data generated based on the modification operation in the second storage medium to a third storage medium, wherein the second storage medium is a volatile storage medium and the third storage medium is a non-volatile storage medium;

写入模块，用于将所述目标事务的未缓存日志写入到所述第一存储介质。A writing module is used to write the uncached log of the target transaction to the first storage medium.

在一种可能实施方式中，响应于所述剩余容量大于或等于所述目标事务的未缓存日志的数据量，执行所述写入模块。In a possible implementation, in response to the remaining capacity being greater than or equal to the data volume of the uncached log of the target transaction, the writing module is executed.

在一种可能实施方式中，所述装置还包括：In one possible implementation, the device further includes:

获取模块，用于获取所述第一存储介质的存储容量；An acquisition module, used to acquire the storage capacity of the first storage medium;

配置模块，用于将所述数据库系统的日志空间容量参数配置为所述第一存储介质的存储容量。A configuration module is used to configure the log space capacity parameter of the database system to the storage capacity of the first storage medium.

在一种可能实施方式中，所述存储模块还用于：In a possible implementation manner, the storage module is further used for:

每间隔第一目标时长，响应于符合目标条件，创建日志检查点；At intervals of a first target duration, in response to meeting the target condition, creating a log checkpoint;

将所述第二存储介质中基于修改操作产生的业务数据存储至所述第三存储介质；storing the business data generated based on the modification operation in the second storage medium to the third storage medium;

将所述第一存储介质中已存储的最后一个日志块内的日志复制至所述第一存储介质中的第一个日志块；Copying the log in the last log block stored in the first storage medium to the first log block in the first storage medium;

将所述第一存储介质的起始写入位置指针移动至所述第一个日志块。The start write position pointer of the first storage medium is moved to the first log block.

在一种可能实施方式中，所述目标条件包括下述至少一项：In one possible implementation, the target condition includes at least one of the following:

所述第一存储介质的剩余容量小于容量阈值；The remaining capacity of the first storage medium is less than a capacity threshold;

所述数据库系统中的最大日志序列号与所述第二存储介质中具有最小时间戳的业务数据所对应的日志序列号之间差值大于第一目标阈值；The difference between the maximum log sequence number in the database system and the log sequence number corresponding to the business data with the minimum timestamp in the second storage medium is greater than the first target threshold;

所述数据库系统中的最大日志序列号与所述第一存储介质中上一次日志检查点的日志序列号之间差值大于第二目标阈值。The difference between the maximum log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is greater than a second target threshold.

恢复模块，用于响应于所述数据库系统宕机后重启，从所述第一存储介质中获取待恢复日志，基于所述待恢复日志进行数据恢复。The recovery module is used to obtain the log to be recovered from the first storage medium in response to the database system restarting after the shutdown, and perform data recovery based on the log to be recovered.

在一种可能实施方式中，所述恢复模块包括：In one possible implementation, the recovery module includes:

校验单元，用于从所述第一存储介质的第一个日志块开始，对日志块进行校验，将校验通过的日志块中存储的日志确定为所述待恢复日志。The verification unit is used to verify the log blocks starting from the first log block of the first storage medium, and determine the log stored in the log block that passes the verification as the log to be restored.

重做单元，用于将所述待恢复日志存储在哈希表中，遍历所述哈希表，对所述哈希表中所存储的待恢复日志进行重做，得到恢复后的业务数据。The redo unit is used to store the log to be recovered in a hash table, traverse the hash table, and redo the log to be recovered stored in the hash table to obtain the recovered business data.

一方面，提供了一种节点设备，该节点设备包括一个或多个处理器和一个或多个存储器，该一个或多个存储器中存储有至少一条程序代码，该至少一条程序代码由该一个或多个处理器加载并执行以实现如上述任一种可能实现方式的日志存储方法。On the one hand, a node device is provided, which includes one or more processors and one or more memories, wherein at least one program code is stored in the one or more memories, and the at least one program code is loaded and executed by the one or more processors to implement a log storage method such as any possible implementation method described above.

一方面，提供了一种存储介质，该存储介质中存储有至少一条程序代码，该至少一条程序代码由处理器加载并执行以实现如上述任一种可能实现方式的日志存储方法。On the one hand, a storage medium is provided, in which at least one program code is stored. The at least one program code is loaded and executed by a processor to implement a log storage method as described in any possible implementation manner.

一方面，提供一种计算机程序产品或计算机程序，所述计算机程序产品或所述计算机程序包括一条或多条程序代码，所述一条或多条程序代码存储在计算机可读存储介质中。节点设备的一个或多个处理器能够从计算机可读存储介质中读取所述一条或多条程序代码，所述一个或多个处理器执行所述一条或多条程序代码，使得节点设备能够执行上述任一种可能实施方式的日志存储方法。On the one hand, a computer program product or a computer program is provided, wherein the computer program product or the computer program includes one or more program codes, and the one or more program codes are stored in a computer-readable storage medium. One or more processors of a node device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the node device can execute the log storage method of any possible implementation mode described above.

本申请实施例提供的技术方案带来的有益效果至少包括：The beneficial effects brought by the technical solution provided by the embodiment of the present application include at least:

通过在提交目标事务时，在该目标事务的未缓存日志持久化在NVM存储介质(也即第一存储介质)中，能够在事务提交的同时实现对日志的持久化维护，由于不需要将日志分别在内存(也即第二存储介质)和磁盘(也即第三存储介质)中存两次，大大节约了日志存储占用的空间，且无需像传统InnoDB那样构建日志缓冲区-日志文件的双层日志存储体系，并且取消了传统的日志文件，正是由于此，如果NVM存储介质的剩余容量不足，直接通过创建日志检查点就能够从NVM存储介质中整理出空闲的存储空间，无需执行繁琐的将日志从内存刷到磁盘的低速IO缓存流程，提升了数据库的系统性能，避免了限制数据库系统的吞吐量上限。By persisting the uncached log of the target transaction in the NVM storage medium (i.e., the first storage medium) when the target transaction is submitted, the persistent maintenance of the log can be achieved while the transaction is submitted. Since the log does not need to be stored twice in the memory (i.e., the second storage medium) and the disk (i.e., the third storage medium), the space occupied by the log storage is greatly saved, and there is no need to build a two-layer log storage system of log buffer and log file like the traditional InnoDB, and the traditional log file is cancelled. Because of this, if the remaining capacity of the NVM storage medium is insufficient, the free storage space can be sorted out from the NVM storage medium directly by creating a log checkpoint, without executing the cumbersome low-speed IO cache process of flushing the log from the memory to the disk, thereby improving the system performance of the database and avoiding limiting the throughput upper limit of the database system.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还能够根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请实施例提供的一种日志存储方法的实施环境示意图；FIG1 is a schematic diagram of an implementation environment of a log storage method provided in an embodiment of the present application;

图2是本申请实施例提供的一种NVM-MySQL日志架构的原理性示意图；FIG2 is a schematic diagram of a NVM-MySQL log architecture provided in an embodiment of the present application;

图3是本申请实施例提供的一种日志存储方法的流程图；FIG3 is a flow chart of a log storage method provided in an embodiment of the present application;

图4是本申请实施例提供的一种NVM-MySQL日志存储体系的架构示意图；FIG4 is a schematic diagram of the architecture of an NVM-MySQL log storage system provided in an embodiment of the present application;

图5是本申请实施例提供的一种日志块的结构示意图；FIG5 is a schematic diagram of the structure of a log block provided in an embodiment of the present application;

图6是本申请实施例提供的一种日志存储方法的原理性流程图；FIG6 is a schematic flow chart of a log storage method provided in an embodiment of the present application;

图7是本申请实施例提供的一种NVM介质的初始化流程图；FIG7 is a flowchart of initializing an NVM medium provided in an embodiment of the present application;

图8是本申请实施例提供的一种周期性检查NVM介质的流程图；FIG8 is a flow chart of periodically checking NVM media provided by an embodiment of the present application;

图9是本申请实施例提供的一种周期性检查NVM介质的原理性流程图；FIG9 is a schematic flow chart of a periodic NVM medium check according to an embodiment of the present application;

图10是本申请实施例提供的一种容灾恢复方法的流程图；FIG10 is a flow chart of a disaster recovery method provided in an embodiment of the present application;

图11是本申请实施例提供的一种容灾恢复方法的原理性流程图；FIG11 is a schematic flow chart of a disaster recovery method provided in an embodiment of the present application;

图12是本申请实施例提供的一种日志存储装置的结构示意图；FIG12 is a schematic diagram of the structure of a log storage device provided in an embodiment of the present application;

图13是本申请实施例提供的一种节点设备的结构示意图。FIG13 is a schematic diagram of the structure of a node device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application more clear, the implementation methods of the present application will be further described in detail below with reference to the accompanying drawings.

本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分，应理解，“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系，也不对数量和执行顺序进行限定。In this application, the terms "first", "second", etc. are used to distinguish identical or similar items with substantially the same effects and functions. It should be understood that there is no logical or temporal dependency between "first", "second", and "nth", nor is there any limitation on quantity and execution order.

本申请中术语“至少一个”是指一个或多个，“多个”的含义是指两个或两个以上，例如，多个第一位置是指两个或两个以上的第一位置。In the present application, the term "at least one" means one or more, and the meaning of "plurality" means two or more. For example, a plurality of first positions means two or more first positions.

在介绍本申请实施例之前，需要引入一些云技术领域内的基本概念：Before introducing the embodiments of the present application, it is necessary to introduce some basic concepts in the field of cloud technology:

云技术(Cloud Technology)：是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来，实现数据的计算、储存、处理和共享的一种托管技术，也即是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称，可以组成资源池，按需所用，灵活便利。云计算技术将变成云技术领域的重要支撑。技术网络系统的后台服务需要大量的计算、存储资源，如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用，将来每个物品都有可能存在自己的识别标志，都需要传输到后台系统进行逻辑处理，不同程度级别的数据将会分开处理，各类行业数据皆需要强大的系统后盾支撑，均能通过云计算来实现。Cloud Technology: refers to a hosting technology that unifies hardware, software, network and other resources in a wide area network or local area network to achieve data computing, storage, processing and sharing. It is also a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model application. It can form a resource pool and be used on demand, which is flexible and convenient. Cloud computing technology will become an important support in the field of cloud technology. The background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites. With the high development and application of the Internet industry, each item may have its own identification mark in the future, and all need to be transmitted to the background system for logical processing. Data of different levels will be processed separately. All kinds of industry data require strong system backing support, which can be achieved through cloud computing.

云存储(Cloud Storage)：是在云计算概念上延伸和发展出来的一个新的概念，分布式云存储系统(以下简称存储系统)是指通过集群应用、网格技术以及分布存储文件系统等功能，将网络中大量各种不同类型的存储设备(存储设备也称之为存储节点)通过应用软件或应用接口集合起来协同工作，共同对外提供数据存储和业务访问功能的一个存储系统。Cloud Storage: is a new concept extended and developed from the concept of cloud computing. A distributed cloud storage system (hereinafter referred to as storage system) refers to a storage system that uses cluster applications, grid technology, and distributed storage file systems to bring together a large number of different types of storage devices (storage devices are also called storage nodes) in the network through application software or application interfaces to work together and provide external data storage and business access functions.

数据库(Database)：简而言之可视为一种电子化的文件柜——存储电子文件的处所，用户可以对文件中的数据进行新增、查询、更新、删除等操作。所谓“数据库”是以一定方式储存在一起、能与多个用户共享、具有尽可能小的冗余度、与应用程序彼此独立的数据集合。Database: In short, it can be regarded as an electronic filing cabinet - a place to store electronic files, where users can add, query, update, delete, etc. The so-called "database" is a collection of data that is stored together in a certain way, can be shared with multiple users, has as little redundancy as possible, and is independent of the application.

本申请实施例所涉及的数据库系统，可以是单机数据库系统、单机以事务为主的数据库系统、单机以分析型为主但需要事务处理能力的数据库系统，可以是NoSQL(Non-relational SQL，泛指非关系型数据库)体系，还可以是分布式数据库系统、分布式大数据处理系统，对于分布式数据库系统而言，当不同变量分布存储在不同的物理节点上时，对应于数据状态一致性模型中存在两个及两个以上变量的情况，数据状态一致性模型以及相应的事务处理流程均将在后文的各个实施例中进行详细说明，这里不做赘述。The database system involved in the embodiments of the present application may be a stand-alone database system, a stand-alone transaction-based database system, a stand-alone analytical-based database system that requires transaction processing capabilities, a NoSQL (Non-relational SQL, generally referring to non-relational databases) system, or a distributed database system or a distributed big data processing system. For a distributed database system, when different variables are distributed and stored on different physical nodes, corresponding to the situation where there are two or more variables in the data state consistency model, the data state consistency model and the corresponding transaction processing flow will be described in detail in the various embodiments below and will not be elaborated here.

在数据库系统中可以包括至少一个节点设备，每个节点设备的数据库中可以存储有多个数据表，每个数据表可以用于存储一个或多个数据项。其中，节点设备的数据库可以为任一类型的分布式数据库，可以包括关系型数据库或者非关系型数据库中至少一项，例如SQL(Structured Query Language，结构化查询语言)数据库、NoSQL、NewSQL(泛指各种新式的可拓展/高性能数据库)等，在本申请实施例中对数据库的类型不作具体限定。The database system may include at least one node device, and the database of each node device may store multiple data tables, and each data table may be used to store one or more data items. The database of the node device may be any type of distributed database, and may include at least one item in a relational database or a non-relational database, such as a SQL (Structured Query Language) database, NoSQL, NewSQL (generally referring to various new scalable/high-performance databases), etc. The type of database is not specifically limited in the embodiments of the present application.

在一些实施例中，本申请实施例还可以应用于一种基于区块链技术的数据库系统(以下简称为“区块链系统”)，上述区块链系统在本质上属于一种去中心化式的分布式数据库系统，采用共识算法保持区块链上不同节点设备所记载的账本数据一致，通过密码算法保证不同节点设备之间账本数据的加密传送以及不可篡改，通过脚本系统来拓展账本功能，通过网络路由来进行不同节点设备之间的相互连接。In some embodiments, the embodiments of the present application can also be applied to a database system based on blockchain technology (hereinafter referred to as the "blockchain system"). The above-mentioned blockchain system is essentially a decentralized distributed database system, which uses a consensus algorithm to keep the ledger data recorded by different node devices on the blockchain consistent, and uses a cryptographic algorithm to ensure the encrypted transmission and non-tamperability of the ledger data between different node devices. The script system is used to expand the ledger function, and network routing is used to connect different node devices to each other.

在区块链系统中可以包括一条或多条区块链，区块链是一串使用密码学方法相关联产生的数据块，每一个数据块中包含了一批次网络交易的信息，用于验证其信息的有效性(防伪)和生成下一个区块。A blockchain system may include one or more blockchains. A blockchain is a string of data blocks generated using cryptographic methods. Each data block contains information about a batch of network transactions, which is used to verify the validity of the information (anti-counterfeiting) and generate the next block.

区块链系统中节点设备之间可以组成点对点(Peer To Peer，P2P)网络，P2P协议是一个运行在传输控制协议(Transmission Control Protocol，TCP)协议之上的应用层协议。在区块链系统中，任一节点设备可以具备如下功能：1)路由，节点设备具有的基本功能，用于支持节点设备之间的通信；2)应用，用于部署在区块链中，根据实际业务需求而实现特定业务，记录实现功能相关的数据形成账本数据，在账本数据中携带数字签名以表示数据来源，将账本数据发送至区块链系统中的其他节点设备，供其他节点设备在验证账本数据来源以及完整性成功时，将账本数据添加至临时区块中，其中，应用实现的业务可以包括钱包、共享账本、智能合约等；3)区块链，包括一系列按照先后的时间顺序相互接续的区块，新区块一旦加入到区块链中就不会再被移除，区块中记录了区块链系统中节点设备提交的账本数据。In the blockchain system, node devices can form a peer-to-peer (P2P) network. The P2P protocol is an application layer protocol running on the Transmission Control Protocol (TCP). In the blockchain system, any node device can have the following functions: 1) Routing, a basic function of node devices, used to support communication between node devices; 2) Application, used to be deployed in the blockchain, to implement specific services according to actual business needs, record data related to the implementation of functions to form account data, carry digital signatures in the account data to indicate the source of data, and send the account data to other node devices in the blockchain system for other node devices to add the account data to the temporary block when they successfully verify the source and integrity of the account data. Among them, the services implemented by the application can include wallets, shared accounts, smart contracts, etc.; 3) Blockchain, including a series of blocks that are connected to each other in chronological order. Once a new block is added to the blockchain, it will not be removed. The block records the account data submitted by the node devices in the blockchain system.

在一些实施例中，每个区块中可以包括本区块存储交易记录的哈希值(本区块的哈希值)以及前一区块的哈希值，各区块通过哈希值连接形成区块链，另，区块中还可以包括有区块生成时的时间戳等信息。In some embodiments, each block may include a hash value of the transaction record stored in the block (the hash value of the block) and a hash value of the previous block. Each block is connected through the hash value to form a blockchain. In addition, the block may also include information such as the timestamp when the block was generated.

图1是本申请实施例提供的一种日志存储方法的实施环境示意图。参见图1，本实施例可以应用于分布式数据库系统，该系统中可以包括网关服务器101、全局时间戳生成集群102、分布式存储集群103以及分布式协调系统104(例如ZooKeeper)，在分布式存储集群103中可以包括数据节点设备和协调节点设备。Fig. 1 is a schematic diagram of an implementation environment of a log storage method provided by an embodiment of the present application. Referring to Fig. 1, the present embodiment can be applied to a distributed database system, which can include a gateway server 101, a global timestamp generation cluster 102, a distributed storage cluster 103, and a distributed coordination system 104 (e.g., ZooKeeper), and the distributed storage cluster 103 can include a data node device and a coordination node device.

其中，网关服务器101用于接收外部的读写请求，并将读写请求对应的读写事务分发至分布式存储集群103，比如，用户在登录终端上的应用客户端之后，触发应用客户端生成读写请求，调用分布式数据库系统提供的API(Application Programming Interface，应用程序编程接口)将该读写请求发送至网关服务器101，比如，该API可以是MySQL API(一种关系型数据库系统提供的API)。Among them, the gateway server 101 is used to receive external read and write requests, and distribute the read and write transactions corresponding to the read and write requests to the distributed storage cluster 103. For example, after the user logs in to the application client on the terminal, the application client is triggered to generate a read and write request, and the API (Application Programming Interface) provided by the distributed database system is called to send the read and write request to the gateway server 101. For example, the API can be a MySQL API (an API provided by a relational database system).

在一些实施例中，该网关服务器101可以与分布式存储集群103中的任一个数据节点设备或任一协调节点设备合并在同一个物理机上，也即是，让某个数据节点设备或协调节点设备充当网关服务器101。In some embodiments, the gateway server 101 can be combined with any data node device or any coordination node device in the distributed storage cluster 103 on the same physical machine, that is, a data node device or a coordination node device can act as the gateway server 101 .

全局时间戳生成集群102用于生成全局事务的全局提交时间戳(GlobalTimestamp，Gts)，该全局事务又称为分布式事务，是指涉及到多个数据节点设备的事务，例如全局读事务可以涉及到对多个数据节点设备上存储数据的读取，又例如，全局写事务可以涉及到对多个数据节点设备上的数据写入。全局时间戳生成集群102在逻辑上可以视为一个单点，但在一些实施例中可以通过一主三从的架构来提供具有更高可用性的服务，采用集群的形式来实现该全局提交时间戳的生成，可以防止单点故障，也就规避了单点瓶颈问题。The global timestamp generation cluster 102 is used to generate a global commit timestamp (GlobalTimestamp, Gts) for a global transaction. The global transaction is also called a distributed transaction, which refers to a transaction involving multiple data node devices. For example, a global read transaction may involve reading data stored on multiple data node devices, and for another example, a global write transaction may involve writing data on multiple data node devices. The global timestamp generation cluster 102 can be logically regarded as a single point, but in some embodiments, a service with higher availability can be provided through a one-master-three-slave architecture. The generation of the global commit timestamp is implemented in the form of a cluster, which can prevent single point failures and avoid single point bottleneck problems.

可选地，全局提交时间戳是一个在分布式数据库系统中全局唯一且单调递增的时间戳标识，能够用于标志每个事务全局提交的顺序，以此来反映出事务之间在真实时间上的先后关系(事务的全序关系)，全局提交时间戳可以采用物理时钟、逻辑时钟、混合物理时钟或者混合逻辑时钟(Hybrid Logical Clock，HLC)中至少一项，本申请实施例不对全局提交时间戳的类型进行具体限定。Optionally, the global commit timestamp is a globally unique and monotonically increasing timestamp identifier in a distributed database system, which can be used to mark the order of global commit of each transaction, so as to reflect the sequence relationship between transactions in real time (the total order relationship of transactions). The global commit timestamp can adopt at least one of a physical clock, a logical clock, a hybrid physical clock or a hybrid logical clock (Hybrid Logical Clock, HLC). The embodiment of the present application does not specifically limit the type of the global commit timestamp.

在一个示例性场景中，全局提交时间戳可以采用混合物理时钟的方式生成，全局提交时间戳可以由八字节组成，其中，前44位可以为物理时间戳的取值(也即Unix时间戳，精确到毫秒)，这样共计可以表示2⁴⁴个无符号整数，因此理论上一共可以表示约为年的物理时间戳，其中，后20位可以为在某一毫秒内的单调递增计数，这样每毫秒有2²⁰个(约100万个)计数，基于上述数据结构，如果单机(任一数据节点设备)的事务吞吐量为10w/s，理论上可以支持包含1万个节点设备的分布式存储集群103，同时，全局提交时间戳的数量代表了系统理论上所能支持的总事务数，基于上述数据结构，理论上系统可以支持(2⁴⁴-1)*2²⁰个事务。这里仅仅是对一种全局提交时间戳的定义方法的示例性说明，根据业务需求的不同，可以对全局提交时间戳的位数进行扩展，以满足对更多的节点数、事务处理数的支持，本申请实施例不对全局提交时间戳的定义方法进行具体限定。In an exemplary scenario, the global commit timestamp can be generated by using a hybrid physical clock. The global commit timestamp can be composed of eight bytes, of which the first 44 bits can be the value of the physical timestamp (that is, the Unix timestamp, accurate to milliseconds). In this way, a total of 2⁴⁴ unsigned integers can be represented. Therefore, in theory, a total of approximately The physical timestamp of the year, where the last 20 bits can be a monotonically increasing count within a certain millisecond, so that there are 2²⁰ (about 1 million) counts per millisecond. Based on the above data structure, if the transaction throughput of a single machine (any data node device) is 10w/s, it can theoretically support a distributed storage cluster 103 containing 10,000 node devices. At the same time, the number of global commit timestamps represents the total number of transactions that the system can theoretically support. Based on the above data structure, the system can theoretically support (2⁴⁴ -1)*2²⁰ transactions. This is just an exemplary description of a method for defining a global commit timestamp. According to different business requirements, the number of bits of the global commit timestamp can be expanded to meet the support for more nodes and transaction processing numbers. The embodiment of the present application does not specifically limit the definition method of the global commit timestamp.

在一些实施例中，该全局时间戳生成集群102可以是物理独立的，也可以和分布式协调系统104(例如ZooKeeper)合并到一起。In some embodiments, the global timestamp generation cluster 102 may be physically independent or may be combined with a distributed coordination system 104 (eg, ZooKeeper).

其中，分布式存储集群103可以包括数据节点设备和协调节点设备，每个协调节点设备可以对应于至少一个数据节点设备，数据节点设备与协调节点设备的划分是针对不同事务而言的，以某一全局事务为例，全局事务的发起节点可以称为协调节点设备，全局事务所涉及的其他节点设备称为数据节点设备，数据节点设备或协调节点设备的数量可以是一个或多个，本申请实施例不对分布式存储集群103中数据节点设备或协调节点设备的数量进行具体限定。由于本实施例所提供的分布式数据库系统中缺乏全局事务管理器，因此在该系统中可以采用XA(eXtended Architecture，X/Open组织分布式事务规范)/2PC(Two-Phase Commit，二阶段提交)技术来支持跨节点的事务(全局事务)，保证跨节点写操作时数据的原子性和一致性，此时，协调节点设备用于充当2PC算法中的协调者，而该协调节点设备所对应的各个数据节点设备用于充当2PC算法中的参与者。Among them, the distributed storage cluster 103 may include data node devices and coordination node devices, each coordination node device may correspond to at least one data node device, the division of data node devices and coordination node devices is for different transactions, taking a certain global transaction as an example, the initiating node of the global transaction may be called a coordination node device, and the other node devices involved in the global transaction may be called data node devices, the number of data node devices or coordination node devices may be one or more, and the embodiment of the present application does not specifically limit the number of data node devices or coordination node devices in the distributed storage cluster 103. Since the distributed database system provided in this embodiment lacks a global transaction manager, XA (eXtended Architecture, X/Open organization distributed transaction specification)/2PC (Two-Phase Commit, two-phase commit) technology can be used in the system to support cross-node transactions (global transactions) to ensure the atomicity and consistency of data during cross-node write operations. At this time, the coordination node device is used to act as a coordinator in the 2PC algorithm, and each data node device corresponding to the coordination node device is used to act as a participant in the 2PC algorithm.

可选地，每个数据节点设备或协调节点设备可以是单机设备，也可以采用主备结构(也即是为一主多备集群)，如图1所示，以节点设备(数据节点设备或协调节点设备)为一主两备集群为例进行示意，每个节点设备中包括一个主机和两个备机，可选地，每个主机或备机都对应配置有代理(agent)设备，代理设备可以与主机或备机是物理独立的，当然，代理设备还可以作为主机或备机上的一个代理模块，以节点设备1为例，节点设备1包括一个主数据库及代理设备(主database+agent，简称主DB+agent)，此外还包括两备数据库及代理设备(备database+agent，简称备DB+agent)。Optionally, each data node device or coordination node device can be a stand-alone device, or it can adopt a master-slave structure (that is, a one-master and multiple-slave cluster), as shown in Figure 1, taking the node device (data node device or coordination node device) as a one-master and two-slave cluster as an example for illustration, each node device includes a host and two standby machines, optionally, each host or standby machine is correspondingly configured with an agent device, the agent device can be physically independent of the host or standby machine, of course, the agent device can also serve as an agent module on the host or standby machine, taking node device 1 as an example, node device 1 includes a master database and an agent device (master database+agent, referred to as master DB+agent), in addition to two standby databases and agent devices (standby database+agent, referred to as standby DB+agent).

在一个示例性场景中，每个节点设备所对应的主机或备机的数据库实例集合称为一个SET(集合)，例如，假设某一节点设备为单机设备，那么该节点设备的SET仅为该单机设备的数据库实例，假设某一节点设备为一主两备集群，那么该节点设备的SET为主机数据库实例以及两个备机数据库实例的集合，此时可以基于云数据库的强同步技术来保证主机的数据与备机的副本数据之间的一致性，可选地，每个SET可以进行线性扩容，以应付大数据场景下的业务处理需求，在一些金融业务场景下，全局事务通常是指跨SET的转账。In an exemplary scenario, the set of database instances of the host or standby corresponding to each node device is called a SET (set). For example, assuming that a node device is a stand-alone device, then the SET of the node device is only the database instance of the stand-alone device. Assuming that a node device is a one-master and two-standby cluster, then the SET of the node device is a set of the host database instance and the two standby database instances. At this time, the consistency between the host data and the replica data of the standby can be guaranteed based on the strong synchronization technology of the cloud database. Optionally, each SET can be linearly expanded to meet the business processing needs in big data scenarios. In some financial business scenarios, global transactions usually refer to transfers across SETs.

分布式协调系统104可以用于对网关服务器101、全局时间戳生成集群102或者分布式存储集群103中至少一项进行管理，可选地，技术人员可以通过终端上的调度器(scheduler)访问该分布式协调系统104，从而基于前端的调度器来控制后端的分布式协调系统104，实现对各个集群或服务器的管理。例如，技术人员可以通过调度器来控制ZooKeeper将某一个节点设备从分布式存储集群103中删除，也即是使得某一个节点设备失效。The distributed coordination system 104 can be used to manage at least one of the gateway server 101, the global timestamp generation cluster 102 or the distributed storage cluster 103. Optionally, the technician can access the distributed coordination system 104 through the scheduler on the terminal, thereby controlling the distributed coordination system 104 at the back end based on the front-end scheduler to achieve management of each cluster or server. For example, the technician can control ZooKeeper through the scheduler to delete a node device from the distributed storage cluster 103, that is, to make a node device invalid.

上述图1仅是提供了一种轻量级的全局事务处理的架构图，是一种类分布式数据库系统。整个分布式数据库系统可以看作是共同维护一个逻辑上的大表，这个大表中存储的数据通过主键被打散到分布式存储集群103中的各个节点设备中，每个节点设备上存储的数据是独立于其他节点设备的，从而实现了节点设备对逻辑大表的水平切分。由于在上述系统中能够将各个数据库中各个数据表水平切分后进行分布式地存储，因此，这种系统也可以形象地称为具有“分库分表”的架构。The above-mentioned Figure 1 only provides an architectural diagram of a lightweight global transaction processing, which is a quasi-distributed database system. The entire distributed database system can be regarded as jointly maintaining a logical large table. The data stored in this large table is scattered to each node device in the distributed storage cluster 103 through the primary key. The data stored on each node device is independent of other node devices, thereby realizing the horizontal segmentation of the logical large table by the node device. Since each data table in each database can be horizontally segmented and distributedly stored in the above-mentioned system, this system can also be figuratively called an architecture with "sub-library and sub-table".

在上述分布式数据库系统中，已经基于XA/2PC算法实现了写操作时数据的原子性和一致性，从技术的角度来看，分布分表架构缺乏一个全局事务管理器，也就缺乏分布式事务处理能力，通过构造轻量的、去中心化的分布式事务处理机制，能够为分布式数据库系统提供水平扩展等能力，并且保证分布式数据库系统简单易推广、事务处理效率更高，必将对传统并发控制方式所设计的分布式数据库架构产生极大冲击。In the above-mentioned distributed database system, the atomicity and consistency of data during write operations have been achieved based on the XA/2PC algorithm. From a technical point of view, the distributed table architecture lacks a global transaction manager, which means it lacks distributed transaction processing capabilities. By constructing a lightweight, decentralized distributed transaction processing mechanism, it can provide horizontal expansion capabilities for the distributed database system, and ensure that the distributed database system is simple and easy to promote, and the transaction processing efficiency is higher, which will inevitably have a great impact on the distributed database architecture designed by the traditional concurrency control method.

本申请实施例提供的日志存储方法，可以应用于上述采用了分库分表架构的分布式系统中，例如，该分布式系统为分布式事务型数据库系统，当然也可以是分布式关系型数据库系统，此外，本申请实施例提供的日志存储方法也可以应用于一些单机数据库系统中，该日志存储方法能够为数据库系统内单节点的存储引擎增加利用NVM(Non-VolatileMemory，非易失性存储)的能力，以应对不同客户的应用需求，提升事务处理效率，能够提升数据库的产品竞争力、技术影响力，具有较强的现实意义。The log storage method provided in the embodiment of the present application can be applied to the above-mentioned distributed system that adopts the sub-library and sub-table architecture. For example, the distributed system is a distributed transactional database system, and of course it can also be a distributed relational database system. In addition, the log storage method provided in the embodiment of the present application can also be applied to some stand-alone database systems. The log storage method can increase the ability of the storage engine of a single node in the database system to utilize NVM (Non-Volatile Memory) to meet the application needs of different customers, improve transaction processing efficiency, and enhance the product competitiveness and technical influence of the database, which has a strong practical significance.

在一些实施例中，上述网关服务器101、全局时间戳生成集群102、分布式存储集群103以及分布式协调系统104所构成的分布式数据库系统，可以视为一种向用户终端提供数据服务的服务器，该服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。可选地，上述用户终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等，但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。In some embodiments, the distributed database system composed of the above-mentioned gateway server 101, the global timestamp generation cluster 102, the distributed storage cluster 103 and the distributed coordination system 104 can be regarded as a server that provides data services to user terminals. The server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Optionally, the above-mentioned user terminal can be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited to this. The terminal and the server can be directly or indirectly connected via wired or wireless communication, and this application is not limited here.

在介绍本申请实施例之前，首先对本申请涉及到的几种不同类型的存储介质分别进行介绍：Before introducing the embodiments of the present application, several different types of storage media involved in the present application are first introduced respectively:

第一存储介质：也即NVM(Non-Volatile Memory，非易失性存储)介质，一种新型的可持久化存储的介质，例如PCM(Phase Change Memory，相变存储器)，PCM的数据写速度处于第二存储介质和第三存储介质的数据写速度之间，但PCM的数据读速度基本与第二存储介质的数据读速度持平，且远大于第三存储介质的数据读速度，在后文中第一存储介质也称为NVM介质。The first storage medium: also known as NVM (Non-Volatile Memory) medium, a new type of persistent storage medium, such as PCM (Phase Change Memory). The data write speed of PCM is between the data write speeds of the second storage medium and the third storage medium, but the data read speed of PCM is basically the same as the data read speed of the second storage medium, and is much greater than the data read speed of the third storage medium. In the following text, the first storage medium is also referred to as NVM medium.

第二存储介质：一种易失性存储介质，通常指内存，用作系统加速的缓存，例如DRAM(Dynamic Random Access Memory，动态随机存取存储器)，具有容量小、IO(Input/Output，输入/输出)速度快、价格高、易失、字节寻址的特性，在后文中第二存储介质也称为内存。Second storage medium: A volatile storage medium, usually refers to memory, used as a cache for system acceleration, such as DRAM (Dynamic Random Access Memory), which has the characteristics of small capacity, fast IO (Input/Output), high price, volatility, and byte addressing. In the following text, the second storage medium is also called memory.

第三存储介质：一种非易失性存储介质，通常指硬盘，用作业务数据的持久化存储，例如HDD(Hard Disk Drive，硬盘驱动器)或SSD(Solid State Disk，固态硬盘)，具有容量大、IO速度慢、价格低、非易失、块寻址的特性，在后文中第三存储介质也称为硬盘/磁盘。The third storage medium is a non-volatile storage medium, usually a hard disk, used for persistent storage of business data, such as a HDD (Hard Disk Drive) or SSD (Solid State Disk). It has the characteristics of large capacity, slow IO speed, low price, non-volatility, and block addressing. In the following text, the third storage medium is also referred to as a hard disk/disk.

在上述系统架构的基础上，目前大部分数据库的底层存储架构均为面向磁盘的设计，其整体存储架构可以划分为两层存储层级。一方面，主要的存储介质为HDD或SSD，具有容量大、IO速度慢、价格低、非易失、块寻址的特性，能够在数据库遭遇重大故障(如系统崩溃、断电等)时保证数据的持久性，也即能够实现数据的持久化存储；另一方面，DRAM作为用于系统加速的缓存，具有容量小、IO速度快、价格高、易失、字节寻址的特性。由于在上述两层存储层级之间IO速度存在较大鸿沟，数据库采用一系列模块以填补IO速度之间的鸿沟，从而优化系统性能，例如日志模块，下面进行介绍。Based on the above system architecture, the underlying storage architecture of most databases is currently designed for disks, and its overall storage architecture can be divided into two storage levels. On the one hand, the main storage medium is HDD or SSD, which has the characteristics of large capacity, slow IO speed, low price, non-volatility, and block addressing. It can ensure the persistence of data when the database encounters a major failure (such as system crash, power outage, etc.), that is, it can achieve persistent storage of data; on the other hand, DRAM, as a cache for system acceleration, has the characteristics of small capacity, fast IO speed, high price, volatility, and byte addressing. Due to the large gap in IO speed between the above two storage levels, the database uses a series of modules to fill the gap between IO speeds, thereby optimizing system performance, such as the log module, which is introduced below.

在数据库系统中，日志模块是用于保证系统的数据可靠性的一个必不可少的模块，同时也是显著影响系统性能的模块，数据库系统为了提升系统吞吐量，在数据写入时会先快速缓存在内存中，之后再异步持久化存储到硬盘中，然而由于内存中的数据具有易失性，当系统遭遇难以修复的问题而崩溃宕机时，内存中还未持久化的数据会面临丢失风险，因此，系统需要通过日志模块来恢复内存中的数据，那么数据写入后对应的重做日志(RedoLog)就必须持久化到磁盘中。虽然上述步骤能够通过磁盘相对快速的顺序IO完成，但依然限制了数据库系统吞吐量的上限。In the database system, the log module is an indispensable module for ensuring the data reliability of the system. It is also a module that significantly affects the system performance. In order to improve the system throughput, the database system will quickly cache the data in the memory when writing, and then asynchronously persist it to the hard disk. However, due to the volatility of the data in the memory, when the system encounters a problem that is difficult to repair and crashes, the data in the memory that has not been persisted will face the risk of loss. Therefore, the system needs to use the log module to restore the data in the memory. Then, the corresponding redo log (RedoLog) after the data is written must be persisted to the disk. Although the above steps can be completed through the relatively fast sequential IO of the disk, it still limits the upper limit of the throughput of the database system.

在传统的InnoDB体系中，日志存储体系分为两层：日志缓冲区和日志文件。日志缓冲区位于内存(也即第二存储介质，一种易失性存储介质)，实际上为一块连续的虚拟内存区，逻辑上可划分成多个日志块，日志缓冲区内部可划分为大小相同的两部分：写入区和持久化区，写入区用于缓存事务新产生的日志数据，事务日志数据产生后都先直接写入到该日志缓冲区内的该写入区中，而持久化区则用于将日志数据持久化，位于该持久化区的日志数据将会被异步持久化到磁盘内的日志文件中。当该写入区中已写入的日志大小达到一定阈值，且该持久化区的日志已经全部持久化时，InnoDB会交换两个区域，即新产生的日志数据写入到之前的持久化区的开头，覆盖已被持久化的老旧数据，而之前的位于写入区内的日志数据则开始进行异步持久化。In the traditional InnoDB system, the log storage system is divided into two layers: log buffer and log file. The log buffer is located in the memory (i.e. the second storage medium, a volatile storage medium), which is actually a continuous virtual memory area. Logically, it can be divided into multiple log blocks. The log buffer can be divided into two parts of the same size: the write area and the persistence area. The write area is used to cache the newly generated log data of the transaction. After the transaction log data is generated, it is directly written to the write area in the log buffer. The persistence area is used to persist the log data. The log data in the persistence area will be asynchronously persisted to the log file in the disk. When the size of the log written in the write area reaches a certain threshold, and the logs in the persistence area have been persisted, InnoDB will swap the two areas, that is, the newly generated log data is written to the beginning of the previous persistence area, overwriting the old data that has been persisted, and the previous log data in the write area will start to be asynchronously persisted.

此外，传统InnoDB体系中日志文件位于磁盘(也即第三存储介质，一种非易失性存储介质)，由多个重做日志文件组进行管理，每个文件组内均包含一个或多个日志文件，通常情况下，InnoDB仅设置一个重做日志文件组。同一个重做日志文件组中，日志数据循环写入，即当前日志文件写满后，即开始写下一个日志文件，如果全都写满了，则从第一个日志文件开头覆盖写，对于上层应用来说，这种循环写机制的使用，一个日志文件组即提供一个无限的连续写入空间。每个日志文件可分为文件头和日志数据两个部分，其中，文件头由4个日志块组成，第一个日志块包含一些基本元数据，例如文件所属日志组ID(Identification，标识)、该文件日志记录的开始LSN(Log Sequence Number，日志序列号)等，第二个和第四个日志块则记录日志检查点的相关信息，每个日志组中仅有第一个日志文件的头部包含这部分信息。日志数据部分则用于存储实际的日志记录，结构与日志缓冲区类似。In addition, in the traditional InnoDB system, the log files are located on the disk (i.e., the third storage medium, a non-volatile storage medium) and are managed by multiple redo log file groups. Each file group contains one or more log files. In general, InnoDB only sets one redo log file group. In the same redo log file group, log data is written cyclically, that is, when the current log file is full, the next log file is written. If all are full, the first log file is overwritten. For the upper-layer application, the use of this cyclic writing mechanism, a log file group provides an infinite continuous writing space. Each log file can be divided into two parts: the file header and the log data. The file header consists of 4 log blocks. The first log block contains some basic metadata, such as the log group ID (Identification) to which the file belongs, the start LSN (Log Sequence Number) of the log record of the file, etc. The second and fourth log blocks record the relevant information of the log checkpoint. Only the header of the first log file in each log group contains this part of information. The log data part is used to store the actual log records, and the structure is similar to the log buffer.

在本申请实施例中，提出了一种新型的存储引擎的日志架构，称之为NVM-MySQL，在该架构中，将重做日志(可简称为日志)全部存储在NVM介质(也即第一存储介质，一种新型的非易失性存储介质)中，将传统的日志缓冲区-日志文件双层架构简化成NVM单层架构，避免了日志模块与磁盘之间的交互。在重做日志产生时，直接持久化到NVM介质中，同时容灾恢复则直接通过NVM介质的高速IO读取重做日志，时间开销较小。In the embodiment of the present application, a new type of storage engine log architecture is proposed, called NVM-MySQL, in which all redo logs (referred to as logs) are stored in NVM media (i.e., the first storage medium, a new type of non-volatile storage medium), and the traditional log buffer-log file two-layer architecture is simplified into an NVM single-layer architecture, avoiding the interaction between the log module and the disk. When the redo log is generated, it is directly persisted to the NVM medium, and disaster recovery directly reads the redo log through the high-speed IO of the NVM medium, with less time overhead.

图2是本申请实施例提供的一种NVM-MySQL日志架构的原理性示意图，请参考图2，在NVM-MySQL日志架构中提供一种日志单层存储体系，如200所示，当事务操作发生时，将数据写入位于内存的数据缓冲池201的同时，将事务操作对应的重做日志写入到NVM介质202中，接着将数据缓冲池201中缓存的数据以数据文件的形式持久化到磁盘中，在系统容灾恢复时，仅需从NVM介质202中获取重做日志，并按照重做日志将数据重新写入到数据缓冲池201中。FIG2 is a schematic diagram of a NVM-MySQL log architecture provided in an embodiment of the present application. Please refer to FIG2 . A single-layer log storage system is provided in the NVM-MySQL log architecture, as shown in 200. When a transaction operation occurs, data is written to a data buffer pool 201 in the memory, and a redo log corresponding to the transaction operation is written to an NVM medium 202. Then, the data cached in the data buffer pool 201 is persisted to the disk in the form of a data file. When the system is restored to disaster recovery, it is only necessary to obtain the redo log from the NVM medium 202 and rewrite the data to the data buffer pool 201 according to the redo log.

与传统的InnoDB存储引擎(一种MySQL数据库引擎)相比，InnoDB数据库的重做日志需要在内存和磁盘中同时存在，而在NVM-MySQL日志架构中仅需将所有的重做日志存储在NVM介质中，即可实现所有重做日志的持久化存储，也即将双层日志存储体系重构为单层日志存储体系。具体地，还需要重新设计日志相关的操作流程，能够让NVM-MySQL日志架构适应于各类数据库系统，在代码层面上，主要需要修改核心数据结构和日志相关操作流程，即可使NVM-MySQL日志架构与各类数据库系统兼容。Compared with the traditional InnoDB storage engine (a MySQL database engine), the redo log of the InnoDB database needs to exist in both memory and disk. In the NVM-MySQL log architecture, all redo logs only need to be stored in the NVM medium to achieve persistent storage of all redo logs, that is, to reconstruct the double-layer log storage system into a single-layer log storage system. Specifically, it is also necessary to redesign the log-related operation process so that the NVM-MySQL log architecture can adapt to various database systems. At the code level, it is mainly necessary to modify the core data structure and log-related operation process to make the NVM-MySQL log architecture compatible with various database systems.

图3是本申请实施例提供的一种日志存储方法的流程图。参见图3，该实施例应用于数据库系统的任一节点设备，该实施例包括：FIG3 is a flow chart of a log storage method provided by an embodiment of the present application. Referring to FIG3, the embodiment is applied to any node device of a database system, and the embodiment includes:

301、节点设备响应于目标事务的提交事件，确定数据库系统中第一存储介质的剩余容量，该第一存储介质为用于存储日志的NVM介质。301. A node device determines a remaining capacity of a first storage medium in a database system in response to a commit event of a target transaction, where the first storage medium is an NVM medium for storing logs.

其中，该目标事务为任一个逻辑事务所划分出的任一个迷你事务(MiniTransaction，MTR)。可选地，该任一个逻辑事务为全局事务，或者该任一个逻辑事务为本地事务，本申请实施例不对逻辑事务的类型进行具体限定。其中，该全局事务(也称为分布式事务)是指涉及到跨节点操作的事务，该本地事务(也称为局部事务)是指仅涉及到单个节点操作的事务。The target transaction is any mini transaction (MiniTransaction, MTR) divided from any logical transaction. Optionally, any logical transaction is a global transaction, or any logical transaction is a local transaction. The embodiment of the present application does not specifically limit the type of logical transaction. The global transaction (also called a distributed transaction) refers to a transaction involving cross-node operations, and the local transaction (also called a local transaction) refers to a transaction involving only a single node operation.

可选地，该节点设备为数据库系统中的任一节点设备，比如，在单机数据库系统中，该节点设备即为该单机数据库系统对应的单机设备，又比如，在分布式数据库系统中，由于分布式事务可能涉及到跨节点操作，因此该节点设备可以为协调节点设备或者数据节点设备。其中，分布式事务的发起节点称为协调节点设备，分布式事务所涉及的其他节点称为数据节点设备。Optionally, the node device is any node device in the database system. For example, in a stand-alone database system, the node device is the stand-alone device corresponding to the stand-alone database system. For another example, in a distributed database system, since distributed transactions may involve cross-node operations, the node device may be a coordination node device or a data node device. The initiating node of the distributed transaction is called a coordination node device, and the other nodes involved in the distributed transaction are called data node devices.

MTR是一种用来保证物理页面写入操作完整性与持久性的重要机制，俗称为“物理事务”，在InnoDB中只要涉及到数据修改、读取操作，都离不开MTR。一个逻辑事务可能包括一个或多个MTR，每个MTR将产生的日志写入到自身的动态数据Log中暂存，一旦当前MTR提交，则会将暂存的日志写入到Redo Log的NVM介质(也即第一存储介质)内。MTR is an important mechanism used to ensure the integrity and durability of physical page write operations, commonly known as "physical transactions". In InnoDB, MTR is indispensable for data modification and reading operations. A logical transaction may include one or more MTRs. Each MTR writes the generated logs to its own dynamic data log for temporary storage. Once the current MTR is committed, the temporary logs will be written to the NVM medium of the Redo Log (that is, the first storage medium).

在NVM-MySQL架构中，NVM介质全部用于存储日志记录，将NVM介质划分为多个日志块(Log Block)进行管理，每个日志块对应于NVM介质内的一端连续的存储地址区间，结构类似于日志缓冲区，但无需如传统InnoDB双层存储体系一样将日志缓冲区划分为写入区和持久化区这两个部分，而是直接将日志数据在NVM介质内循环覆盖写入。In the NVM-MySQL architecture, all NVM media are used to store log records. The NVM media is divided into multiple log blocks for management. Each log block corresponds to a continuous storage address range at one end of the NVM media. The structure is similar to the log buffer, but there is no need to divide the log buffer into a write area and a persistence area as in the traditional InnoDB two-layer storage system. Instead, the log data is directly written in a loop in the NVM media.

可选地，NVM介质中每个日志块可以用于存储一条或多条日志(也称为日志记录、日志数据)，或者，用于存储一条日志的一部分数据，也即是说，当日志的数据较小时，可以与其他日志的数据合并记录在同一个日志块中，当日志的数据较大时，可以被分散存储到多个连续的日志块中。Optionally, each log block in the NVM medium can be used to store one or more logs (also called log records, log data), or to store part of the data of a log. That is to say, when the log data is small, it can be merged with the data of other logs and recorded in the same log block. When the log data is large, it can be dispersed and stored in multiple continuous log blocks.

图4是本申请实施例提供的一种NVM-MySQL日志存储体系的架构示意图，如400所示，NVM介质中包括多个日志块，如果单条事务的日志较小，则可能会与其他事务的日志合并记录在同一个日志块中，如果单条事务的日志较大，则会被分散到多个日志块中。可选地，每个日志块实际上为一个512字节大小的数据块。FIG4 is a schematic diagram of the architecture of an NVM-MySQL log storage system provided by an embodiment of the present application. As shown in 400, the NVM medium includes multiple log blocks. If the log of a single transaction is small, it may be merged with the logs of other transactions and recorded in the same log block. If the log of a single transaction is large, it will be dispersed into multiple log blocks. Optionally, each log block is actually a 512-byte data block.

在一些实施例中，每个日志块包括元数据区和数据区，元数据区位于日志块的头部和尾部，数据区则是指头部和尾部之间的存储区域。日志块的头部占用12个字节，属于元数据区，具体包括：1)Number(日志块序号)：即表示当前日志块是第几个，占用4个字节，Number＝LSN/512+1；2)Data Len(数据长度)：即表示当前日志块已使用字节数，占用2个字节，如果整个日志块都被写满则Data Len＝512；3)First Rec offset(第一条记录所在的开始位置)：即表示当前日志块中第一个新的MTR的第一条日志记录在当前日志块中的开始位置，占用2个字节，由于当前日志块上可能会存储上一个日志块中没存完的数据，因此开始位置不一定位于数据部分的头部；4)Checkpoint No(日志检查点序号)：即表示当前日志块对应的日志检查点序号，占用4个字节，仅在当前日志块被写满时才设置，用于在容灾恢复时判断是否需要重新应用当前日志块中的日志记录(以进行重做)。日志块的尾部占用4个字节，属于元数据区，用于记录当前日志块的校验和(Checksum)。在日志块的头部和尾部之间的存储区域称为数据区，占用496个字节，用于存放日志记录。In some embodiments, each log block includes a metadata area and a data area, the metadata area is located at the head and the tail of the log block, and the data area refers to the storage area between the head and the tail. The head of the log block occupies 12 bytes and belongs to the metadata area, including: 1) Number (log block sequence number): which indicates the number of the current log block, occupies 4 bytes, Number = LSN/512 + 1; 2) Data Len (data length): which indicates the number of bytes used in the current log block, occupies 2 bytes, if the entire log block is full, Data Len = 512; 3) First Rec offset (the starting position of the first record): which indicates the starting position of the first log record of the first new MTR in the current log block in the current log block, occupies 2 bytes, because the current log block may store data that has not been fully stored in the previous log block, so the starting position is not necessarily located at the head of the data part; 4) Checkpoint No (log checkpoint sequence number): which indicates the log checkpoint sequence number corresponding to the current log block, occupies 4 bytes, and is only set when the current log block is full, and is used to determine whether the log records in the current log block need to be reapplied (for redo) during disaster recovery. The tail of the log block occupies 4 bytes and belongs to the metadata area, which is used to record the checksum of the current log block. The storage area between the head and tail of the log block is called the data area, which occupies 496 bytes and is used to store log records.

图5是本申请实施例提供的一种日志块的结构示意图，如图5所示，日志块500分为元数据区501和数据区502，日志块的头部和尾部均属于元数据区501，而在头部和尾部之间的存储区域属于数据区502。可选地，日志块的头部占用12个字节，包括日志块序号(4字节)、数据长度(2字节)、第一条记录所在的开始位置(2字节，在后文中也简称为“开始位置”)、日志检查点序号(4字节)。日志块的尾部占用4个字节，为日志块最后剩余的4个字节，包含日志块的校验和。数据区502位于12个字节的头部之后、4个字节的尾部之前，占用496个字节，用于存放日志记录。FIG5 is a schematic diagram of the structure of a log block provided by an embodiment of the present application. As shown in FIG5 , a log block 500 is divided into a metadata area 501 and a data area 502. The head and tail of the log block belong to the metadata area 501, and the storage area between the head and the tail belongs to the data area 502. Optionally, the head of the log block occupies 12 bytes, including the log block sequence number (4 bytes), the data length (2 bytes), the starting position of the first record (2 bytes, hereinafter also referred to as the "starting position"), and the log checkpoint sequence number (4 bytes). The tail of the log block occupies 4 bytes, which are the last 4 bytes of the log block and contain the checksum of the log block. The data area 502 is located after the 12-byte header and before the 4-byte tail, and occupies 496 bytes for storing log records.

相较于传统的InnoDB体系，NVM介质中移除了日志文件这一形式，因此也就无需日志检查点(Checkpoint)信息协助定位从日志文件的哪个位置恢复日志，从而无需保存日志元数据，将冗余复杂的双层日志存储体系简化成了基于NVM介质的单层日志存储体系。Compared with the traditional InnoDB system, the NVM medium removes the log file format, so there is no need for log checkpoint information to help locate the location of the log file to restore the log, and there is no need to save log metadata, simplifying the redundant and complex two-layer log storage system into a single-layer log storage system based on NVM media.

在一些实施例中，由于NVM-MySQL体系修改了日志存储架构，需要相应对InnoDB中的日志核心数据结构进行修改，InnoDB中的日志核心数据结构主要包括log_t和log_group_t。其中log_t代表整个日志系统，只有一个实例，持有日志缓冲区的相关指针；log_group_t代表一个日志文件组，实例数量和文件组数量相同，持有文件头缓冲区、日志检查点缓冲区的相关指针。In some embodiments, since the NVM-MySQL system modifies the log storage architecture, the log core data structure in InnoDB needs to be modified accordingly. The log core data structure in InnoDB mainly includes log_t and log_group_t. Among them, log_t represents the entire log system, has only one instance, and holds the relevant pointers of the log buffer; log_group_t represents a log file group, the number of instances is the same as the number of file groups, and holds the relevant pointers of the file header buffer and the log checkpoint buffer.

而在NVM-MySQL体系中，所有日志数据都保存在NVM介质中，那么log_group_t无需继续保持，因此需要在log_t中添加NVM介质的相关属性，用于管理NVM介质的日志空间。可选地，需要添加的属性至少包括：1)ulint buf_free：空闲空间的起始offset；2)byte*buf：日志空间的起始写入位置指针；3)ulint buf_size：日志空间容量参数，受另一参数innodb_log_buffer_size控制；In the NVM-MySQL system, all log data is stored in the NVM medium, so log_group_t does not need to be maintained. Therefore, it is necessary to add NVM medium-related attributes in log_t to manage the log space of the NVM medium. Optionally, the attributes that need to be added include at least: 1) ulint buf_free: the starting offset of the free space; 2) byte*buf: the starting write position pointer of the log space; 3) ulint buf_size: the log space capacity parameter, which is controlled by another parameter innodb_log_buffer_size;

4)ulint max_buf_free：当buf_free超过该值时，代表日志空间不足。4)ulint max_buf_free: When buf_free exceeds this value, it means that the log space is insufficient.

在上述步骤301中，节点设备可以查询该第一存储介质中已占用的日志块数量，将已占用的日志块数量与单个日志块的存储容量相乘，得到第一存储介质的已占用容量，将该将第一存储介质的总存储容量减去该已占用容量，即可得到该第一存储介质的剩余容量。In the above step 301, the node device can query the number of occupied log blocks in the first storage medium, multiply the number of occupied log blocks by the storage capacity of a single log block to obtain the occupied capacity of the first storage medium, and subtract the occupied capacity from the total storage capacity of the first storage medium to obtain the remaining capacity of the first storage medium.

302、节点设备响应于该剩余容量大于或等于该目标事务的未缓存日志的数据量，将该目标事务的未缓存日志写入到该第一存储介质。302. In response to the remaining capacity being greater than or equal to the data volume of the uncached log of the target transaction, the node device writes the uncached log of the target transaction to the first storage medium.

在上述过程中，节点设备在获取剩余容量之后，需要确定剩余容量是否充足(也即是否满足本次日志存储所需的数据量)，如果剩余容量大于或等于目标事务的未缓存日志的数据量，说明剩余容量充足，将目标事务的未缓存日志全部持久化到第一存储介质中，否则，执行下述步骤303。In the above process, after obtaining the remaining capacity, the node device needs to determine whether the remaining capacity is sufficient (that is, whether it meets the data volume required for this log storage). If the remaining capacity is greater than or equal to the data volume of the uncached log of the target transaction, it means that the remaining capacity is sufficient, and all the uncached logs of the target transaction are persisted to the first storage medium. Otherwise, execute the following step 303.

可选地，节点设备在将该目标事务的未缓存日志写入到第一存储介质的过程中，可以从该第一存储介质的最后一个日志块开始，以日志块为单位写入该未缓存日志，这样可以逐个日志块来对未缓存日志进行写入，且保证了日志块中记录的未缓存日志具有从旧到新的提交顺序。Optionally, when the node device is writing the uncached log of the target transaction to the first storage medium, it can start from the last log block of the first storage medium and write the uncached log in units of log blocks. In this way, the uncached log can be written log block by log block, and it is ensured that the uncached log recorded in the log block has a submission order from old to new.

可选地，节点设备将该未缓存日志写入最后一个日志块之后，如果最后一个日志块的存储容量小于该未缓存日志的数据量，说明最后一个日志块被写满时仍然存在尚未写完的未缓存日志，则在对该最后一个日志块写入完毕后，在该最后一个日志块之后创建另一个日志块，由于新建的该另一个日志块位于原本的最后一个日志块之后，因此，将新建的该另一个日志块作为下一轮循环过程中的最后一个日志块，重复执行将未缓存日志写入最后一个日志块的操作，直到该未缓存日志全部被写入到第一存储介质中。Optionally, after the node device writes the uncached log to the last log block, if the storage capacity of the last log block is less than the data amount of the uncached log, it means that there are still unwritten uncached logs when the last log block is full. Then, after the last log block is written, another log block is created after the last log block. Since the newly created another log block is located after the original last log block, the newly created another log block is used as the last log block in the next cycle process, and the operation of writing the uncached log to the last log block is repeated until the uncached log is completely written to the first storage medium.

在上述过程中，为从最后一个日志块开始以日志块为单位写入未缓存日志的一种可能实施方式，通过循环执行写满最后一个日志块后再新建另一个日志块的操作，能够逐个日志块将未缓存日志写入到第一存储介质中。这样即使任一时刻系统宕机，能够最大程度降低需要根据重做日志恢复的数据。In the above process, it is a possible implementation method to write the unbuffered log in units of log blocks starting from the last log block. By looping and writing the last log block and then creating another log block, the unbuffered log can be written to the first storage medium one by one. In this way, even if the system crashes at any time, the data that needs to be restored based on the redo log can be minimized.

在一些实施例中，节点设备还可以根据未缓存日志的数据量，先创建对应数量个日志块，以保证创建的对应数量个日志块的容量大于未缓存日志的数据量，接着从最后一个日志块开始写入未缓存日志，在最后一个日志块写满后，向创建的各个日志块中继续写入未写完的未缓存日志，这样可以避免频繁执行创建日志块的操作，简化了繁琐的日志写入流程。In some embodiments, the node device may also first create a corresponding number of log blocks based on the amount of data in the uncached log to ensure that the capacity of the corresponding number of log blocks created is greater than the amount of data in the uncached log, and then start writing the uncached log from the last log block. After the last log block is full, continue to write the unfinished uncached log to each created log block. This can avoid frequent execution of log block creation operations and simplify the cumbersome log writing process.

303、节点设备响应于该剩余容量小于该目标事务的未缓存日志的数据量，创建日志检查点，将第二存储介质中基于修改操作产生的业务数据存储至第三存储介质，将该目标事务的未缓存日志写入到该第一存储介质。303. In response to the remaining capacity being less than the amount of data in the uncached log of the target transaction, the node device creates a log checkpoint, stores the business data generated based on the modification operation in the second storage medium to the third storage medium, and writes the uncached log of the target transaction to the first storage medium.

其中，该第二存储介质为易失性存储介质，该第三存储介质为非易失性存储介质。在本申请实施例中，以第二存储介质为内存、第三存储介质为磁盘为例进行说明。The second storage medium is a volatile storage medium, and the third storage medium is a non-volatile storage medium. In the embodiment of the present application, the second storage medium is a memory and the third storage medium is a disk.

在上述过程中，节点设备在获取剩余容量之后，需要确定剩余容量是否充足(也即是否满足本次日志存储所需的数据量)，如果剩余容量小于目标事务的未缓存日志的数据量，说明剩余容量不足，那么需要通过创建日志检查点的方式，先将内存的数据缓冲池内原本存储的业务数据(也即脏数据)刷到磁盘中，删除不需要的日志记录，腾出空闲的存储空间之后，再将目标事务的未缓存日志全部缓存到第一存储介质中。In the above process, after obtaining the remaining capacity, the node device needs to determine whether the remaining capacity is sufficient (that is, whether it meets the data volume required for this log storage). If the remaining capacity is less than the data volume of the uncached log of the target transaction, it means that the remaining capacity is insufficient. Then it is necessary to create a log checkpoint, first flush the business data (that is, dirty data) originally stored in the data buffer pool of the memory to the disk, delete unnecessary log records, free up free storage space, and then cache all the uncached logs of the target transaction to the first storage medium.

在一些实施例中，数据缓冲池位于内存中，用于存储各个事务所产生的业务数据，由于各个事务最终的执行结果有可能为提交完成或者回滚完成，因此在对内存中的业务数据进行持久化时，仅持久化基于修改操作所产生的业务数据即可，对于一些回滚操作所产生的业务数据，直接丢弃，以避免无效数据占用磁盘的存储空间。In some embodiments, the data buffer pool is located in the memory and is used to store the business data generated by each transaction. Since the final execution result of each transaction may be committed or rolled back, when persisting the business data in the memory, only the business data generated by the modification operation is persisted. For some business data generated by rollback operations, it is directly discarded to avoid invalid data occupying disk storage space.

其中，未缓存日志的存储过程与上述步骤302类似，这里不做赘述。The storage process of the uncached log is similar to the above step 302 and will not be described in detail here.

图6是本申请实施例提供的一种日志存储方法的原理性流程图，如600所示，以目标事务为MTR为例，在步骤601中，MTR提交。在步骤602中，节点设备检测日志空间空闲部分(也即NVM的剩余容量)是否充足。如果空闲部分充足，执行步骤603：将MTR的未缓存日志写入日志空间最后一个Block(日志块)，写满则停止。接着，执行步骤604：检测MTR是否还有未缓存日志。如果MTR还有未缓存日志，执行步骤605：在日志空间最后一个Block后新建一个Block并初始化，返回执行步骤603，直到MTR不存在未缓存日志，结束流程。反之，如果空间部分不充足，那么需要执行步骤606：创建日志检查点，将数据缓冲池内的脏数据刷到磁盘中，删除日志空间中脏数据所对应的日志，腾出充足的空闲空间之后，再返回执行步骤602。FIG6 is a principle flow chart of a log storage method provided by an embodiment of the present application. As shown in 600, taking the target transaction as MTR as an example, in step 601, MTR commits. In step 602, the node device detects whether the free part of the log space (that is, the remaining capacity of the NVM) is sufficient. If the free part is sufficient, execute step 603: write the uncached log of MTR to the last Block (log block) of the log space, and stop when it is full. Then, execute step 604: detect whether MTR still has uncached logs. If MTR still has uncached logs, execute step 605: create a new Block after the last Block of the log space and initialize it, return to execute step 603, until there is no uncached log in MTR, and end the process. On the contrary, if the space is insufficient, then it is necessary to execute step 606: create a log checkpoint, flush the dirty data in the data buffer pool to the disk, delete the log corresponding to the dirty data in the log space, and return to execute step 602 after freeing up sufficient free space.

在一个示例中，NVM-MySQL体系中整理日志空闲空间的算法如下表1所示：In an example, the algorithm for tidying up log free space in the NVM-MySQL system is shown in Table 1 below:

表1Table 1

在代码层面中，节点设备可以采用finish_write()方法来缓存日志记录，在传统的InnoDB系统中，日志缓存之前如果检测到日志缓冲区空间不足，InnoDB会调用log_write_up_to()方法将日志缓冲区内缓存的日志数据持久化到磁盘中，而NVM-MySQL体系中，则会调用checkpoint_for_NVM_MySQL()方法整理出空闲空间，也即通过创建日志检查点的方式来整理空闲空间，这一方法直接将数据缓冲池中基于修改操作(包括更新、插入、删除等)产生的业务数据持久化到磁盘，由于基于修改操作产生的业务数据已经被持久化了，那么其对应的日志也即无需被记录，此时可以删除掉对应的日志，或者基于覆盖写的机制，直接将日志空间最后一个Block移动至起始位置(也即将最后一个Block内记载的日志复制到第一个Block中)，并重置第一存储介质的起始写入位置指针指向复制得到的第一个Block所正在写入的位置，使得下一次写入未缓存日志时能够直接从起始位置开始进行覆盖写。At the code level, the node device can use the finish_write() method to cache log records. In the traditional InnoDB system, if the log buffer space is insufficient before the log cache is detected, InnoDB will call the log_write_up_to() method to persist the log data cached in the log buffer to the disk. In the NVM-MySQL system, the checkpoint_for_NVM_MySQL() method will be called to sort out the free space, that is, to sort out the free space by creating a log checkpoint. This method directly persists the business data generated by the modification operation (including update, insert, delete, etc.) in the data buffer pool to the disk. Since the business data generated by the modification operation has been persisted, the corresponding log does not need to be recorded. At this time, the corresponding log can be deleted, or the last Block of the log space can be directly moved to the starting position based on the overwrite mechanism (that is, the log recorded in the last Block is copied to the first Block), and the starting write position pointer of the first storage medium is reset to point to the position where the copied first Block is being written, so that the next time the uncached log is written, it can be directly overwritten from the starting position.

上述所有可选技术方案，能够采用任意结合形成本公开的可选实施例，在此不再一一赘述。All the above optional technical solutions can be arbitrarily combined to form optional embodiments of the present disclosure, and will not be described in detail here.

本申请实施例提供的方法，通过在提交目标事务时，在该目标事务的未缓存日志持久化在NVM存储介质(也即第一存储介质)中，能够在事务提交的同时实现对日志的持久化维护，由于不需要将日志分别在内存(也即第二存储介质)和磁盘(也即第三存储介质)中存两次，大大节约了日志存储占用的空间，且无需像传统InnoDB那样构建日志缓冲区-日志文件的双层日志存储体系，并且取消了传统的日志文件，正是由于此，如果NVM存储介质的剩余容量不足，直接通过创建日志检查点就能够从NVM存储介质中整理出空闲的存储空间，无需执行繁琐的将日志从内存刷到磁盘的低速IO缓存流程，提升了数据库的系统性能，避免了限制数据库系统的吞吐量上限。The method provided in the embodiment of the present application can realize persistent maintenance of the log while the transaction is committed, by persisting the uncached log of the target transaction in the NVM storage medium (i.e., the first storage medium) when the target transaction is committed. Since the log does not need to be stored twice in the memory (i.e., the second storage medium) and the disk (i.e., the third storage medium), the space occupied by the log storage is greatly saved, and there is no need to build a two-layer log storage system of log buffer and log file like the traditional InnoDB, and the traditional log file is cancelled. Precisely because of this, if the remaining capacity of the NVM storage medium is insufficient, the free storage space can be sorted out from the NVM storage medium directly by creating a log checkpoint, without executing the cumbersome low-speed IO cache process of flushing the log from the memory to the disk, thereby improving the system performance of the database and avoiding limiting the throughput upper limit of the database system.

在上述实施例中，提供了将目标事务的未缓存日志写入第一存储介质的策略，如果剩余容量充足，直接从最后一个日志块开始写入未缓存日志，如果剩余容量不足，那么需要创建日志检查点以腾出空闲空间之后，再写入未缓存日志。而在本申请实施例中，将提供第一存储介质的初始化策略，以目标事务为MTR、第一存储介质为NVM介质为例进行详述。In the above embodiment, a strategy for writing the uncached log of the target transaction to the first storage medium is provided. If the remaining capacity is sufficient, the uncached log is written directly from the last log block. If the remaining capacity is insufficient, a log checkpoint needs to be created to free up free space before writing the uncached log. In the embodiment of the present application, an initialization strategy for the first storage medium is provided, and a detailed description is given by taking the target transaction as MTR and the first storage medium as NVM medium as an example.

图7是本申请实施例提供的一种NVM介质的初始化流程图，请参考图7，该初始化过程应用于节点设备，该初始化过程包括：FIG. 7 is a flowchart of initializing an NVM medium provided in an embodiment of the present application. Please refer to FIG. 7 . The initialization process is applied to a node device. The initialization process includes:

701、节点设备获取NVM介质的存储容量。701. The node device obtains the storage capacity of the NVM medium.

在传统InnoDB体系中，日志模块的初始化在数据库启动时进行，主要工作为创建相关数据结构并分配内存空间给相关缓冲区。InnoDB的日志记录、日志元信息等信息在内存中都配置有相应的内存结构进行缓存，初始化时需要为日志记录和元数据缓冲区分配内存空间，并初始化日志缓冲区的第一个Block。In the traditional InnoDB system, the log module is initialized when the database is started. The main work is to create relevant data structures and allocate memory space to relevant buffers. InnoDB's log records, log metadata and other information are all configured with corresponding memory structures in memory for caching. During initialization, memory space needs to be allocated for log records and metadata buffers, and the first block of the log buffer needs to be initialized.

而在本申请实施例提供的NVM-MySQL体系中，由于在NVM介质中仅存储日志(也即日志记录)，初始化时只需要对数据库系统的日志空间容量参数进行配置，使其与NVM介质的存储容量相匹配，同时初始化第一个Block即可。可选地，这里“相匹配”的含义为：日志空间容量参数小于或等于NVM介质的存储容量。In the NVM-MySQL system provided in the embodiment of the present application, since only logs (i.e. log records) are stored in the NVM medium, it is only necessary to configure the log space capacity parameter of the database system during initialization to match the storage capacity of the NVM medium, and initialize the first Block at the same time. Optionally, "matching" here means that the log space capacity parameter is less than or equal to the storage capacity of the NVM medium.

在上述步骤701中，节点设备计算NVM介质的存储容量(在后文中也称为“日志空间大小”)，该NVM介质的存储容量由节点设备在数据库系统首次启动时计算得到。可选地，在计算得到NVM介质的存储容量之后，可以将该NVM介质的存储容量记录在日志核心数据结构log_t的ulint buf_size参数中，以便于后续记录及访问。In the above step 701, the node device calculates the storage capacity of the NVM medium (hereinafter also referred to as "log space size"), and the storage capacity of the NVM medium is calculated by the node device when the database system is first started. Optionally, after the storage capacity of the NVM medium is calculated, the storage capacity of the NVM medium can be recorded in the ulint buf_size parameter of the log core data structure log_t for subsequent recording and access.

702、节点设备将该数据库系统的日志空间容量参数配置为该NVM介质的存储容量。702. The node device configures the log space capacity parameter of the database system to the storage capacity of the NVM medium.

在上述过程中，节点设备分配NVM介质给数据库系统的日志空间，可选地，初始化log_t中的日志空间之后，通过mmap(一种内存映射文件的方法)将该NVM介质映射为数据库系统的日志空间，从而完成对NVM介质的空间分配。其中，mmap方式位于log_init()方法中。In the above process, the node device allocates NVM media to the log space of the database system. Optionally, after initializing the log space in log_t, the NVM media is mapped to the log space of the database system through mmap (a memory mapping file method), thereby completing the space allocation of the NVM media. The mmap method is located in the log_init() method.

703、节点设备初始化NVM介质中的第一个Block。703. The node device initializes the first Block in the NVM medium.

在上述过程中，节点设备对NVM介质中第一个Block的元数据区进行初始化，将数据区置为空，并将开始位置设为初始化后的第一个Block。In the above process, the node device initializes the metadata area of the first Block in the NVM medium, sets the data area to empty, and sets the start position to the first Block after initialization.

在一个示例中，NVM-MySQL体系的日志模块初始化算法如下表2所示：In an example, the log module initialization algorithm of the NVM-MySQL system is shown in Table 2 below:

表2Table 2

在本申请实施例中，提供了NVM-MySQL体系的日志初始化方法，相较于传统的InnoDB体系，无需将内存空间分别分配给日志缓冲区和文件头缓冲区，取消了冗余的双层日志存储体系，仅需要分配NVM介质作为第一存储介质，大大简化了日志初始化流程，提升了数据库的系统性能。In an embodiment of the present application, a log initialization method for an NVM-MySQL system is provided. Compared with a traditional InnoDB system, there is no need to allocate memory space to a log buffer and a file header buffer respectively, and the redundant double-layer log storage system is eliminated. Only NVM media needs to be allocated as the first storage medium, which greatly simplifies the log initialization process and improves the system performance of the database.

在上述实施例中，提供了NVM-MySQL的日志初始化策略，在此基础上，由于传统的InnoDB内存中的日志无法持久化，因此需要在恰当时机将内存中的日志持久化到磁盘，而针对NVM-MySQL体系，由于NVM介质本身就已经具有非易失性，也即是说，日志本身已经在NVM介质中持久化了，因此并不需要额外对日志进行持久化，也即无需维护日志元数据。NVM-MySQL的日志模块完全无需涉及到低速磁盘IO，日志记录也无需额外持久化，节省了大量维护日志的时间开销，避免了限制数据库系统的吞吐量上限。In the above embodiment, a log initialization strategy for NVM-MySQL is provided. On this basis, since the log in the traditional InnoDB memory cannot be persisted, it is necessary to persist the log in the memory to the disk at the right time. For the NVM-MySQL system, since the NVM medium itself is already non-volatile, that is, the log itself has been persisted in the NVM medium, there is no need to persist the log additionally, that is, there is no need to maintain the log metadata. The log module of NVM-MySQL does not need to involve low-speed disk IO at all, and the log record does not need to be persisted additionally, which saves a lot of time overhead for maintaining the log and avoids limiting the throughput upper limit of the database system.

图8是本申请实施例提供的一种周期性检查NVM介质的流程图，请参考图8，以目标事务为MTR、第一存储介质为NVM介质为例进行详述，该周期性检查过程应用于节点设备，包括下述步骤：FIG8 is a flowchart of a periodic NVM medium check provided by an embodiment of the present application. Please refer to FIG8 , taking the target transaction as MTR and the first storage medium as NVM medium as an example for detailed description, the periodic check process is applied to the node device, including the following steps:

801、每间隔第一目标时长，节点设备响应于符合目标条件，创建日志检查点。801. At every first target duration, the node device creates a log checkpoint in response to meeting a target condition.

可选地，该目标条件包括下述至少一项：该第一存储介质的剩余容量小于容量阈值；或者，数据库系统中的最大日志序列号与第二存储介质中具有最小时间戳的业务数据所对应的日志序列号之间差值大于第一目标阈值；或者，数据库系统中的最大日志序列号与该第一存储介质中上一次日志检查点的日志序列号之间差值大于第二目标阈值。Optionally, the target condition includes at least one of the following: the remaining capacity of the first storage medium is less than a capacity threshold; or, the difference between the maximum log sequence number in the database system and the log sequence number corresponding to the business data with the minimum timestamp in the second storage medium is greater than a first target threshold; or, the difference between the maximum log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is greater than a second target threshold.

其中，该第一目标阈值或者第二目标阈值为任一大于或等于0的数值。The first target threshold or the second target threshold is any value greater than or equal to 0.

在上述过程中，NVM-MySQL由于是单层日志存储架构，当NVM介质内的日志空闲空间(也即剩余容量)不够充足时，例如剩余容量小于容量阈值，或者剩余容量所占NVM介质总存储容量的比例小于比例阈值，直接创建日志检查点即可，在创建日志检查点后，系统需要将数据缓冲池内的基于修改操作(包括更新、插入、删除等)所产生的业务数据持久化到磁盘中，而无需额外持久化目标事务的重做日志(在NVM介质中已经实现了持久化，不需要再次持久化)，然后将日志空间的起始写入位置指针和最后一个Block移至第一个Block即可。由于数据缓冲池内基于修改操作所产生的业务数据已经持久化，因此之前的日志记录可以直接被覆盖。In the above process, since NVM-MySQL is a single-layer log storage architecture, when the log free space (i.e., remaining capacity) in the NVM medium is not sufficient, for example, the remaining capacity is less than the capacity threshold, or the proportion of the remaining capacity to the total storage capacity of the NVM medium is less than the proportion threshold, a log checkpoint can be directly created. After creating the log checkpoint, the system needs to persist the business data generated by the modification operation (including update, insert, delete, etc.) in the data buffer pool to the disk without the need to additionally persist the redo log of the target transaction (which has been persisted in the NVM medium and does not need to be persisted again), and then move the start write position pointer of the log space and the last Block to the first Block. Since the business data generated by the modification operation in the data buffer pool has been persisted, the previous log records can be directly overwritten.

在一些实施例中，针对NVM-MySQL体系，节点设备可以定期(每间隔第一目标时长)调用log_check_margins()方法以确保是否符合目标条件，具体地，可以检查下述三项条目：日志空间是否不充足，例如，该第一存储介质的剩余容量小于容量阈值；数据缓冲池内基于修改操作所产生的业务数据是否太久没有持久化，例如，数据库系统中的最大日志序列号与第二存储介质中具有最小时间戳的业务数据所对应的日志序列号之间差值大于第一目标阈值；是否太久没有创建日志检查点，例如，数据库系统中的最大日志序列号与该第一存储介质中上一次日志检查点的日志序列号之间差值大于第二目标阈值。在检查过程中，如果节点设备满足任一条目，则确定符合目标条件，执行创建日志检查点的步骤，否则，如果三项条目均不满足，才确定不符合目标条件。In some embodiments, for the NVM-MySQL system, the node device can call the log_check_margins() method regularly (every interval of the first target duration) to ensure whether the target condition is met. Specifically, the following three items can be checked: whether the log space is insufficient, for example, the remaining capacity of the first storage medium is less than the capacity threshold; whether the business data generated based on the modification operation in the data buffer pool has not been persisted for too long, for example, the difference between the maximum log sequence number in the database system and the log sequence number corresponding to the business data with the minimum timestamp in the second storage medium is greater than the first target threshold; whether the log checkpoint has not been created for too long, for example, the difference between the maximum log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is greater than the second target threshold. During the inspection process, if the node device meets any of the items, it is determined that the target condition is met and the step of creating a log checkpoint is executed. Otherwise, if all three items are not met, it is determined that the target condition is not met.

可选地，在创建日志检查点时，调用checkpoint_for_NVM_MySQL()方法整理NVM介质，腾出空闲区域。创建日志检查点的算法与上述步骤303类似，这里不做赘述。Optionally, when creating a log checkpoint, the checkpoint_for_NVM_MySQL() method is called to organize the NVM medium and free up free space. The algorithm for creating a log checkpoint is similar to the above step 303 and will not be described in detail here.

在一个示例性，NVM-MySQL体系的周期性检查算法如下表3In an exemplary embodiment, the periodic check algorithm of the NVM-MySQL system is as follows Table 3

表3Table 3

802、节点设备将内存中基于修改操作产生的业务数据存储至磁盘。802. The node device stores the business data generated in the memory based on the modification operation to the disk.

在上述过程中，节点设备将位于内存的数据缓冲区内基于修改操作产生的业务数据存储至磁盘，是将第二存储介质中基于修改操作产生的业务数据存储至第三存储介质的一种可能实施方式，也即对各个事务所操作的数据对象进行持久化存储，而无需额外再次持久化各个事务的读写操作对应的日志，大大减少了周期性检查的时间开销。In the above process, the node device stores the business data generated based on the modification operation in the data buffer located in the memory to the disk, which is a possible implementation method of storing the business data generated based on the modification operation in the second storage medium to the third storage medium, that is, the data objects operated by each transaction are persistently stored without the need to persist the logs corresponding to the read and write operations of each transaction again, which greatly reduces the time overhead of periodic inspections.

803、节点设备将该NVM介质中已存储的最后一个日志块内的日志复制至该第一存储介质中的第一个日志块。803. The node device copies the log in the last log block stored in the NVM medium to the first log block in the first storage medium.

上述步骤803也即是创建日志检查点之后对NVM介质所做的一系列相关操作，可选地，节点设备确定了NVM介质中已存储的最后一个日志块之后，读取该最后一个日志块内存储的日志，接着，将该最后一个日志块内存储的日志复制到第一个日志块中，直接覆盖写入即可，执行下述步骤804。The above step 803 is a series of related operations performed on the NVM medium after creating a log checkpoint. Optionally, after the node device determines the last log block stored in the NVM medium, it reads the log stored in the last log block, and then copies the log stored in the last log block to the first log block, directly overwriting it, and executes the following step 804.

804、节点设备将该NVM介质的起始写入位置指针移动至该第一个日志块。804. The node device moves the start write position pointer of the NVM medium to the first log block.

可选地，节点设备将NVM介质的起始写入位置指针移动至第一个日志块内最后一条日志记录的末尾，作为下一次日志存储过程的开始位置。Optionally, the node device moves the start write position pointer of the NVM medium to the end of the last log record in the first log block as the start position of the next log storage process.

图9是本申请实施例提供的一种周期性检查NVM介质的原理性流程图，请参考图9，在步骤901中，节点设备判断日志空间是否充足。如果日志空间不充足，执行步骤904-906，如果日志空间充足，执行步骤902：判断数据缓冲池内是否存在太久未持久化的数据修改(是否存在时间戳太旧的基于修改操作产生的业务数据)。如果存在太久未持久化的数据修改，执行步骤904-906，如果不存在太久未持久化的数据修改，执行步骤903：判断是否太久没有创建日志检查点(Checkpoint)。如果太久没有创建Checkpoint，执行步骤904-906，如果并非太久没有创建Checkpoint，结束流程。接着，在步骤904中，将数据缓冲池中基于修改操作产生的业务数据持久化到磁盘，在步骤905中，将日志空间最后一个Block内的日志复制至第一个Block，在步骤906中，将日志空间起始写入位置指针移动至第一个Block。FIG9 is a principle flow chart of a periodic check of NVM media provided by an embodiment of the present application. Please refer to FIG9. In step 901, the node device determines whether the log space is sufficient. If the log space is insufficient, execute steps 904-906. If the log space is sufficient, execute step 902: determine whether there is a data modification that has not been persisted for a long time in the data buffer pool (whether there is business data generated based on the modification operation with a timestamp that is too old). If there is a data modification that has not been persisted for a long time, execute steps 904-906. If there is no data modification that has not been persisted for a long time, execute step 903: determine whether a log checkpoint has not been created for a long time. If a Checkpoint has not been created for a long time, execute steps 904-906. If a Checkpoint has not been created for a long time, end the process. Then, in step 904, the business data generated based on the modification operation in the data buffer pool is persisted to the disk. In step 905, the log in the last Block of the log space is copied to the first Block. In step 906, the log space start write position pointer is moved to the first Block.

在本申请实施例中，相较于传统InnoDB体现在周期性检查时不但要持久化业务数据，还要持久化日志，而在NVM-MySQL体系中只需要持久化业务数据即可，在事务提交时已经将日志持久化在NVM介质中，无需在周期性检查时再次持久化，大大降低了周期性检查过程的时间开销。In the embodiments of the present application, compared with the traditional InnoDB, not only the business data but also the logs need to be persisted during periodic checks. However, in the NVM-MySQL system, only the business data needs to be persisted. The logs are persisted in the NVM medium when the transaction is committed, and there is no need to persist them again during periodic checks, which greatly reduces the time overhead of the periodic check process.

在上述实施例中，提供了NVM-MySQL体系的周期性检查策略，在满足三个条目中任一项时，确定符合目标条件，并执行创建日志检查点、持久化数据修改的相关操作，在本申请实施例中，提供一种NVM-MySQL的容灾恢复体系，下面进行详述。In the above embodiment, a periodic inspection strategy of the NVM-MySQL system is provided. When any of the three items is met, it is determined that the target condition is met, and related operations of creating log checkpoints and persistent data modification are performed. In the embodiment of the present application, a disaster recovery system of NVM-MySQL is provided, which is described in detail below.

图10是本申请实施例提供的一种容灾恢复方法的流程图，请参考图10，该方法包括下述步骤：FIG. 10 is a flow chart of a disaster recovery method provided in an embodiment of the present application. Please refer to FIG. 10 . The method includes the following steps:

1001、节点设备响应于数据库系统宕机后重启，从NVM介质的第一个日志块开始，对日志块进行校验，将校验通过的日志块中存储的日志确定为待恢复日志。1001. In response to a database system restarting after a shutdown, a node device verifies log blocks starting from the first log block of an NVM medium, and determines a log stored in a log block that passes the verification as a log to be recovered.

在上述过程中，由于在NVM-MySQL体系中，需要恢复的日志数据全部都存储在NVM介质中，因此容灾恢复时，只需要将NVM介质的日志记录校验、解析后，通过下述步骤1002保存到哈希表，最终遍历哈希表中的日志并重新应用，即可完成容灾恢复过程。In the above process, since in the NVM-MySQL system, all the log data that needs to be restored is stored in the NVM medium, during disaster recovery, it is only necessary to verify and parse the log records of the NVM medium, save them to the hash table through the following step 1002, and finally traverse the logs in the hash table and reapply them to complete the disaster recovery process.

上述步骤1001为节点设备从该NVM介质中获取待恢复日志的一种可能实施方式，可选地，节点设备修改recv_group_scan_log_recs()方法，将其实现改为直接从NVM介质log_sys->buf分批次解析校验日志记录并缓存到哈希表即可。The above step 1001 is a possible implementation method for the node device to obtain the log to be recovered from the NVM medium. Optionally, the node device modifies the recv_group_scan_log_recs() method and changes its implementation to directly parse and verify log records in batches from the NVM medium log_sys->buf and cache them in the hash table.

在一些实施例中，节点设备获取NVM介质的第一个Block，判断第一个Block的校验和(Checksum)是否正确，如果Checksum不正确，说明这是系统崩溃时正在写的最后一个Block，日志记录已全部恢复，跳出该方法，否则，如果Checksum正确，将第一个Block中的日志数据解析校验并添加到哈希表中，接着，判断第一个Block是否写满，如果没写满，说明这是系统崩溃时NVM介质的最后一个Block，日志记录已全部恢复，跳出该方法，否则，如果写满，读取下一个Block，循环进行校验。在遍历完毕NVM介质的所有日志块之后，即可确定哈希表中存储了所有待恢复日志，执行下述步骤1002。In some embodiments, the node device obtains the first Block of the NVM medium, and determines whether the checksum of the first Block is correct. If the checksum is incorrect, it means that this is the last Block being written when the system crashed, and the log records have been fully recovered, and the method is exited. Otherwise, if the checksum is correct, the log data in the first Block is parsed and verified and added to the hash table. Then, it is determined whether the first Block is full. If it is not full, it means that this is the last Block of the NVM medium when the system crashed, and the log records have been fully recovered, and the method is exited. Otherwise, if it is full, the next Block is read and verified in a loop. After traversing all log blocks of the NVM medium, it can be determined that all logs to be recovered are stored in the hash table, and the following step 1002 is executed.

1002、节点设备将该待恢复日志存储在哈希表中，遍历该哈希表，对该哈希表中所存储的待恢复日志进行重做，得到恢复后的业务数据。1002. The node device stores the log to be recovered in a hash table, traverses the hash table, and redoes the log to be recovered stored in the hash table to obtain recovered business data.

在上述步骤1002，为节点设备基于该待恢复日志进行数据恢复的一种可能实施方式，在传统InnoDB中容灾恢复时需要从磁盘中读取待恢复日志，而在本申请实施例中能够直接从NVM介质中读取待恢复日志，NVM的IO速度远远高于磁盘，因此大大降低了容灾恢复的耗时。而通过存储在哈希表中，能够加快容灾恢复的速度。In the above step 1002, a possible implementation method of the node device performing data recovery based on the log to be recovered is described. In the traditional InnoDB, the log to be recovered needs to be read from the disk during disaster recovery, while in the embodiment of the present application, the log to be recovered can be directly read from the NVM medium. The IO speed of the NVM is much higher than that of the disk, thus greatly reducing the time consumption of disaster recovery. By storing in the hash table, the speed of disaster recovery can be accelerated.

图11是本申请实施例提供的一种容灾恢复方法的原理性流程图，请参考图11，容灾恢复时，首先在步骤1101中，将NVM介质的日志解析校验并放入哈希表中，其次在步骤1102中，重做哈希表中日志，无需从磁盘中花费较长时间读取待恢复日志，能够大大降低容灾恢复的耗时。Figure 11 is a principle flow chart of a disaster recovery method provided in an embodiment of the present application. Please refer to Figure 11. During disaster recovery, first in step 1101, the log of the NVM medium is parsed and verified and put into the hash table. Secondly, in step 1102, the log in the hash table is redone. There is no need to spend a long time reading the log to be recovered from the disk, which can greatly reduce the time consumption of disaster recovery.

在一个示例中，NVM-MySQL的容灾恢复策略如下表4所示：In an example, the disaster recovery strategy of NVM-MySQL is shown in Table 4 below:

表4Table 4

在本申请实施例中，提供的容灾恢复方法，无需从磁盘中定位到某个日志文件、再对日志文件进行读取，只需要从NVM介质中直接读取待恢复日志，并重做待恢复日志即可，由于NVM的IO速度远远高于磁盘，因此大大降低了容灾恢复的耗时。In the embodiment of the present application, the disaster recovery method provided does not need to locate a log file from the disk and then read the log file. It only needs to directly read the log to be recovered from the NVM medium and redo the log to be recovered. Since the IO speed of NVM is much higher than that of the disk, the time consumption of disaster recovery is greatly reduced.

在上述实施例中，介绍了NVM-MySQL体系分别对传统InnoDB体系下日志模块在日志缓存、日志持久化、周期性检查、容灾恢复等各个过程的优化方案，接下来，将从理论层面分析NVM-MySQL体系分别对上述操作的性能优化效果。In the above embodiments, the optimization schemes of the NVM-MySQL system for the log module in the traditional InnoDB system in the log cache, log persistence, periodic inspection, disaster recovery and other processes are introduced. Next, the performance optimization effects of the NVM-MySQL system on the above operations will be analyzed from a theoretical level.

在本申请实施例中，以PCM(Phase Change Memory，相变存储器)存储介质作为NVM设备代表进行分析。假设每读写1个单位的数据，DRAM的写延迟和读延迟分别为DRAM_W和DRAM_R，PCM的写延迟和读延迟分别为PCM_W和PCM_R，HDD的写延迟和读延迟分别为HDD_W和HDD_R，则三者的写延迟之间的关系可以表示为：DRAM_W×10⁵＝PCM_W×5000＝HDD_W；三者的读延迟之间的关系可以表示为：DRAM_R×10⁵＝PCM_R×10⁵＝HDD_R。In the embodiment of the present application, PCM (Phase Change Memory) storage medium is used as a representative NVM device for analysis. Assuming that for each read and write of 1 unit of data, the write delay and read delay of DRAM are DRAM_W and DRAM_R respectively, the write delay and read delay of PCM are PCM_W and PCM_R respectively, and the write delay and read delay of HDD are HDD_W and HDD_R respectively, then the relationship between the write delays of the three can be expressed as: DRAM_W × 10⁵ = PCM_W × 5000 = HDD_W ; the relationship between the read delays of the three can be expressed as: DRAM_R × 10⁵ = PCM_R × 10⁵ = HDD_R.

1、日志缓存阶段1. Log caching stage

为了便于分析，日志缓存时不考虑日志空间不足的情况，日志空间不足的场景统一在“3、周期性检查”这一节进行分析。For ease of analysis, insufficient log space is not considered during log caching. Scenarios of insufficient log space are analyzed in the section "3. Periodic Check".

假设一次写入的日志数据量大小为log，对于一次日志缓冲区的写入，InnoDB只需要将日志写入内存即可，其耗时为：log×DRAM_W；NVM-MySQL则要将日志写入NVM，耗时为：log×PCM_W＝20×log×DRAM_W。Assuming that the amount of log data written at one time is log, for a log buffer write, InnoDB only needs to write the log to the memory, which takes time: log×DRAM_W ; NVM-MySQL needs to write the log to NVM, which takes time: log×PCM_W = 20×log×DRAM_W.

综上所述，日志写入缓冲区时NVM-MySQL耗时较多，为InnoDB的20倍。In summary, NVM-MySQL takes more time to write logs into the buffer, which is 20 times that of InnoDB.

2、日志持久化阶段2. Log persistence stage

假设需要持久化的日志缓冲区大小为buf，元数据的大小为metadata，并且每次持久化需要更新一次日志元数据，其中buf远大于metadata。Assume that the size of the log buffer to be persisted is buf, the size of the metadata is metadata, and the log metadata needs to be updated once for each persistence, where buf is much larger than metadata.

InnoDB持久化日志时，需要将缓存的日志元数据和日志记录写到磁盘，其耗时为：metadata×HDD_W+buf×HDD_W≈buf×HDD_W；NVM-MySQL则不用做任何事，实际耗时为：0。When InnoDB persists logs, it needs to write cached log metadata and log records to disk, which takes time: metadata×HDD_W + buf×HDD_W ≈ buf×HDD_W ; NVM-MySQL does not need to do anything, and the actual time taken is: 0.

综上所述，相比于InnoDB，因为NVM-MySQL无需持久化日志数据，所以没有时间开销。In summary, compared to InnoDB, NVM-MySQL does not need to persist log data, so there is no time overhead.

3、周期性检查3. Periodic inspection

假设每次周期性检查时，日志缓冲区空间不足的概率为rate_Buf，空间不足时需要持久化的缓冲区大小为buf，根据前一小节“2、日志持久化阶段”的结论，持久化日志缓冲区的耗时大约为buf×HDD_W；数据缓冲池太久没有刷脏的概率为rate_Data，需要刷脏的数据大小为data；需要创建日志检查点的概率为rate_CP，检查点信息的大小为checkpoint。Assume that the probability of insufficient log buffer space during each periodic check is rate_Buf , and the size of the buffer that needs to be persisted when the space is insufficient is buf. According to the conclusion of the previous section "2. Log persistence stage", the time consumed to persist the log buffer is approximately buf×HDD_W ; the probability that the data buffer pool has not been flushed for too long is rate_Data , and the size of the data that needs to be flushed is data; the probability of creating a log checkpoint is rate_CP , and the size of the checkpoint information is checkpoint.

为了便于分析，假设日志缓冲区写满一次即会触发一次日志和数据持久化，即rate_Buf＝rate_Data＝rate_CP＝rate。另外buf和data实际情况下都远大于checkpoint。For the sake of analysis, we assume that once the log buffer is full, log and data persistence will be triggered, that is, rate_Buf = rate_Data = rate_CP = rate. In addition, in actual situations, buf and data are much larger than checkpoint.

InnoDB做周期性检查时，可能会将日志缓冲区、数据缓冲池以及日志检查点信息写到磁盘，其耗时为：When InnoDB performs periodic checks, it may write the log buffer, data buffer pool, and log checkpoint information to disk, which takes:

rate×buf×HDD_W+rate×data×HDD_W+rate×checkpoint×HDD_W≈(buf+data)×rate×HDD_Wrate×buf×HDD_W +rate×data×HDD_W +rate×checkpoint×HDD_W ≈(buf+data)×rate×HDD_W

而NVM-MySQL检查到空间不足时只需要将数据缓冲池写到磁盘并重置日志空间写入位置，其耗时为：rate×data×HDD_W。When NVM-MySQL detects insufficient space, it only needs to write the data buffer pool to the disk and reset the log space write position, which takes time: rate×data×HDD_W.

从上述分析可以看出，NVM-MySQL主要节省的时间开销为持久化日志缓冲区的耗时，具体优化效果取决于buf和data的大小比例。From the above analysis, it can be seen that the main time saved by NVM-MySQL is the time spent on persisting the log buffer. The specific optimization effect depends on the size ratio of buf and data.

4、容灾恢复4. Disaster Recovery

假设只有一个日志文件组，日志检查点信息的大小为checkpoint，需要重做的日志大小为redo，redo一般远大于checkpoint。Assume that there is only one log file group, the size of the log checkpoint information is checkpoint, and the size of the log that needs to be redo is redo. Redo is generally much larger than checkpoint.

InnoDB在容灾恢复时需要从磁盘中读取日志检查点信息以及日志数据，其耗时为：checkpoint×HDD_R+redo×HDD_R≈redo×HDD_R；而NVM-MySQL则只需要从NVM读取日志数据，其耗时为：redo×PCM_R。InnoDB needs to read log checkpoint information and log data from the disk during disaster recovery, which takes time: checkpoint×HDD_R +redo×HDD_R ≈redo×HDD_R ; while NVM-MySQL only needs to read log data from NVM, which takes time: redo×PCM_R .

综上所述，NVM-MySQL将容灾恢复阶段的日志数据读取从磁盘转移到了NVM，而NVM的读取速度很快，和DRAM相似，因此大大减少了时间开销。In summary, NVM-MySQL transfers the log data reading in the disaster recovery phase from the disk to NVM. The reading speed of NVM is very fast, similar to DRAM, thus greatly reducing the time overhead.

5、系统综合优化5. Comprehensive system optimization

InnoDB系统运行时，日志的主要开销分为两个阶段：1)正常运行阶段；2)容灾恢复阶段。因此将对这两个阶段分别做理论上的优化分析。When the InnoDB system is running, the main log overhead is divided into two stages: 1) normal operation stage; 2) disaster recovery stage. Therefore, theoretical optimization analysis will be conducted for these two stages respectively.

(一)正常运行阶段(I) Normal operation stage

系统正常运行时，日志数据持久化主要是在周期性检查的时候完成，因此只需要综合日志缓存和周期性检查两方面进行分析即可。根据前文的分析，可以得出表5的结果，其中差值列表示NVM-MySQL比InnoDB减少了多少耗时。When the system is running normally, log data persistence is mainly completed during periodic checks, so only log cache and periodic checks need to be analyzed. Based on the previous analysis, the results in Table 5 can be obtained, where the difference column indicates how much time NVM-MySQL reduces compared to InnoDB.

表5Table 5

正常情况下，每进行buf/log次日志缓存，日志缓冲区就会写满，周期性检查时就会触发一次日志和数据的持久化。因此综合考虑，NVM-MySQL在正常运行时理论上的优化耗时为：Under normal circumstances, the log buffer will be full every time buf/log log cache is performed, and the persistence of logs and data will be triggered during periodic checks. Therefore, considering all factors, the theoretical optimization time of NVM-MySQL during normal operation is:

-buf÷log×log×PCMW+buf×HDD_W＝buf×(HDD_W-PCMW)≈buf×H.DD_W-buf÷log×log×PCMW+buf×HDD_W =buf×(HDD_W -PCMW)≈buf×H.DD_W

这一优化耗时大致占原InnoDB时间开销的buf/(buf+data)。The time consumed by this optimization is roughly equivalent to the original InnoDB time overhead of buf/(buf+data).

综上所述，NVM-MySQL实际节省的时间开销为日志通过低速磁盘IO持久化的时间开销，其具体优化幅度取决于日志缓冲区大小和需要持久化的数据修改数量的比例。In summary, the time overhead actually saved by NVM-MySQL is the time overhead of persisting logs through low-speed disk IO. The specific optimization range depends on the ratio of the log buffer size to the number of data modifications that need to be persisted.

(二)容灾恢复阶段(II) Disaster Recovery Phase

根据前文可得出InnoDB和NVM-MySQL在容灾恢复阶段的时间开销对比，如表6所示，其中差值列表示NVM-MySQL比InnoDB减少了多少耗时。According to the previous article, we can compare the time cost of InnoDB and NVM-MySQL in the disaster recovery phase, as shown in Table 6, where the difference column indicates how much time NVM-MySQL saves compared to InnoDB.

表6Table 6

操作operateInnoDBInnoDBNVM-MySQLNVM-MySQL差值Difference容灾恢复Disaster Recoveryredo×HDD_Rredo×HDD_Rredo×PCM_Rredo×PCM_Rredo×HDD_Rredo×HDD_R

从表6得知，容灾恢复时NVM-MySQL读取日志时完全无需低速磁盘IO，而NVM的读速度远快于磁盘，因此容灾恢复读取日志的时间可以缩短10万倍。From Table 6, we can see that NVM-MySQL does not need low-speed disk IO when reading logs during disaster recovery, and the reading speed of NVM is much faster than that of disk. Therefore, the time for reading logs during disaster recovery can be shortened by 100,000 times.

在本申请实施例中，一方面，事务执行阶段中NVM-MySQL节省了日志通过低速磁盘IO持久化的时间开销，提高了系统整体的吞吐量，优化幅度的大小取决于日志缓冲区大小和每次需要持久化的数据修改数量之间的比例。另一方面，容灾恢复阶段中NVM-MySQL读取日志时完全无需低速磁盘IO，而NVM的读速度远快于磁盘，因此容灾恢复时同样大大缩短了读取日志的时间，与传统InnoDB相比几乎可以忽略不计。In the embodiment of the present application, on the one hand, in the transaction execution phase, NVM-MySQL saves the time overhead of persisting the log through low-speed disk IO, and improves the overall throughput of the system. The magnitude of the optimization depends on the ratio between the log buffer size and the number of data modifications that need to be persisted each time. On the other hand, in the disaster recovery phase, NVM-MySQL does not need low-speed disk IO at all when reading logs, and the reading speed of NVM is much faster than that of disk. Therefore, the time of reading logs is also greatly shortened during disaster recovery, which is almost negligible compared with traditional InnoDB.

在一些实施例中，除了将日志全部存储于NVM之外，还可以将日志直接融于数据记录中，利用NVM的非易失性直接保证数据持久性，以移除日志模块，从而可以节省存储空间、简化容灾恢复流程。表7示出了仅将日志存储于NVM的方案(对应于NVM-Data)以及将数据记录和日志一起存储于NVM的方案(对应于NVM-Log)之间的区别与联系。In some embodiments, in addition to storing all logs in NVM, logs can also be directly integrated into data records, using the non-volatility of NVM to directly ensure data persistence, so as to remove the log module, thereby saving storage space and simplifying the disaster recovery process. Table 7 shows the difference and connection between the solution of storing only logs in NVM (corresponding to NVM-Data) and the solution of storing data records and logs in NVM together (corresponding to NVM-Log).

表7Table 7

从表7中可以看出，NVM-Log方案中，由于将数据记录和日志相融后，一起存储到了NVM介质中，能够进行持久化存储，相当于取消了整个日志模块，省去了提交事务时的持久化数据记录的步骤，简化了事务提交流程，能够节省系统的存储空间。As can be seen from Table 7, in the NVM-Log solution, since the data records and logs are integrated and stored together in the NVM medium, persistent storage can be performed, which is equivalent to canceling the entire log module, eliminating the step of persistent data recording when submitting transactions, simplifying the transaction submission process, and saving system storage space.

图12是本申请实施例提供的一种日志存储装置的结构示意图，请参考图12，该装置包括：FIG. 12 is a schematic diagram of the structure of a log storage device provided in an embodiment of the present application. Please refer to FIG. 12 , the device includes:

确定模块1201，用于响应于目标事务的提交事件，确定数据库系统中第一存储介质的剩余容量，所述第一存储介质为用于存储日志的非易失性存储介质；A determination module 1201 is used to determine the remaining capacity of a first storage medium in the database system in response to a commit event of a target transaction, where the first storage medium is a non-volatile storage medium for storing logs;

存储模块1202，用于响应于所述剩余容量小于所述目标事务的未缓存日志的数据量，创建日志检查点，将第二存储介质中基于修改操作产生的业务数据存储至第三存储介质，所述第二存储介质为易失性存储介质，所述第三存储介质为非易失性存储介质；The storage module 1202 is used to create a log checkpoint in response to the remaining capacity being less than the data amount of the uncached log of the target transaction, and store the business data generated based on the modification operation in the second storage medium to a third storage medium, wherein the second storage medium is a volatile storage medium and the third storage medium is a non-volatile storage medium;

写入模块1203，用于将所述目标事务的未缓存日志写入到所述第一存储介质。The writing module 1203 is used to write the uncached log of the target transaction to the first storage medium.

本申请实施例提供的装置，通过在提交目标事务时，在该目标事务的未缓存日志持久化在NVM存储介质(也即第一存储介质)中，能够在事务提交的同时实现对日志的持久化维护，由于不需要将日志分别在内存(也即第二存储介质)和磁盘(也即第三存储介质)中存两次，大大节约了日志存储占用的空间，且无需像传统InnoDB那样构建日志缓冲区-日志文件的双层日志存储体系，并且取消了传统的日志文件，正是由于此，如果NVM存储介质的剩余容量不足，直接通过创建日志检查点就能够从NVM存储介质中整理出空闲的存储空间，无需执行繁琐的将日志从内存刷到磁盘的低速IO缓存流程，提升了数据库的系统性能，避免了限制数据库系统的吞吐量上限。The device provided in the embodiment of the present application can realize persistent maintenance of the log at the same time as the transaction is committed, by persisting the uncached log of the target transaction in the NVM storage medium (i.e., the first storage medium) when the target transaction is committed. Since the log does not need to be stored twice in the memory (i.e., the second storage medium) and the disk (i.e., the third storage medium), the space occupied by the log storage is greatly saved, and there is no need to build a two-layer log storage system of log buffer and log file like the traditional InnoDB, and the traditional log file is cancelled. Precisely because of this, if the remaining capacity of the NVM storage medium is insufficient, the free storage space can be sorted out from the NVM storage medium directly by creating a log checkpoint, without executing the cumbersome low-speed IO cache process of flushing the log from the memory to the disk, thereby improving the system performance of the database and avoiding limiting the throughput upper limit of the database system.

在一种可能实施方式中，基于图12的装置组成，该写入模块1203包括：In a possible implementation, based on the device composition of FIG. 12 , the writing module 1203 includes:

写入单元，用于从该第一存储介质中已存储的最后一个日志块开始，以日志块为单位写入该未缓存日志。The writing unit is used to write the uncached log in units of log blocks starting from the last log block stored in the first storage medium.

在一种可能实施方式中，该写入单元用于：In one possible implementation, the writing unit is used to:

将该未缓存日志写入该最后一个日志块；Writing the uncached log to the last log block;

若该最后一个日志块的存储容量小于该未缓存日志的数据量，对该最后一个日志块写入完毕后，在该最后一个日志块之后创建另一个日志块。If the storage capacity of the last log block is less than the data amount of the uncached log, after writing to the last log block is completed, another log block is created after the last log block.

在一种可能实施方式中，响应于该剩余容量大于或等于该目标事务的未缓存日志的数据量，执行该写入模块1203。In a possible implementation, in response to the remaining capacity being greater than or equal to the data volume of the uncached log of the target transaction, the writing module 1203 is executed.

在一种可能实施方式中，基于图12的装置组成，该装置还包括：In a possible implementation manner, based on the device composition of FIG12 , the device further includes:

获取模块，用于获取该第一存储介质的存储容量；An acquisition module, used for acquiring the storage capacity of the first storage medium;

配置模块，用于将该数据库系统的日志空间容量参数配置为该第一存储介质的存储容量。The configuration module is used to configure the log space capacity parameter of the database system to the storage capacity of the first storage medium.

在一种可能实施方式中，该存储模块1202还用于：In a possible implementation manner, the storage module 1202 is further used for:

将该第二存储介质中基于修改操作产生的业务数据存储至该第三存储介质；storing the business data generated based on the modification operation in the second storage medium to the third storage medium;

将该第一存储介质中已存储的最后一个日志块内的日志复制至该第一存储介质中的第一个日志块；Copying the log in the last log block stored in the first storage medium to the first log block in the first storage medium;

将该第一存储介质的起始写入位置指针移动至该第一个日志块。The start write position pointer of the first storage medium is moved to the first log block.

在一种可能实施方式中，该目标条件包括下述至少一项：In one possible implementation, the target condition includes at least one of the following:

该第一存储介质的剩余容量小于容量阈值；The remaining capacity of the first storage medium is less than a capacity threshold;

该数据库系统中的最大日志序列号与该第二存储介质中具有最小时间戳的业务数据所对应的日志序列号之间差值大于第一目标阈值；The difference between the maximum log sequence number in the database system and the log sequence number corresponding to the business data with the minimum timestamp in the second storage medium is greater than the first target threshold;

该数据库系统中的最大日志序列号与该第一存储介质中上一次日志检查点的日志序列号之间差值大于第二目标阈值。The difference between the maximum log sequence number in the database system and the log sequence number of the last log checkpoint in the first storage medium is greater than the second target threshold.

恢复模块，用于响应于该数据库系统宕机后重启，从该第一存储介质中获取待恢复日志，基于该待恢复日志进行数据恢复。The recovery module is used to obtain the log to be recovered from the first storage medium in response to the database system restarting after the shutdown, and perform data recovery based on the log to be recovered.

在一种可能实施方式中，基于图12的装置组成，该恢复模块包括：In a possible implementation, based on the device composition of FIG. 12 , the recovery module includes:

校验单元，用于从该第一存储介质的第一个日志块开始，对日志块进行校验，将校验通过的日志块中存储的日志确定为该待恢复日志。The verification unit is used to verify the log blocks starting from the first log block of the first storage medium, and determine the log stored in the log block that passes the verification as the log to be restored.

重做单元，用于将该待恢复日志存储在哈希表中，遍历该哈希表，对该哈希表中所存储的待恢复日志进行重做，得到恢复后的业务数据。The redo unit is used to store the log to be recovered in a hash table, traverse the hash table, and redo the log to be recovered stored in the hash table to obtain the recovered business data.

需要说明的是：上述实施例提供的日志存储装置在存储日志时，仅以上述各功能模块的划分进行举例说明，实际应用中，能够根据需要而将上述功能分配由不同的功能模块完成，即将节点设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的日志存储装置与日志存储方法实施例属于同一构思，其具体实现过程详见日志存储方法实施例，这里不再赘述。It should be noted that: the log storage device provided in the above embodiment only uses the division of the above functional modules as an example to illustrate when storing logs. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the node device is divided into different functional modules to complete all or part of the functions described above. In addition, the log storage device provided in the above embodiment and the log storage method embodiment belong to the same concept. The specific implementation process is detailed in the log storage method embodiment, which will not be repeated here.

图13是本申请实施例提供的一种节点设备的结构示意图。可选地，该节点设备1300的设备类型包括：智能手机、平板电脑、MP3播放器(Moving Picture Experts GroupAudio Layer III，动态影像专家压缩标准音频层面3)、MP4(Moving Picture ExpertsGroup Audio Layer IV，动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。节点设备1300还可能被称为用户设备、便携式节点设备、膝上型节点设备、台式节点设备等其他名称。FIG13 is a schematic diagram of the structure of a node device provided in an embodiment of the present application. Optionally, the device types of the node device 1300 include: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Group Audio Layer 4) player, a laptop or a desktop computer. The node device 1300 may also be referred to as a user device, a portable node device, a laptop node device, a desktop node device, or other names.

通常，节点设备1300包括有：处理器1301和存储器1302。Typically, the node device 1300 includes: a processor 1301 and a memory 1302 .

可选地，处理器1301包括一个或多个处理核心，比如4核心处理器、8核心处理器等。可选地，处理器1301采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable LogicArray，可编程逻辑阵列)中的至少一种硬件形式来实现。在一些实施例中，处理器1301包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central Processing Unit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器1301集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器1301还包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。Optionally, the processor 1301 includes one or more processing cores, such as a 4-core processor, an 8-core processor, etc. Optionally, the processor 1301 is implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). In some embodiments, the processor 1301 includes a main processor and a coprocessor. The main processor is a processor for processing data in the awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state. In some embodiments, the processor 1301 is integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 1301 also includes an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.

在一些实施例中，存储器1302包括一个或多个计算机可读存储介质，可选地，该计算机可读存储介质是非暂态的。可选地，存储器1302还包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器1302中的非暂态的计算机可读存储介质用于存储至少一个程序代码，该至少一个程序代码用于被处理器1301所执行以实现本申请中各个实施例提供的日志存储方法。In some embodiments, the memory 1302 includes one or more computer-readable storage media, and optionally, the computer-readable storage medium is non-transitory. Optionally, the memory 1302 also includes a high-speed random access memory, and a non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one program code, and the at least one program code is used to be executed by the processor 1301 to implement the log storage method provided in each embodiment of the present application.

在一些实施例中，节点设备1300还可选包括有：外围设备接口1303和至少一个外围设备。处理器1301、存储器1302和外围设备接口1303之间能够通过总线或信号线相连。各个外围设备能够通过总线、信号线或电路板与外围设备接口1303相连。具体地，外围设备包括：射频电路1304、触摸显示屏1305、摄像头组件1306、音频电路1307、定位组件1308和电源1309中的至少一种。In some embodiments, the node device 1300 may also optionally include: a peripheral device interface 1303 and at least one peripheral device. The processor 1301, the memory 1302 and the peripheral device interface 1303 may be connected via a bus or a signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1304, a touch display screen 1305, a camera assembly 1306, an audio circuit 1307, a positioning assembly 1308 and a power supply 1309.

外围设备接口1303可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器1301和存储器1302。在一些实施例中，处理器1301、存储器1302和外围设备接口1303被集成在同一芯片或电路板上；在一些其他实施例中，处理器1301、存储器1302和外围设备接口1303中的任意一个或两个在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 1303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1301 and the memory 1302. In some embodiments, the processor 1301, the memory 1302, and the peripheral device interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 are implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路1304用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路1304通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1304将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路1304包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。可选地，射频电路1304通过至少一种无线通信协议来与其它节点设备进行通信。该无线通信协议包括但不限于：城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity，无线保真)网络。在一些实施例中，射频电路1304还包括NFC(Near Field Communication，近距离无线通信)有关的电路，本申请对此不加以限定。The radio frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1304 communicates with the communication network and other communication devices through electromagnetic signals. The radio frequency circuit 1304 converts the electrical signal into an electromagnetic signal for transmission, or converts the received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and the like. Optionally, the radio frequency circuit 1304 communicates with other node devices through at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: a metropolitan area network, various generations of mobile communication networks (2G, 3G, 4G and 5G), a wireless local area network and/or a WiFi (Wireless Fidelity) network. In some embodiments, the radio frequency circuit 1304 also includes circuits related to NFC (Near Field Communication), which is not limited in this application.

显示屏1305用于显示UI(User Interface，用户界面)。可选地，该UI包括图形、文本、图标、视频及其它们的任意组合。当显示屏1305是触摸显示屏时，显示屏1305还具有采集在显示屏1305的表面或表面上方的触摸信号的能力。该触摸信号能够作为控制信号输入至处理器1301进行处理。可选地，显示屏1305还用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。在一些实施例中，显示屏1305为一个，设置节点设备1300的前面板；在另一些实施例中，显示屏1305为至少两个，分别设置在节点设备1300的不同表面或呈折叠设计；在再一些实施例中，显示屏1305是柔性显示屏，设置在节点设备1300的弯曲表面上或折叠面上。甚至，可选地，显示屏1305设置成非矩形的不规则图形，也即异形屏。可选地，显示屏1305采用LCD(Liquid Crystal Display，液晶显示屏)、OLED(Organic Light-EmittingDiode,有机发光二极管)等材质制备。The display screen 1305 is used to display a UI (User Interface). Optionally, the UI includes graphics, text, icons, videos, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to collect touch signals on the surface or above the surface of the display screen 1305. The touch signal can be input to the processor 1301 as a control signal for processing. Optionally, the display screen 1305 is also used to provide virtual buttons and/or virtual keyboards, also known as soft buttons and/or soft keyboards. In some embodiments, the display screen 1305 is one, and the front panel of the node device 1300 is set; in other embodiments, the display screen 1305 is at least two, which are respectively set on different surfaces of the node device 1300 or are folded; in some other embodiments, the display screen 1305 is a flexible display screen, which is set on the curved surface or folded surface of the node device 1300. Even, optionally, the display screen 1305 is set to a non-rectangular irregular shape, that is, a special-shaped screen. Optionally, the display screen 1305 is made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).

摄像头组件1306用于采集图像或视频。可选地，摄像头组件1306包括前置摄像头和后置摄像头。通常，前置摄像头设置在节点设备的前面板，后置摄像头设置在节点设备的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality，虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中，摄像头组件1306还包括闪光灯。可选地，闪光灯是单色温闪光灯，或者是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合，用于不同色温下的光线补偿。The camera assembly 1306 is used to collect images or videos. Optionally, the camera assembly 1306 includes a front camera and a rear camera. Typically, the front camera is set on the front panel of the node device, and the rear camera is set on the back of the node device. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth of field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth of field camera to realize the background blur function, the fusion of the main camera and the wide-angle camera to realize panoramic shooting and VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1306 also includes a flash. Optionally, the flash is a monochrome temperature flash, or a dual-color temperature flash. A dual-color temperature flash refers to a combination of a warm light flash and a cold light flash, which is used for light compensation at different color temperatures.

在一些实施例中，音频电路1307包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器1301进行处理，或者输入至射频电路1304以实现语音通信。出于立体声采集或降噪的目的，麦克风为多个，分别设置在节点设备1300的不同部位。可选地，麦克风是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1301或射频电路1304的电信号转换为声波。可选地，扬声器是传统的薄膜扬声器，或者是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时，不仅能够将电信号转换为人类可听见的声波，也能够将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中，音频电路1307还包括耳机插孔。In some embodiments, the audio circuit 1307 includes a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals and input them into the processor 1301 for processing, or input them into the RF circuit 1304 to achieve voice communication. For the purpose of stereo acquisition or noise reduction, there are multiple microphones, which are respectively arranged at different parts of the node device 1300. Optionally, the microphone is an array microphone or an omnidirectional acquisition microphone. The speaker is used to convert the electrical signal from the processor 1301 or the RF circuit 1304 into sound waves. Optionally, the speaker is a traditional film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert the electrical signal into sound waves audible to humans, but also convert the electrical signal into sound waves inaudible to humans for purposes such as ranging. In some embodiments, the audio circuit 1307 also includes a headphone jack.

定位组件1308用于定位节点设备1300的当前地理位置，以实现导航或LBS(Location Based Service，基于位置的服务)。可选地，定位组件1308是基于GPS(GlobalPositioning System，全球定位系统)、北斗系统、格雷纳斯系统或伽利略系统的定位组件。The positioning component 1308 is used to locate the current geographical location of the node device 1300 to implement navigation or LBS (Location Based Service). Optionally, the positioning component 1308 is a positioning component based on GPS (Global Positioning System), Beidou system, Greninja system or Galileo system.

电源1309用于为节点设备1300中的各个组件进行供电。可选地，电源1309是交流电、直流电、一次性电池或可充电电池。当电源1309包括可充电电池时，该可充电电池支持有线充电或无线充电。该可充电电池还用于支持快充技术。The power supply 1309 is used to power various components in the node device 1300. Optionally, the power supply 1309 is an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1309 includes a rechargeable battery, the rechargeable battery supports wired charging or wireless charging. The rechargeable battery is also used to support fast charging technology.

在一些实施例中，节点设备1300还包括有一个或多个传感器1310。该一个或多个传感器1310包括但不限于：加速度传感器1311、陀螺仪传感器1312、压力传感器1313、指纹传感器1314、光学传感器1315以及接近传感器1316。In some embodiments, the node device 1300 further includes one or more sensors 1310 , including but not limited to: an acceleration sensor 1311 , a gyroscope sensor 1312 , a pressure sensor 1313 , a fingerprint sensor 1314 , an optical sensor 1315 , and a proximity sensor 1316 .

在一些实施例中，加速度传感器1311检测以节点设备1300建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器1311用于检测重力加速度在三个坐标轴上的分量。可选地，处理器1301根据加速度传感器1311采集的重力加速度信号，控制触摸显示屏1305以横向视图或纵向视图进行用户界面的显示。加速度传感器1311还用于游戏或者用户的运动数据的采集。In some embodiments, the acceleration sensor 1311 detects the magnitude of acceleration on the three coordinate axes of the coordinate system established by the node device 1300. For example, the acceleration sensor 1311 is used to detect the components of gravity acceleration on the three coordinate axes. Optionally, the processor 1301 controls the touch display screen 1305 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 is also used for collecting motion data of games or users.

在一些实施例中，陀螺仪传感器1312检测节点设备1300的机体方向及转动角度，陀螺仪传感器1312与加速度传感器1311协同采集用户对节点设备1300的3D动作。处理器1301根据陀螺仪传感器1312采集的数据，实现如下功能：动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。In some embodiments, the gyroscope sensor 1312 detects the body direction and rotation angle of the node device 1300, and the gyroscope sensor 1312 cooperates with the acceleration sensor 1311 to collect the user's 3D actions on the node device 1300. The processor 1301 implements the following functions based on the data collected by the gyroscope sensor 1312: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

可选地，压力传感器1313设置在节点设备1300的侧边框和/或触摸显示屏1305的下层。当压力传感器1313设置在节点设备1300的侧边框时，能够检测用户对节点设备1300的握持信号，由处理器1301根据压力传感器1313采集的握持信号进行左右手识别或快捷操作。当压力传感器1313设置在触摸显示屏1305的下层时，由处理器1301根据用户对触摸显示屏1305的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。Optionally, the pressure sensor 1313 is arranged on the side frame of the node device 1300 and/or the lower layer of the touch display screen 1305. When the pressure sensor 1313 is arranged on the side frame of the node device 1300, it can detect the user's holding signal of the node device 1300, and the processor 1301 performs left and right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1313. When the pressure sensor 1313 is arranged on the lower layer of the touch display screen 1305, the processor 1301 controls the operability controls on the UI interface according to the user's pressure operation on the touch display screen 1305. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

指纹传感器1314用于采集用户的指纹，由处理器1301根据指纹传感器1314采集到的指纹识别用户的身份，或者，由指纹传感器1314根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时，由处理器1301授权该用户执行相关的敏感操作，该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。可选地，指纹传感器1314被设置节点设备1300的正面、背面或侧面。当节点设备1300上设置有物理按键或厂商Logo时，指纹传感器1314能够与物理按键或厂商Logo集成在一起。The fingerprint sensor 1314 is used to collect the user's fingerprint, and the processor 1301 identifies the user's identity based on the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the user's identity based on the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings. Optionally, the fingerprint sensor 1314 is set on the front, back or side of the node device 1300. When a physical button or a manufacturer logo is set on the node device 1300, the fingerprint sensor 1314 can be integrated with the physical button or the manufacturer logo.

光学传感器1315用于采集环境光强度。在一个实施例中，处理器1301根据光学传感器1315采集的环境光强度，控制触摸显示屏1305的显示亮度。具体地，当环境光强度较高时，调高触摸显示屏1305的显示亮度；当环境光强度较低时，调低触摸显示屏1305的显示亮度。在另一个实施例中，处理器1301还根据光学传感器1315采集的环境光强度，动态调整摄像头组件1306的拍摄参数。The optical sensor 1315 is used to collect the ambient light intensity. In one embodiment, the processor 1301 controls the display brightness of the touch display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1305 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 1305 is decreased. In another embodiment, the processor 1301 also dynamically adjusts the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.

接近传感器1316，也称距离传感器，通常设置在节点设备1300的前面板。接近传感器1316用于采集用户与节点设备1300的正面之间的距离。在一个实施例中，当接近传感器1316检测到用户与节点设备1300的正面之间的距离逐渐变小时，由处理器1301控制触摸显示屏1305从亮屏状态切换为息屏状态；当接近传感器1316检测到用户与节点设备1300的正面之间的距离逐渐变大时，由处理器1301控制触摸显示屏1305从息屏状态切换为亮屏状态。The proximity sensor 1316, also called a distance sensor, is usually arranged on the front panel of the node device 1300. The proximity sensor 1316 is used to collect the distance between the user and the front of the node device 1300. In one embodiment, when the proximity sensor 1316 detects that the distance between the user and the front of the node device 1300 is gradually decreasing, the processor 1301 controls the touch display screen 1305 to switch from the screen-on state to the screen-off state; when the proximity sensor 1316 detects that the distance between the user and the front of the node device 1300 is gradually increasing, the processor 1301 controls the touch display screen 1305 to switch from the screen-off state to the screen-on state.

本领域技术人员能够理解，图13中示出的结构并不构成对节点设备1300的限定，能够包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art will appreciate that the structure shown in FIG. 13 does not limit the node device 1300 , and may include more or fewer components than shown, or combine certain components, or adopt a different component arrangement.

在示例性实施例中，还提供了一种计算机可读存储介质，例如包括至少一条程序代码的存储器，上述至少一条程序代码可由终端中的处理器执行以完成上述实施例中日志存储方法。例如，该计算机可读存储介质包括ROM(Read-Only Memory，只读存储器)、RAM(Random-Access Memory，随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory，只读光盘)、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including at least one program code, and the at least one program code can be executed by a processor in a terminal to complete the log storage method in the above embodiment. For example, the computer-readable storage medium includes ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, and optical data storage device, etc.

在示例性实施例中，还提供了一种计算机程序产品或计算机程序，包括一条或多条程序代码，该一条或多条程序代码存储在计算机可读存储介质中。节点设备的一个或多个处理器能够从计算机可读存储介质中读取该一条或多条程序代码，该一个或多个处理器执行该一条或多条程序代码，使得节点设备能够执行以完成上述实施例中日志存储方法。In an exemplary embodiment, a computer program product or computer program is also provided, including one or more program codes, which are stored in a computer-readable storage medium. One or more processors of a node device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the node device can execute to complete the log storage method in the above embodiment.

本领域普通技术人员能够理解实现上述实施例的全部或部分步骤能够通过硬件来完成，也能够通过程序来指令相关的硬件完成，可选地，该程序存储于一种计算机可读存储介质中，可选地，上述提到的存储介质是只读存储器、磁盘或光盘等。A person of ordinary skill in the art will understand that all or part of the steps to implement the above embodiments can be completed by hardware, or can be completed by instructing related hardware through a program. Optionally, the program is stored in a computer-readable storage medium. Optionally, the above-mentioned storage medium is a read-only memory, a disk or an optical disk, etc.

以上所述仅为本申请的可选实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above description is only an optional embodiment of the present application and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application shall be included in the protection scope of the present application.