CN115774699A

Movatterモバイル変換

Info

Publication number: CN115774699A
Application number: CN202310045920.4A
Authority: CN
Inventors: 林科旭; 张程伟; 张皖川
Original assignee: Primitive Data Beijing Information Technology Co ltd
Current assignee: Primitive Data Beijing Information Technology Co ltd
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2023-03-10
Anticipated expiration: 2043-01-30
Also published as: CN115774699B

Abstract

The embodiment of the application discloses a compression method, a device, electronic equipment and a storage medium for a database shared dictionary, and relates to the technical field of data compression.

Description

Database shared dictionary compression method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data compression technologies, and in particular, to a method and an apparatus for compressing a database shared dictionary, an electronic device, and a storage medium.

Background

With the continuous development of the information era, various information data also grow exponentially, and no matter whether transmission, storage and the like exist in the data processing process, the data are compressed to effectively save space and reduce cost. The database compression is a method for compressing and storing contents stored in a database to save space, and the dictionary compression is a most widely used compression algorithm at present, but the database dictionary compression algorithm in the related art has the problems that the compression rate is low, or the dictionary and the data are stored together and are difficult to manage, and the like.

Disclosure of Invention

The present application is directed to solving at least one of the problems in the prior art. Therefore, the embodiment of the application provides a database shared dictionary compression method, device, electronic device and storage medium, which can improve the compression rate of database data, and simultaneously associate the data page with the dictionary based on the mapping relation, so that the corresponding dictionary can be conveniently and rapidly searched to compress and decompress the data, thereby improving the performance of the database, and facilitating the storage management of the dictionary.

In a first aspect, an embodiment of the present application provides a database shared dictionary compression method, which is applied to a database, where the database includes multiple database tables, the database tables include multiple data pages, and the data pages include multiple data rows, and the method includes:

performing a write operation on the page of data, the write operation to write data to a plurality of the rows of data;

when the written data of the database table reach a preset threshold value, training at least one dictionary by using the written data; the data page is stored with first metadata, the data line is stored with second metadata, the first metadata is used for storing the mapping relation between the dictionary and at least one data page, and the second metadata is used for storing attribute information of the data line;

storing the trained dictionary into a dictionary file;

and selecting the corresponding dictionary from the dictionary file to compress the written data of the data line in the data page based on the mapping relation.

In some embodiments of the present application, after the written data of the database table reaches a preset threshold, training at least one dictionary using the written data, further including:

acquiring the uncompressed quantity of the uncompressed data pages in the database table;

when the uncompressed quantity reaches a preset quantity threshold value, taking the written data of the uncompressed data page as dictionary training data;

and inputting the dictionary training data into a dictionary generating model to generate a plurality of dictionaries, wherein the number of the dictionaries is a preset number.

In some embodiments of the application, the inputting the dictionary training data into a dictionary generation model to generate a plurality of dictionaries further includes:

generating the preset number and the size of the dictionary according to the dictionary training data and a preset compression rate;

inputting the dictionary training data into a dictionary generating model, and generating a plurality of dictionaries based on the preset number and the size of the dictionaries;

and generating the first metadata of the data page to the dictionary based on the mapping relation.

In some embodiments of the application, when the uncompressed quantity reaches a preset quantity threshold, taking the written data of the uncompressed data page as dictionary training data, further includes:

taking the written data of the uncompressed data page as initial dictionary training data;

and acquiring the dictionary training data from the initial dictionary training data according to a preset selection strategy.

In some embodiments of the present application, the obtaining the dictionary training data from the initial dictionary training data according to a preset selection policy further includes:

acquiring a training data threshold;

and randomly selecting or selecting data with corresponding quantity and size from the initial dictionary training data according to the training data threshold value to serve as the dictionary training data.

In some embodiments of the present application, after the selecting the corresponding dictionary from the dictionary file based on the mapping relationship to compress the write data of the data line in the data page, the method further includes:

if the compression ratio of the data page does not reach a preset compression threshold value, the written data of the data page is used as alternative data;

acquiring newly-added write data;

training the dictionary by using the newly added write data and the alternative data to update the dictionary of the data page;

and compressing the data page by using the updated dictionary.

In some embodiments of the present application, the first metadata is stored in a first preset location of the data page, and the first metadata includes: one or more of dictionary filename, file offset, dictionary length.

In some embodiments of the present application, the second metadata is stored at a second preset position of the data line, and the second metadata includes: line data length and/or line transaction information.

In some embodiments of the present application, a mapping relationship between the dictionary and at least one of the data pages is generated according to a preset mapping rule, where the preset mapping rule includes a continuous mapping rule, a discontinuous mapping rule, or a content-related mapping rule;

when the preset mapping rule is a continuous mapping rule, selecting continuous data pages of a first quantity to be associated with the same dictionary;

when the preset mapping rule is a discontinuous mapping rule, selecting a discontinuous second number of data pages to be associated with the same dictionary;

and when the preset mapping rule is a content-related mapping rule, selecting a third number of data pages with write-in data having relevance to be associated with the same dictionary.

In some embodiments of the present application, the first metadata of the data pages and the second metadata of the data rows in each of the data pages are not compressed when the write data of the data pages in the database table is compressed.

In some embodiments of the present application, after the selecting the corresponding dictionary from the dictionary file based on the mapping relationship to compress the write data of the data line in the data page, the method further includes: decompressing the compressed data line to obtain the corresponding write-in data;

the decompression process comprises the following steps:

reading the first metadata of the data page to obtain the mapping relation;

searching the corresponding dictionary in the dictionary file by using the mapping relation;

decompressing the written data of the data line using the dictionary.

In a second aspect, an embodiment of the present application further provides a database shared dictionary compression apparatus, including:

a write module to perform a write operation on a data page, the write operation to write data into a plurality of data rows;

the training module is used for training at least one dictionary by utilizing the written data after the written data of the database table reach a preset threshold value;

the storage module is used for storing the trained dictionary into a dictionary file;

and the compression module is used for selecting the corresponding dictionary from the dictionary file based on the mapping relation to compress the written data of the data line in the data page.

In a third aspect, an embodiment of the present application further provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the database shared dictionary compression method according to the embodiment of the first aspect of the present application when executing the computer program.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where the storage medium stores a program, and the program is executed by a processor to implement the database shared dictionary compression method according to the embodiment of the first aspect of the present application.

The embodiment of the application at least comprises the following beneficial effects: the embodiment of the application provides a compression method, a device, electronic equipment and a storage medium for a database shared dictionary, firstly, writing operation is carried out on a data page in a database table, written data are written into a data line of the data page, a dictionary is trained by using the written data after the written data reach a preset threshold value, then first metadata are stored in the data page, the mapping relation between the data page and the dictionary is recorded, the trained dictionary is stored into an independent dictionary file, finally, the corresponding dictionary is selected from the dictionary file according to the mapping relation to compress the written data of the data line of the data page, the written data reaching the preset threshold value are adopted to train the dictionary in the process, the dictionary training speed is improved, meanwhile, the compression ratio is improved, the dictionary is stored in an independent file to facilitate management, the data page and the dictionary are related based on the mapping relation to facilitate searching for the corresponding dictionary to compress and decompress and the like.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart illustrating a database shared dictionary compression method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of step S102 in FIG. 1;

FIG. 3 is a schematic flow chart of step S203 in FIG. 2;

FIG. 4 is a diagram illustrating a first metadata mapping relationship, according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of step S202 in FIG. 2;

FIG. 6 is a schematic flow chart of step S402 in FIG. 5;

FIG. 7 is a schematic flow chart of FIG. 1 after step S104;

FIG. 8 is a schematic flow chart of FIG. 1 after step S104;

FIG. 9 is a diagram illustrating a mapping relationship provided by an embodiment of the present application;

FIG. 10 is a diagram of an apparatus for compressing a shared dictionary in a database according to an embodiment of the present application;

FIG. 11 is a flowchart of a database shared dictionary creation application provided by one embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals: awriting module 100, atraining module 200, astorage module 300, acompression module 400, anelectronic device 1000, aprocessor 1001, amemory 1002.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for the purpose of explaining the present application and are not to be construed as limiting the present application.

In the description of the present application, it is to be understood that the positional descriptions, such as the directions of up, down, front, rear, left, right, etc., referred to herein are based on the directions or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, and do not indicate or imply that the referred device or element must have a specific direction, be constructed and operated in a specific direction, and thus, should not be construed as limiting the present application.

In the description of the present application, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and larger, smaller, larger, etc. are understood as excluding the present number, and larger, smaller, inner, etc. are understood as including the present number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, unless otherwise expressly limited, terms such as set, mounted, connected and the like should be construed broadly, and those skilled in the art can reasonably determine the specific meaning of the terms in the present application by combining the detailed contents of the technical solutions.

For a better understanding of the technical solutions provided in the present application, the terms appearing herein are described accordingly:

data compression: the method is a technical method for reducing the data volume to reduce the storage space and improve the transmission, storage and processing efficiency on the premise of not losing information. Or reorganize the data according to a certain algorithm to reduce the redundancy and storage space of the data.

A database: a database is a "warehouse that organizes, stores, and manages data according to a data structure," which is an organized, sharable, uniformly managed collection of large amounts of data that is stored in a computer for a long period of time.

A database table: the Table, a database object containing all data in the database, is an object used for storing data in the database, is a set of structured data, and is the basis of the whole database system.

Data page: the Page is a basic unit for exchanging a disk and a memory in a database, is also a basic unit for managing a storage space in the database, and represents a unit for writing data in the database to the disk, and is generally 8k,16k and the like.

Metadata: the data is also called intermediate data and relay data, and is descriptive information of data and information resources, mainly information describing data attributes, and is used for supporting functions such as indicating storage positions, historical data, resource searching, file recording and the like.

Compression ratio: the higher the compression rate, the smaller the data after compression.

Disk IO: input and Output (acronyms for Input and Output) of a disk refer to the read speed of bytes, i.e. the read-write capability of a hard disk.

And (3) persistence: is a mechanism for transitioning program data between a persistent state and a transient state by saving the data (e.g., objects in memory) to a permanently storable storage device (e.g., disk). The main application of persistence is to store objects in memory in a database, or in a disk file, an XML data file, and the like, that is, to persist transient data (such as data in memory, which cannot be permanently saved) into persistent data (such as persisting into a database, which can be permanently saved).

Caching: the memory can exchange data with the CPU at a high speed before the memory, so that the speed is high.

Byte: byte, abbreviated as B, is a unit of computer information technology for measuring storage capacity, and also represents data types and language characters in some computer programming languages, one Byte stores 8-bit unsigned numbers, and the stored value range is 0-255. Where 1 Byte (Byte) =8 bits (bit), 1KB (Kilobyte ) =1024b,1mb (Megabyte ) =1024kb,1gb (Gigabyte ) =1024MB.

The database compression is a method for compressing and storing main texts stored in a database to save space, and the compression of data in the database has at least the following advantages: 1. the research shows that the storage cost of data is higher than that of a processor and a memory in practice, and the expansion of storage demand space is higher than that of calculation. 2. Reduce disk IO, in many business scenes, the performance bottleneck of database is all on disk IO, if become IO compressed data with the original uncompressed data of IO, reduction IO volume that can be very big to reduce the emergence of performance bottleneck, realize the promotion of database performance. 3. And improving the cache hit rate. The modern server has a relatively high storage hardware configuration, but the memory is always limited because all user data cannot be cached in the memory, and if the data in the cache can be changed into compressed data, more data can be cached, so that the cache hit rate is increased, and the purpose of improving the performance is achieved.

Based on this, the embodiment of the application provides a database shared dictionary compression method, device, electronic device and storage medium, which can improve the compression rate and performance of database data and facilitate the lookup and management of a dictionary.

Referring to fig. 1, the embodiment of the present application provides a database shared dictionary compression method, which is applied to a database, and it should be understood that the database includes a plurality of database tables, the database tables include a plurality of data pages, and the data pages include a plurality of records, i.e., data rows, and the database shared dictionary compression method includes, but is not limited to, the following steps S101 to S104.

In step S101, a write operation is performed on a data page.

In some embodiments, the write operation is used to write data into multiple rows of data, i.e., store the write data into data pages of a database table in a database. Specifically, the write operation may be triggered by an input request of a user, may also be triggered by cache data of a system, or is triggered by a preset time of a system timer, which is not limited in this embodiment, further, in some embodiments, when performing the write operation on the data page, it is further required to first determine whether there is enough remaining space in the data page, for example, 16KB of write data is to be written into a data page with a size of 32KB, and when 24KB of data has been written into the data page, then the remaining space of the data page is 8KB and is smaller than the write data of 16KB, so as to determine that there is not enough remaining space in the data page.

And S102, when the written data in the database table reach a preset threshold value, training at least one dictionary by using the written data.

In some embodiments, when the written data in the database table reaches a preset threshold, for example, the data amount of the written data reaches the data amount stored in a preset number of data pages, or reaches a threshold of the compressed data amount adaptively selected by the database algorithm, specifically, whether zstd compression algorithm, lz4 compression algorithm or other compression algorithms, there are advantages and disadvantages, if the compression ratio is high (saving a large space after compression), the decompression speed is usually slow, otherwise, the decompression speed is fast, and therefore it is difficult to achieve the effect of fast and high compression ratio, whereas the database algorithm is adaptive, which means that depending on the data entering the system, it must be selected according to specific conditions, and exemplarily, the database file may be divided into several blocks, and then compressed using a suitable algorithm. Or selecting a compression format that allows for a first division and recompression, selecting a container file format that allows for a division and compression, such as Avro and partial, which can be mixed with other compression formats to achieve the desired speed and compression ratio, so that different compression algorithms correspond to different compressed data amount thresholds in the adaptive selection of database algorithms. In some embodiments, at least one dictionary is trained using the written data.

In some embodiments, each data page stores first metadata and each data row stores second metadata. Specifically, the first metadata is used to store mapping relationships between a dictionary and a preset number of data pages, for example, data rows of ten data pages share one dictionary, then the ten data pages all store mapping relationships corresponding to the dictionary, and the second metadata is used to store attribute information of corresponding data rows.

And step S103, storing the trained dictionary into a dictionary file.

In some embodiments, the dictionary is persistently stored in a separate dictionary file after the dictionary training is completed, it is understood that the dictionary is stored separately from the database, that is, the dictionary is not stored in the database table or the data page but is stored in a separate dictionary file, and further, the dictionary file is cached in the memory, and it is understood that the cache is a memory chip on the hard disk controller, has an extremely fast access speed, and is a buffer between the internal storage of the hard disk and the external interface. The internal data transmission speed of the hard disk is different from the external data transmission speed, so that the cache plays a role of buffering. The size and speed of the cache are important factors directly related to the transmission speed of the hard disk, and the overall performance of the hard disk can be greatly improved. When the hard disk accesses the fragmented data, the data needs to be exchanged between the hard disk and the memory continuously, if a large cache exists, the fragmented data can be temporarily stored in the cache, and the data transmission speed is improved. The caching of the hard disk mainly has three functions: the first function is pre-reading, when the hard disk is controlled by the processor instruction to start reading data, the control chip on the hard disk can control the magnetic head to read the data in the next cluster or clusters which are being read into the cache, when the data in the next cluster or clusters needs to be read, the hard disk does not need to read the data again, and the data in the cache is directly transmitted into the memory, because the cache speed is far higher than the read-write speed of the magnetic head, the purpose of obviously improving the performance can be achieved; the second is to cache the write action, and the third is to temporarily store the recently accessed data. Therefore, in the embodiment, the compression or decompression operation can be performed only by reading the dictionary file of the memory and inquiring the corresponding dictionary, so that the read data can be rapidly compressed or decompressed, and the performance of the database is effectively improved.

In some embodiments, since a database table includes multiple data pages, it may be preset that different numbers of data pages share one dictionary, so that one database table usually corresponds to multiple dictionaries, and one dictionary file may store multiple dictionaries, for example, a database table stores 1024 data pages of write data, and every 32 pages of data pages share one dictionary, so that the database table corresponds to a dictionary file in which 1024/32=32 dictionaries should be stored. It can be understood that each database table in the database corresponds to an independent dictionary file, the name of the dictionary file is consistent with that of the database table, and all dictionaries of the database table are arranged in the dictionary file in sequence.

In some embodiments, if the dictionary is persisted to the dictionary file, then the delegate dictionary training is complete.

And step S104, selecting a corresponding dictionary from the dictionary file to compress the written data of the data lines in the data page.

In some embodiments, according to the first metadata stored in the data page, based on the mapping relation stored in the first metadata, the corresponding dictionary is selected from the dictionary file to compress the write data of the data line in the data page. It can be understood that, in order to compress the written data of the data page, the corresponding dictionary is first searched for by the first metadata stored in the data page, and the compression is performed according to the dictionary.

Write-in data is written into a data line through write-in operation, after the write-in data reaches a preset threshold value, a dictionary is trained by the write-in data, the trained dictionary is stored in an independent dictionary file, the mapping relation between the dictionary and the data page is stored by storing first metadata in the data page, and therefore the corresponding dictionary is searched according to the mapping relation when the write-in data is compressed, so that the dictionary is easy to store and manage when being searched, the dictionary is trained by setting the preset threshold value, and the training speed of the dictionary is effectively improved.

Referring to fig. 2, in some embodiments of the present application, the step S102 may further include, but is not limited to, the following steps S201 to S203.

In step S201, the uncompressed number of uncompressed data pages in the database table is obtained.

In some embodiments, the obtaining of the uncompressed data pages in the database table may be triggered by a compression request, and then the number of the uncompressed data pages is obtained, for example, the number of the uncompressed data pages is obtained by counting, where the number of the uncompressed data pages is 16 or 32 data pages in the database table that are not compressed.

Step S202, when the uncompressed quantity reaches a preset quantity threshold value, writing data of the uncompressed data page is used as dictionary training data.

In some embodiments, the preset number threshold may be set to be 20 data pages, and it is understood that, when the write data of one data page is 16KB, then the write data of the 20 data pages of the preset number threshold is 320KB, that is, the preset threshold of the write data is 320KB, so that the write data of the 20 uncompressed data pages can be used as the training data of the dictionary.

Step S203, inputting dictionary training data into a dictionary generation model to generate a plurality of dictionaries.

In some embodiments, a plurality of dictionaries are generated through the dictionary generation model, and the number of the generated dictionaries is a preset number, which may be configured by a user, for example, different dictionaries are generated correspondingly according to different compression algorithms, and a dictionary with the highest storage compression rate is selected. It can be understood that the dictionary generation model is a neural network model, compression is performed according to the field frequency of the data in the database table, and the compression algorithm may be zstd, lzw, lz4, and the like, which is not limited in the present application. Specifically, referring to the write data of the data page in the database table shown in table 1 below, it can be seen from table 1 that Boo appears three times in the NAME column, street appears 7 times in the ADDRESS column, san also appears 7 times in the CITY column for the highest frequency, francisco appears three times, jose appears four times, and CA 9 appears 7 times in the STATE ZIP column, and thus the dictionary generation model performs compression training by counting the frequency of each field to obtain the dictionary of table 2. It can be understood that, in the dictionary, each field appearing multiple times in the original data page is represented by an index sequence, specifically, 1 represents Boo,2 represents Street,3 represents San,4 represents Francisco,5 represents Jose, and 6 represents CA 9, and by the dictionary, the original data page can be represented after being compressed as shown in table 3, so that the data of the database is compressed.

TABLE 1

TABLE 2

TABLE 3

By acquiring the uncompressed quantity of uncompressed data pages in the database table, if the quantity of the uncompressed data pages reaches a preset quantity threshold value, the written data in the uncompressed data pages is used as dictionary training data and input into a dictionary generation model, and dictionaries with preset quantities are generated according to training of a compression algorithm, so that training of the dictionaries is performed by selecting partial data, the training speed of the dictionaries is effectively improved, and the dictionaries with the highest compression ratio are selected for storing compression of subsequent database data through comparison of a plurality of dictionaries, so that the optimal compression ratio is achieved.

Referring to fig. 3, in some embodiments of the present application, the step S203 may further include, but is not limited to, the following steps S301 to S303.

Step S301, generating a preset number and a dictionary size according to the dictionary training data and a preset compression rate.

In some embodiments, the compression performance of the dictionary can be ensured by setting the preset compression rate, and it can be understood that the preset number and the size of the dictionary can be configured by a user or can be selected adaptively through a database algorithm in the process of generating the preset number and the size of the dictionary according to the dictionary training data and the preset compression rate, so that the optimal compression rate is achieved, and the method is not limited by the application.

Specifically, if the size of the dictionary is configured by the user, the boot configuration file of the database is modified, for example, the first _ DICT _ ROW _ NUM is modified, generally, the default first _ DICT _ ROW _ NUM =1000000, which represents that one million ROWs share one dictionary, and the user can configure the size of the dictionary by modifying a specific numerical value, and if the size of the dictionary is not configured by the user, the size of the dictionary is adaptively selected or selected by a database algorithm, for example, how many data ROWs share one dictionary is selected according to the data type of the written data, so that the optimal compression rate is achieved.

Step S302, inputting dictionary training data into a dictionary generation model, and generating a plurality of dictionaries based on a preset number and the size of the dictionaries.

In some embodiments, after dictionary training data is input into a dictionary generation model, a dictionary is generated according to a preset number and a dictionary size, so that the size and the number of the generated dictionary satisfy a preset condition, for example, the preset number is 16 dictionaries generated, the size of the dictionary can be 64K, 4MB and the like, generally speaking, the larger the dictionary is, the better the compression effect is, the smaller the compressed file is, but the slower the compression speed is, the more resources such as a memory and a processor are occupied during compression, and the smaller the dictionary is, although the faster the compression speed is, the less resources such as a memory and a processor are occupied during compression are, but the poorer the compression effect is, and the larger the compressed file is, so that the smaller the dictionary is, the better the dictionary is, and the better the dictionary is, but the more other resources are not occupied while the optimal compression ratio is achieved within a preset range.

Step S303, generating first metadata of the data page pair dictionary based on the mapping relation.

In some embodiments, the dictionary is stored in an independent dictionary file after the dictionary training is completed, so that to compress or decompress the written data of the data page through the dictionary, the corresponding dictionary needs to be found in the dictionary file through a mapping relationship, and the mapping relationship is stored in the first metadata of the data page, so that when the dictionary is generated, the first metadata of the data page to the dictionary needs to be synchronously generated. Specifically, the first metadata is stored in a first preset position of each data page, for example, a tail or a head of the data page, and the first metadata may include: dictionary filename, file offset, dictionary length. It can be understood that the dictionary file names are generally consistent with the names of the database tables, one database table corresponds to one dictionary file, a plurality of dictionaries are stored in one dictionary file, the file offset represents the offset of the dictionary in the dictionary file, and the dictionary length is the data of the dictionary length.

Referring to fig. 4, the left side of the drawing represents one database table in the database, the database table includes n data pages, each data page is enlarged as shown in the drawing, specifically, the enlarged data page ispage 1, which stores header information page header, free space for storing write data, and a first metadata part meta stored at the tail ofpage 1, specifically, the first metadata part meta stores a dictionary file name difile, a file offset diffset, and a dictionary length dift len, and according to these information, the dictionary file part can be located to the right side of the drawing, and as can be seen from the drawing, the compression dictionary corresponding todata page 1 is thefirst dictionary part 0 stored in the dictionary file part, so that a dictionary corresponding to thedata page 1 is found for compression or decompression.

After dictionary training data are input into a dictionary generation model, a plurality of dictionaries are further generated according to the dictionary training data and the preset compression rate, the preset number and the dictionary size are generated according to the dictionary training data and the preset compression rate, the dictionary size and the preset number are selected through user configuration or self-adaptive selection of a database algorithm, the optimal compression rate is favorably achieved, the dictionary training is stored into an independent dictionary file after being generated, first metadata for synchronously generating the mapping relation between a storage data page and the dictionary is stored in each data page, the storage management of the dictionary is facilitated, meanwhile, rapid positioning and searching can be carried out, and therefore compression or decompression operation of data written in a database is achieved.

Referring to fig. 5, in some embodiments of the present application, the step S202 may further include, but is not limited to, the following steps S401 to S402.

In step S401, the written data of the uncompressed data page is used as initial dictionary training data.

In some embodiments, the written data of the uncompressed data page may not be completely random numbers, but have a certain correlation, for example, a nationality field in the database may be mostly china, if the data row is an age field, most of the data rows are in the range of 0 to 80, and if the data row is a gender field, the data row is male or female, so that the written data of the uncompressed data page is used as the initial dictionary training data, and then the performance of the training dictionary is improved through further analysis processing.

Step S402, dictionary training data are obtained from the initial dictionary training data according to a preset selection strategy.

In some embodiments, related written data is selected from initial dictionary training data according to a preset selection strategy as dictionary training data, it can be understood that when a database table in the database has 10GB of written data sharing one dictionary, that is, 10GB of written data are compressed or decompressed by the same dictionary, but the database does not need to train the dictionary through complete 10GB of written data, but 10GB of written data are sampled, and a part of data reaching a preset threshold is selected for training.

The writing data of the uncompressed data page is used as the initial dictionary training data, the initial dictionary training data is further selected according to a preset selection strategy to obtain the dictionary training data, the dictionary training data can be compressed in advance according to the selected partial data, the compression ratio is not lost, and the training performance of the dictionary is improved.

Referring to fig. 6, in some embodiments of the present application, the step S402 may further include, but is not limited to, the following steps S501 to S502.

Step S501, a training data threshold is acquired.

In some embodiments, if a large amount of written data needs to be compressed or decompressed by using the same dictionary, only a part of the written data needs to be selected for training, so as to generate a shared dictionary for compressing or decompressing all the written data. It can be understood that the obtained partial write data needs to reach a training data threshold, otherwise, the compression performance of the dictionary generated by training is not good due to too little write data used for training, specifically, when 10GB of write data share one dictionary, the training data threshold may be set to be 128MB or 256MB, which is not limited by this embodiment.

Step S502, randomly selecting or selecting data with corresponding quantity and size from the initial dictionary training data at a preset position to serve as the dictionary training data.

In some embodiments, after the training data threshold is obtained, write-in data of a corresponding number and size may be randomly selected from the initial dictionary training data as dictionary training data, or write-in data of a corresponding number and size may be selected from a preset position of the initial dictionary training as dictionary training data, specifically, the obtained training data threshold is 128MB, 128MB write-in data may be randomly selected from 10GB write-in data, or continuous first 128MB write-in data or continuous second 128MB write-in data may be selected from 10GB write-in data, which is not limited in this application.

By acquiring the training data threshold, the written data with the quantity and the size corresponding to the preset position is selected from the written data through random sampling to be used as dictionary training data so as to train the dictionary, the remaining written data sharing the dictionary can be compressed or decompressed by using the dictionary subsequently, the compression ratio cannot be lost, and the training speed of the dictionary is improved.

Referring to fig. 7, in some embodiments of the present application, after the step S104, if the compression ratio of the data page does not reach the preset compression threshold, the written data of the data page is taken as the alternative data, and then, but not limited to, the following steps S601 to S603 may also be included.

In step S601, new write data is obtained.

In some embodiments, new write data is continuously written into the database through a write operation, and the new write data may be obtained through a trigger condition such as a user request or a database command.

Step S602, the dictionary is trained by using the newly added write data and the alternative data to update the dictionary of the data page.

In some embodiments, if the compression ratio of the data page during the compression process does not reach the preset compression threshold, for example, the preset compression threshold is 80%, that is, the compression ratio of the data page during the compression process is less than 80%, the data page is considered to have not reached the preset compression threshold, and therefore the compression fails, so that the written data of the data page is used as the alternative data. It will be appreciated that the dictionary is retrained with the newly added write data and the candidate data, thereby updating the dictionary of the data page that failed to be compressed, and in particular, the data associated with the first metadata stored in the data page.

In step S603, the data page is compressed using the updated dictionary.

In some embodiments, a corresponding updated dictionary is searched from the dictionary file through the mapping relation of the first metadata storage to compress the data page, if the compressed data page reaches a preset compression threshold, the compression is considered to be successful, if the compressed data page does not reach the preset compression threshold, the written data of the data page is continuously used as alternative data, newly-added written data is obtained again to train, and the above process is repeated until the compressed data page reaches the preset compression threshold.

In some embodiments, the compression of the write data for a data page in the database table is performed without compressing the first metadata for the data page and the second metadata for the data row in each data page. Specifically, the first metadata is stored in each data page, when the data is read to the data page, the corresponding dictionary can be found through the mapping relation stored by the first metadata, and the related information of the dictionary can be obtained without decompression operation, so that compression or decompression operation is performed on the written data, and the performance of the database is favorably improved. It can be understood that the written data stored in each data line in the data page includes user data and second metadata, the second metadata is used for storing attribute information of the data line, including line data length and/or line transaction information, specifically, the line data length represents length information of the data line, the line transaction information represents sequence information of the operation, when the data line is compressed, the user data in the written data is mainly compressed, and the second metadata does not need to be compressed, so that the number of times of decompression is reduced, and the performance of the database is improved, and because the second metadata occupies a small space, usually only 12B or 24B, the influence on the compression ratio can be almost ignored.

The written data which does not reach the compression preset data page is used as the alternative data, then the written data and newly added written data in a subsequent database are used as dictionary training data together, the dictionary is retrained, the associated information is updated, the updated dictionary is used for compressing the data, the first metadata stored in the data page and the second metadata stored in the data line are not compressed in the compression process, the corresponding dictionary can be searched without decompression, the decompression times are reduced, the performance of the database is effectively improved, and therefore the database achieves the optimal compression rate.

Referring to fig. 8, in some embodiments of the present application, after the step S104, after the compressed write data is read, decompression is required to obtain corresponding write data, where the decompression process includes, but is not limited to, the following steps S701 to S703.

Step S701, reading the first metadata of the data page to obtain a mapping relationship.

In some embodiments, the mapping relationship between the data page and the dictionary is obtained by reading the first metadata of the data page, specifically, after the data page is read, it is first determined whether the data page is compressed according to the written data of the data line, and if the written data of the data line is compressed, the mapping relationship between the data page and the dictionary is obtained according to the first metadata stored in the data page where the data line is located.

In step S702, the mapping relation is used to search the corresponding dictionary in the dictionary file.

In some embodiments, the mapping relationship is embodied by one or more of a dictionary file name, an offset of the dictionary in the dictionary file, and a dictionary length, and it can be understood that the mapping relationship is used to search the corresponding dictionary in the dictionary file by searching the dictionary file name to locate the dictionary file, and then calculating the offset or the dictionary length of the dictionary in the dictionary file to locate the dictionary to be specific.

In step S703, the dictionary is used to decompress the written data of the data page.

In some embodiments, the dictionary is used to decompress the compressed data page to obtain corresponding write data, and specifically, the read data line is decompressed, it can be understood that the dictionary may be cached in the memory to further improve the performance of compression or decompression, and further, the encoder used in the compression process or the decoder used in the decompression process may be cached in the memory.

When data is read into a data line of a data page in a database table, after the written data of the data line is judged to be compressed, a mapping relation between the data line and a dictionary is obtained according to first metadata stored in the corresponding data page, the corresponding dictionary in a dictionary file is quickly located and searched based on the mapping relation, so that the data line is decompressed to obtain the written data, and the performance is effectively improved by caching the dictionary into an internal memory.

In some embodiments of the present application, the mapping between each dictionary in the dictionary file and at least one data page is generated according to a preset mapping rule, specifically, the preset mapping rule includes a continuous mapping rule, a discontinuous mapping rule or a content-related mapping rule.

In some embodiments, when the preset mapping rule is a continuous mapping rule, a first number of continuous data pages are selected to be associated with the same dictionary, for example, the first number is set to be 128, that is, 128 continuous data pages are selected to be associated with the same dictionary, illustratively, 0 th to 127 th data pages are selected to share the first dictionary, 128 th to 255 th data pages share the second dictionary, and so on, the same dictionary is used to compress or decompress the 128 continuous data pages, and the corresponding mapping relationship is stored in each data page through the first metadata.

In some embodiments, when the predetermined mapping rule is a non-sequential mapping rule, a non-sequential second number of data pages are selected to be associated with the same dictionary, for example, the second number is set to 128, that is, 128 data pages are selected to be associated with the same dictionary, and for example, 128 data pages that are non-sequential in 0,2,4,5,6,9,12,16,32 … … may be selected to share one dictionary. Specifically, referring to the mapping relationship diagram shown in fig. 9, thefirst data page 1 and thethird data page 3 share thesame dictionary dit 1, thesecond data page 2, thefourth data page 4 and thefifth data page 5 share thesame dictionary dit 2, and the mapping relationship stored by the first metadata in each data page can be used for quickly positioning and searching the corresponding dictionary to compress or decompress data.

In some embodiments, when the preset mapping rule is a content-related mapping rule, a third number of data pages with written data having a correlation are selected to be associated with the same dictionary, for example, the third number is set to 128, that is, 128 data pages are selected, and a correlation exists between the data pages, for example, the same field exists between each data page, specifically, in the database of the enterprise, a database table with employee information exists, where the same field data such as an age field, a sex field, and the like is included between each data page, so that the data pages with the correlation can be regarded as data pages with the correlation, and one dictionary can be shared for compression or decompression, thereby improving the performance of the database and the efficiency of training the dictionary.

An embodiment of the present invention further provides a compression apparatus for a database shared dictionary, which can implement the compression method for a database shared dictionary described above, and as shown in fig. 10, in some embodiments of the present application, the compression apparatus for a database shared dictionary includes awriting module 100, atraining module 200, astorage module 300, and acompression module 400. Specifically, thewrite module 100 is configured to perform a write operation on a data page, where the write operation is used to write data into a plurality of data rows; thetraining module 200 is configured to train at least one dictionary by using the written data when the written data of the database table reaches a preset threshold; thestorage module 300 is used for storing the trained dictionary into the dictionary file in a persistent mode; thecompression module 400 is configured to select a corresponding dictionary from the dictionary file based on the mapping relationship to compress the write data of the data line in the data page.

Referring to fig. 11, in some embodiments of the present application, the object of compression is each data line in a data page, and thus the data lines in the same data page share a dictionary. Specifically, in FIG. 11.1, an empty database table is created, and no write data is currently written. In fig. 11.2, part of the data is written in the database table, but the preset threshold for creating the dictionary is not reached yet, for example, the preset threshold is set to be 128MB, and the data volume of the written data needs to reach 128MB to create a new dictionary. In fig. 11.3, data is continuously written into the database table, and the amount of data in the database table reaches the preset threshold for creating the dictionary. In the diagram 11.4.1, the written data are used as dictionary training data, the dictionary training data are input into a dictionary generation model, a dictionary is generated through training according to the preset size and number of the dictionary, the trained dictionary is durably and separately stored in a dictionary file, and further the dictionary file can be cached in a memory. In fig. 11.4.2, data pages in a database table are compressed by using the successfully trained dictionary in fig. 11.4.1, specifically, in fig. 11.4.2, partial data pages are successfully compressed, for example,data page 1,data page 3, anddata page 128, and partial data pages are not successfully compressed, for example,data page 2, it is understood that the failure in compression is due to the fact that the compression ratio of the dictionary to the data page does not reach a preset compression threshold, for example, the preset compression threshold is set to 80%, which represents that the compression ratio ofdata page 2 is less than 80%, so that the failure in compression requires writing data newly added subsequently as new dictionary training data, retraining the dictionary into a dictionary file, and then performing re-compression using the new dictionary.

The specific implementation of the database shared dictionary compression apparatus of this embodiment is substantially the same as the specific implementation of the database shared dictionary compression method, and is not described in detail herein.

Fig. 12 shows anelectronic device 1000 provided in an embodiment of the present application. Theelectronic device 1000 includes: aprocessor 1001, amemory 1002 and a computer program stored on thememory 1002 and executable on theprocessor 1001, the computer program being operable to perform the above-mentioned database shared dictionary compression method.

Theprocessor 1001 and thememory 1002 may be connected by a bus or other means.

Thememory 1002, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs and non-transitory computer executable programs, such as the database-shared dictionary compression methods described in the embodiments of the present application. Theprocessor 1001 implements the database-shared dictionary compression method described above by running non-transitory software programs and instructions stored in thememory 1002.

Thememory 1002 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data for performing the database-shared dictionary compression method described above. Further, thememory 1002 may include a high speedrandom access memory 1002, and may also include anon-transitory memory 1002, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some embodiments, thememory 1002 may optionally includememory 1002 located remotely from theprocessor 1001, and suchremote memory 1002 may be coupled to theelectronic device 1000 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Non-transitory software programs and instructions required to implement the above-described database shared dictionary compression method are stored in thememory 1002, and when executed by the one ormore processors 1001, perform the above-described database shared dictionary compression method, for example, perform method steps S101 to S104 in fig. 1, method steps S201 to S203 in fig. 2, method steps S301 to S303 in fig. 3, method steps S401 to S402 in fig. 5, method steps S501 to S502 in fig. 6, method steps S601 to S603 in fig. 7, and method steps S701 to S703 in fig. 8.

The embodiment of the application further provides a storage medium which is a computer readable storage medium, and the storage medium stores a computer program, and when the computer program is executed by a processor, the compression method of the database shared dictionary is implemented. The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the compression method, the device, the electronic equipment and the storage medium of the database shared dictionary, the write-in operation is executed on the data page in the database table, the write-in data is written into the data row of the data page, the dictionary is trained by using the write-in data after the write-in data reaches the preset threshold, then the first metadata is stored in the data page, the mapping relation between the data page and the dictionary is recorded, the trained dictionary is stored into an independent dictionary file, finally the corresponding dictionary is selected from the dictionary file according to the mapping relation to compress the write-in data of the data row of the data page, compression and decompression are performed according to the row granularity of the data row, the first metadata of the data page is kept in an uncompressed state in the compression process, the decompression times are effectively reduced, the performance of the database is improved, a small amount of write-in data reaching the preset threshold is adopted to train, the size and the number of pages of the shared dictionary can be configured by a user, the dictionary can also be selected in a self-adapting mode through the algorithm of the database, the dictionary can be beneficial to achieving the optimal dictionary efficiency while the dictionary and the compression ratio can be stored in an independent file, the dictionary is cached in an internal memory and managed conveniently inquired and the dictionary corresponding dictionary and the dictionary can be conveniently searched and the compressed and decompressed based on the mapping relation.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

It should also be appreciated that the various implementations provided in the embodiments of the present application can be combined arbitrarily to achieve different technical effects. While the present application has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A database shared dictionary compression method is applied to a database, wherein the database comprises a plurality of database tables, the database tables comprise a plurality of data pages, and the data pages comprise a plurality of data rows; the method comprises the following steps:

when the written data of the database table reach a preset threshold value, training at least one dictionary by utilizing the written data; the data page is stored with first metadata, the data line is stored with second metadata, the first metadata is used for storing the mapping relation between the dictionary and at least one data page, and the second metadata is used for storing attribute information of the data line;

storing the trained dictionary into a dictionary file;

2. The method of claim 1, wherein the training at least one dictionary using the written data when the written data in the database table reaches a predetermined threshold, further comprises:

3. The method of compressing a database shared dictionary according to claim 2, wherein the inputting the dictionary training data into a dictionary generation model to generate a plurality of dictionaries further comprises:

4. The method for compressing a database shared dictionary according to claim 2, wherein when the uncompressed quantity reaches a preset quantity threshold, the method takes the written data of the uncompressed data page as dictionary training data, and further comprising:

5. The database shared dictionary compression method of claim 4, wherein the dictionary training data is obtained from the initial dictionary training data according to a preset selection strategy, further comprising:

acquiring a training data threshold;

6. The database sharing dictionary compression method according to claim 1, wherein after the selecting the corresponding dictionary from the dictionary file based on the mapping relationship to compress the written data of the data row in the data page, further comprising:

acquiring newly added write data;

and compressing the data page by using the updated dictionary.

7. The database-shared dictionary compression method of claim 1, wherein the first metadata is stored at a first preset location of the data page, the first metadata comprising: one or more of dictionary filename, file offset, dictionary length.

8. The database shared dictionary compression method of claim 1, wherein the second metadata is stored at a second preset location of the data row, the second metadata comprising: line data length and/or line transaction information.

9. The database sharing dictionary compression method according to claim 1, wherein the mapping relationship between the dictionary and at least one of the data pages is generated according to a preset mapping rule, and the preset mapping rule includes a continuous mapping rule, a discontinuous mapping rule or a content-dependent mapping rule;

and when the preset mapping rule is a content-related mapping rule, selecting a third number of data pages with the write-in data having the relevance to be associated with the same dictionary.

10. The database shared dictionary compression method of claim 1, wherein the first metadata of the data pages and the second metadata of the data rows in each of the data pages are not compressed when the write data of the data pages in the database table is compressed.

11. The database sharing dictionary compression method according to claim 1, wherein after the selecting the corresponding dictionary from the dictionary file based on the mapping relationship to compress the written data of the data row in the data page, further comprising: decompressing the compressed data line to obtain the corresponding write-in data;

the decompression process comprises the following steps:

reading the first metadata of the data page to obtain the mapping relation;

searching the corresponding dictionary in the dictionary file by utilizing the mapping relation;

decompressing the written data of the data line using the dictionary.

12. An apparatus for compressing a shared dictionary in a database, comprising:

13. An electronic device, comprising a memory storing a computer program, and a processor implementing the database relational dictionary compression method according to any one of claims 1 to 11 when the processor executes the computer program.

14. A computer-readable storage medium, characterized in that the storage medium stores a program executed by a processor to implement the database shared dictionary compression method according to any one of claims 1 to 11.