技术领域technical field
本发明涉及数据压缩技术领域,具体涉及一种数据中心容灾系统。The invention relates to the technical field of data compression, in particular to a data center disaster recovery system.
背景技术Background technique
对于IT而言,容灾系统就是为计算机信息系统提供的一个能应付各种灾难的环境,当计算机系统在遭受如水灾、战争等不可抗拒的自然灾害以及人为灾难时,容灾系统可以保证用户数据的安全性。目前的医院已经发展为现代化的综合性医院,为了实现医院管理的科学化、现代化,实现数据全面共享,共同形成全面的医院信息管理系统。由于医院信息系统涉及临床、实验室信息系统、医学影像管理、患者信息等多个重要系统,庞大的信息系统比人会产生大量数据,而且医院作为重点单位,系统中通常存实验等重要数据,若医院信息系统遭到自然灾害或黑客入侵,不免会造成数据损坏甚至系统瘫痪,因此建立一种医院信息容灾系统非常重要。For IT, the disaster recovery system is an environment for computer information systems that can cope with various disasters. When the computer system suffers from irresistible natural disasters such as floods and wars and man-made disasters, the disaster recovery system can guarantee Data Security. The current hospital has developed into a modern comprehensive hospital. In order to realize the scientific and modernization of hospital management, realize the comprehensive sharing of data, and jointly form a comprehensive hospital information management system. Since the hospital information system involves multiple important systems such as clinical and laboratory information systems, medical imaging management, and patient information, a huge information system will generate a large amount of data compared to humans, and as a key unit, the system usually stores important data such as experiments. If the hospital information system is attacked by natural disasters or hackers, it will inevitably cause data damage or even system failure. Therefore, it is very important to establish a hospital information disaster recovery system.
由于医院信息系统的数据庞大且复杂,对数据进行备份以构建容灾系统可能会耗费大量的人力和物力,由于对数据进行压缩备份在提高备份效率的同时,既可以减轻计算机系统运行压力也能保证数据完整。编码作为一种基于数据重复性的无损压缩方法,具有较强的压缩比,但是只考虑到当前搜索缓冲区内数据的重复性,若搜索缓冲区的长度长,会降低编码的时间效率;反之搜索缓冲区的长度短时,包含的待编码数据序列中字符串的可能性较低,导致压缩效率降低。Due to the huge and complex data of the hospital information system, it may take a lot of manpower and material resources to back up the data to build a disaster recovery system. Since the data compression and backup can improve the backup efficiency, it can not only reduce the operating pressure of the computer system but also reduce the risk of disaster recovery. Ensure data integrity. As a lossless compression method based on data repetition, encoding has a strong compression ratio, but only considers the repetition of data in the current search buffer. If the length of the search buffer is long, the time efficiency of encoding will be reduced; otherwise When the length of the search buffer is short, the possibility of containing strings in the sequence of data to be encoded is low, resulting in less efficient compression.
发明内容Contents of the invention
本发明提供一种数据中心容灾系统,以解决现有的问题。The invention provides a data center disaster recovery system to solve the existing problems.
本发明的一种数据中心容灾系统采用如下技术方案:A data center disaster recovery system of the present invention adopts the following technical solutions:
本发明一个实施例提供了一种数据中心容灾系统,该系统包括以下模块:One embodiment of the present invention provides a data center disaster recovery system, the system includes the following modules:
数据预处理模块、用于采集医院信息数据,利用平滑算法处理医院信息数据并将医院信息数据按行展开获取待编码数据序列;The data preprocessing module is used to collect hospital information data, use smoothing algorithm to process hospital information data and expand hospital information data by row to obtain the data sequence to be encoded;
相似度获取模块、用于根据预设的搜索缓冲区对待编码数据序列进行匹配操作,得到匹配到的字符串;根据编码算法对匹配到的字符串进行编码获取编码结果;根据匹配到的字符串更新搜索缓冲区;根据更新前后的搜索缓冲区中包含的字符种类及频率获取字符频率分布序列;根据更新前后的搜索缓冲区的字符频率分布序列获取更新前后的搜索缓冲区的相似度;The similarity acquisition module is used to perform a matching operation on the to-be-encoded data sequence according to a preset search buffer to obtain a matched string; The encoding algorithm encodes the matched string to obtain the encoding result; updates the search buffer according to the matched string; obtains the character frequency distribution sequence according to the character type and frequency contained in the search buffer before and after the update; according to the search before and after the update The character frequency distribution sequence of the buffer obtains the similarity of the search buffer before and after the update;
编码模块、用于根据更新前后的搜索缓冲区的相似度调整搜索缓冲区的长度,得到最终的搜索缓冲区;根据最终的搜索缓冲区继续对待编码数据序列进行匹配操作,直到待编码数据序列中所有字符都已完成遍历时停止迭代,将编码过程中所有匹配到的字符串的编码结果构成医院信息数据的压缩数据;The coding module is used to adjust the length of the search buffer according to the similarity of the search buffer before and after the update to obtain the final search buffer; continue to perform the matching operation on the data sequence to be coded according to the final search buffer until the data sequence to be coded Stop iterating when all characters have been traversed, and form the compressed data of hospital information data from the encoding results of all matched character strings in the encoding process;
存储模块、用于对医院信息数据的压缩数据进行存储,实现医院信息数据容灾系统构建。The storage module is used to store the compressed data of the hospital information data to realize the construction of the hospital information data disaster recovery system.
优选的,所述根据匹配到的字符串更新搜索缓冲区,包括的具体方法为:Preferably, the specific method of updating the search buffer according to the matched string includes:
将搜索缓冲区中匹配到的字符串以及搜索缓冲区中在匹配到的字符串位置以前的字符从搜索缓冲区中剔除,并将待编码数据序列中匹配到的字符串和与字符串相邻的后一位字符加入到搜索缓冲区的末尾,完成搜索缓冲区的更新。Remove the matched character string in the search buffer and the characters before the matched character string in the search buffer from the search buffer, and combine the matched character string and the adjacent character string in the data sequence to be encoded The last character is added to the end of the search buffer to complete the update of the search buffer.
优选的,所述根据更新前后的搜索缓冲区中包含的字符种类及频率获取字符频率分布序列,包括的具体方法为:Preferably, the specific methods for obtaining the character frequency distribution sequence according to the character types and frequencies contained in the search buffer before and after the update include:
获取更新前后的搜索缓冲区的字符种类并将字符种类进行整合,分别在更新前后的搜索缓冲区中统计所有种类字符出现频率,构成更新前的搜索缓冲区的字符频率分布序列以及更新后的搜索缓冲区的字符频率分布序列,更新前后的搜索缓冲区的字符频率分布序列中每个位置对应的字符是相同的。Obtain the character types of the search buffer before and after the update and integrate the character types, count the frequency of occurrence of all types of characters in the search buffer before and after the update, and form the character frequency distribution sequence of the search buffer before the update and the search after the update The character frequency distribution sequence of the buffer, the characters corresponding to each position in the character frequency distribution sequence of the search buffer before and after the update are the same.
优选的,所述根据更新前后的搜索缓冲区的字符频率分布序列获取更新前后的搜索缓冲区的相似度,包括的具体公式为:Preferably, the similarity of the search buffer before and after the update is obtained according to the character frequency distribution sequence of the search buffer before and after the update, and the specific formula included is:
其中,表示滑动的第/>个更新前后搜索缓冲区的相似度,且/>,其中/>表示遍历整个待编码数据序列需要的搜索缓冲区个数,/>表示归一化处理后的第/>个更新前的搜索缓冲区与待编码数据序列匹配到的字符串的长度,/>和/>分别表示更新前和更新后的搜索缓冲区的字符频率分布序列中第/>个字符对应的频率,/>表示更新前和更新后的搜索缓冲区中所有的字符种类数,/>表示以自然数为底的指数函数。in, Indicates the slide's first /> The similarity of the search buffer before and after an update, and /> , where /> Indicates the number of search buffers required to traverse the entire data sequence to be encoded, /> Indicates the normalized No. /> The length of the string matching the search buffer before the update and the data sequence to be encoded, /> and /> Respectively represent the character frequency distribution sequence of the search buffer before and after the update The frequency corresponding to characters, /> Indicates the number of all character types in the search buffer before and after the update, /> Represents an exponential function with a natural number base.
优选的,所述根据更新前后的搜索缓冲区的相似度调整搜索缓冲区的长度,得到最终的搜索缓冲区,包括的具体方法为:Preferably, the length of the search buffer is adjusted according to the similarity of the search buffer before and after the update to obtain the final search buffer, and the specific methods included are:
预设相似度阈值,进行更新前后的搜索缓冲区的相似度的判断操作:当更新前后的搜索缓冲区的相似度大于或等于相似度阈值时,将更新后的搜索缓冲区作为最终的搜索缓冲区;当更新前后的搜索缓冲区的相似度小于相似度阈值时,根据预设的扩展长度a,将更新后的搜索缓冲区向前扩展a个字符,得到再次更新后的搜索缓冲区,获取更新前的搜索缓冲区与再次更新后的搜索缓冲区的相似度,重复进行更新前的搜索缓冲区与再次更新后的搜索缓冲区的相似度的判断操作,直到得到最终的搜索缓冲区时停止迭代。The similarity threshold is preset to judge the similarity of the search buffer before and after the update: when the similarity of the search buffer before and after the update is greater than or equal to the similarity threshold, the updated search buffer is used as the final search buffer area; when the similarity of the search buffer before and after the update is less than the similarity threshold, according to the preset extension length a, extend the updated search buffer forward by a characters to obtain the updated search buffer again, and obtain The similarity between the search buffer before updating and the search buffer after updating again, repeat the judgment operation of the similarity between the search buffer before updating and the search buffer after updating again, until the final search buffer is obtained and stop iterate.
本发明的技术方案的有益效果是:能够根据搜索缓冲区和待编码数据序列的匹配结果更新搜索缓冲区,缩短了搜索缓冲区的长度,能够提高匹配速度;根据更新前后的搜索缓冲区的相似度判断是否调整搜索缓冲区,并前向扩展搜索缓冲区,确保了压缩率;本实施例提高了医院信息数据的压缩效率。The beneficial effects of the technical solution of the present invention are: the search buffer can be updated according to the matching results of the search buffer and the data sequence to be encoded, the length of the search buffer can be shortened, and the matching speed can be improved; according to the similarity between the search buffer before and after the update It is judged whether to adjust the search buffer according to the degree, and the search buffer is expanded forward to ensure the compression rate; this embodiment improves the compression efficiency of the hospital information data.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本发明一种数据中心容灾系统的结构框图。FIG. 1 is a structural block diagram of a data center disaster recovery system according to the present invention.
具体实施方式Detailed ways
为了更进一步阐述本发明为达成预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本发明提出的一种数据中心容灾系统,其具体实施方式、结构、特征及其功效,详细说明如下。在下述说明中,不同的“一个实施例”或“另一个实施例”指的不一定是同一实施例。此外,一或多个实施例中的特定特征、结构或特点可由任何合适形式组合。In order to further explain the technical means and effects of the present invention to achieve the intended purpose of the invention, the specific implementation, structure, Features and their effects are detailed below. In the following description, different "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, the particular features, structures or characteristics of one or more embodiments may be combined in any suitable manner.
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of the invention.
下面结合附图具体的说明本发明所提供的一种数据中心容灾系统的具体方案。A specific solution of a data center disaster recovery system provided by the present invention will be described in detail below in conjunction with the accompanying drawings.
请参阅图1,其示出了本发明一个实施例提供的一种数据中心容灾系统的结构框图,该系统包括以下模块:Please refer to Fig. 1, which shows a structural block diagram of a data center disaster recovery system provided by an embodiment of the present invention, the system includes the following modules:
数据采集模块101.利用医院信息系统采集医院信息数据并分类,获得待编码数据序列。Data collection module 101. Use the hospital information system to collect and classify hospital information data to obtain the data sequence to be coded.
需要说明的是,获取医院信息系统中的数据,按照临床信息、实验室信息、患者信息以及医学影像信息等类别将采集到的数据进行分类。由于编码算法基于数据重复性压缩,而未处理的数据可能受到噪声等多方面的影响造成数据的重复性较差,从而可能会影响压缩效果。因此本实施例利用平滑算法对采集到的多个类别的信息数据进行处理,使相邻时间点和空间点的数据更接近,增加了数据重复的可能性。It should be noted that the data in the hospital information system is obtained, and the collected data are classified according to categories such as clinical information, laboratory information, patient information, and medical imaging information. because The encoding algorithm is based on data repetition compression, and unprocessed data may be affected by noise and other aspects, resulting in poor data repetition, which may affect the compression effect. Therefore, this embodiment uses a smoothing algorithm to process the collected information data of multiple categories, so that the data of adjacent time points and spatial points are closer, and the possibility of data duplication is increased.
需要说明的是,通常利用医院信息系统采集到的医院信息数据,包括了日期、科室、患者数量等多个信息,并利用二维数据表进行存储,为方便后续进行数据压缩,本实施例将数据表看作一个二维矩阵,将矩阵进行转置后按照行展开的方式生成一维数据序列,将生成的一维数据序列视作待编码数据序列。It should be noted that the hospital information data collected by the hospital information system usually includes multiple information such as date, department, and number of patients, and is stored in a two-dimensional data table. In order to facilitate subsequent data compression, this embodiment will The data table is regarded as a two-dimensional matrix. After the matrix is transposed, a one-dimensional data sequence is generated according to the row expansion method, and the generated one-dimensional data sequence is regarded as a data sequence to be encoded.
至此,获取了医院信息的待编码数据序列。So far, the data sequence to be encoded of the hospital information has been obtained.
相似度获取模块102.读入待编码数据,根据预设的搜索缓冲区与待编码数据进行匹配并更新搜索缓冲区,获取更新前后的搜索缓冲区的相似度。The similarity acquisition module 102 reads in the data to be encoded, matches and updates the search buffer according to the preset search buffer and the data to be encoded, and obtains the similarity of the search buffer before and after the update.
需要说明的是,传统的编码算法固定搜索缓冲区的长度,若搜索缓冲区的长度过长,则在提高编码速率的同时会降低编码的时间效率;反之搜索缓冲区的长度过短,则在提高编码的时间效率时将低了编码速率,压缩效率降低。而且,/>编码中的搜索缓冲区长度固定,所以根据搜索缓冲区与待编码数据序列的匹配结果,搜索缓冲区会在待编码数据序列中前移匹配到的字符串对应的长度,以确保搜索缓冲区长度固定,这样匹配结果仅依赖搜索缓冲区长度,容易造成压缩效率降低或时间效率过低。因此本实施例通过匹配结果调整搜索缓冲区长度,尽可能缩短搜索缓冲区长度。It should be noted that the traditional The encoding algorithm fixes the length of the search buffer. If the length of the search buffer is too long, the time efficiency of encoding will be reduced while increasing the encoding rate; The lower the encoding rate, the lower the compression efficiency. And, /> The length of the search buffer in encoding is fixed, so according to the matching result of the search buffer and the data sequence to be encoded, the search buffer will move forward in the data sequence to be encoded by the corresponding length of the matched string to ensure the length of the search buffer Fixed, so that the matching result only depends on the length of the search buffer, which is likely to reduce the compression efficiency or the time efficiency is too low. Therefore, in this embodiment, the length of the search buffer is adjusted according to the matching result, and the length of the search buffer is shortened as much as possible.
首先预设初始搜索缓冲区长度,基于待编码数据序列与初始搜索缓冲区进行最长匹配,本实施例以/>为例进行阐述,本实施例对/>不做限制。First preset the initial search buffer length , based on the longest match between the data sequence to be encoded and the initial search buffer, this embodiment uses /> As an example to illustrate, this embodiment will /> No restrictions.
滑动初始的搜索缓冲区与待编码数据序列进行最长匹配,记匹配到的字符串为,根据搜索缓冲区对匹配到的字符串L以及L之后的一个字符进行编码,得到编码结果。根据搜索缓冲区对匹配到的字符串L以及L之后的一个字符进行编码为/>编码算法中的现有技术,在此不详细阐述。获取编码结果后对搜索缓冲区进行更新:找到搜索缓冲区中匹配到的字符串/>所在的位置,将字符串/>以及处于字符串/>之前的字符均从初始搜索缓冲区中剔除,并将在待编码数据序列中匹配到的字符串/>以及与字符串/>相邻的后一位字符加入到搜索缓冲区的末尾,完成对搜索缓冲区的更新,记更新后的搜索缓冲区长度为/>。Slide the initial search buffer to perform the longest match with the data sequence to be encoded, record the matched string as , encode the matched character string L and a character after L according to the search buffer, and obtain the encoding result. According to the search buffer, encode the matched string L and a character after L as /> The prior art in the encoding algorithm will not be described in detail here. Update the search buffer after obtaining the encoding result: find the matched string in the search buffer /> where the string /> and in the string /> All previous characters are removed from the initial search buffer, and the string matched in the data sequence to be encoded /> and with the string /> The next adjacent character is added to the end of the search buffer to complete the update of the search buffer. Note that the length of the updated search buffer is /> .
需要说明的是,从搜索缓冲区中剔除的部分包括了匹配到的字符串以及位置处于字符串之前的字符,由于将匹配到的字符串加入到了初始搜索缓冲区的末尾,则当待编码数据序列中再次出现字符串/>时不会影响到匹配结果,但是由于剔除的部分包含了初始搜索缓冲区中处于字符串/>之前的部分,若待编码数据序列中出现字符串/>之前的字符,则会因为搜索缓冲区长度过短而导致压缩效率降低,因此为了在保证编码时间效率的同时需要提高压缩效率,本实施例对更新前后的搜索缓冲区进行相似度判断。It should be noted that the part excluded from the search buffer includes the matched character string and the character before the character string, because the matched character string added to the end of the initial search buffer, when the string /> appears again in the data sequence to be encoded will not affect the matching result, but because the excluded part contains the string in the initial search buffer /> In the previous part, if the character string /> appears in the data sequence to be encoded For the previous characters, the compression efficiency will decrease because the search buffer length is too short. Therefore, in order to improve the compression efficiency while ensuring the encoding time efficiency, this embodiment judges the similarity of the search buffer before and after the update.
需要说明的是,若更新前后的搜索缓冲区的相似度越高说明更新后的搜索缓冲区对于匹配结果的影响不大,反之则影响较大,需要进行调整。由于字符频率可以直观地反映两个搜索缓冲区的内容在字符分布上的相似性,因此本实施例通过统计更新前后的搜索缓冲区包含的字符频率并生成更新前后的搜索缓冲区字符频率分布序列,通过两个字符频率分布序列的散度量化两个搜索缓冲区的相似度。It should be noted that if the similarity of the search buffer before and after the update is higher, it means that the updated search buffer has little influence on the matching result; otherwise, it has a greater influence and needs to be adjusted. Since the character frequency can intuitively reflect the similarity in the character distribution of the content of the two search buffers, this embodiment generates the character frequency distribution sequence of the search buffer before and after the update by counting the character frequencies contained in the search buffer before and after the update , through the sequence of two character frequency distributions Divergence quantifies how similar two search buffers are.
需要说明的是,若搜索缓冲区与待编码数据序列匹配到的字符串长度越长,说明更新前的搜索缓冲区较为优异,此时根据散度衡量更新前后的搜索缓冲区的相似度的必要性较高;若搜索缓冲区与待编码数据序列匹配到的字符串长度越短,说明更新前的搜索缓冲区效果较差,此时以更新前的搜索缓冲区与更新后的都多缓冲区的/>散度作为衡量标准的必要性较低。因此本实施例根据匹配到的字符串长度调节利用/>散度量化更新前后的搜索缓冲区的相似度的准确性。It should be noted that if the length of the string matched by the search buffer and the data sequence to be encoded is longer, it means that the search buffer before the update is better. At this time, according to Divergence is more necessary to measure the similarity of the search buffer before and after the update; if the length of the string matching the search buffer and the data sequence to be encoded is shorter, it means that the search buffer before the update is less effective. Both the search buffer before the update and the one after the update have multiple buffers /> Divergence is less necessary as a measure. Therefore, this embodiment adjusts the use of /> according to the length of the matched string Divergence quantifies the accuracy of the similarity of the search buffer before and after the update.
获取更新前后的搜索缓冲区的字符频率分布序列,首先以更新前的搜索缓冲区中出现的字符种类构建更新前的搜索缓冲区的字符集合记为,同理获取更新后的搜索缓冲区的字集合记为/>,通过集合相并获取更新前后的搜索缓冲区的所有字符集合,记为。而后分别在更新前后的搜索缓冲区中统计集合/>中包含的所有种类字符的出现频率,构成更新前的搜索缓冲区的字符频率分布序列以及更新后的搜索缓冲区的字符频率分布序列。需要说明的是,更新前后的搜索缓冲区的字符频率分布序列中每个位置对应的字符是相同的。To obtain the character frequency distribution sequence of the search buffer before and after the update, first construct the character set of the search buffer before the update based on the character types that appear in the search buffer before the update and denote as , similarly obtain the word set of the updated search buffer as /> , through the set phase and get all the character sets of the search buffer before and after the update, denoted as . Then count the collection in the search buffer before and after the update respectively /> The occurrence frequencies of all types of characters contained in , constitute the character frequency distribution sequence of the search buffer before the update and the character frequency distribution sequence of the search buffer after the update. It should be noted that the characters corresponding to each position in the character frequency distribution sequence of the search buffer before and after the update are the same.
由于散度是基于两个序列的相对差异来定义的,所以利用/>散度量化更新前后的搜索缓冲区的相似性可以更准确的得到对应字符的频率变化,因此本实施例通过计算两个字符频率分布序列的/>散度量化相似度。又因为搜索缓冲区与待编码数据序列匹配到的字符串越长,则根据更新前后的搜索缓冲区的/>散度衡量更新前后的搜索缓冲区之间的相似度越具有说服力;若匹配到的字符串越短,说明更新前的搜索缓冲区的匹配效果较差,此时需要利用匹配到的字符串长度调整更新前后的搜索缓冲区的/>散度以获得更准确的相似度结果。因此,本实施例构建相似度计算公式如下:because Divergence is defined based on the relative difference between two series, so use /> The similarity of the search buffer before and after the divergence quantification update can more accurately obtain the frequency change of the corresponding character, so this embodiment calculates the frequency distribution sequence of two characters Divergence quantifies similarity. And because the longer the string matched by the search buffer and the data sequence to be encoded, the /> of the search buffer before and after the update Divergence measures the similarity between the search buffer before and after the update, the more convincing it is; if the matched string is shorter, it means that the matching effect of the search buffer before the update is poor, and the matched string needs to be used at this time /> of the search buffer before and after the length adjustment update Divergence for more accurate similarity results. Therefore, this embodiment constructs a similarity calculation formula as follows:
其中,表示滑动的第/>个更新前后的搜索缓冲区的相似度,且/>,其中/>表示遍历整个待编码数据序列需要的搜索缓冲区个数,/>表示归一化处理后的第/>个更新前的搜索缓冲区与待编码数据序列匹配到的字符串长度,/>和/>分别表示更新前和更新后的搜索缓冲区的字符频率分布序列中第/>个字符对应的频率,/>表示统计的更新前和更新后的搜索缓冲区中所有的字符种类,/>表示以自然数为底的指数函数。in, Indicates the slide's first /> The similarity of the search buffer before and after an update, and /> , where /> Indicates the number of search buffers required to traverse the entire data sequence to be encoded, /> Indicates the normalized No. /> The length of the string matched by the search buffer before the update and the data sequence to be encoded, /> and /> Respectively represent the character frequency distribution sequence of the search buffer before and after the update The frequency corresponding to characters, /> Indicates all character types in the search buffer before and after the update of the statistics, /> Represents an exponential function with a natural number base.
需要说明的是,更新前后的搜索缓冲区的字符频率分布序列的散度越大,说明更新前后的搜索缓冲区的相似度越低,因此本实施例利用/>函数构建/>散度与相似度的负相关关系,又因为搜索缓冲区与待编码数据序列匹配到的字符串长度越长时说明此时直接利用/>散度衡量更新前后的搜索缓冲区的相似度的效果越好,则以/>作为调整系数,当匹配到的字符串长度越大时,/>无限趋于0,则利用/>调整此时的相似度结果几乎与/>散度保持一致;反之,当匹配到的字符串长度越短时,利用/>散度衡量更新前后的搜索缓冲区的相似度的必要性低,通过/>调整相似度结果更大,从而使后续的搜索缓冲区扩展操作更准确。It should be noted that the character frequency distribution sequence of the search buffer before and after the update The larger the divergence, the lower the similarity of the search buffer before and after the update, so this embodiment uses function build /> The negative correlation between divergence and similarity, and because the length of the string matched by the search buffer and the data sequence to be encoded is longer, it means that the direct use of /> The better the divergence is to measure the similarity of the search buffer before and after the update, the better it is to use /> As an adjustment factor, when the length of the matched string is larger, /> Infinity tends to 0, then use /> Adjusting the similarity result at this time is almost the same as /> The divergence remains consistent; on the contrary, when the length of the matched string is shorter, use /> Divergence measures the similarity of the search buffer before and after the update with low necessity, via /> The adjusted similarity results are larger, making subsequent search buffer expansion operations more accurate.
至此,获取了在编码过程中更新前后搜索缓冲区的相似度。So far, the similarity of the search buffer before and after updating during the encoding process has been obtained.
编码模块103.根据相似度结果调整搜索缓冲区长度并继续进行相似度检测,直到相似度大于预设阈值时则停止,根据最终获取的搜索缓冲区对待编码数据序列进行编码。The encoding module 103 adjusts the length of the search buffer according to the similarity result and continues to perform similarity detection until the similarity is greater than the preset threshold, then stops, and encodes the data sequence to be encoded according to the finally obtained search buffer.
需要说明的是,更新前后的搜索缓冲区相似度越高,说明根据更新后的搜索缓冲区继续进行编码,因此本实施例进行更新前后的搜索缓冲区的相似度的判断操作:It should be noted that the higher the similarity of the search buffer before and after the update, it means that the encoding is continued according to the updated search buffer, so this embodiment performs the judgment operation of the similarity of the search buffer before and after the update:
通过预设相似度阈值,当/>时,将更新后的搜索缓冲区作为最终的搜索缓冲区;当/>时,根据预设的扩展长度/>,将更新后的搜索缓冲区向前扩展/>个字符,得到再次更新后的搜索缓冲区,获取更新前的搜索缓冲区与再次更新后的搜索缓冲区的相似度,重复进行更新前的搜索缓冲区与再次更新后的搜索缓冲区的相似度的判断操作,直到得到最终的搜索缓冲区时停止迭代。本实施例以/>为例进行叙述,对/>的取值不做限制。By preset similarity threshold , when /> When , use the updated search buffer as the final search buffer; when /> , according to the preset extension length /> , extending the updated search buffer forward /> characters, get the search buffer after updating again, obtain the similarity between the search buffer before updating and the search buffer after updating again, repeat the similarity between the search buffer before updating and the search buffer after updating again The judgment operation until the final search buffer is obtained and the iteration is stopped. This example starts with /> As an example to describe, to /> The value of is not limited.
需要说明的是,本发明实施例提出的编码算法是边搜索边编码的过程,搜索缓冲区不断向右滑动,并根据搜索缓冲区与待编码数据序列匹配到的字符串长度以及相似度更新并调整搜索缓冲区长度,得到最终的搜索缓冲区。It should be noted that the encoding algorithm proposed in the embodiment of the present invention is a process of encoding while searching, and the search buffer is constantly sliding to the right, and is updated and updated according to the length and similarity of the string matched between the search buffer and the data sequence to be encoded. Adjust the length of the search buffer to obtain the final search buffer.
在编码过程中,根据最终的搜索缓冲区继续对待编码数据序列进行匹配操作,直到待编码数据序列中所有字符都已完成遍历时停止迭代,将编码过程中所有编码结果构成医院信息数据的压缩数据。During the encoding process, continue to perform the matching operation on the data sequence to be encoded according to the final search buffer until all the characters in the data sequence to be encoded have been traversed and stop iterating, and form all the encoding results in the encoding process into the compressed data of the hospital information data .
至此,获取了医院信息数据的压缩数据。So far, the compressed data of the hospital information data has been obtained.
存储模块104.存储压缩后的医院信息数据,实现医院信息数据容灾系统的建立。The storage module 104 stores compressed hospital information data to realize the establishment of a hospital information data disaster recovery system.
需要说明的是,对医院信息数据进行压缩存储,极大地提高了医院信息数据的存储率和存储效率。在面临自然灾害以及人为攻击时,存储的医院信息数据可以进行持续访问和使用。It should be noted that compressing and storing the hospital information data greatly improves the storage rate and storage efficiency of the hospital information data. In the face of natural disasters and man-made attacks, the stored hospital information data can be continuously accessed and used.
在使用时需要对压缩数据进行解压,具体的解压过程如下:创建一个空的解码数据序列,获取医院信息数据的压缩数据序列并以此读入医院信息数据的压缩数据序列,通常是一系列三元组,每个三元组包含了表示数据的指针、匹配到的字符串的长度值以及下一个字符。根据指针位置和长度值从搜索缓冲区中找到字符串对应的位置,将找到的字符串以及三元组中下一个字符作为三元组的解码结果。将在搜索缓冲区中找到的字符串以及处于字符串之前的所有字符剔除并将解码结果加入搜索缓冲区末尾,实现更新,根据相似度获取模块得到更新前后的搜索缓冲区相似度,若相似度大于阈值则根据更新后的搜索缓冲区继续解码下一个三元组,反之则向前扩展搜索缓冲区,获取最终的搜索缓冲区后对下一个三元组进行解码。The compressed data needs to be decompressed during use. The specific decompression process is as follows: create an empty decoding data sequence, obtain the compressed data sequence of the hospital information data and read it into the compressed data sequence of the hospital information data, usually a series of three A tuple, each triplet contains a pointer to the data, the length of the matched string, and the next character. Find the corresponding position of the string from the search buffer according to the pointer position and the length value, and use the found string and the next character in the triplet as the decoding result of the triplet. Eliminate the character string found in the search buffer and all the characters before the character string and add the decoding result to the end of the search buffer to implement the update. According to the similarity acquisition module, the similarity of the search buffer before and after the update is obtained. If the similarity If it is greater than the threshold, the next triplet will be decoded according to the updated search buffer, otherwise, the search buffer will be extended forward, and the next triplet will be decoded after obtaining the final search buffer.
医院信息数据容灾系统的建立,确保了最大程度地降低数据丢失和业务中断的风险,保证了医院信息数据的安全和可靠性,实现了利用容灾系统保护医院数据信息。The establishment of the hospital information data disaster recovery system ensures that the risks of data loss and business interruption are minimized, the safety and reliability of hospital information data are guaranteed, and the hospital data information is protected by using the disaster recovery system.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the scope of the present invention. within the scope of protection.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310966546.1ACN116683916B (en) | 2023-08-03 | 2023-08-03 | A data center disaster recovery system |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310966546.1ACN116683916B (en) | 2023-08-03 | 2023-08-03 | A data center disaster recovery system |
| Publication Number | Publication Date |
|---|---|
| CN116683916Atrue CN116683916A (en) | 2023-09-01 |
| CN116683916B CN116683916B (en) | 2023-10-10 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310966546.1AActiveCN116683916B (en) | 2023-08-03 | 2023-08-03 | A data center disaster recovery system |
| Country | Link |
|---|---|
| CN (1) | CN116683916B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5406279A (en)* | 1992-09-02 | 1995-04-11 | Cirrus Logic, Inc. | General purpose, hash-based technique for single-pass lossless data compression |
| US5444489A (en)* | 1993-02-11 | 1995-08-22 | Georgia Tech Research Corporation | Vector quantization video encoder using hierarchical cache memory scheme |
| US5621403A (en)* | 1995-06-20 | 1997-04-15 | Programmed Logic Corporation | Data compression system with expanding window |
| US5694125A (en)* | 1995-08-02 | 1997-12-02 | Advance Hardware Architecture | Sliding window with big gap data compression system |
| US5951623A (en)* | 1996-08-06 | 1999-09-14 | Reynar; Jeffrey C. | Lempel- Ziv data compression technique utilizing a dictionary pre-filled with frequent letter combinations, words and/or phrases |
| CN107506153A (en)* | 2017-09-26 | 2017-12-22 | 深信服科技股份有限公司 | A kind of data compression method, data decompression method and related system |
| CN114157305A (en)* | 2021-11-18 | 2022-03-08 | 华中科技大学 | A Hardware-based Fast GZIP Compression Method and Its Application |
| CN115173866A (en)* | 2022-07-14 | 2022-10-11 | 郑州朗灵电子科技有限公司 | Efficient storage method of applet data |
| CN115866287A (en)* | 2023-02-22 | 2023-03-28 | 深圳市网联天下科技有限公司 | Efficient data transmission method for smart campus management platform |
| CN116032292A (en)* | 2023-03-27 | 2023-04-28 | 山东智慧译百信息技术有限公司 | Efficient big data storage method based on translation file |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5406279A (en)* | 1992-09-02 | 1995-04-11 | Cirrus Logic, Inc. | General purpose, hash-based technique for single-pass lossless data compression |
| US5444489A (en)* | 1993-02-11 | 1995-08-22 | Georgia Tech Research Corporation | Vector quantization video encoder using hierarchical cache memory scheme |
| US5621403A (en)* | 1995-06-20 | 1997-04-15 | Programmed Logic Corporation | Data compression system with expanding window |
| US5694125A (en)* | 1995-08-02 | 1997-12-02 | Advance Hardware Architecture | Sliding window with big gap data compression system |
| US5951623A (en)* | 1996-08-06 | 1999-09-14 | Reynar; Jeffrey C. | Lempel- Ziv data compression technique utilizing a dictionary pre-filled with frequent letter combinations, words and/or phrases |
| CN107506153A (en)* | 2017-09-26 | 2017-12-22 | 深信服科技股份有限公司 | A kind of data compression method, data decompression method and related system |
| CN114157305A (en)* | 2021-11-18 | 2022-03-08 | 华中科技大学 | A Hardware-based Fast GZIP Compression Method and Its Application |
| CN115173866A (en)* | 2022-07-14 | 2022-10-11 | 郑州朗灵电子科技有限公司 | Efficient storage method of applet data |
| CN115866287A (en)* | 2023-02-22 | 2023-03-28 | 深圳市网联天下科技有限公司 | Efficient data transmission method for smart campus management platform |
| CN116032292A (en)* | 2023-03-27 | 2023-04-28 | 山东智慧译百信息技术有限公司 | Efficient big data storage method based on translation file |
| Title |
|---|
| S. JAIN, R. K. BANSAL,: "On Match Lengths, Zero Entropy, and Large Deviations—With Application to Sliding Window Lempel–Ziv Algorithm", 《EEE TRANSACTIONS ON INFORMATION THEORY》, pages 120 - 132* |
| 杨长生, 宋广华, 卓越: "HLZ:一种采用混合字典的自适应无损编码算法", 《浙江大学学报(工学版)》, pages 40 - 43* |
| Publication number | Publication date |
|---|---|
| CN116683916B (en) | 2023-10-10 |
| Publication | Publication Date | Title |
|---|---|---|
| JP4961126B2 (en) | An efficient algorithm for finding candidate objects for remote differential compression | |
| CN102831222B (en) | Differential compression method based on data de-duplication | |
| CN113704261B (en) | Key value storage system based on cloud storage | |
| US8849772B1 (en) | Data replication with delta compression | |
| CN101467148A (en) | Efficient data storage using similarity of data segments | |
| US20120191672A1 (en) | Dictionary for data deduplication | |
| CN113328755B (en) | Compressed data transmission method facing edge calculation | |
| CN105912268B (en) | Distributed repeated data deleting method and device based on self-matching characteristics | |
| WO2017128763A1 (en) | Data compression device and method | |
| WO2017096532A1 (en) | Data storage method and apparatus | |
| CN101464910A (en) | Balance clustering compression method based on data similarity | |
| CN111061428B (en) | Method and device for data compression | |
| CN103248369A (en) | Compression system and method based on FPFA (Field Programmable Gate Array) | |
| CN115408350A (en) | Log compression method, log recovery method, log compression device, log recovery device, computer equipment and storage medium | |
| CN105515586B (en) | A kind of quick residual quantity compression method | |
| CN114647764A (en) | Graph structure query method and device and storage medium | |
| JP2025124637A (en) | Methods for Compression of Genomic Sequence Data | |
| Dolgorsuren et al. | StarZIP: Streaming graph compression technique for data archiving | |
| CN112508182A (en) | Pruning method, device, equipment, program product and medium for machine learning model | |
| CN112149416A (en) | A method for detecting hot academic research topics in a distributed academic data warehouse | |
| CN112162973A (en) | Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system | |
| CN116683916A (en) | Disaster recovery system of data center | |
| CN110083743A (en) | A kind of quick set of metadata of similar data detection method based on uniform sampling | |
| CN119166615A (en) | A cross-port data migration management system and method | |
| Guerrini et al. | Lossy compressor preserving variant calling through extended BWT |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| PE01 | Entry into force of the registration of the contract for pledge of patent right | Denomination of invention:A Disaster Recovery System for Data Centers Effective date of registration:20231207 Granted publication date:20231010 Pledgee:China People's Property Insurance Co.,Ltd. Qingdao Branch Pledgor:Shandong Wukesong Electric Technology Co.,Ltd. Registration number:Y2023370010127 | |
| PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
| PC01 | Cancellation of the registration of the contract for pledge of patent right | Granted publication date:20231010 Pledgee:China People's Property Insurance Co.,Ltd. Qingdao Branch Pledgor:Shandong Wukesong Electric Technology Co.,Ltd. Registration number:Y2023370010127 | |
| PC01 | Cancellation of the registration of the contract for pledge of patent right | ||
| PE01 | Entry into force of the registration of the contract for pledge of patent right | Denomination of invention:A data center disaster recovery system Granted publication date:20231010 Pledgee:China People's Property Insurance Co.,Ltd. Qingdao Branch Pledgor:Shandong Wukesong Electric Technology Co.,Ltd. Registration number:Y2024370010141 | |
| PE01 | Entry into force of the registration of the contract for pledge of patent right |