技术领域technical field
本发明属于计算机软件领域,尤其涉及一种用于医疗影像数据的基于布隆过滤器的多级索引设计方法。The invention belongs to the field of computer software, and in particular relates to a multi-level index design method based on Bloom filter for medical image data.
背景技术Background technique
随着现今医疗信息化的不断发展,数据量的急剧增加,PACS(影像归档和通信系统)所使用的关系型数据库存储方案已经很难满足日常的存储与检索需求。借助Hadoop这样的大数据分布式平台及新型的NoSQL数据库解决海量医疗影像数据的存储与检索已成为解决该问题的有效途径之一。With the continuous development of medical informatization and the sharp increase in the amount of data, the relational database storage scheme used by PACS (image archiving and communication system) has been difficult to meet the daily storage and retrieval needs. With the help of big data distributed platforms such as Hadoop and new NoSQL databases to solve the storage and retrieval of massive medical image data, it has become one of the effective ways to solve this problem.
HBase列式数据库属于Hadoop生态圈,兼容性非常好,可以直接与HDFS进行读写交互。同时,HBase的可扩展性使得用户不需要提前制定好表的结构,而是根据需要进行动态扩展,解决了关系型数据库必须预先制定表结构的问题。然而,HBase在索引方面仍有所欠缺,仅仅支持主键的索引,对于非主键的列必须全表扫描,效率不高。目前已经有一些学者研究提出一些对于HBase非主键索引的设计方法。下面举例说明。(1)构建二级索引的方法。这种方法主要采取倒排索引的思想,将主表中待建立索引的列作为索引表中的主键,而主表的主键作为索引表的值,首先通过查询一次索引表,拿到相关列对应的主键,再从主表中查询对应的行,这种方法虽然简易,但是需要两次查询,牺牲了一些性能。(2)线性化索引方法,这种方法是通过把K维数据映射到一维空间,然后利用HBase的一维索引技术。这种方法在处理空间数据时有着很好的效果,但是处理文本数据并不理想,而医疗影像文件主要是数字和文本数据,所以这种方法也不适用。(3)双层索引法,该方法使用全局索引与局部索引相配合的形式,来减少查询的数据节点个数,从高层索引到低层索引缩小了查询范围。但是每次写入数据时都需要对双层索引进行维护,代价很大,而且具体的双层索引需要使用不同的数据结构,实现比较复杂。The HBase columnar database belongs to the Hadoop ecosystem and has very good compatibility. It can directly read and write with HDFS. At the same time, the scalability of HBase makes it unnecessary for users to formulate the structure of the table in advance, but to dynamically expand as needed, which solves the problem that the relational database must formulate the table structure in advance. However, HBase still lacks in indexing. It only supports the index of the primary key. For non-primary key columns, a full table scan is required, which is inefficient. At present, some scholars have proposed some design methods for HBase non-primary key indexes. An example is given below. (1) The method of constructing the secondary index. This method mainly adopts the idea of inverted index. The column to be indexed in the main table is used as the primary key in the index table, and the primary key of the main table is used as the value of the index table. First, the index table is queried once to get the corresponding column. This method is simple, but requires two queries, sacrificing some performance. (2) Linearized indexing method, this method is to map K-dimensional data to one-dimensional space, and then use HBase's one-dimensional indexing technology. This method works well for spatial data, but it is not ideal for textual data, and medical image files are mainly numeric and textual data, so this method is also not suitable. (3) Double-layer index method, this method uses the combination of global index and local index to reduce the number of data nodes to be queried, and narrows the query range from high-level index to low-level index. However, the double-layer index needs to be maintained every time the data is written, which is very expensive, and the specific double-layer index needs to use different data structures, which is complicated to implement.
索引本身也是一种数据结构,目的是为了快速的定位数据他可以分为两个过程,第一,判断被检索的数据是否在集合中。第二,如果在,则定位到数据的准确位置。先看第一步,判断一个元素是否在一个长度为n的集合之中,最常见的方案就是拿这个元素与集合中的元素挨个比较,比如顺序表。但是这种算法的时间复杂度是O(n),效率低下。Hash算法是用一个下标范围比较大的数组来存储元素,每个元素的关键字经过之间设定好的哈希函数的计算,得到的结果与数组下标相对应,用这个数组单元来存储这个集合。使用Hash的优点是可以快速准确地定位元素,只需要O(1)的时间复杂度,当然这种算法可能会出现冲突,就是不同元素的关键字得出了相同的函数值,由此产生了很多冲突解决法。然而该方法比较浪费内存空间,因为数据量很大的情况下,存储数组也要设置的特别大才行。布隆过滤算法,原理上也是采用了Hash的实现机制,只不过它比Hash有更好的空间效率,其核心是随机化Hash中的映射函数。首先建立一个容量较大的位串结构,将集合中的每个关键字通过若干个Hash函数分别计算出相应的Hash值,然后对这些值分别用位串的长度取模,最后就是类似Hash表的操作,在其中对应的位置置1,我们将其称为特征值,简单说就是将每个关键字对应到位串中的几个位置上,当需要快速查找某个关键字时,只需要将其通过若干个Hash函数运算,然后映射到位串中的对应位,如果位串中的对应位全部是1,说明关键字匹配成功,如果至少有一位是0,则查找失败。但是这种算法也存在着缺陷,就是可能有不属于这个集合的元素被错判为属于这个集合,即“假阳性”。如果有一个非集合中的元素,经过Hash后的值在位串中置1的位置刚好与某个属于集合中的元素的位置完全相同,就会出现误判。本发明采取了一些优化方法将误判率降低。再看第二步,如果判定待查询元素在集合中之后,如何定位?本专利采用了一种改进后的二级索引方法,可以看到,传统的二级索引法之所以效率不高,主要是需要进行两次查询,其中需要将第一次查询到结果返回给客户端,然后客户端再发起一次查询,这样会产生一些多余的I/O操作,而网络I/O的速度对于内存的检索速度来讲是要慢很多的,所以只要通过一些方法减少这些I/O时间的牺牲就可以大幅提升二级索引法的查询速度。The index itself is also a data structure. The purpose is to quickly locate the data. It can be divided into two processes. First, determine whether the retrieved data is in the collection. Second, if it is, locate the exact location of the data. Let’s look at the first step to determine whether an element is in a set of length n. The most common solution is to compare this element with the elements in the set one by one, such as a sequence table. But the time complexity of this algorithm is O(n), which is inefficient. The Hash algorithm uses an array with a relatively large subscript range to store elements. The keyword of each element is calculated by the hash function set between them, and the result obtained corresponds to the subscript of the array. Store this collection. The advantage of using Hash is that it can quickly and accurately locate elements, and only requires O(1) time complexity. Of course, this algorithm may conflict, that is, the keywords of different elements obtain the same function value, which results in Lots of conflict resolution. However, this method is a waste of memory space, because in the case of a large amount of data, the storage array must be set to a particularly large size. The Bloom filter algorithm also adopts the Hash implementation mechanism in principle, but it has better space efficiency than Hash, and its core is to randomize the mapping function in Hash. First build a bit string structure with a large capacity, calculate the corresponding Hash value of each keyword in the set through several Hash functions, and then take the modulo of the length of the bit string for these values, and finally it is similar to the Hash table operation, in which the corresponding position is set to 1, we call it the eigenvalue, in short, each keyword corresponds to several positions in the bit string, when you need to quickly find a keyword, you only need to set the It operates through several Hash functions, and then maps to the corresponding bits in the bit string. If all the corresponding bits in the bit string are 1, it means that the keyword matching is successful. If at least one bit is 0, the search fails. However, this algorithm also has a flaw, that is, there may be elements that do not belong to this set and are wrongly judged to belong to this set, that is, "false positives". If there is an element that is not in the set, the position where the value after Hash is set to 1 in the bit string is exactly the same as the position of an element that belongs to the set, and a misjudgment occurs. The present invention adopts some optimization methods to reduce the misjudgment rate. Look at the second step. If it is determined that the element to be queried is in the collection, how to locate it? This patent adopts an improved secondary index method. It can be seen that the reason why the traditional secondary index method is inefficient is that two queries need to be performed, and the results of the first query need to be returned to the customer Then the client initiates a query again, which will generate some redundant I/O operations, and the speed of network I/O is much slower for the retrieval speed of memory, so it is only necessary to reduce these I/O by some methods. The sacrifice of O time can greatly improve the query speed of the secondary index method.
综上所述,本发明通过改进后的布隆过滤算法来判断待查询元素是否在集合中,可以迅速的获得反馈以及很低的误差率,之后通过重新设计的二级索引方案,可以减少传统的二级索引在效率上的不足。所以本发明提出一种将二者结合起来的新索引方案,采用位向量的方法将布隆过滤算法中全局的位串结构替换为每一个随机函数都分配一个位串的结构,以更好的降低误差率。同时,通过控制HBase的数据分片操作,使得索引表和数据表在物理上位于同一Region,这样就可以在同一个Region发起两次查询,而节省两次耗时的I/O操作。To sum up, the present invention uses the improved Bloom filtering algorithm to determine whether the element to be queried is in the set, which can quickly obtain feedback and a very low error rate, and then use the redesigned secondary index scheme to reduce the traditional The efficiency of the secondary index is insufficient. Therefore, the present invention proposes a new indexing scheme that combines the two, using the bit vector method to replace the global bit string structure in the Bloom filtering algorithm with a structure in which each random function allocates a bit string, so as to better Reduce the error rate. At the same time, by controlling the data sharding operation of HBase, the index table and the data table are physically located in the same region, so that two queries can be initiated in the same region, and two time-consuming I/O operations can be saved.
发明内容SUMMARY OF THE INVENTION
本发明的内容:Contents of the present invention:
①提出了基于布隆过滤算法构建HBase索引的方法,用来确定待查元素是否在索引表中,并且通过优化降低了误差率,传统的布隆过滤器是将待查元素经过多个哈希函数散列后的值映射到同一个位串中,而本发明的优化点在于将待查元素经过多个哈希函数散列后的值映射到不同的位串上,将这些位串组成一个向量组,以适当增加内存的方式减少了误差率。①Proposed a method of constructing HBase index based on the Bloom filter algorithm to determine whether the element to be checked is in the index table, and the error rate was reduced by optimization. The traditional Bloom filter is to pass the element to be checked through multiple hashes. The value hashed by the function is mapped to the same bit string, and the optimization point of the present invention is to map the value of the element to be checked after being hashed by multiple hash functions to different bit strings, and combine these bit strings into a vector groups, reducing the error rate by appropriately increasing the memory.
②提出了一种改进的Hbase二级索引的设计方法,通过HBase自带的协处理器,让数据表和索引表在物理上处于同一个Region中,将传统的HBase二级索引方案中需要的两次IO操作减少为一次,大幅提高检索速度。② An improved HBase secondary index design method is proposed. Through HBase's own coprocessor, the data table and the index table are physically in the same Region, and the traditional HBase secondary index scheme needs Two IO operations are reduced to one, greatly improving the retrieval speed.
③提出了一种抽样散列法,解决HBase Region的写热点问题,即逻辑上相邻的数据总是会写到同一个或者相邻的Region中。这种方法预先估计HBase整个Region的数量,然后通过对行键的抽样,散列,将不同的写数据请求均分到不同的Region上,解决了写热点问题。(3) A sampling hashing method is proposed to solve the write hotspot problem of HBase Region, that is, logically adjacent data will always be written to the same or adjacent Region. This method pre-estimates the number of entire regions in HBase, and then divides different write data requests into different regions by sampling and hashing the row keys, which solves the problem of write hotspots.
本发明是一种集成式的索引设计方法,鉴于布隆过滤算法的众多优点,采用优化的布隆过滤算法作为索引的第一步,同时使用改进的二级索引方案进一步提高HBase索引的检索速度。The invention is an integrated index design method. In view of the many advantages of the Bloom filtering algorithm, the optimized Bloom filtering algorithm is used as the first step of the index, and the improved secondary index scheme is used to further improve the retrieval speed of the HBase index. .
为实现上述目的,本发明采用如下的技术方案:For achieving the above object, the present invention adopts the following technical scheme:
步骤1.采用位向量法对传统的布隆过滤算法进行优化,传统的布隆过滤器是将待查元素经过多个哈希函数散列后的值映射到同一个位串中,而本发明的优化点在于将待查元素经过多个哈希函数散列后的值映射到不同的位串上,将这些位串组成一个向量组,以适当增加内存的方式减少了误差率。将其应用于HBase的索引中作为第一层过滤,如果待查元素在索引表中,则进行步骤2;如果不在,跳到步骤4。Step 1. Use the bit vector method to optimize the traditional Bloom filter algorithm. The traditional Bloom filter is to map the value of the element to be checked through the hashing of multiple hash functions into the same bit string, and the present invention The optimization point is to map the value of the element to be checked after being hashed by multiple hash functions to different bit strings, and form these bit strings into a vector group to reduce the error rate by appropriately increasing the memory. Apply it to the HBase index as the first layer of filtering. If the element to be checked is in the index table, go to step 2; if not, skip to step 4.
步骤2.关闭HBase的自动分片功能,使用本发明改进后的二级索引方法,预估Region的数量,以及Region的分裂点,然后将数据表的主键进行散列,平均分割主键的范围,使得每次数据的写入不会集中于某个热点,更好的利用了HBase的将查询请求均分到各个服务器上的特性,最重要的是保证了数据表和索引表的统一性,减少I/O操作的时间消耗。Step 2. Turn off the automatic fragmentation function of HBase, use the improved secondary index method of the present invention, estimate the number of Regions, and the split point of the Region, then hash the primary key of the data table, and divide the range of the primary key equally, This ensures that each data write will not be concentrated in a certain hotspot, making better use of HBase's feature of evenly dividing query requests to each server. The most important thing is to ensure the uniformity of data tables and index tables, reducing Time consumption of I/O operations.
步骤3.使用HBase自带的Coprocessor协处理器,构建索引模块,在步骤1中待查询元素在索引表中的情况下,通过协处理器将两次查询的过程在服务器本地进行,减少查询时间。Step 3. Use the Coprocessor coprocessor that comes with HBase to build an index module. In the case where the element to be queried is in the index table in step 1, the process of two queries is performed locally on the server through the coprocessor to reduce query time. .
步骤4.返回查询结果。Step 4. Return the query result.
附图说明Description of drawings
图1布隆过滤算法的优化流程图Figure 1 Optimization flow chart of Bloom filter algorithm
图2二级索引方案的优化流程图Figure 2 Optimization flow chart of the secondary index scheme
图3预分片算法流程图Figure 3 Flowchart of the pre-sharding algorithm
图4综合的查询流程图Figure 4 Comprehensive query flow chart
具体实施方式Detailed ways
本发明采用混合布隆过滤器与二级索引的基本方案,对其中的算法进行改进,并应用的索引之上,希望达到对HBase索引的更快检索的目的。The invention adopts the basic scheme of mixing Bloom filter and secondary index, improves the algorithm therein, and applies it to the index, hoping to achieve the purpose of faster retrieval of the HBase index.
传统的布隆过滤器存在着“假阳性”,就是有不属于这个集合的元素被错判为属于这个集合,我们分析一下这种错判的概率。假定布隆过滤器有m比特大小的位串,每个元素对应k个信息指纹的哈希函数,当然这些m比特里有些是1,有些是0。先来看看某个比特位为零的概率。在这个布隆过滤器中插入一个元素,它的第一个哈希函数会把过滤器中的某个比特位置1,因此,任何一个比特被置1的概率为1/m,它依然是0的概率为1-1/m。对于过滤器中一个特定的位置,如果这个元素的k个哈希函数都没有把它设置成1,其概率为公式1所示:The traditional Bloom filter has "false positives", that is, elements that do not belong to this set are wrongly judged to belong to this set. Let's analyze the probability of this misjudgment. Suppose that the Bloom filter has a bit string of m bits, and each element corresponds to the hash function of k information fingerprints. Of course, some of these m bits are 1 and some are 0. Let's first look at the probability of a bit being zero. Insert an element into this bloom filter, its first hash function will set a bit in the filter to 1, so the probability of any bit being set to 1 is 1/m, it is still 0 The probability is 1-1/m. For a particular position in the filter, if none of the k hash functions of this element set it to 1, the probability is given by Equation 1:
如果过滤器中插入第二个元素,某个特定的位置依然没有被设置成1,其概率为公式2所示:If a second element is inserted into the filter, a particular position is still not set to 1, and its probability is shown in Equation 2:
假如现在共插入了n个元素,还没有把某个位置设置成1,其概率为公式3所示:If a total of n elements are inserted now, and a certain position has not been set to 1, the probability is shown in formula 3:
反过来,在插入了n个元素之后一个比特位被置成1的概率则为公式4所示:Conversely, the probability of a bit being set to 1 after inserting n elements is given by Equation 4:
现在假定这n个元素都被放到布隆过滤器的位串中了,新来的一个不在集合中的元素,由于它的信息指纹的哈希函数都是随机的,因此,它的第一个哈希函数正好命中某个值为1的比特的概率就是上述概率。一个不在集合中的元素被误识别在集合中,需要所有哈希函数对应的比特值均为1,其概率为p,如公式5所示:Now suppose that these n elements are all put into the bit string of the Bloom filter, and a new element that is not in the set, because the hash function of its information fingerprint is random, therefore, its first element is not in the set. The probability that a hash function hits a bit with a value of 1 exactly is the above probability. An element that is not in the set is mistakenly identified in the set, and the bit value corresponding to all hash functions needs to be 1, and its probability is p, as shown in Equation 5:
化简后为:Simplified to:
如果n比较大,可以近似为:If n is relatively large, it can be approximated as:
假定一个元素用16比特,k=8,那么假阳性的概率大概是万分之五。Assuming that an element uses 16 bits and k=8, the probability of a false positive is about 5 in 10,000.
下面我们来讨论算法的改进。一个关键字数据经过d个Hash函数映射后,长度为N的位串V中某位为1的概率是d/N。每个函数h(i)都是独立随机的,i取值为1~d,有一个长度为y的集合S,当集合S={X1,X2,X3...Xy}的全部成员都被这些Hash函数映射到长度为m位的数组时,这个数组中某一位为1的概率P,如公式8所示:Next we discuss the improvement of the algorithm. After a keyword data is mapped by d Hash functions, the probability that a bit in the bit string V of length N is 1 is d/N. Each function h(i) is independent and random, i ranges from 1 to d, and there is a set S of length y, when all members of the set S={X1, X2, X3...Xy} are When these Hash functions are mapped to an array of length m bits, the probability P that a bit in the array is 1 is shown in Equation 8:
如果有某个集合外的元素K被误认为Hash所表示的数据集,也就是说该元素被所有Hash映射后的结果,都有h(K)=1。因此我们得到误差率Perr,如公式9所示:If there is an element K outside a certain set that is mistaken for the data set represented by Hash, that is to say, the result of this element being mapped by all Hashes has h(K)=1. So we get the error rate Perr as shown in Equation 9:
perr=pd (9)perr = pd (9)
根据误判率我们可以得出布隆过滤的平均判定时间为T,如公式10所示:According to the false positive rate, we can get that the average decision time of Bloom filtering is T, as shown in Equation 10:
一般来说,位向量表示的范围N比数据源集合S的个数表示的范围y要大很多,因为如果y>N的话,布隆过滤的误差率会很大,从数据Xi经过Hash函数映射到位向量的过程必然存在多个冲突,只能通过Hash函数的选择来减少冲突。这是Hash表示法的内在特性产生的冲突,我们将这样的冲突称为内冲突。而与内冲突对应的是外冲突,由于多个Hash函数映射到同一个位向量而引起的冲突。可以看出,发生冲突的根本原因是因为映射地址不够。那能不能在不牺牲查询性能的前提下适当的增加一些地址空间呢?基于这样的思想,为每一个Hash函数h(i)使用一个独立的位向量来进行地址映射,从而形成一个向量组V。假设有一个数据集A,x为A中的某个元素,那么向量组V的表示如公式11所示Generally speaking, the range N represented by the bit vector is much larger than the range y represented by the number of data source sets S, because if y>N, the error rate of Bloom filtering will be very large, from the data Xi through the Hash function mapping There must be multiple conflicts in the process of bit vector, which can only be reduced by the selection of Hash function. This is a conflict arising from the inherent properties of Hash notation, and we call such a conflict an internal conflict. Corresponding to the inner conflict is the outer conflict, which is caused by the mapping of multiple Hash functions to the same bit vector. It can be seen that the root cause of the conflict is not enough mapped addresses. Can we appropriately increase some address space without sacrificing query performance? Based on this idea, an independent bit vector is used for address mapping for each Hash function h(i), thereby forming a vector group V. Suppose there is a data set A, and x is an element in A, then the representation of the vector group V is shown in Equation 11
其中,V(i,j)表示第i个向量中的第j位。假设所有的Hash函数都是随机分布,那么每一个函数在其独自的位串中由于内冲突引起的地址映射的误差率pnew为公式12所示:where V(i,j) represents the jth bit in the ith vector. Assuming that all Hash functions are randomly distributed, the error rate pnew of address mapping caused by internal conflicts in each function in its own bit string is shown in Equation 12:
改进后的平均判定时间公式是不变的,但是因为误差率pnew的变小,所以整个平均判定时间也变小。The formula of the improved average judgment time is unchanged, but because the error rate pnew becomes smaller, the entire average judgment time also becomes smaller.
结合附图1可以清晰的看出算法的结构。The structure of the algorithm can be clearly seen in conjunction with FIG. 1 .
再来看二级索引的改进方案。在HBase中,通常单张表的数据量会非常大,因此单表的数据会分别存储到一至多个Region中,同样也会被一至多个Region server监管。Region有起始和终止的主键标识,表示这个Region的主键范围,在进行读写操作时,如果主键满足某个Region的主键范围,那么这个Region就会被命中,读写相关数据。但问题是,单个Region存储到一定的大小后会分裂,这是由HBase的LSM树形索引结构决定的。那么就可能会出现这样一种情况,原本在同一个Region上的数据表和索引表的数据可能会被拆分到不同的Region。这样客户端发送一个查询请求,就会前后进行四次I/O操作,第一次根据待查询数据查询索引表,第二次得到主表的主键,第三次根据主键查询主表,第四次得到最后的查询结果。虽然现在有很多优秀的I/O框架比如Netty已经可以极大改善I/O速度,但是如果能减少这样的I/O操作次数,那么带来的性能改进是可以预见的。具体实现方案如下:Let's look at the improvement plan of the secondary index. In HBase, the data volume of a single table is usually very large, so the data of a single table will be stored in one or more regions, and will also be supervised by one or more region servers. A Region has a primary key identifier of the start and end, indicating the range of the primary key of the Region. During read and write operations, if the primary key satisfies the primary key range of a Region, the Region will be hit and read and write related data. But the problem is that after a single Region is stored to a certain size, it will be split, which is determined by the LSM tree index structure of HBase. Then there may be a situation where the data of the data table and index table originally in the same Region may be split into different Regions. In this way, when the client sends a query request, it will perform four I/O operations before and after. The first time is to query the index table according to the data to be queried, the second time to obtain the primary key of the main table, the third time to query the main table according to the primary key, and the fourth time to query the main table according to the primary key. to get the last query result. Although there are many excellent I/O frameworks such as Netty that can greatly improve the I/O speed, if the number of such I/O operations can be reduced, the performance improvement brought by it is predictable. The specific implementation scheme is as follows:
HBase提供的Coprocessor协处理器可以直接将程序运行在服务器上,达到一种程序在数据本地运行的效果。剩下的问题就是要保证主表和索引表在同一个RegionServer上,通过之前对于Region的介绍已经可以清楚,数据具体存储在哪一个Region是根据其主键的范围决定的,所以只要主表的主键和索引表的主键相匹配即可,又因为主键的检索遵循最左前缀原则,所以只需要将索引表的主键的开头部分与主表的主键相同即可。而数据表的主键要求唯一,综合上述需求,我们将索引表的主键设计为:region的起始行键+索引名+索引值+主表的行键。起始部分保证了与索引表与主表在同一Region,结尾部分保证了索引表的主键唯一性。The Coprocessor coprocessor provided by HBase can directly run the program on the server to achieve the effect that the program runs locally on the data. The remaining problem is to ensure that the main table and the index table are on the same RegionServer. It is clear from the previous introduction to the Region that the Region where the data is stored is determined according to the range of its primary key, so as long as the primary key of the main table is It is enough to match the primary key of the index table, and because the retrieval of the primary key follows the leftmost prefix principle, it is only necessary to make the beginning of the primary key of the index table the same as the primary key of the main table. The primary key of the data table is required to be unique. Based on the above requirements, we design the primary key of the index table as: the starting row key of the region + the index name + the index value + the row key of the main table. The starting part ensures that the index table and the main table are in the same Region, and the ending part ensures the uniqueness of the primary key of the index table.
HBase的区域分片方法基本有三种,一种是预先分区法,也就是在表建立之前就对表的分区个数和每个Region对应的主键范围进行设置,然后再对数据进行写入。第二种是自动分区法,这种方法是HBase默认的分区方法,即刚开始只有一个Region,随着数据的不断写入,Region不断增大,等增大到一定的体积时就会分裂成两个大小相等的Region,然后再向新生成的Region写数据,继续分裂,一直进行下去。第三种方法是强制分区法,也就是通过HBase命令行,输入特定的指令来强行控制HBase的分片情况。There are basically three types of region sharding methods in HBase. One is the pre-partitioning method, that is, the number of partitions of the table and the range of primary keys corresponding to each Region are set before the table is created, and then the data is written. The second is the automatic partitioning method. This method is the default partitioning method of HBase, that is, there is only one Region at the beginning. With the continuous writing of data, the Region continues to increase, and when it increases to a certain size, it will be split into two partitions. Two Regions of equal size, and then write data to the newly generated Region, continue to split, and keep going. The third method is the forced partition method, that is, through the HBase command line, input specific instructions to forcefully control the fragmentation of HBase.
如果采用默认的自动分区法,即按照主键的增大顺序写入数据,可能会产生Region的写热点问题。在一个Region分裂为两个后,主键的范围也被一分为二,而数据的写入是按照主键增长的顺序,这就意味着之后的写入总是会在起始主键比较大的Region上进行,而起始主键比较小的Region很难再被写入,那这样就没有很好的利用分布式数据库的负载均衡特性了,HBase强大的写性能也会受到影响。If the default automatic partitioning method is used, that is, the data is written in the increasing order of the primary key, which may cause the problem of writing hot spots in the Region. After a Region is split into two, the range of the primary key is also divided into two, and the data is written in the order in which the primary key grows, which means that subsequent writes will always be in the Region with the larger primary key. If the initial primary key is relatively small, it is difficult to be written again. In this way, the load balancing feature of the distributed database will not be well utilized, and the powerful write performance of HBase will also be affected.
本方案采用第一种预先分片法,但是又通过抽样散列法解决了写热点的问题,很好的利用了负载均衡的特性。这种方法可以通过HBase提供的编程接口实现,但是在建表之前需要This solution adopts the first pre-sharding method, but solves the problem of write hotspots by sampling and hashing, and makes good use of the characteristics of load balancing. This method can be implemented through the programming interface provided by HBase, but it needs to be
知道数据表的Region数量,以及每个Region的分裂点,也就是固定每个分区的主键范围。为了解决Region的写热点问题,本方案设计了一种抽样散列法。结合图3,详细描述该抽样散列法:Know the number of Regions in the data table and the split point of each Region, that is, fix the primary key range of each partition. In order to solve the problem of writing hot spots in Region, a sampling hashing method is designed in this scheme. In conjunction with Figure 3, the sampling hashing method is described in detail:
步骤1:预估Region数量M。Step 1: Estimate the number of Regions M.
Region的数量对于HBase整体的读写效率有很大的影响,如果数量太多,则内存的占用率会过高;如果数量太少的话,又不能很好地利用并发的特性。因此,必须选取业界经过大量实验给出的合理值,再结合我们的数据表的大小来估算Region的值,公式如下:The number of regions has a great impact on the overall read and write efficiency of HBase. If the number is too large, the memory usage will be too high; if the number is too small, the concurrency feature cannot be well utilized. Therefore, it is necessary to select the reasonable value given by the industry after a lot of experiments, and then combine the size of our data table to estimate the value of Region. The formula is as follows:
其中RSXmx为一个RegionServer的内存大小,habse.regionserver.global.memstore.size以及hbase.hregion.memstore.flush.size采用HBase官方推荐的最优值,可从HBase官方文档获得,cf是数据表的列族数,这样我们计算得到了Region的数量M。Where RSXmx is the memory size of a RegionServer, habse.regionserver.global.memstore.size and hbase.hregion.memstore.flush.size are the optimal values officially recommended by HBase, which can be obtained from HBase official documents, cf is the column of the data table The number of families, so we calculate the number M of Regions.
步骤2:将行键进行散列,形成无顺序的字符串Step 2: Hash the row key to form an unordered string
因为我们需要检索,所以最好选用可逆的加密算法,本方案使用AES加密算法,将主键散列为随机的字符串。Because we need to retrieve, it is best to use a reversible encryption algorithm. This scheme uses the AES encryption algorithm to hash the primary key into a random string.
步骤3:取样,随机取出一定数量的主键,然后按照升序排序将其放到一个集合里Step 3: Sampling, randomly take out a certain number of primary keys, and then sort them in ascending order and put them into a set
步骤4:根据预估的分区个数M,对整个集合平均分割,找到分裂点。Step 4: According to the estimated number of partitions M, divide the entire set equally to find the split point.
最后结合附图4,总结整个方案流程步骤:Finally, in conjunction with Figure 4, summarize the steps of the entire program flow:
步骤1:开始查询请求。Step 1: Start query request.
步骤2:通过Coprocessor协处理器解析查询。Step 2: Parse the query through the Coprocessor coprocessor.
步骤3:查询索引表。Step 3: Query the index table.
步骤4:经过布隆过滤器过滤。如果存在,跳到步骤5,如果不存在跳到步骤6。Step 4: Filter through Bloom filter. If it exists, skip to step 5, if not, skip to step 6.
步骤5:查询主表。Step 5: Query the main table.
步骤6:返回最后结果。Step 6: Return the final result.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910070748.1ACN109977113A (en) | 2019-01-25 | 2019-01-25 | A kind of HBase Index Design method based on Bloom filter for medical imaging data |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910070748.1ACN109977113A (en) | 2019-01-25 | 2019-01-25 | A kind of HBase Index Design method based on Bloom filter for medical imaging data |
| Publication Number | Publication Date |
|---|---|
| CN109977113Atrue CN109977113A (en) | 2019-07-05 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910070748.1APendingCN109977113A (en) | 2019-01-25 | 2019-01-25 | A kind of HBase Index Design method based on Bloom filter for medical imaging data |
| Country | Link |
|---|---|
| CN (1) | CN109977113A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111198886A (en)* | 2019-12-31 | 2020-05-26 | 浙江华云信息科技有限公司 | Method for constructing Hbase secondary index table |
| CN111198847A (en)* | 2019-12-30 | 2020-05-26 | 广东奡风科技股份有限公司 | Data parallel processing method, device and system suitable for large data set |
| CN114328522A (en)* | 2021-12-26 | 2022-04-12 | 浪潮云信息技术股份公司 | Application method of book distinguishing mode based on bloom filter in digital library |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101901248A (en)* | 2010-04-07 | 2010-12-01 | 北京星网锐捷网络技术有限公司 | Method and device for creating and updating Bloom filter and searching elements |
| CN101958883A (en)* | 2010-03-26 | 2011-01-26 | 湘潭大学 | A Method of Defense against SYN Flood Attack Based on Bloom Filter and Open Source Kernel |
| US20190005041A1 (en)* | 2016-02-12 | 2019-01-03 | International Business Machines Corporation | Locating data in a set with a single index using multiple property values |
| CN109165222A (en)* | 2018-08-20 | 2019-01-08 | 福州大学 | A kind of HBase secondary index creation method and system based on coprocessor |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101958883A (en)* | 2010-03-26 | 2011-01-26 | 湘潭大学 | A Method of Defense against SYN Flood Attack Based on Bloom Filter and Open Source Kernel |
| CN101901248A (en)* | 2010-04-07 | 2010-12-01 | 北京星网锐捷网络技术有限公司 | Method and device for creating and updating Bloom filter and searching elements |
| US20190005041A1 (en)* | 2016-02-12 | 2019-01-03 | International Business Machines Corporation | Locating data in a set with a single index using multiple property values |
| CN109165222A (en)* | 2018-08-20 | 2019-01-08 | 福州大学 | A kind of HBase secondary index creation method and system based on coprocessor |
| Title |
|---|
| 2K10: "Hbase 学习(九) 华为二级索引(原理)", 《HTTPS://MY.OSCHINA.NET/U/923508/BLOG/413129》* |
| 王晓明: "布隆过滤器及其改进算法在分布式环境下的模拟实现", 《中国优秀硕士论文全文数据库》* |
| 郭红: "基于协处理器的HBase二级索引方法", 《计算机工程与应用》* |
| 马成龙: "基于Hadoop的医学影像存储检索系统的研究与实现", 《中国优秀硕士论文全文数据库 信息科技辑》* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111198847A (en)* | 2019-12-30 | 2020-05-26 | 广东奡风科技股份有限公司 | Data parallel processing method, device and system suitable for large data set |
| CN111198886A (en)* | 2019-12-31 | 2020-05-26 | 浙江华云信息科技有限公司 | Method for constructing Hbase secondary index table |
| CN114328522A (en)* | 2021-12-26 | 2022-04-12 | 浪潮云信息技术股份公司 | Application method of book distinguishing mode based on bloom filter in digital library |
| Publication | Publication Date | Title |
|---|---|---|
| US20220093210A1 (en) | System and method for characterizing biological sequence data through a probabilistic data structure | |
| US7805427B1 (en) | Integrated search engine devices that support multi-way search trees having multi-column nodes | |
| JP5524144B2 (en) | Memory system having a key-value store system | |
| US10649997B2 (en) | Method, system and computer program product for performing numeric searches related to biometric information, for finding a matching biometric identifier in a biometric database | |
| US8229916B2 (en) | Method for massively parallel multi-core text indexing | |
| US8086641B1 (en) | Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same | |
| US9262511B2 (en) | System and method for indexing streams containing unstructured text data | |
| CN101789027A (en) | Metadata management method based on DBMS and metadata server | |
| US7653619B1 (en) | Integrated search engine devices having pipelined search and tree maintenance sub-engines therein that support variable tree height | |
| CN109977113A (en) | A kind of HBase Index Design method based on Bloom filter for medical imaging data | |
| US7987205B1 (en) | Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations | |
| JP6258436B2 (en) | Memory system local controller | |
| US20160154890A1 (en) | Multidimensional-range search apparatus and multidimensional-range search method | |
| US7953721B1 (en) | Integrated search engine devices that support database key dumping and methods of operating same | |
| JP5646775B2 (en) | Memory system having a key-value store system | |
| KR102388458B1 (en) | Method and apparatus for conversing data key value | |
| CN118939657A (en) | Metadata index construction method, device, equipment, storage medium and program product | |
| JP5833212B2 (en) | Memory system having a key-value store system | |
| US20080162414A1 (en) | Accelerating queries using delayed value projection of enumerated storage | |
| JP6034467B2 (en) | system | |
| US12253974B2 (en) | Metadata processing method and apparatus, and a computer-readable storage medium | |
| US9824105B2 (en) | Adaptive probabilistic indexing with skip lists | |
| Mazumdar et al. | An index scheme for fast data stream to distributed append-only store | |
| CN104537017B (en) | A kind of file search method and device based on path | |
| Pinkas et al. | A simple recursive tree oblivious ram |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication | ||
| WD01 | Invention patent application deemed withdrawn after publication | Application publication date:20190705 |