CN110515920A

Movatterモバイル変換

Info

Publication number: CN110515920A
Application number: CN201910816503.9A
Authority: CN
Inventors: 孙伟源
Original assignee: Beijing Inspur Data Technology Co Ltd
Current assignee: Beijing Inspur Data Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-11-29

Abstract

The invention discloses a kind of mass small documents access method and system based on Hadoop, method includes: step 1, judges whether to need to save small documents；If so, step 2, classifies to small documents according to predetermined characteristic, and the small documents of small documents index is put into small documents queue；Step 3, judge whether the length of small documents queue reaches threshold value；If so, multiple small documents in small documents queue are merged into big file by step 4, global index is established, and corresponding relationship is deposited into file index backward NameNode and initiates storage request；Step 5, NameNode according to the block of default size to big file division at data block after, by the storage at least one DataNode of big file, and the state of DataNode and DataNode where data block are write in name space.By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store, small documents rope and vertical global index are established, memory consumption, system load are reduced, improves access efficiency.

Description

A kind of mass small documents access method and system based on Hadoop

Technical field

The present invention relates to big data processing technology fields, access more particularly to a kind of mass small documents based on HadoopMethod and system.

Background technique

Currently, Internet application is ubiquitous, resulting mass data brings huge pressure to storage and processingPower.Big data technology is a series of unconventional tool of uses to a large amount of structuring, unstructured and partly-structured dataHandled and obtained the technology of analysis and prediction result.

It can not only be the storage of mass data using big data processing technique by Hadoop frame application in mass dataCarrier is provided, while also providing new approach for efficiently processing data.Hadoop provides a distributed document storageSystem HDFS.HDFS can be used to save the mass data of substantially sequential access, and provide it is a kind of quickly access it is specificThe mechanism of data.

However, the HDFS designed to handle big file is that can generate in small documents such as processing picture, file typesProblem.General small documents refer to that size is less than the file of 10M, if there are a large amount of this small documents in system, it will poleThe memory headroom of the earth trumpet NameNode, to influence the performance of entire HDFS cluster.

There is no very good solution methods aiming at the problem that HDFS accesses small documents at present, and HDFS itself is providedSequencefile solution reduces the memory consumption of NameNode by merging small documents Li Ai to greatest extent.Sequencefile is the text storage file being made of the byte of Binary Serialization key/value.InIn Sequencefile, each key/value is counted as a record.In general, can by the file of small documents andFile content constructs a key-value pair, and the key-value pair set being made of in this way multiple small documents can be bundled toIn Sequencefile.Sequencefile supports compression, can by several recording compresseds to together, the method reduceThe memory consumption of NameNode, but file mergences needs to consume the long period, since key assignments therein does not arrange, searches oneA small documents need to be traversed for entire Sequencefile, reduce access efficiency.

Summary of the invention

The object of the present invention is to provide a kind of mass small documents access method and system based on Hadoop are reducedNameNode memory consumption improves access efficiency, reduces system load.

In order to solve the above technical problems, the embodiment of the invention provides a kind of mass small documents access side based on HadoopMethod, comprising:

Step 1, judge whether to need to save small documents；

If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described smallThe small documents index of file is put into small documents queue；

Step 3, judge whether the length of the small documents queue reaches threshold value；

If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is establishedDraw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request；

Step 5, the NameNode according to the block of default size to the big file division at data block after, by instituteIt states in the storage at least one DataNode of big file, and by the DataNode where the data block and describedThe state of DataNode is write in name space.

Wherein, the step 2 includes:

The small documents are sorted out according to the file type or creation time of the small documents.

Wherein, the step 4 includes:

Multiple small documents in the small documents queue are merged by big file using Mapfile, whereinMapFile includes the part index and the part data, and for storing data, the part index is used for file for the part dataData directory, for recording the deviation post of the key value and record of record hereof.

Wherein, after the step 5, further includes:

Step 6, judge whether to receive small documents read requests；

Step 7, pre-read in the big file in the small documents read requests where corresponding small documents with the small documentsThe relevant small documents.

Wherein, after the step 7, further includes:

Step 8, judge whether the frequency accessed in the given time of the small documents reaches threshold value；

If so, step 9, by small documents storage into caching.

Wherein, after the step 9, further includes:

Step 10, judge the small documents in the caching and it is last accessed between time interval whether reachPre- fixed length T；

If so, the small documents are deleted from the caching.

In addition to this, the embodiment of the invention also provides a kind of, and the mass small documents based on Hadoop access system, comprising:

Small documents store request module, for after having detected that small documents are stored, output pretreatment to be orderedIt enables；

Small documents preprocessing module is connect with small documents storage request module, receives the pretreatment order, according toThe small documents of the small documents and small documents index is put into small by predetermined characteristic after classifying to the small documentsIn document queue, after the length of the small documents queue reaches threshold value, by multiple small texts in the small documents queuePart merges into big file, establishes global index, and corresponding relationship is deposited into after the file index and is deposited to NameNode initiationStorage request, control the NameNode according to preset size block to the big file division at data block after, will it is described greatlyFile storage is at least one DataNode, and by the DataNode and the DataNode where the data blockState is write in name space.

It wherein, further include the small documents read requests module being connect with the small documents preprocessing module, the pre- modulus of indexBlock, the small documents read requests module are used for after detecting small documents read requests, prefetch module hair to the indexPre-read is requested out, and the index prefetches module and pre-reads big file in the small documents read requests where corresponding small documentsIn the small documents relevant to the small documents.

It wherein, further include prefetching the cache module that module is connect with index, the cache module is for storing the predetermined timeInterior accessed frequency reaches the small documents of threshold value.

It wherein, further include the cache cleaner module being connect with the cache module, the cache cleaner module detects instituteState the small documents in caching and it is last accessed between time interval reach pre- fixed length T after, by the small documents fromIt is deleted in the caching.

Mass small documents access method and system based on Hadoop provided by the embodiment of the present invention, with prior art phaseThan having the advantage that

Mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, by being deposited in small documentsBefore storage, first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents rope and the vertical overall situationIndex, so that double indexes are formed by small documents rope and vertical global index, so that the reading process of small documents in reading processIn, it is first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents, simultaneously because needingThe index file wanted is less, reduces memory consumption, system load, improves access efficiency, while storage is stored according to predetermined characteristic,Storage efficiency is higher, can also improve reading efficiency accordingly.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show belowThere is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present inventionSome embodiments for those of ordinary skill in the art without creative efforts, can also basisThese attached drawings obtain other attached drawings.

Fig. 1 is a kind of specific embodiment party of the mass small documents access method provided in an embodiment of the present invention based on HadoopThe step flow diagram of formula；

Fig. 2 is another specific implementation of the mass small documents access method provided in an embodiment of the present invention based on HadoopThe step flow diagram of mode；

Fig. 3 is a kind of specific embodiment party that the mass small documents provided in an embodiment of the present invention based on Hadoop access systemThe attachment structure schematic diagram of formula；

Fig. 4 is another specific implementation that the mass small documents provided in an embodiment of the present invention based on Hadoop access systemThe attachment structure schematic diagram of mode.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based onEmbodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every otherEmbodiment shall fall within the protection scope of the present invention.

FIG. 1 to FIG. 4 is please referred to, Fig. 1 is the mass small documents access method provided in an embodiment of the present invention based on HadoopA kind of specific embodiment step flow diagram；Fig. 2 is that the magnanimity provided in an embodiment of the present invention based on Hadoop is smallThe step flow diagram of another specific embodiment of file access method；Fig. 3 is provided in an embodiment of the present invention is based onA kind of attachment structure schematic diagram of specific embodiment of the mass small documents access system of Hadoop；Fig. 4 is that the present invention is implementedThe attachment structure schematic diagram of another specific embodiment for the mass small documents access system based on Hadoop that example provides.

In a specific embodiment, the mass small documents access method based on Hadoop, comprising:

Step 1, judge whether to need to save small documents；It needs to judge whether there is small documents storage herein and askAsk, to open subsequent step, save memory, small documents storage request here can be timing and detect, be also possible toMachine testing, if system carries out random assignment, such as 1-3s, or according to the flat rate of appearance of small documents, if frequency is very beforeHeight illustrates currently carrying out large-scale small documents storage, thus needs to improve detection frequency, reduces between detection timeEvery on the contrary, detection time interval can be increased.

If so, step 2, after classifying according to predetermined characteristic to the small documents, by small documents and described smallThe small documents index of file is put into small documents queue；Here the purpose classified is storage and subsequent reading for convenienceIt takes, the file of general same category feature can be stored and be read by collective so that it is convenient to subsequent reading, otherwise, even if carrying out small textThe position enquiring of part just needs many time, both increases memory consumption, also increases the time read and needed, substantially reducesAccess efficiency.

Step 3, judge whether the length of the small documents queue reaches threshold value；Judge that the length of small documents queue reaches thresholdThe purpose of value is, facilitates it in the big file of subsequent synthesis, all has unified length, the length between different big filesSpend of substantially equal, it is of substantially equal to store the space occupied, is similar to and uses packaging cargo in case, greatly improves the utilization for spaceEfficiency, the present invention for small documents queue length threshold without limitation, can be according to the size of big file storage intoRow auto-changing, as big file storage space in, can allow for storage quantity be 100G, allow 100 big files, eachThe length of big file is not more than 1G, and after the memory space of big file becomes 200G, allow quantity or 100, thatThe length of each big file becomes not greater than 1G, or regardless of in that memory space, each the size of big file isNo more than 1G, only the size according to corresponding memory space, quantity are accordingly converted, and this is not limited by the present invention.

If so, multiple small documents in the small documents queue are merged into big file by step 4, global rope is establishedDraw, and corresponding relationship is deposited into file index backward NameNode and initiates storage request；Here global index is established with beforePerson indexes to form double indexes in the small documents formed in small documents queue, so that in subsequent reading, it can be using double indexesStructure be read out, small documents can be more quickly positioned, so that all becoming more in the reading of small documents and storing processAccelerate speed.

By first being sorted out according to predetermined characteristic, big file being synthesized in small documents queue before small documents store,Small documents rope and vertical global index are established, so that forming double ropes by small documents rope and vertical global index in reading processDraw, so that first being indexed again to small documents from global index, inquiry velocity has more block, realizes small text in the reading process of small documentsThe quick positioning of part reduces memory consumption, system load simultaneously because the index file needed is less, improves access efficiency, togetherWhen storage stored according to predetermined characteristic, storage efficiency is higher, can also improve reading efficiency accordingly.

It needing to carry out certain pretreatment before small documents storage in the present invention, it, which is first sorted out again, becomes big file,The present invention is to its classifying mode and sorts out requirement without limitation, and in one embodiment of the invention, the step 2 includes:

It should be pointed out that a kind of classifying mode is generally used in the present invention, such as only with file type or only with creationTime is sorted out, and may be such that a small documents not only belong to the former in such a way that mixing is sorted out, but also belong to the latter, badIt is divided, certain present invention can also in other manners, and the present invention is without limitation.

It needs to merge into small documents into big file after classification in the present invention, similar to the standard of small documents storageChange, in being merged into big file and then reading process, is first read in the way of big file, then read in big fileTake small documents, the present invention for small documents merging mode without limitation, in one embodiment, the step 4 includes:

When MapFile is accessed, index file can be loaded into memory, can be navigated to rapidly by indexing mapping relationsDocument location where specified record greatly improves recall precision, and then improves access efficiency.

It is used in the present invention to the pretreated mode of small documents, by changing storage mode in storing process, so that itsConveniently it is read, and in reading process, if it is possible to there is better reading mechanism, also can be improved access efficiency, in this hairIn bright one embodiment, the mass small documents access method based on Hadoop is after the step 5, further includes:

Step 6, judge whether to receive small documents read requests；

After receiving small documents read requests, small documents and file relevant to small documents are pre-read, energyEnough save the step interacted with NameNode and time, it is this prefetch mechanism under, NameNode node visit amount will be significantlyIt reduces, hence it is evident that improve the operational efficiency of NameNode.

In one embodiment, when HDFS attempts to read a small documents in MapFile, with this document sameThe metadata information of other related small documents in MapFile can be prefetched from NameNode node.Due to at oneThere is correlation between small documents in MapFile, user often accesses relative file when reading a file,When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNodeStep and time, so that NameNode node visit amount will greatly reduce, hence it is evident that the operational efficiency of NameNode is improved,The memory for reducing NameNode consumption, reduces system load.

In order to further increase reading efficiency, in one embodiment of the present of invention, after the step 7, further includes:

If so, step 9, by small documents storage into caching.

Some files are often repeatedly inquired, and the access frequency of each file is not identical.To improve reading speed,After user reads file, access record is write down, for counting access times.Caching clothes are placed on for the file of high access frequencyDevice be engaged in as caching, when user reads again same file, need to only be read from cache server, read these files in this wayThe time of consumption can greatly reduce, and improve access efficiency.

However the access behavior of user often changes, if the file of storing excess in the buffer, and be infrequently byThe file used, then caching, which becomes, can become too fat to move, simultaneously because the space of caching is limited, the quantity for the small documents that can storeLimited, the efficiency for caching this high-quality storage resource will not be given full play to, in order to solve this technical problem, in the present inventionOne embodiment in, after the step 9, further includes:

If so, step 11, the small documents are deleted from the caching.

The time interval used by judging small documents, if its time interval exceeds threshold value T, illustrating may quiltThe probability used can decline, and value will decline, and have exceeded the lower limit of buffer memory file value, what is cached in this way makesIt will be lower with efficiency, and by being deleted, allow for that the high file of more accessed probabilities can be stored in caching in this way, this is rightAsk that file reading speed is very helpful in raising.In the present invention by using double-indexing mechanism and caching mechanism, from visitorThe accessed note probability of file is improved in terms of family end and server-side two, enhances the robustness of system.

Small documents store request module 10, for after having detected that small documents are stored, output to be pre-processedOrder；

Small documents preprocessing module 20 is connect with small documents storage request module 10, receives the pretreatment order,The small documents of the small documents and small documents index is put after classifying according to predetermined characteristic to the small documentsEnter in small documents queue, it, will be multiple described in the small documents queue after the length of the small documents queue reaches threshold valueSmall documents merge into big file, establish global index, and corresponding relationship is deposited into after the file index and is sent out to NameNodeRise storage request, control the NameNode according to preset size block to the big file division at data block after, by instituteIt states in the storage at least one DataNode of big file, and by the DataNode where the data block and describedThe state of DataNode is write in name space.

Since the mass small documents access system based on Hadoop is based on the above-mentioned mass small documents based on HadoopThe system of access method, beneficial effect having the same, this is not limited by the present invention.

It is in one embodiment of the invention, described based on Hadoop's in order to further increase the reading efficiency of fileIt further includes the small documents read requests module 30 connecting with the small documents preprocessing module 20, rope that mass small documents, which access system,Draw and prefetch module 40, the small documents read requests module 30 is used for the Xiang Suoshu rope after detecting small documents read requestsDraw prefetch module 40 issue pre-read request, it is described index prefetch module 40 pre-read it is corresponding small in the small documents read requestsThe small documents relevant to the small documents in big file where file.

By using the mode pre-read, setting index prefetches module between HDFS client and NameNode.Work as HDFSWhen attempting to read a small documents in big file (such as the Mapfile being merged into using MapFile technology), with this document sameThe metadata information of other related small documents in one big file can be prefetched from NameNode node.Due to at oneThere is correlation between small documents in MapFile, user often accesses relative file when reading a file,When the metadata of related small documents is stored in HDFS client-cache, client can be saved to be interacted with NameNodeStep and time.It is this prefetch mechanism under, NameNode node visit amount will greatly reduce, hence it is evident that improve NameNodeOperational efficiency.

In order to further increase file reading efficiency, in one embodiment of the invention, the sea based on HadoopAmount small documents access system further includes prefetching the cache module 50 that module 40 is connect with index, and the cache module 50 is for storingThe small documents that frequency reaches threshold value are accessed in predetermined time.

Thus, in the present invention by increasing cache module, the file of reading and the higher culture of frequency of use are putIt sets in the buffer, reads characteristic using natural high efficiency is cached, improve the reading efficiency of file.

However the access behavior of user often changes, high access frequency is on certain time section, if quilt in cachingThe file blocking or injection being largely not frequently used, since the space of itself is very limited, the file that can store becomesLess, it so that its efficiency utilization rate reduces, in order to solve this technical problem, in one embodiment of the invention, is set forth inThe mass small documents access system of Hadoop further includes the cache cleaner module connecting with the cache module 50, and the caching is clearReason module detect small documents in the caching and it is last it is accessed between time interval reach pre- fixed length T after,The small documents are deleted from the caching.

By setting up a timer in cache server, for recording the time of last access file till nowInterval, when time interval be greater than scheduled duration T after, system can be automatically deleted the file higher than T, this makes it possible to realize cachingIn file regular update so that its frequency of use and service efficiency, maintain a high-order level alwaysOn, improve service efficiency.

In conclusion the mass small documents access method and system provided in an embodiment of the present invention based on Hadoop, passes throughBefore small documents storage, is first sorted out according to predetermined characteristic, big file is synthesized in small documents queue, establishes small documents ropeAnd vertical global index, so that double indexes are formed by small documents rope and vertical global index, so that small documents in reading processReading process in, first indexed again to small documents from global index, inquiry velocity has more block, realizes the quick positioning of small documents,Simultaneously because the index file needed is less, memory consumption, system load are reduced, improves access efficiency, while storing according to pre-Determine characteristic storage, storage efficiency is higher, can also improve reading efficiency accordingly.

The transaudient alarm method of phone provided by the present invention and device are described in detail above.It is used hereinA specific example illustrates the principle and implementation of the invention, and the above embodiments are only used to help understand originallyThe method and its core concept of invention.It should be pointed out that for those skilled in the art, not departing from this hair, can be with several improvements and modifications are made to the present invention under the premise of bright principle, these improvement and modification also fall into power of the present inventionIn the protection scope that benefit requires.

Claims

1. a kind of mass small documents access method based on Hadoop characterized by comprising

Step 1, judge whether to need to save small documents；

If so, step 2, after classifying according to predetermined characteristic to the small documents, by the small documents and the small documentsSmall documents index be put into small documents queue；

If so, multiple small documents in the small documents queue are merged into big file, establish global index by step 4, andCorresponding relationship is deposited into file index backward NameNode and initiates storage request；

Step 5, the NameNode according to the block of default size to the big file division at data block after, will it is described greatlyFile storage is at least one DataNode, and by the DataNode and the DataNode where the data blockState is write in name space.

2. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 2 includes:

3. as claim 1 is based on the mass small documents access method of Hadoop, which is characterized in that the step 4 includes:

Multiple small documents in the small documents queue are merged by big file using Mapfile, wherein MapFile packetThe part index and the part data are included, for storing data, the part index is used for the data rope of file for the part dataDraw, for recording the deviation post of the key value and record of record hereof.

4. as claim 3 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 5,Further include:

Step 6, judge whether to receive small documents read requests；

Step 7, it pre-reads related to the small documents in the big file in the small documents read requests where corresponding small documentsThe small documents.

5. as claim 4 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 7,Further include:

If so, step 9, by small documents storage into caching.

6. as claim 5 is based on the mass small documents access method of Hadoop, which is characterized in that after the step 9,Further include:

Step 10, judge the small documents in the caching and it is last accessed between time interval whether reach predeterminedLong T；

If so, step 11, the small documents are deleted from the caching.

7. a kind of mass small documents based on Hadoop access system characterized by comprising

Small documents store request module, for after having detected that small documents are stored, output pretreatment to be ordered；

Small documents preprocessing module is connect with small documents storage request module, the pretreatment order is received, according to predeterminedThe small documents of the small documents and the small documents are indexed after classifying to the small documents and are put into small documents by featureIn queue, after the length of the small documents queue reaches threshold value, multiple small documents in the small documents queue are closedAnd be big file, global index is established, and corresponding relationship is deposited into initiate to store to NameNode after the file index and is askedAsk, control the NameNode according to the block of default size to the big file division at data block after, by the big fileIt stores at least one DataNode, and by the state of the DataNode and the DataNode where the data blockIt writes in name space.

8. mass small documents based on Hadoop access system as claimed in claim 7, which is characterized in that further include with it is described smallThe small documents read requests module of file preprocessing module connection, index prefetch module, and the small documents read requests module is usedIn after detecting small documents read requests, module is prefetched to the index and issues pre-read request, the pre- modulus of indexBlock pre-reads relevant to the small documents described small in the big file in the small documents read requests where corresponding small documentsFile.

9. the mass small documents based on Hadoop access system as claimed in claim 8, which is characterized in that further include pre- with indexThe cache module of modulus block connection, the cache module reach the described small of threshold value for storing accessed frequency in the predetermined timeFile.

10. mass small documents based on Hadoop access system as claimed in claim 9, which is characterized in that further include with it is describedThe cache cleaner module of cache module connection, the cache cleaner module detect the small documents and upper one in the cachingIt is secondary it is accessed between time interval reach pre- fixed length T after, the small documents are deleted from the caching.