A kind of small documents storage system that is adapted to the Mapreduce computation modelTechnical field
The present invention relates to MapReduce and small documents field of storage, be specifically related to a kind of small documents storage system of the MapReduce of being adapted to computation model.
Background technology
Hadoop is a distributed architecture, by the development group exploitation of the Yahoo at Doug Cutting and place thereof.Under the thinking of the paper about GFS and MapReduce that this development group is delivered at Google, with Java language, realized a realization that is similar to the MapReduce of Google, i.e. Hadoop, an and distributed file system HDFS.
The small documents problem has caused some concerns in academia and industry member gradually.Famous social network sites Facebook has stored 2,600 hundred million pictures, and capacity surpasses 20PB, and these file overwhelming majority all are less than 64MB.The data of accessing on internet mostly are the small documents of high access frequency.
GFS technical control people Sean Quinlan mentions one of them application scenarios of BigTable towards small documents in the GFS interview.Hadoop existing problems aspect the processing mass small documents are also pointed out in the report about Small File Problem of the famous Hadoop application Cloudera of company issue.
Process such small documents and brought serious problem to performance and the extendability of HDFS.First, mass small documents has brought a large amount of metadata, because the metadata information of each catalogue in HDFS and file leaves in the internal memory of title node, if there is a large amount of small documents in system, can reduce undoubtedly storage efficiency and the storage capacity of whole storage system.For example, if 1,000 ten thousand small documents are arranged in system, each small documents need take a block, and Namenode approximately needs the 3G space, so the memory size of Namenode has seriously restricted the expansion of cluster.The second, the speed of access large amount of small documents is far smaller than the speed of the several large files of access, because if access a large amount of small documents, needs constantly from a DataNode, to jump to another DataNote, and this is a kind of data access patterns of poor efficiency.The 3rd, accessing large file differs greatly with the map number of tasks that the access small documents is used, for example, the file of a 1G is divided into the piece of 16 64MB, with 10000 100KB(1GB altogether) file, these 10000 files each need a map, the final Mapreduce activity duration may be than hundred times of the activity duration long numbers of a 1G.Although Hadoop is used JVM to reuse etc., but still can not finely address these problems.
Hadoop itself provides Hadoop archive(HAR) be used for small documents is merged into to large file.The HAR file is to go up by HDFS the file system that builds a stratification to carry out work, and HAR file is that the archive order by Hadoop is created, this order actual motion a Mapreudce task small documents is packaged into to the HAR file.
Summary of the invention
The present invention designed and a kind ofly online small documents merged to the method for storage, and provides and be applicable to MapReduce computation process.
At first, in Hadoop, deposit under the catalogue of small documents while uploading first file to Hadoop, system can create the large file (being referred to as piece) that a size is 64MB, from the document misregistration amount, be wherein 0 to start to write the content of this small documents, and count 1 at the current small documents number of depositing of the end of piece write-in block, and write the filename of this small documents, the side-play amount of this small documents in piece and the size of this small documents.Subsequently under this catalogue during upload file, beginning by current blank in the content write-in block of this small documents, and, by the filename of this small documents, the size of the side-play amount of this small documents in piece and this small documents writes blank ending, and the small documents number counting at renewal piece end.In other words, the content of small documents starts to deposit successively from the beginning of piece, and the retrieving information of small documents in piece deposited successively from the ending of piece, upgrades small documents number counting.
Location mode is as Fig. 1.
When MapReduce reads this small documents, at first the information of these small documents of Study document head, then be organized into key-value couple, in map, processes.So need to realize reading the input class for the small documents in this Merge Scenarios.
MapReduce framing dependence InputFormat in Hadoop provides data, relies on OutputFormat output data; Each MapReduce program needs to carry out input and output by these classes.Hadoop provides a series of InputFormat and the convenient exploitation of OutputFormat.As TextInputFormat, for reading text-only file, file is divided into a series of row that finish with LF or CR, and key is the position (side-play amount, LongWritable type) of every a line, and value is the content of every a line, the Text type.KeyValueTextInputFormat, equally for file reading, is divided into two parts if row is separated symbol (the default tab of being), and first is key, and remaining part is value; If there is no separator, full line is as key, and value is empty.SequenceFileInputFormat is for reading sequence file.Sequence file is that Hadoop is for storing the binary file of data user-defined format.It has two subclass: SequenceFileAsBinaryInputFormat, and key and value are read with the type of BytesWritable; SequenceFileAsTextInputFormat, read key and value with the type of Text.
In the present invention, need self-defined input class SmallBulkInputFormat to read small documents for the file from bulk and carry out map operation (this be applied in the fields such as a large amount of picture processings very common) using each small documents as a key-value.
The accompanying drawing explanation
Fig. 1 is small documents location mode schematic diagram in piece.
Embodiment
Step 1: the flow process of improving HDFS read-write small documents.
When Hadoop writes small documents, at first in advance generate the large block file of several 64M, then after NameServer receives the request of client written document, according to load balancing, select a DataServer, receive this write request, and the information of this DataServer is issued to client, client call is improved writes function interface (realize identically with original function interface that writes, just function name is inconsistent); After DataServer receives this write request, at first select the file of a preallocated 64M, the content of small documents in write request is write to this large file, and record hereof the retrieving information of this small documents.
Step 2: new input class is provided.
At first defining SmallBulkInputFormat inherits from FileInputFormat, under core code:
Step 3: the developer uses new input class.
The developer carries out writing in files with the new interface function that writes, and the input format that Job is set is the SmallBulkInputFormat class.