CN103500089A

Movatterモバイル変換

Info

Publication number: CN103500089A
Application number: CN201310430402.0A
Authority: CN
Inventors: 王雷; 王鲁俊; 龙翔
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-09-18
Filing date: 2013-09-18
Publication date: 2014-01-08

Abstract

本发明设计了Hadoop上HDFS的在线将小文件合并到大文件的方式，如图1，减少了MaprReduce中启动的map的数量。本发明主要提供了一种新的上传小文件的接口，同时提供了对应的输入格式类，通过使用本发明提供的上传接口和输入类，能够完成这种在线的小文件存储和处理。

The present invention designs the online mode of merging small files into large files in HDFS on Hadoop, as shown in Figure 1, which reduces the number of maps started in MaprReduce. The present invention mainly provides a new interface for uploading small files, and provides corresponding input format classes at the same time. By using the upload interface and input class provided by the present invention, this online small file storage and processing can be completed.

Description

A kind of small documents storage system that is adapted to the Mapreduce computation model

Technical field

The present invention relates to MapReduce and small documents field of storage, be specifically related to a kind of small documents storage system of the MapReduce of being adapted to computation model.

Background technology

Hadoop is a distributed architecture, by the development group exploitation of the Yahoo at Doug Cutting and place thereof.Under the thinking of the paper about GFS and MapReduce that this development group is delivered at Google, with Java language, realized a realization that is similar to the MapReduce of Google, i.e. Hadoop, an and distributed file system HDFS.

The small documents problem has caused some concerns in academia and industry member gradually.Famous social network sites Facebook has stored 2,600 hundred million pictures, and capacity surpasses 20PB, and these file overwhelming majority all are less than 64MB.The data of accessing on internet mostly are the small documents of high access frequency.

GFS technical control people Sean Quinlan mentions one of them application scenarios of BigTable towards small documents in the GFS interview.Hadoop existing problems aspect the processing mass small documents are also pointed out in the report about Small File Problem of the famous Hadoop application Cloudera of company issue.

Process such small documents and brought serious problem to performance and the extendability of HDFS.First, mass small documents has brought a large amount of metadata, because the metadata information of each catalogue in HDFS and file leaves in the internal memory of title node, if there is a large amount of small documents in system, can reduce undoubtedly storage efficiency and the storage capacity of whole storage system.For example, if 1,000 ten thousand small documents are arranged in system, each small documents need take a block, and Namenode approximately needs the 3G space, so the memory size of Namenode has seriously restricted the expansion of cluster.The second, the speed of access large amount of small documents is far smaller than the speed of the several large files of access, because if access a large amount of small documents, needs constantly from a DataNode, to jump to another DataNote, and this is a kind of data access patterns of poor efficiency.The 3rd, accessing large file differs greatly with the map number of tasks that the access small documents is used, for example, the file of a 1G is divided into the piece of 16 64MB, with 10000 100KB(1GB altogether) file, these 10000 files each need a map, the final Mapreduce activity duration may be than hundred times of the activity duration long numbers of a 1G.Although Hadoop is used JVM to reuse etc., but still can not finely address these problems.

Hadoop itself provides Hadoop archive(HAR) be used for small documents is merged into to large file.The HAR file is to go up by HDFS the file system that builds a stratification to carry out work, and HAR file is that the archive order by Hadoop is created, this order actual motion a Mapreudce task small documents is packaged into to the HAR file.

Summary of the invention

The present invention designed and a kind ofly online small documents merged to the method for storage, and provides and be applicable to MapReduce computation process.

At first, in Hadoop, deposit under the catalogue of small documents while uploading first file to Hadoop, system can create the large file (being referred to as piece) that a size is 64MB, from the document misregistration amount, be wherein 0 to start to write the content of this small documents, and count 1 at the current small documents number of depositing of the end of piece write-in block, and write the filename of this small documents, the side-play amount of this small documents in piece and the size of this small documents.Subsequently under this catalogue during upload file, beginning by current blank in the content write-in block of this small documents, and, by the filename of this small documents, the size of the side-play amount of this small documents in piece and this small documents writes blank ending, and the small documents number counting at renewal piece end.In other words, the content of small documents starts to deposit successively from the beginning of piece, and the retrieving information of small documents in piece deposited successively from the ending of piece, upgrades small documents number counting.

Location mode is as Fig. 1.

When MapReduce reads this small documents, at first the information of these small documents of Study document head, then be organized into key-value couple, in map, processes.So need to realize reading the input class for the small documents in this Merge Scenarios.

MapReduce framing dependence InputFormat in Hadoop provides data, relies on OutputFormat output data; Each MapReduce program needs to carry out input and output by these classes.Hadoop provides a series of InputFormat and the convenient exploitation of OutputFormat.As TextInputFormat, for reading text-only file, file is divided into a series of row that finish with LF or CR, and key is the position (side-play amount, LongWritable type) of every a line, and value is the content of every a line, the Text type.KeyValueTextInputFormat, equally for file reading, is divided into two parts if row is separated symbol (the default tab of being), and first is key, and remaining part is value; If there is no separator, full line is as key, and value is empty.SequenceFileInputFormat is for reading sequence file.Sequence file is that Hadoop is for storing the binary file of data user-defined format.It has two subclass: SequenceFileAsBinaryInputFormat, and key and value are read with the type of BytesWritable; SequenceFileAsTextInputFormat, read key and value with the type of Text.

In the present invention, need self-defined input class SmallBulkInputFormat to read small documents for the file from bulk and carry out map operation (this be applied in the fields such as a large amount of picture processings very common) using each small documents as a key-value.

The accompanying drawing explanation

Fig. 1 is small documents location mode schematic diagram in piece.

Embodiment

Step 1: the flow process of improving HDFS read-write small documents.

When Hadoop writes small documents, at first in advance generate the large block file of several 64M, then after NameServer receives the request of client written document, according to load balancing, select a DataServer, receive this write request, and the information of this DataServer is issued to client, client call is improved writes function interface (realize identically with original function interface that writes, just function name is inconsistent); After DataServer receives this write request, at first select the file of a preallocated 64M, the content of small documents in write request is write to this large file, and record hereof the retrieving information of this small documents.

Step 2: new input class is provided.

At first defining SmallBulkInputFormat inherits from FileInputFormat, under core code:

Figure 2013104304020100002DEST_PATH_IMAGE001

Step 3: the developer uses new input class.

The developer carries out writing in files with the new interface function that writes, and the input format that Job is set is the SmallBulkInputFormat class.

Claims

Translated fromChinese

1.在线的HDFS小文件存储，其特征在于在线存储小文件，而不是HAR方式的离线的压缩文件方式。本发明提供了新的上传小文件的接口函数，用于进行在线的小文件存储使用。1. The online HDFS small file storage is characterized in that small files are stored online, rather than the offline compressed file method of the HAR method. The invention provides a new interface function for uploading small files, which is used for online small file storage.

2.提供新的输入格式SmallBulkInputFormat，其特征在于：通过使用这种输入格式类，就可以对通过使用新的上传小文件接口创建的这些小文件作为一个一个的key-value进行map操作。2. Provide a new input format SmallBulkInputFormat, which is characterized in that by using this input format class, these small files created by using the new upload small file interface can be used as key-value map operations one by one.