Movatterモバイル変換


[0]ホーム

URL:


CN103500089A - Small file storage system suitable for Mapreduce calculation model - Google Patents

Small file storage system suitable for Mapreduce calculation model
Download PDF

Info

Publication number
CN103500089A
CN103500089ACN201310430402.0ACN201310430402ACN103500089ACN 103500089 ACN103500089 ACN 103500089ACN 201310430402 ACN201310430402 ACN 201310430402ACN 103500089 ACN103500089 ACN 103500089A
Authority
CN
China
Prior art keywords
small documents
small
file
file storage
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310430402.0A
Other languages
Chinese (zh)
Inventor
王雷
王鲁俊
龙翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang UniversityfiledCriticalBeihang University
Priority to CN201310430402.0ApriorityCriticalpatent/CN103500089A/en
Publication of CN103500089ApublicationCriticalpatent/CN103500089A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

Translated fromChinese

本发明设计了Hadoop上HDFS的在线将小文件合并到大文件的方式,如图1,减少了MaprReduce中启动的map的数量。本发明主要提供了一种新的上传小文件的接口,同时提供了对应的输入格式类,通过使用本发明提供的上传接口和输入类,能够完成这种在线的小文件存储和处理。

The present invention designs the online mode of merging small files into large files in HDFS on Hadoop, as shown in Figure 1, which reduces the number of maps started in MaprReduce. The present invention mainly provides a new interface for uploading small files, and provides corresponding input format classes at the same time. By using the upload interface and input class provided by the present invention, this online small file storage and processing can be completed.

Description

A kind of small documents storage system that is adapted to the Mapreduce computation model
Technical field
The present invention relates to MapReduce and small documents field of storage, be specifically related to a kind of small documents storage system of the MapReduce of being adapted to computation model.
Background technology
Hadoop is a distributed architecture, by the development group exploitation of the Yahoo at Doug Cutting and place thereof.Under the thinking of the paper about GFS and MapReduce that this development group is delivered at Google, with Java language, realized a realization that is similar to the MapReduce of Google, i.e. Hadoop, an and distributed file system HDFS.
The small documents problem has caused some concerns in academia and industry member gradually.Famous social network sites Facebook has stored 2,600 hundred million pictures, and capacity surpasses 20PB, and these file overwhelming majority all are less than 64MB.The data of accessing on internet mostly are the small documents of high access frequency.
GFS technical control people Sean Quinlan mentions one of them application scenarios of BigTable towards small documents in the GFS interview.Hadoop existing problems aspect the processing mass small documents are also pointed out in the report about Small File Problem of the famous Hadoop application Cloudera of company issue.
Process such small documents and brought serious problem to performance and the extendability of HDFS.First, mass small documents has brought a large amount of metadata, because the metadata information of each catalogue in HDFS and file leaves in the internal memory of title node, if there is a large amount of small documents in system, can reduce undoubtedly storage efficiency and the storage capacity of whole storage system.For example, if 1,000 ten thousand small documents are arranged in system, each small documents need take a block, and Namenode approximately needs the 3G space, so the memory size of Namenode has seriously restricted the expansion of cluster.The second, the speed of access large amount of small documents is far smaller than the speed of the several large files of access, because if access a large amount of small documents, needs constantly from a DataNode, to jump to another DataNote, and this is a kind of data access patterns of poor efficiency.The 3rd, accessing large file differs greatly with the map number of tasks that the access small documents is used, for example, the file of a 1G is divided into the piece of 16 64MB, with 10000 100KB(1GB altogether) file, these 10000 files each need a map, the final Mapreduce activity duration may be than hundred times of the activity duration long numbers of a 1G.Although Hadoop is used JVM to reuse etc., but still can not finely address these problems.
Hadoop itself provides Hadoop archive(HAR) be used for small documents is merged into to large file.The HAR file is to go up by HDFS the file system that builds a stratification to carry out work, and HAR file is that the archive order by Hadoop is created, this order actual motion a Mapreudce task small documents is packaged into to the HAR file.
Summary of the invention
The present invention designed and a kind ofly online small documents merged to the method for storage, and provides and be applicable to MapReduce computation process.
At first, in Hadoop, deposit under the catalogue of small documents while uploading first file to Hadoop, system can create the large file (being referred to as piece) that a size is 64MB, from the document misregistration amount, be wherein 0 to start to write the content of this small documents, and count 1 at the current small documents number of depositing of the end of piece write-in block, and write the filename of this small documents, the side-play amount of this small documents in piece and the size of this small documents.Subsequently under this catalogue during upload file, beginning by current blank in the content write-in block of this small documents, and, by the filename of this small documents, the size of the side-play amount of this small documents in piece and this small documents writes blank ending, and the small documents number counting at renewal piece end.In other words, the content of small documents starts to deposit successively from the beginning of piece, and the retrieving information of small documents in piece deposited successively from the ending of piece, upgrades small documents number counting.
Location mode is as Fig. 1.
When MapReduce reads this small documents, at first the information of these small documents of Study document head, then be organized into key-value couple, in map, processes.So need to realize reading the input class for the small documents in this Merge Scenarios.
MapReduce framing dependence InputFormat in Hadoop provides data, relies on OutputFormat output data; Each MapReduce program needs to carry out input and output by these classes.Hadoop provides a series of InputFormat and the convenient exploitation of OutputFormat.As TextInputFormat, for reading text-only file, file is divided into a series of row that finish with LF or CR, and key is the position (side-play amount, LongWritable type) of every a line, and value is the content of every a line, the Text type.KeyValueTextInputFormat, equally for file reading, is divided into two parts if row is separated symbol (the default tab of being), and first is key, and remaining part is value; If there is no separator, full line is as key, and value is empty.SequenceFileInputFormat is for reading sequence file.Sequence file is that Hadoop is for storing the binary file of data user-defined format.It has two subclass: SequenceFileAsBinaryInputFormat, and key and value are read with the type of BytesWritable; SequenceFileAsTextInputFormat, read key and value with the type of Text.
In the present invention, need self-defined input class SmallBulkInputFormat to read small documents for the file from bulk and carry out map operation (this be applied in the fields such as a large amount of picture processings very common) using each small documents as a key-value.
The accompanying drawing explanation
Fig. 1 is small documents location mode schematic diagram in piece.
Embodiment
Step 1: the flow process of improving HDFS read-write small documents.
When Hadoop writes small documents, at first in advance generate the large block file of several 64M, then after NameServer receives the request of client written document, according to load balancing, select a DataServer, receive this write request, and the information of this DataServer is issued to client, client call is improved writes function interface (realize identically with original function interface that writes, just function name is inconsistent); After DataServer receives this write request, at first select the file of a preallocated 64M, the content of small documents in write request is write to this large file, and record hereof the retrieving information of this small documents.
Step 2: new input class is provided.
At first defining SmallBulkInputFormat inherits from FileInputFormat, under core code:
Figure 2013104304020100002DEST_PATH_IMAGE001
Step 3: the developer uses new input class.
The developer carries out writing in files with the new interface function that writes, and the input format that Job is set is the SmallBulkInputFormat class.

Claims (2)

Translated fromChinese
1.在线的HDFS小文件存储,其特征在于在线存储小文件,而不是HAR方式的离线的压缩文件方式。本发明提供了新的上传小文件的接口函数,用于进行在线的小文件存储使用。1. The online HDFS small file storage is characterized in that small files are stored online, rather than the offline compressed file method of the HAR method. The invention provides a new interface function for uploading small files, which is used for online small file storage.2.提供新的输入格式SmallBulkInputFormat,其特征在于:通过使用这种输入格式类,就可以对通过使用新的上传小文件接口创建的这些小文件作为一个一个的key-value进行map操作。2. Provide a new input format SmallBulkInputFormat, which is characterized in that by using this input format class, these small files created by using the new upload small file interface can be used as key-value map operations one by one.
CN201310430402.0A2013-09-182013-09-18Small file storage system suitable for Mapreduce calculation modelPendingCN103500089A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201310430402.0ACN103500089A (en)2013-09-182013-09-18Small file storage system suitable for Mapreduce calculation model

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201310430402.0ACN103500089A (en)2013-09-182013-09-18Small file storage system suitable for Mapreduce calculation model

Publications (1)

Publication NumberPublication Date
CN103500089Atrue CN103500089A (en)2014-01-08

Family

ID=49865304

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201310430402.0APendingCN103500089A (en)2013-09-182013-09-18Small file storage system suitable for Mapreduce calculation model

Country Status (1)

CountryLink
CN (1)CN103500089A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103970874A (en)*2014-05-142014-08-06浪潮(北京)电子信息产业有限公司Method and device for processing Hadoop files
CN104331428A (en)*2014-10-202015-02-04暨南大学Storage and access method of small files and large files
CN105139281A (en)*2015-08-202015-12-09北京中电普华信息技术有限公司Method and system for processing big data of electric power marketing
CN106708606A (en)*2015-11-172017-05-24阿里巴巴集团控股有限公司MapReduce based data processing method and MapReduce based data processing device
CN106855872A (en)*2015-12-082017-06-16山东商务职业学院The method for quickly retrieving of the mass picture based on Hadoop platform
CN106855861A (en)*2015-12-092017-06-16北京金山安全软件有限公司File merging method and device and electronic equipment
WO2017133216A1 (en)*2016-02-062017-08-10华为技术有限公司Distributed storage method and device
CN107948334A (en)*2018-01-092018-04-20无锡华云数据技术服务有限公司Data processing method based on distributed memory system
CN110018997A (en)*2019-03-082019-07-16中国农业科学院农业信息研究所A kind of mass small documents storage optimization method based on HDFS
CN110321329A (en)*2019-06-182019-10-11中盈优创资讯科技有限公司Data processing method and device based on big data
CN110457265A (en)*2019-08-202019-11-15上海商汤智能科技有限公司Data processing method, device and storage medium
CN111221472A (en)*2019-12-262020-06-02天津中科曙光存储科技有限公司Multi-block allocation strategy optimization method and system for disk space allocation
CN113568877A (en)*2020-04-282021-10-29杭州海康威视数字技术股份有限公司File merging method and device, electronic equipment and storage medium
CN115982232A (en)*2022-12-132023-04-18国网湖北省电力有限公司电力科学研究院Hadoop-based power grid data processing method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102222092A (en)*2011-06-032011-10-19复旦大学Massive high-dimension data clustering method for MapReduce platform
CN102902716A (en)*2012-08-272013-01-30苏州两江科技有限公司Storage system based on Hadoop distributed computing platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102222092A (en)*2011-06-032011-10-19复旦大学Massive high-dimension data clustering method for MapReduce platform
CN102902716A (en)*2012-08-272013-01-30苏州两江科技有限公司Storage system based on Hadoop distributed computing platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张春明 等: "一种Hadoop小文件存储和读取的方法", 《计算机应用与软件》, vol. 29, no. 11, 15 November 2012 (2012-11-15)*
江柳: "HDFS下小文件存储优化相关技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 9, 15 September 2011 (2011-09-15)*
洪旭升 等: "基于MapFile的HDFS小文件存储效率问题", 《计算机系统应用》, vol. 21, no. 11, 15 November 2012 (2012-11-15)*

Cited By (24)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103970874A (en)*2014-05-142014-08-06浪潮(北京)电子信息产业有限公司Method and device for processing Hadoop files
CN104331428A (en)*2014-10-202015-02-04暨南大学Storage and access method of small files and large files
CN104331428B (en)*2014-10-202017-07-04暨南大学The storage of a kind of small documents and big file and access method
CN105139281A (en)*2015-08-202015-12-09北京中电普华信息技术有限公司Method and system for processing big data of electric power marketing
CN106708606B (en)*2015-11-172020-07-07阿里巴巴集团控股有限公司Data processing method and device based on MapReduce
CN106708606A (en)*2015-11-172017-05-24阿里巴巴集团控股有限公司MapReduce based data processing method and MapReduce based data processing device
WO2017084509A1 (en)*2015-11-172017-05-26阿里巴巴集团控股有限公司Mapreduce-based data processing method and device
CN106855872A (en)*2015-12-082017-06-16山东商务职业学院The method for quickly retrieving of the mass picture based on Hadoop platform
CN106855861A (en)*2015-12-092017-06-16北京金山安全软件有限公司File merging method and device and electronic equipment
US11301154B2 (en)2016-02-062022-04-12Huawei Technologies Co., Ltd.Distributed storage method and device
WO2017133216A1 (en)*2016-02-062017-08-10华为技术有限公司Distributed storage method and device
US12260102B2 (en)2016-02-062025-03-25Huawei Technologies Co., Ltd.Distributed storage method and device
CN107045422A (en)*2016-02-062017-08-15华为技术有限公司Distributed storage method and equipment
US11809726B2 (en)2016-02-062023-11-07Huawei Technologies Co., Ltd.Distributed storage method and device
CN107948334A (en)*2018-01-092018-04-20无锡华云数据技术服务有限公司Data processing method based on distributed memory system
CN110018997A (en)*2019-03-082019-07-16中国农业科学院农业信息研究所A kind of mass small documents storage optimization method based on HDFS
CN110018997B (en)*2019-03-082021-07-23中国农业科学院农业信息研究所 An optimization method for massive small file storage based on HDFS
CN110321329A (en)*2019-06-182019-10-11中盈优创资讯科技有限公司Data processing method and device based on big data
CN110457265A (en)*2019-08-202019-11-15上海商汤智能科技有限公司Data processing method, device and storage medium
CN111221472B (en)*2019-12-262023-08-25天津中科曙光存储科技有限公司Multi-block allocation strategy optimization method and system for disk space allocation
CN111221472A (en)*2019-12-262020-06-02天津中科曙光存储科技有限公司Multi-block allocation strategy optimization method and system for disk space allocation
CN113568877A (en)*2020-04-282021-10-29杭州海康威视数字技术股份有限公司File merging method and device, electronic equipment and storage medium
CN115982232A (en)*2022-12-132023-04-18国网湖北省电力有限公司电力科学研究院Hadoop-based power grid data processing method and system
CN115982232B (en)*2022-12-132025-08-05国网湖北省电力有限公司电力科学研究院 A power grid data processing method and system based on Hadoop

Similar Documents

PublicationPublication DateTitle
CN103500089A (en)Small file storage system suitable for Mapreduce calculation model
US10255108B2 (en)Parallel execution of blockchain transactions
CN108460045B (en)Snapshot processing method and distributed block storage system
US9996557B2 (en)Database storage system based on optical disk and method using the system
CN112965939A (en)File merging method, device and equipment
US9798761B2 (en)Apparatus and method for fsync system call processing using ordered mode journaling with file unit
CN103412803A (en)Data recovering method and device
CN107391544B (en)Processing method, device and equipment of column type storage data and computer storage medium
CN102222071A (en)Method, device and system for data synchronous processing
CN102541691B (en)Log check point recovery method applied to memory data base OLTP (online transaction processing)
JPWO2020012380A5 (en)
CN102169460A (en)Method and device for managing variable length data
CN114297196B (en)Metadata storage method and device, electronic equipment and storage medium
CN102306168A (en)Log operation method and device and file system
CN104657366A (en)Method and device for writing mass logs in database and log disaster-tolerant system
CN109933564A (en) File system management method, device, terminal and medium for fast rollback based on linked list and N-ary tree structure
CN109213898A (en)The video retrieval method and device of video monitoring system
CN103838780A (en)Data recovery method of database and relevant device
CN103473258A (en)Cloud storage file system
CN112035428A (en)Distributed storage system, method, apparatus, electronic device, and storage medium
CN115309341A (en)Small file processing method, system, terminal and medium based on hierarchical storage
US20230409235A1 (en)File system improvements for zoned storage device operations
US20180373727A1 (en)Management of b-tree leaf nodes with variable size values
CN102495838B (en)Data processing method and data processing device
CN104133970A (en)Data space management method and device

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
WD01Invention patent application deemed withdrawn after publication
WD01Invention patent application deemed withdrawn after publication

Application publication date:20140108


[8]ページ先頭

©2009-2025 Movatter.jp