Storage system based on the Hadoop Distributed Computing PlatformTechnical field
The invention belongs to Hadoop Distributed Computing Platform file system technology field, be specifically related to a kind of storage system based on the Hadoop Distributed Computing Platform.
Background technology
Hadoop Distributed File System is called for short HDFS, is a distributed file system.HDFS has the characteristics of high fault tolerance, and design is used for being deployed on cheap (low-cost) hardware.And its data of providing high-throughput to visit application program, being fit to those has the application program of super large data set.Data in form access (streaming access) file system that the requirement that HDFS has relaxed POSIX can realize flowing like this.It is that foundation structure for the apache project nutch that increases income creates that HDFS begins, and HDFS is the part of hadoop project, and hadoop is the part of lucene.
Along with enterprise's data volume to be processed is increasing, MapReduce thought more and more comes into one's own.Hadoop is of the MapReduce realization of increasing income, because its good extendability and fault-tolerance have obtained using more and more widely.Hadoop is as a basic data processing platform, although its using value has obtained everybody approval, but still has a lot of problems, and the small documents problem of HDFS is one of them.Small documents refers to that file size is less than the file of the upper block size of HDFS.Such file brings serious problems can for extendability and the performance of hadoop.At first, in HDFS, any block, all with the form storage of object, each object accounts for 150byte, if 10000000 small documents are arranged in internal memory for file or catalogue, each file takies a block, and then namenode needs 2G space (depositing two parts).If store 100,000,000 files, then namenode needs the 20G space.The namenode memory size has seriously restricted the expansion of cluster like this.Secondly, access large amount of small documents speed is far smaller than the several large files of access.HDFS accesses large file development for streaming at first, if the access large amount of small documents needs constantly to jump to another datanode from a datanode, has a strong impact on performance.At last, process the speed that large amount of small documents speed is far smaller than the large file of processing equal size.Each small documents will take a slot, and the task startup will expend the plenty of time even the most of the time is all expended on startup task and release task.Solve HDFS small documents problem, help to enlarge the range of application of HDFS and strengthen its extendability and performance.The present invention therefore.
Summary of the invention
The object of the invention is to provide a kind of storage system based on the Hadoop Distributed Computing Platform, has solved in the prior art Hadoop Distributed Computing Platform small documents quantity and has caused too greatly the problems such as hydraulic performance decline is obvious.
In order to solve these problems of the prior art, technical scheme provided by the invention is:
A kind of storage system based on the Hadoop Distributed Computing Platform comprises HDFS general file processing module, it is characterized in that described system also comprises file type judge module, small documents processing module and time block; Described file type judge module is used for judging whether the file that the user uploads belongs to small documents; The file size of uploading as the user is during less than the piece of HDFS file system, and the file type judge module judges that file is small documents, otherwise the file type judge module judges that file is large file;
Time block by the timer timing is set, is added up the size of the medium and small file sequence of small documents processing module when arriving predetermined period, whether the size of judging the small documents sequence is greater than the piece of HDFS file system;
The small documents processing module is used for depositing each small documents in the SequenceFile class as a Record and forms the small documents formation; The size of judging the small documents sequence when time block is during greater than the piece of HDFS file system, and as the Key value, and file content is as the Value value with the filename of small documents, disposable the small documents formation is write among the MapFile, and the small documents processed of simultaneously deletion.
Preferably, described system also comprises the small documents slip condition database, adopts the MySQL database to store filename, the file size of small documents, upload date, store path, and uses this small documents formation of FileList object maintenance; After the small documents formation writes MapFile, behind renewal FileList object and the MySQL database, delete again the small documents of having processed.
Preferably, described system also is provided with file information table, described file information table arranges small documents mode field flag, flag=0 represents that small documents is in armed state, flag=1 represents that small documents processes, and be present among the MapFile of HDFS, flag=2 represents that small documents re-generates and is written in the local disk.
Preferably, described system constructing MySQL index and MapFile index, the MapFile index is by the filename field index building of file information table.
Another object of the present invention is to provide a kind of File Upload storage means of the storage system based on the Hadoop Distributed Computing Platform, it is characterized in that said method comprising the steps of:
(1) user is to the server up transfer file that has based on the storage system of Hadoop Distributed Computing Platform;
(2) file file type judge module judges whether the file that the user uploads belongs to small documents; When the file of uploading is large file, directly uploads and store among the HDFS; When the file of uploading is small documents, deposits each small documents in SequenceFile class formation small documents formation as a Record, and start time block;
(3) add up the size of the medium and small file sequence of small documents processing module when arriving predetermined period, whether the size of judging the small documents sequence is greater than the piece of HDFS file system; The size of judging the small documents sequence when time block is during greater than the piece of HDFS file system, and as the Key value, and file content is as the Value value with the filename of small documents, disposable the small documents formation is write among the MapFile, and the small documents processed of simultaneously deletion.
Another purpose of the present invention is to provide a kind of file of the storage system based on the Hadoop Distributed Computing Platform to download read method, it is characterized in that said method comprising the steps of:
(1) user is to the server transmission download file that has based on the storage system of Hadoop Distributed Computing Platform;
(2) have based on the storage system of Hadoop Distributed Computing Platform and judge whether file is stored in local disk;
(3) be present in local disk when file, directly download by the download component access;
(4) be not present in local disk when file, the storage system that has based on the Hadoop Distributed Computing Platform starts MYSQL index and MapFile index, first small documents is read local disk from MapFile, then downloads by the download component access.
With respect to scheme of the prior art, advantage of the present invention is:
Technical solution of the present invention has solved the small documents storage problem of HDFS based on the small documents storage means of Hadoop.Technical solution of the present invention uses SequenceFile as the scheme that solves the small documents read-write in other words, each small documents is deposited among the SequenceFile as a Record, wherein, filename is as the Key value, and file content is as the Value value, no matter this method is from theory or numerous practices, it is the solution route of in HDFS, processing at present small documents the best.
Description of drawings
The invention will be further described below in conjunction with drawings and Examples:
Fig. 1 is the schematic flow sheet based on the File Upload of the storage system of Hadoop Distributed Computing Platform.
Fig. 2 is the schematic flow sheet of downloading based on the file of the storage system of Hadoop Distributed Computing Platform.
Embodiment
Below in conjunction with specific embodiment such scheme is described further.Should be understood that these embodiment are not limited to limit the scope of the invention for explanation the present invention.The implementation condition that adopts among the embodiment can be done further adjustment according to the condition of concrete producer, and not marked implementation condition is generally the condition in the normal experiment.
Embodiment
As shown in Figure 1, present embodiment adds small documents processing module, file type judge module and a time block on original HDFS basis.Wherein said file type judge module is used for judging whether the file that the user uploads belongs to small documents; The file size of uploading as the user is during less than the piece of HDFS file system, and the file type judge module judges that file is small documents, otherwise the file type judge module judges that file is large file; Time block by the timer timing is set, is added up the size of the medium and small file sequence of small documents processing module when arriving predetermined period, whether the size of judging the small documents sequence is greater than the piece of HDFS file system; The small documents processing module is used for depositing each small documents in the SequenceFile class as a Record and forms the small documents formation; The size of judging the small documents sequence when time block is during greater than the piece of HDFS file system, and as the Key value, and file content is as the Value value with the filename of small documents, disposable the small documents formation is write among the MapFile, and the small documents processed of simultaneously deletion.
During concrete upload file, as shown in Figure 1, the concrete operations flow process is as follows:
1, when user's upload file, judge whether this document belongs to small documents, if so, then give the small documents processing module and process, otherwise, give the general file processing module and process.
2, in the small documents module, open a timed task, its major function is when the total size of module File goes up the file of block size greater than HDFS, then be key by the SequenceFile assembly with filename, corresponding file content is that value is with these small documents one-time writes HDFS module.
3, the file processed of deletion simultaneously, and with write into Databasce as a result.
4, when the user carries out read operation, can come file reading according to the as a result sign in the database.
Technical solution of the present invention has reduced the quantity of small documents among the HDFS, has effectively improved the performance that the HDFS File reads.
On small documents is uploaded onto the server by transmitting assembly, use simultaneously small documents formation of FileList object maintenance, total file size under the record upload catalogue, then the filename tabulation is persisted to local disk with the FileList object by the object serialization technology.Simultaneously with small documents essential information such as filename, file size, upload date, store path etc. and be recorded in the MySQL database.Small documents has three kinds of states, and be respectively pending (under local disk upload catalogue), process (in HDFS), re-generate (under local disk download catalogue), be 0,1,2 corresponding to the field flag of file information table filetb.Under the web mode, specify a timed task with Timer and TimeTask, this task was read in internal memory every five minutes with the FileList object, by judging that total file size determines whether the small documents under the upload catalogue to be write among the HDFS.If total file size is greater than HDFS block size, be key by the MapFile assembly with filename, corresponding file content be value with among these small documents one-time writes HDFS, upgrade simultaneously FileList object and MySQL database, delete at last these small documents.What timed task was mainly carried out here is MapFileWriter method in the MapFileTools class.
In the time that small documents will be downloaded, at first judge that according to the flag field of filetb table whether small documents is at local disk.If small documents not at local disk, is read local disk with small documents first from the MapFile of HDFS, and new database more, then use download component to download small documents.
In order efficiently small documents to be read at random, present embodiment adopts two-stage index, and the first order is the MySQL index, and the second level is the index of MapFile.The index of MapFile is that the filename field of filetb is set up index, so that the MapFile file at fast query small documents place.MapFile comprises two files: data file, index file.MapFile can search the corresponding value value of single key (small documents name).When execution is searched, MapFile.Reader () need to read in index in the internal memory, then carry out a simple binary search and find data, MapFile.Reader () is when searching, can find less than the index key value of wanting to look in the first indexed file, and then in the data file, search backward.Hadoop provides a very effective method, when reading index file exactly, can read index key value every several index key again, can effectively reduce like this size of the index file that reads in internal memory.Arrange by io.map.index.skip as for the number of skipping key.
JAVA realizes several assemblies of File Upload: 1, the maximum assembly used of SmartUpload, no longer upgraded, and can realize upload and download; 2, the File Upload assembly of FileUpload Apache realization, function is complete; 3, the File Upload assembly of J2KUpload java2000 realization all uses internal memory, is fit to a plurality of small documents that are no more than 10M.The present invention mainly adopts the second way, uploads by the copy mode, on local file is uploaded onto the server, supports multiple document uploading, and size, the type of upload file can be set.
Download relatively simply, only need provide the download address of file just passable.The path of depositing of file is divided into physical pathway and virtual route.Physical pathway refers to file and leaves position on the server hard disc in, and virtual route refers to the position that file leaves HDFS in, and virtual route is converted to the process of physical pathway and above illustrates, and repeats no more.
Above-mentioned example only is explanation technical conceive of the present invention and characteristics, and its purpose is to allow the people who is familiar with technique can understand content of the present invention and according to this enforcement, can not limit protection scope of the present invention with this.All equivalent transformations that Spirit Essence is done according to the present invention or modification all should be encompassed within protection scope of the present invention.