Movatterモバイル変換


[0]ホーム

URL:


CN107402924A - MR files apply the implementation method and device in HDFS - Google Patents

MR files apply the implementation method and device in HDFS
Download PDF

Info

Publication number
CN107402924A
CN107402924ACN201610333313.8ACN201610333313ACN107402924ACN 107402924 ACN107402924 ACN 107402924ACN 201610333313 ACN201610333313 ACN 201610333313ACN 107402924 ACN107402924 ACN 107402924A
Authority
CN
China
Prior art keywords
file
index
block
data
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201610333313.8A
Other languages
Chinese (zh)
Inventor
刘哲
胡伦良
张海斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Putian Information Technology Co Ltd
Original Assignee
Putian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian Information Technology Co LtdfiledCriticalPutian Information Technology Co Ltd
Priority to CN201610333313.8ApriorityCriticalpatent/CN107402924A/en
Publication of CN107402924ApublicationCriticalpatent/CN107402924A/en
Withdrawnlegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请提供了MR文件应用在HDFS的实现方法和装置。本申请中,MR服务器并非单独将一个个MR文件上传给HDFS(HDFS客户端),而是将同一周期内的所有MR文件汇总在一个汇总文件(形成一个大数据文件)上传给HDFS(HDFS客户端),这契合了HDFS本身的属性,保证了Hadoop的性能和扩展性。

This application provides a method and device for implementing the application of MR files in HDFS. In this application, the MR server does not upload individual MR files to HDFS (HDFS client), but aggregates all MR files in the same period into a summary file (forming a big data file) and uploads it to HDFS (HDFS client end), which conforms to the properties of HDFS itself and ensures the performance and scalability of Hadoop.

Description

Translated fromChinese
MR文件应用在HDFS的实现方法和装置Implementation method and device for MR file application in HDFS

技术领域technical field

本申请涉及数据通信技术,特别涉及测量报告(MR:MeasurementReport)应用在分布式文件系统(HDFS:Hadoop Distributed File System)的实现方法和装置。The present application relates to data communication technology, in particular to a method and device for implementing a measurement report (MR: MeasurementReport) applied in a distributed file system (HDFS: Hadoop Distributed File System).

背景技术Background technique

用户按需求配置测量项,部分测量项被配置为统计值上报,另一部分被配置为样本值上报。对于统计值上报的测量项,eNodeB对相关的测量数据进行样本收集、统计、生成MR统计文件,周期性上传文件到MR文件服务器。而对于样本值上报的测量项,eNodeB对测量数据进行收集整理,形成MR样本文件周期性上传到MR文件服务器。不管是MR统计文件,还是MR样本文件,统一称为MR文件。最后,MR文件服务器再将MR文件统一上传给HDFS,以通过MR数据的分析,实现全网/局部网络的质量评价和覆盖分析,并对网络进行优化和监测。The user configures measurement items according to requirements. Some measurement items are configured to report statistical values, and others are configured to report sample values. For measurement items reported by statistical values, the eNodeB collects samples of the relevant measurement data, makes statistics, generates MR statistical files, and periodically uploads the files to the MR file server. For the measurement item reported by the sample value, the eNodeB collects and organizes the measurement data to form an MR sample file and upload it to the MR file server periodically. Regardless of whether it is an MR statistical file or an MR sample file, it is collectively called an MR file. Finally, the MR file server uploads the MR files to HDFS in a unified manner, so that through the analysis of MR data, the quality evaluation and coverage analysis of the whole network/partial network can be realized, and the network can be optimized and monitored.

HDFS主要用于大数据文件的分析,特点是将一个超大文件,分解成多个小文件,部署到多台低配置机器上存储和分析。其中,分解的小文件是指文件大小小于HDFS块大小(默认为64MB)的文件。HDFS is mainly used for the analysis of large data files. It is characterized by decomposing a large file into multiple small files and deploying them to multiple low-configuration machines for storage and analysis. Among them, the decomposed small file refers to the file whose file size is smaller than the HDFS block size (default is 64MB).

而MR文件服务器上传至HDFS的MR文件大小通常不到1M,远远低于HDFS块大小(默认为64MB)。并且,一个eNodeB一天就生成将近300个MR文件文件,而1000个eNodeB一天产生300*1000=30万个MR文件,如此大量的MR文件如果上传至HDFS,会严重影响HDFS的性能及其扩展性。However, the size of the MR file uploaded to HDFS by the MR file server is usually less than 1M, which is much lower than the HDFS block size (64MB by default). Moreover, one eNodeB generates nearly 300 MR files a day, while 1000 eNodeBs generate 300*1000=300,000 MR files a day. If such a large number of MR files are uploaded to HDFS, it will seriously affect the performance and scalability of HDFS .

发明内容Contents of the invention

本申请提供了MR文件应用在HDFS的实现方法和装置,以在不影响HDFS的前提下实现MR文件应用在HDFS。This application provides a method and device for implementing MR file application in HDFS, so as to implement MR file application in HDFS without affecting HDFS.

本申请提供的技术方案包括:The technical solutions provided by this application include:

一种测量报告MR文件应用在分布式文件系统HDFS的实现方法,包括:A method for realizing measurement report MR file application in distributed file system HDFS, comprising:

MR文件服务器接收MR文件;The MR file server receives the MR file;

MR文件服务器判断所述MR文件的采集时间与本地未完成的汇总文件中MR文件的采集时间是否为同一周期,所述同一周期包括但不限于同一天、同一周、或者同一个月;The MR file server judges whether the collection time of the MR file is the same cycle as the collection time of the MR file in the local unfinished summary file, and the same cycle includes but is not limited to the same day, the same week, or the same month;

如果是,将接收的MR文件汇总至所述汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中;If yes, summarizing the received MR file into the summary file and parsing the MR file to form a corresponding MR file index block, storing the MR file index block in the local index file;

如果否,关闭所述汇总文件,将所述汇总文件作为已完成的汇总文件上传至HDFS客户端,在本地重新创建一个标识为未完成的汇总文件,将接收的MR文件汇总至该重新创建的汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中。If not, close the summary file, upload the summary file to the HDFS client as a completed summary file, recreate an unfinished summary file locally, and summarize the received MR files to the recreated Summarize the files and parse the MR files to form corresponding MR file index blocks, and store the MR file index blocks into the local index file.

一种测量报告MR文件应用在分布式文件系统HDFS的实现装置,该装置应用于MR文件服务器,包括:A device for implementing a measurement report MR file application in a distributed file system HDFS, the device is applied to an MR file server, including:

接收单元,用于接收MR文件;a receiving unit, configured to receive MR files;

判断单元,用于判断所述MR文件的采集时间与本地未完成的汇总文件中MR文件的采集时间是否为同一周期,所述同一周期包括但不限于同一天、同一周、或者同一个月;A judging unit, configured to judge whether the collection time of the MR file is in the same cycle as the collection time of the MR file in the local unfinished summary file, the same cycle includes but not limited to the same day, the same week, or the same month;

汇总单元,用于在判断单元的判断结果为是时,将接收的MR文件汇总至所述汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中;以及,The summary unit is used for summarizing the received MR file into the summary file and parsing the MR file to form a corresponding MR file index block when the judgment result of the judging unit is yes, and storing the MR file index block into the local index in the file; and,

在判断单元的判断结果为否时,关闭所述汇总文件,将所述汇总文件作为已完成的汇总文件上传至HDFS客户端,在本地重新创建一个标识为未完成的汇总文件,将接收的MR文件汇总至该重新创建的汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中。When the judging result of the judging unit is no, close the summary file, upload the summary file to the HDFS client as a completed summary file, recreate an unfinished summary file locally, and receive the MR The files are summarized into the newly created summary file and the MR file is parsed to form a corresponding MR file index block, and the MR file index block is stored in the local index file.

由以上技术方案可以看出,本发明中,MR服务器并非单独将一个个MR文件上传给HDFS(HDFS客户端),而是将同一周期内的所有MR文件汇总在一个汇总文件(形成一个大数据文件)上传给HDFS(HDFS客户端),这契合了HDFS本身的属性,保证了Hadoop的性能和扩展性。As can be seen from the above technical solutions, in the present invention, the MR server does not upload each MR file to HDFS (HDFS client) separately, but summarizes all MR files in the same cycle into a summary file (forming a big data file) to HDFS (HDFS client), which conforms to the properties of HDFS itself and ensures the performance and scalability of Hadoop.

附图说明Description of drawings

图1为本发明提供的流程图;Fig. 1 is the flowchart that the present invention provides;

图2为本发明提供的图1所示流程的应用图;Fig. 2 is the application diagram of the process shown in Fig. 1 provided by the present invention;

图3为本发明提供的汇总文件的结构图;Fig. 3 is the structural diagram of the summary file provided by the present invention;

图4为本发明提供的索引文件的结构图;Fig. 4 is a structural diagram of an index file provided by the present invention;

图5为本发明提供的索引数据逻辑信息的结构图;FIG. 5 is a structural diagram of index data logic information provided by the present invention;

图6为本发明提供的MR文件查询流程图;Fig. 6 is the MR file inquiry flowchart that the present invention provides;

图7为本发明提供的另一MR文件查询流程图;Fig. 7 is another MR file query flowchart provided by the present invention;

图8为本发明提供的装置结构图。Fig. 8 is a structural diagram of the device provided by the present invention.

具体实施方式detailed description

为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

参见图1,图1为本发明提供的方法流程图。如图1所示,该流程可包括以下步骤:Referring to Fig. 1, Fig. 1 is a flow chart of the method provided by the present invention. As shown in Figure 1, the process may include the following steps:

步骤101,MR文件服务器接收MR文件。Step 101, the MR file server receives the MR file.

本发明中,用户按需求进行配置,部分测量项被配置为统计值上报,另一部分被配置为样本值上报。对于统计上报的测量项,eNodeB对相关的测量数据进行样本收集、统计、生成MR文件,周期性上传MR文件到MR文件服务器。对于样本上报的测量项,eNodeB对测量数据进行收集整理,形成MR文件文件周期性上传MR文件到MR文件服务器。In the present invention, the user configures according to requirements, and some measurement items are configured as statistical value reporting, and the other part is configured as sample value reporting. For statistically reported measurement items, the eNodeB performs sample collection, statistics, and generation of MR files for the relevant measurement data, and periodically uploads the MR files to the MR file server. For the measurement items reported by the sample, the eNodeB collects and organizes the measurement data to form an MR file and upload the MR file to the MR file server periodically.

步骤102,MR文件服务器判断MR文件的采集时间与本地未完成的汇总文件中MR文件的采集时间是否为同一周期,如果是,执行步骤103,如果否,执行步骤104。In step 102, the MR file server judges whether the acquisition time of the MR file is in the same cycle as the acquisition time of the MR file in the local unfinished summary file, if yes, execute step 103, and if not, execute step 104.

MR文件的MR文件名称包含MR文件的采集时间,基于此,本步骤102中,MR文件服务器解析MR文件的MR文件名称得到采集时间。至于步骤102中本地未完成的汇总文件中MR文件可为本地未完成的汇总文件中的任意一个MR文件,而同一周期具体实现时包括但不限于同一天、同一周、或者同一个月。以同一周期为同一天为例,则步骤102中,MR文件服务器解析MR文件得到MR文件的采集时间,之后判断得到的采集时间与本地未完成的汇总文件中任一MR文件的采集时间是否为同一天,如果是,执行步骤103,如果否,执行步骤104。The MR file name of the MR file includes the collection time of the MR file. Based on this, in step 102, the MR file server analyzes the MR file name of the MR file to obtain the collection time. As for the MR file in the local uncompleted summary file in step 102, it can be any MR file in the local unfinished summary file, and the specific implementation of the same cycle includes but is not limited to the same day, the same week, or the same month. Taking the same period as the same day as an example, in step 102, the MR file server parses the MR file to obtain the collection time of the MR file, and then judges whether the collection time obtained is equal to the collection time of any MR file in the local unfinished summary file. On the same day, if yes, go to step 103, if not, go to step 104.

步骤103,将接收的MR文件汇总至汇总文件中并解析MR文件形成对应的MR文件索引块,将MR文件索引块存入本地索引文件中。Step 103, summarizing the received MR files into a summary file, parsing the MR files to form corresponding MR file index blocks, and storing the MR file index blocks into the local index file.

步骤104,关闭汇总文件,将汇总文件作为已完成的汇总文件上传至HDFS客户端,在本地重新创建一个标识为未完成的汇总文件,将接收的MR文件汇总至该重新创建的汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中。Step 104, closing the summary file, uploading the summary file to the HDFS client as a completed summary file, recreating an unfinished summary file locally, summarizing the received MR files into the newly created summary file and The MR file is parsed to form a corresponding MR file index block, and the MR file index block is stored in a local index file.

至此,完成图1所示的流程。So far, the process shown in FIG. 1 is completed.

通过图1所示的流程可以看出,在本发明中,MR服务器并非单独将一个个MR文件上传给HDFS(HDFS客户端),而是将同一周期内的所有MR文件汇总在一个汇总文件(形成一个大数据文件)上传给HDFS(HDFS客户端),这契合了HDFS本身的属性,保证了Hadoop的性能和扩展性,图2示出了图1所示流程的应用图,图3示出了汇总文件的结构。As can be seen from the process shown in Figure 1, in the present invention, the MR server does not upload each MR file to HDFS (HDFS client) separately, but summarizes all MR files in the same cycle into a summary file ( form a large data file) and upload it to HDFS (HDFS client), which conforms to the properties of HDFS itself and ensures the performance and scalability of Hadoop. Figure 2 shows the application diagram of the process shown in Figure 1, and Figure 3 shows The structure of the summary file.

需要说明的是,在本发明中,索引文件,用于存放访问汇总文件中单独MR小文件的索引字段,主要包括:It should be noted that, in the present invention, the index file is used to store the index fields of the individual MR small files in the access summary file, mainly including:

1),数据块(block);1), data block (block);

2),文件信息;2), file information;

3),数据索引block;3), data index block;

4),文件尾信息。4), end of file information.

图4示出了索引文件的结构。Fig. 4 shows the structure of the index file.

其中,数据block、数据索引block统称为MR文件索引块。Among them, the data block and the data index block are collectively referred to as the MR file index block.

如图4所示,作为一个优选实施例,数据block,主要包含以下两部分:As shown in Figure 4, as a preferred embodiment, the data block mainly includes the following two parts:

数据block标识(flag):字段标志位,优选为8个字节,固定字节可优选为:(‘D’,‘A’,‘T’,‘A’,‘B’,‘L’,‘K’,99};Data block identification (flag): field flag, preferably 8 bytes, fixed bytes can be preferably: ('D', 'A', 'T', 'A', 'B', 'L', 'K',99};

索引数据逻辑信息:用于描述MR文件信息以及MR文件在汇总文件中的位置信息,结构如图5所示,具体可包括:Index data logic information: used to describe the MR file information and the location information of the MR file in the summary file, the structure is shown in Figure 5, which may specifically include:

MR文件名称:由<eNodeB ID>_<MR文件的文件类型>_<板卡ID>_<采集时间>.<扩展名>,字段长度优选为100字节,其中,eNodeB ID为上传MR文件的基站eNodeB的标识,字段长度优选为30字节,板卡ID,字段长度优选为30字节。MR file name: from <eNodeB ID>_<file type of MR file>_<board ID>_<acquisition time>.<extension>, the field length is preferably 100 bytes, where the eNodeB ID is the uploaded MR file The identifier of the base station eNodeB, the field length is preferably 30 bytes, and the board ID, the field length is preferably 30 bytes.

时间戳:数据block的生成时间,字段长度优选为30字节;Timestamp: the generation time of the data block, the field length is preferably 30 bytes;

MR文件长度:字段长度优选为30字节;MR file length: the field length is preferably 30 bytes;

位于汇总文件中的起始位置:MR文件在汇总该文件中的起始位置,字段长度优选为30字节;Start position in the summary file: the start position of the MR file in the summary file, the field length is preferably 30 bytes;

MR文件类型:字段长度优选为30字节;MR file type: the field length is preferably 30 bytes;

汇总文件名称:MR文件汇总至的汇总文件的名称,字段长度优选为100字节;Summary file name: the name of the summary file to which the MR file is summarized, and the field length is preferably 100 bytes;

汇总文件的扩展名:MR文件汇总至的汇总文件的扩展名,字段长度优选为10字节。Summary file extension: the extension of the summary file to which the MR file is summarized, and the field length is preferably 10 bytes.

如图4所示,作为一个优选实施例,文件信息,可包含各类附加信息,具体可包括:As shown in Figure 4, as a preferred embodiment, file information can include various additional information, specifically can include:

文件信息Flag:字段标志位,字段长度优选为8字节,固定字节优选为:{‘F’,‘I’,‘L’,‘E’,‘F’,‘L’,‘G’,99}。File information Flag: field flag bit, the field length is preferably 8 bytes, and the fixed byte is preferably: {'F', 'I', 'L', 'E', 'F', 'L', 'G' ,99}.

厂商(Vendor)信息:字段长度优选为30字节,其中,可配置为设定的厂商信息。Vendor information: the field length is preferably 30 bytes, which can be configured as set vendor information.

保留字段(Reserve):字段长度优选为100字节。Reserved field (Reserve): the field length is preferably 100 bytes.

如图4所示,作为一个优选实施例,数据block索引,是指数据block在索引文件中的偏移,主要包含:As shown in Figure 4, as a preferred embodiment, the data block index refers to the offset of the data block in the index file, mainly including:

数据block flag:字段标志位,字段长度优选为8字节,固定字节优选为:{‘D’,‘A’,‘T’,‘A’,‘I’,‘N’,‘D’,99}。Data block flag: field flag bit, the field length is preferably 8 bytes, and the fixed byte is preferably: {'D', 'A', 'T', 'A', 'I', 'N', 'D' ,99}.

数据block在索引文件中的起始位置(Offset):字段长度优选为8字节;The starting position (Offset) of the data block in the index file: the field length is preferably 8 bytes;

数据block的长度:类型为整数,字段长度优选为4字节;The length of the data block: the type is an integer, and the field length is preferably 4 bytes;

数据block中的行关键字:数据block中的行关键字,char,字段长度优选为100字节,至少包含上述的时间戳。Row key in the data block: the row key in the data block, char, the field length is preferably 100 bytes, at least including the above timestamp.

如图4所示,作为一个优选实施例,文件尾信息是指文件尾字段,包含读取文件的一些附加信息,主要包括:As shown in Figure 4, as a preferred embodiment, the file tail information refers to the file tail field, which contains some additional information for reading the file, mainly including:

文件尾flag:字段标志位,字段长度优选为8字节,固定字节如下:{‘F’,‘I’,‘L’,‘E’,‘T’,‘N’,‘D’,99}。End of file flag: field flag, the field length is preferably 8 bytes, and the fixed bytes are as follows: {'F','I','L','E','T','N','D', 99}.

数据block数量:数据block字段的数量,int,字段长度优选为4字节Number of data blocks: the number of data block fields, int, the field length is preferably 4 bytes

数据索引block数量:数据索引block字段的数量,int,字段长度优选为4字节;Number of data index blocks: the number of data index block fields, int, the field length is preferably 4 bytes;

文件版本(Version):字段类型为char,字段长度优选为50字节。File version (Version): The field type is char, and the field length is preferably 50 bytes.

以上对本发明涉及的索引文件进行了描述。The index file involved in the present invention has been described above.

优选地,本发明中,当需要查询MR文件时,可依据上述的索引文件进行MR文件的读取。Preferably, in the present invention, when the MR file needs to be queried, the MR file can be read according to the above-mentioned index file.

参见图6,图6为本发明提供的MR文件查询流程图。如图6所示,该流程可包括以下步骤:Referring to Fig. 6, Fig. 6 is a flow chart of MR file query provided by the present invention. As shown in Figure 6, the process may include the following steps:

步骤601,MR文件服务器接收用于查询MR文件的查询请求。In step 601, the MR file server receives a query request for querying MR files.

这里,查询请求中可携带MR文件的文件名称。Here, the query request may carry the file name of the MR file.

步骤602,依据待查询的MR文件的采集时间在本地索引文件中找到对应的MR文件索引块。Step 602, find the corresponding MR file index block in the local index file according to the acquisition time of the MR file to be queried.

如上描述,查询请求携带MR文件的文件名称,而MR文件的文件名称至少由MR文件的采集时间组成,基于此,步骤602中,可通过解析查询请求携带的MR文件的文件名称得到MR文件的采集时间,然后以得到的MR文件采集时间为关键字在本地索引文件中查找到包含该关键字的MR文件索引块。As described above, the query request carries the file name of the MR file, and the file name of the MR file is at least composed of the collection time of the MR file. Based on this, in step 602, the file name of the MR file can be obtained by parsing the file name of the query request. The collection time, and then use the obtained MR file collection time as a keyword to find the MR file index block containing the keyword in the local index file.

步骤603,根据找到的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件。Step 603, read the corresponding MR file from the HDFS client according to the summary file name in the index block of the found MR file and the starting position of the MR file in the summary file.

至此,完成图6所示流程。So far, the process shown in FIG. 6 is completed.

作为本发明的一个实施例,本发明还提供了一个通过二分法实现的快速定位查询MR文件的流程,只不过该流程需要本地索引文件中的MR文件索引块按照时间先后顺序排列。参见图7,图7为本发明提供的另一MR文件查询流程图。如图7所示,该流程可包括以下步骤:As an embodiment of the present invention, the present invention also provides a process of quickly locating and querying MR files through the dichotomy method, but this process requires that the MR file index blocks in the local index file be arranged in chronological order. Referring to Fig. 7, Fig. 7 is another MR file query flow chart provided by the present invention. As shown in Figure 7, the process may include the following steps:

步骤701,MR文件服务器接收行关键字,接收的行关键字中至少包含时间戳(以T1为例)。In step 701, the MR file server receives a row key, and the received row key contains at least a time stamp (T1 is taken as an example).

步骤702,MR文件服务器计算本地索引文件中的中间位置(mid)。Step 702, the MR file server calculates the middle position (mid) in the local index file.

这里,mid可通过以下算法计算:Here, mid can be calculated by the following algorithm:

mid=(本地索引文件中MR文件索引块处于的最低端low位置+最高端high位置)/2。mid=(the lowest low position of the MR file index block in the local index file+the highest high position)/2.

步骤703,将mid作为当前位置。Step 703, set mid as the current location.

步骤704,定位本地索引文件中处于当前位置的MR文件索引块。Step 704, locate the index block of the MR file at the current position in the local index file.

步骤705,判断接收的行关键字是否与定位出的MR文件索引块中数据索引block包含的行关键字一致,如果是,执行步骤706,如果否,执行步骤707。Step 705 , judging whether the received row key is consistent with the row key included in the located data index block in the MR file index block, if yes, execute step 706 , if not, execute step 707 .

步骤706,根据定位出的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件。Step 706, read the corresponding MR file from the HDFS client according to the summary file name in the located MR file index block and the starting position of the MR file in the summary file.

步骤707,当T1小于定位出的MR文件索引块中数据索引block包含的行关键字中的时间戳(以T2为例),将本地索引文件中当前位置的上一个位置作为当前位置,返回步骤704;当T1大于T2,将本地索引文件中当前位置的下一个位置作为当前位置,返回步骤704。Step 707, when T1 is less than the timestamp in the row keyword contained in the data index block in the index block of the MR file (taking T2 as an example), take the previous position of the current position in the local index file as the current position, and return to the step 704 ; when T1 is greater than T2 , use the location next to the current location in the local index file as the current location, and return to step 704 .

至此,完成图7所示的流程。So far, the process shown in FIG. 7 is completed.

通过图6、图7所示流程,能够快速定位出所需的MR文件。Through the processes shown in Figure 6 and Figure 7, the required MR files can be quickly located.

以上对本发明提供的方法进行了描述,下面对本发明提供的装置进行描述:The method provided by the present invention has been described above, and the device provided by the present invention is described below:

参见图8,图8为本发明提供的装置结构图。该装置应用于MR文件服务器,如图8所示,该装置包括:Referring to Fig. 8, Fig. 8 is a structural diagram of the device provided by the present invention. The device is applied to the MR file server, as shown in Figure 8, the device includes:

接收单元,用于接收MR文件;a receiving unit, configured to receive MR files;

判断单元,用于判断所述MR文件的采集时间与本地未完成的汇总文件中MR文件的采集时间是否为同一周期,所述同一周期包括但不限于同一天、同一周、或者同一个月;A judging unit, configured to judge whether the collection time of the MR file is in the same cycle as the collection time of the MR file in the local unfinished summary file, the same cycle includes but not limited to the same day, the same week, or the same month;

汇总单元,用于在判断单元的判断结果为是时,将接收的MR文件汇总至所述汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中;以及,The summary unit is used for summarizing the received MR file into the summary file and parsing the MR file to form a corresponding MR file index block when the judgment result of the judging unit is yes, and storing the MR file index block into the local index in the file; and,

在判断单元的判断结果为否时,关闭所述汇总文件,将所述汇总文件作为已完成的汇总文件上传至HDFS客户端,在本地重新创建一个标识为未完成的汇总文件,将接收的MR文件汇总至该重新创建的汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中。When the judging result of the judging unit is no, close the summary file, upload the summary file to the HDFS client as a completed summary file, recreate an unfinished summary file locally, and receive the MR The files are summarized into the newly created summary file and the MR file is parsed to form a corresponding MR file index block, and the MR file index block is stored in the local index file.

优选地,所述汇总单元解析出的MR文件对应的MR文件索引块包括:Preferably, the MR file index block corresponding to the MR file parsed by the summary unit includes:

数据块block,至少包含数据block标识flag和索引数据逻辑信息,其中,索引数据逻辑信息至少包含MR文件名称、数据block的生成时间、MR文件汇总至的汇总文件名称、扩展名、以及位于汇总文件中的起始位置;所述MR文件名称至少包含采集时间、MR文件的长度;The data block block at least includes the data block identification flag and index data logic information, wherein the index data logic information includes at least the name of the MR file, the generation time of the data block, the name of the summary file to which the MR file is summarized, the extension, and the location of the summary file The starting position in; the MR file name at least includes the acquisition time and the length of the MR file;

数据索引block,至少包含数据索引block flag、数据block的长度及在索引文件中的起始位置、以及数据block中的行关键字。The data index block at least includes the data index block flag, the length of the data block and the starting position in the index file, and the row key in the data block.

优选地,所述索引文件除了包含MR文件索引块之外,还进一步包括:Preferably, in addition to containing the MR file index block, the index file further includes:

文件信息,至少包含文件信息flag、厂商Vendor信息、保留字段Reserve,其中,Vendor信息为预配置的厂商信息;File information, at least including file information flag, vendor Vendor information, reserved field Reserve, where Vendor information is pre-configured vendor information;

文件尾信息,至少包含文件尾flag,数据block数量、数据索引block数量、文件版本Version。File tail information, including at least the file tail flag, the number of data blocks, the number of data index blocks, and the file version.

优选地,该装置进一步包括:Preferably, the device further comprises:

第一查询单元,用于接收用于查询MR文件的查询请求,依据待查询的MR文件的采集时间在本地索引文件中找到对应的MR文件索引块,并根据找到的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件。The first query unit is configured to receive a query request for querying the MR file, find the corresponding MR file index block in the local index file according to the acquisition time of the MR file to be queried, and summarize the MR file according to the found MR file index block The file name, the starting position of the MR file in the summary file, and the HDFS client reads the corresponding MR file.

优选地,本地索引文件中的MR文件索引块按照时间先后顺序排列;Preferably, the MR file index blocks in the local index file are arranged in chronological order;

该装置进一步包括:The device further includes:

第二查询单元,用于接收行关键字,接收的行关键字中至少包含时间戳T1,计算本地索引文件中的中间位置mid,mid=(本地索引文件中MR文件索引块处于的最低端low位置+最高端high位置)/2,将mid作为当前位置,定位本地索引文件中处于当前位置的MR文件索引块,判断接收的行关键字是否与定位出的MR文件索引块中数据索引block包含的行关键字一致,The second query unit is used to receive the row key, and the received row key contains at least the time stamp T1, and calculates the middle position mid in the local index file, mid=(the lowest end low of the MR file index block in the local index file position + the highest high position)/2, use mid as the current position, locate the MR file index block at the current position in the local index file, and judge whether the received row keyword is included in the data index block in the located MR file index block The row keywords are the same,

如果是,根据定位出的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件;If so, go to the HDFS client to read the corresponding MR file according to the summary file name in the located MR file index block and the starting position of the MR file in the summary file;

如果否,当接收的行关键字中包含的时间戳T1小于定位出的MR文件索引块中数据索引block包含的行关键字中的时间戳T2,将本地索引文件中当前位置的上一个位置作为当前位置,返回定位本地索引文件中处于当前位置的MR文件索引块;当T1大于T2,将本地索引文件中当前位置的下一个位置作为当前位置,返回定位本地索引文件中处于当前位置的MR文件索引块。If not, when the timestamp T1 contained in the received row key is smaller than the timestamp T2 contained in the row key contained in the data index block in the located MR file index block, the previous position of the current position in the local index file is used as Current position, returns and locates the MR file index block at the current position in the local index file; when T1 is greater than T2, takes the next position of the current position in the local index file as the current position, returns and locates the MR file at the current position in the local index file index block.

至此,完成图8所示的装置。So far, the device shown in Figure 8 is completed.

以上对本发明提供的装置进行了描述。The device provided by the present invention has been described above.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims (10)

Translated fromChinese
1.一种测量报告MR文件应用在分布式文件系统HDFS的实现方法,其特征在于,该方法包括:1. a kind of implementation method that measurement report MR file is applied in distributed file system HDFS, it is characterized in that, the method comprises:MR文件服务器接收MR文件;The MR file server receives the MR file;MR文件服务器判断所述MR文件的采集时间与本地未完成的汇总文件中MR文件的采集时间是否为同一周期,所述同一周期包括但不限于同一天、同一周、或者同一个月;The MR file server judges whether the collection time of the MR file is the same cycle as the collection time of the MR file in the local unfinished summary file, and the same cycle includes but is not limited to the same day, the same week, or the same month;如果是,将接收的MR文件汇总至所述汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中;If yes, summarizing the received MR file into the summary file and parsing the MR file to form a corresponding MR file index block, storing the MR file index block in the local index file;如果否,关闭所述汇总文件,将所述汇总文件作为已完成的汇总文件上传至HDFS客户端,在本地重新创建一个标识为未完成的汇总文件,将接收的MR文件汇总至该重新创建的汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中。If not, close the summary file, upload the summary file to the HDFS client as a completed summary file, recreate an unfinished summary file locally, and summarize the received MR files to the recreated Summarize the files and parse the MR files to form corresponding MR file index blocks, and store the MR file index blocks into the local index file.2.根据权利要求1所述的方法,其特征在于,所述MR文件对应的MR文件索引块包括:2. The method according to claim 1, wherein the MR file index block corresponding to the MR file comprises:数据块block,至少包含数据block标识flag和索引数据逻辑信息,其中,索引数据逻辑信息至少包含MR文件名称、数据block的生成时间、MR文件汇总至的汇总文件名称、扩展名、以及位于汇总文件中的起始位置;所述MR文件名称至少包含采集时间、MR文件的长度;The data block block at least includes the data block identification flag and index data logic information, wherein the index data logic information includes at least the name of the MR file, the generation time of the data block, the name of the summary file to which the MR file is summarized, the extension, and the location of the summary file The starting position in; the MR file name at least includes the acquisition time and the length of the MR file;数据索引block,至少包含数据索引block flag、数据block的长度及在索引文件中的起始位置、以及数据block中的行关键字。The data index block at least includes the data index block flag, the length of the data block and the starting position in the index file, and the row key in the data block.3.根据权利要求2所述的方法,其特征在于,所述索引文件除了包含MR文件索引块之外,还进一步包括:3. The method according to claim 2, wherein the index file further includes, in addition to comprising the MR file index block:文件信息,至少包含文件信息flag、厂商Vendor信息、保留字段Reserve,其中,Vendor信息为预配置的厂商信息;File information, at least including file information flag, vendor Vendor information, reserved field Reserve, where Vendor information is pre-configured vendor information;文件尾信息,至少包含文件尾flag,数据block数量、数据索引block数量、文件版本Version。File tail information, including at least the file tail flag, the number of data blocks, the number of data index blocks, and the file version.4.根据权利要求2所述的方法,其特征在于,该方法进一步包括:4. The method according to claim 2, characterized in that the method further comprises:接收用于查询MR文件的查询请求;Receive a query request for querying MR files;依据待查询的MR文件的采集时间在本地索引文件中找到对应的MR文件索引块;Find the corresponding MR file index block in the local index file according to the acquisition time of the MR file to be queried;根据找到的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件。According to the summary file name in the found MR file index block and the starting position of the MR file in the summary file, the HDFS client reads the corresponding MR file.5.根据权利要求2所述的方法,其特征在于,本地索引文件中的MR文件索引块按照时间先后顺序排列;5. The method according to claim 2, wherein the MR file index blocks in the local index file are arranged in chronological order;该方法进一步包括:The method further includes:接收行关键字,接收的行关键字中至少包含时间戳T1;Receive the row key, and the received row key contains at least timestamp T1;计算本地索引文件中的中间位置mid,mid=(本地索引文件中MR文件索引块处于的最低端low位置+最高端high位置)/2,将mid作为当前位置;Calculate the middle position mid in the local index file, mid=(the lowest end low position+the highest end high position where the MR file index block is in the local index file)/2, using mid as the current position;定位本地索引文件中处于当前位置的MR文件索引块;Locate the MR file index block at the current position in the local index file;判断接收的行关键字是否与定位出的MR文件索引块中数据索引block包含的行关键字一致,Determine whether the received row key is consistent with the row key contained in the data index block in the located MR file index block,如果是,根据定位出的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件;If so, go to the HDFS client to read the corresponding MR file according to the summary file name in the located MR file index block and the starting position of the MR file in the summary file;如果否,当接收的行关键字中包含的时间戳T1小于定位出的MR文件索引块中数据索引block包含的行关键字中的时间戳T2,将本地索引文件中当前位置的上一个位置作为当前位置,返回定位本地索引文件中处于当前位置的MR文件索引块;当T1大于T2,将本地索引文件中当前位置的下一个位置作为当前位置,返回定位本地索引文件中处于当前位置的MR文件索引块。If not, when the timestamp T1 contained in the received row key is smaller than the timestamp T2 contained in the row key contained in the data index block in the located MR file index block, the previous position of the current position in the local index file is used as Current position, returns and locates the MR file index block at the current position in the local index file; when T1 is greater than T2, takes the next position of the current position in the local index file as the current position, returns and locates the MR file at the current position in the local index file index block.6.一种测量报告MR文件应用在分布式文件系统HDFS的实现装置,该装置应用于MR文件服务器,其特征在于,该装置包括:6. A measurement report MR file is applied to the implementation device of the distributed file system HDFS, the device is applied to the MR file server, and it is characterized in that the device includes:接收单元,用于接收MR文件;a receiving unit, configured to receive MR files;判断单元,用于判断所述MR文件的采集时间与本地未完成的汇总文件中MR文件的采集时间是否为同一周期,所述同一周期包括但不限于同一天、同一周、或者同一个月;A judging unit, configured to judge whether the collection time of the MR file is in the same cycle as the collection time of the MR file in the local unfinished summary file, the same cycle includes but not limited to the same day, the same week, or the same month;汇总单元,用于在判断单元的判断结果为是时,将接收的MR文件汇总至所述汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中;以及,The summary unit is used for summarizing the received MR file into the summary file and parsing the MR file to form a corresponding MR file index block when the judgment result of the judging unit is yes, and storing the MR file index block into the local index in the file; and,在判断单元的判断结果为否时,关闭所述汇总文件,将所述汇总文件作为已完成的汇总文件上传至HDFS客户端,在本地重新创建一个标识为未完成的汇总文件,将接收的MR文件汇总至该重新创建的汇总文件中并解析MR文件形成对应的MR文件索引块,将所述MR文件索引块存入本地索引文件中。When the judging result of the judging unit is no, close the summary file, upload the summary file to the HDFS client as a completed summary file, recreate an unfinished summary file locally, and receive the MR The files are summarized into the newly created summary file and the MR file is parsed to form a corresponding MR file index block, and the MR file index block is stored in the local index file.7.根据权利要求6所述的装置,其特征在于,所述汇总单元解析出的MR文件对应的MR文件索引块包括:7. The device according to claim 6, wherein the MR file index block corresponding to the MR file parsed by the summary unit includes:数据块block,至少包含数据block标识flag和索引数据逻辑信息,其中,索引数据逻辑信息至少包含MR文件名称、数据block的生成时间、MR文件汇总至的汇总文件名称、扩展名、以及位于汇总文件中的起始位置;所述MR文件名称至少包含采集时间、MR文件的长度;The data block block at least includes the data block identification flag and index data logic information, wherein the index data logic information includes at least the name of the MR file, the generation time of the data block, the name of the summary file to which the MR file is summarized, the extension, and the location of the summary file The starting position in; the MR file name at least includes the acquisition time and the length of the MR file;数据索引block,至少包含数据索引block flag、数据block的长度及在索引文件中的起始位置、以及数据block中的行关键字。The data index block at least includes the data index block flag, the length of the data block and the starting position in the index file, and the row key in the data block.8.根据权利要求7所述的装置,其特征在于,所述索引文件除了包含MR文件索引块之外,还进一步包括:8. The device according to claim 7, characterized in that, in addition to including the MR file index block, the index file further includes:文件信息,至少包含文件信息flag、厂商Vendor信息、保留字段Reserve,其中,Vendor信息为预配置的厂商信息;File information, at least including file information flag, vendor Vendor information, reserved field Reserve, where Vendor information is pre-configured vendor information;文件尾信息,至少包含文件尾flag,数据block数量、数据索引block数量、文件版本Version。File tail information, including at least the file tail flag, the number of data blocks, the number of data index blocks, and the file version.9.根据权利要求7所述的装置,其特征在于,该装置进一步包括:9. The device according to claim 7, further comprising:第一查询单元,用于接收用于查询MR文件的查询请求,依据待查询的MR文件的采集时间在本地索引文件中找到对应的MR文件索引块,并根据找到的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件。The first query unit is configured to receive a query request for querying the MR file, find the corresponding MR file index block in the local index file according to the acquisition time of the MR file to be queried, and summarize the MR file according to the found MR file index block The file name, the starting position of the MR file in the summary file, and the HDFS client reads the corresponding MR file.10.根据权利要求7所述的装置,其特征在于,本地索引文件中的MR文件索引块按照时间先后顺序排列;10. The device according to claim 7, wherein the index blocks of the MR file in the local index file are arranged in chronological order;该装置进一步包括:The device further includes:第二查询单元,用于接收行关键字,接收的行关键字中至少包含时间戳T1,计算本地索引文件中的中间位置mid,mid=(本地索引文件中MR文件索引块处于的最低端low位置+最高端high位置)/2,将mid作为当前位置,定位本地索引文件中处于当前位置的MR文件索引块,判断接收的行关键字是否与定位出的MR文件索引块中数据索引block包含的行关键字一致,The second query unit is used to receive the row key, and the received row key contains at least the time stamp T1, and calculates the middle position mid in the local index file, mid=(the lowest end low of the MR file index block in the local index file position + the highest high position)/2, use mid as the current position, locate the MR file index block at the current position in the local index file, and judge whether the received row keyword is included in the data index block in the located MR file index block The row keywords are the same,如果是,根据定位出的MR文件索引块中的汇总文件名称、MR文件位于汇总文件中的起始位置去HDFS客户端读取对应的MR文件;If so, go to the HDFS client to read the corresponding MR file according to the summary file name in the located MR file index block and the starting position of the MR file in the summary file;如果否,当接收的行关键字中包含的时间戳T1小于定位出的MR文件索引块中数据索引block包含的行关键字中的时间戳T2,将本地索引文件中当前位置的上一个位置作为当前位置,返回定位本地索引文件中处于当前位置的MR文件索引块;当T1大于T2,将本地索引文件中当前位置的下一个位置作为当前位置,返回定位本地索引文件中处于当前位置的MR文件索引块。If not, when the timestamp T1 contained in the received row key is smaller than the timestamp T2 contained in the row key contained in the data index block in the located MR file index block, the previous position of the current position in the local index file is used as Current position, returns and locates the MR file index block at the current position in the local index file; when T1 is greater than T2, takes the next position of the current position in the local index file as the current position, returns and locates the MR file at the current position in the local index file index block.
CN201610333313.8A2016-05-192016-05-19MR files apply the implementation method and device in HDFSWithdrawnCN107402924A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201610333313.8ACN107402924A (en)2016-05-192016-05-19MR files apply the implementation method and device in HDFS

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201610333313.8ACN107402924A (en)2016-05-192016-05-19MR files apply the implementation method and device in HDFS

Publications (1)

Publication NumberPublication Date
CN107402924Atrue CN107402924A (en)2017-11-28

Family

ID=60393950

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201610333313.8AWithdrawnCN107402924A (en)2016-05-192016-05-19MR files apply the implementation method and device in HDFS

Country Status (1)

CountryLink
CN (1)CN107402924A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102902716A (en)*2012-08-272013-01-30苏州两江科技有限公司Storage system based on Hadoop distributed computing platform
CN104572670A (en)*2013-10-152015-04-29方正国际软件(北京)有限公司Small file storage, query and deletion method and system
CN104765876A (en)*2015-04-242015-07-08中国人民解放军信息工程大学Massive GNSS small file cloud storage method
CN104778270A (en)*2015-04-242015-07-15成都汇智远景科技有限公司Storage method for multiple files
CN104978330A (en)*2014-04-042015-10-14西南大学Data storage method and device
CN105138571A (en)*2015-07-242015-12-09四川长虹电器股份有限公司Distributed file system and method for storing lots of small files
CN105183839A (en)*2015-09-022015-12-23华中科技大学Hadoop-based storage optimizing method for small file hierachical indexing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102902716A (en)*2012-08-272013-01-30苏州两江科技有限公司Storage system based on Hadoop distributed computing platform
CN104572670A (en)*2013-10-152015-04-29方正国际软件(北京)有限公司Small file storage, query and deletion method and system
CN104978330A (en)*2014-04-042015-10-14西南大学Data storage method and device
CN104765876A (en)*2015-04-242015-07-08中国人民解放军信息工程大学Massive GNSS small file cloud storage method
CN104778270A (en)*2015-04-242015-07-15成都汇智远景科技有限公司Storage method for multiple files
CN105138571A (en)*2015-07-242015-12-09四川长虹电器股份有限公司Distributed file system and method for storing lots of small files
CN105183839A (en)*2015-09-022015-12-23华中科技大学Hadoop-based storage optimizing method for small file hierachical indexing

Similar Documents

PublicationPublication DateTitle
CN111526060B (en)Method and system for processing service log
US8054756B2 (en)Path discovery and analytics for network data
CN107634848B (en)System and method for collecting and analyzing network equipment information
CN103559217A (en)Heterogeneous database oriented massive multicast data storage implementation method
CN102332030A (en) Data storage, management and query method and system for distributed key-value storage system
CN105512283A (en)Data quality management and control method and device
US20230044850A1 (en)Tracing and exposing data used for generating analytics
CN113312376B (en)Method and terminal for real-time processing and analysis of Nginx logs
WO2020042029A1 (en)Discovery method for invoked link, apparatus, device, and storage medium
CN108228322B (en)Distributed link tracking and analyzing method, server and global scheduler
CN107133329A (en)Data processing method, data processing equipment and storage medium
WO2019187208A1 (en)Information processing device, data management system, data management method, and non-temporary computer-readable medium in which data management program is stored
CN108228432A (en)A kind of distributed link tracking, analysis method and server, global scheduler
CN113297245A (en)Method and device for acquiring execution information
CN112269726A (en)Data processing method and device
CN106326280A (en)Data processing method, apparatus and system
CN113162960A (en)Data processing method, device, equipment and medium
CN112217657B (en) Data transmission method, data processing method, device and medium based on SD-WAN system
CN102946423B (en)Data mapping and pushing system and method based on distributed system architecture
CN114745436B (en) Data collection method, device, computer equipment and storage medium
CN120045587A (en)Data query method, device, equipment, storage medium and program product for distributed database
CN113778996B (en) A method, device, electronic device and storage medium for processing large data streams
CN107341198B (en)Electric power mass data storage and query method based on theme instance
CN107402924A (en)MR files apply the implementation method and device in HDFS
US10027754B2 (en)Large data set updating for network usage records

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
WW01Invention patent application withdrawn after publication

Application publication date:20171128

WW01Invention patent application withdrawn after publication

[8]ページ先頭

©2009-2025 Movatter.jp