Movatterモバイル変換


[0]ホーム

URL:


CN116700596A - File storage method, device and system and computer readable storage medium - Google Patents

File storage method, device and system and computer readable storage medium
Download PDF

Info

Publication number
CN116700596A
CN116700596ACN202210188324.7ACN202210188324ACN116700596ACN 116700596 ACN116700596 ACN 116700596ACN 202210188324 ACN202210188324 ACN 202210188324ACN 116700596 ACN116700596 ACN 116700596A
Authority
CN
China
Prior art keywords
file
target
metadata
size
namenode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210188324.7A
Other languages
Chinese (zh)
Inventor
冯榆斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile IoT Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile IoT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile IoT Co LtdfiledCriticalChina Mobile Communications Group Co Ltd
Priority to CN202210188324.7ApriorityCriticalpatent/CN116700596A/en
Publication of CN116700596ApublicationCriticalpatent/CN116700596A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Translated fromChinese

本申请实施例提供了一种文件存储方法,该方法应用于基于Hadoop的文件存储系统中,Hadoop包括NameNode、Client和目标集群,该方法包括:NameNode接收来自Client发送的针对目标文件的写入请求,其中,写入请求中携带有目标文件的文件大小,NameNode确定目标文件的元数据,其中,目标文件的元数据包括目标文件的存储地址,当文件大小大于等于预设阈值时,NameNode将目标文件的元数据进行本地存储,当文件大小小于预设阈值时,NameNode将目标文件的元数据存储至目标集群,NameNode向Client返回目标文件的存储地址,以存储目标文件的文件内容。本申请实施例还同时提供了一种文件存储装置、系统及计算机可读存储介质。

The embodiment of the present application provides a file storage method, the method is applied in a Hadoop-based file storage system, Hadoop includes a NameNode, a Client, and a target cluster, and the method includes: the NameNode receives a write request for the target file sent from the Client , where the write request carries the file size of the target file, NameNode determines the metadata of the target file, wherein the metadata of the target file includes the storage address of the target file, when the file size is greater than or equal to the preset threshold, the NameNode will target The metadata of the file is stored locally. When the file size is smaller than the preset threshold, the NameNode stores the metadata of the target file to the target cluster, and the NameNode returns the storage address of the target file to the Client to store the file content of the target file. The embodiment of the present application also provides a file storage device, a system, and a computer-readable storage medium.

Description

Translated fromChinese
一种文件存储方法、装置、系统以及计算机可读存储介质A file storage method, device, system and computer-readable storage medium

技术领域technical field

本申请涉及分布式文件系统中文件的读写技术领域,具体而言,涉及一种文件存储方法、装置、系统以及计算机可读存储介质。The present application relates to the technical field of reading and writing files in a distributed file system, and in particular, relates to a file storage method, device, system, and computer-readable storage medium.

背景技术Background technique

随着移动互联网的普及和应用,全球数据信息呈爆发式增长,人工智能、机器学习、云计算等新兴产业迅速发展,传统计算机存储系统的发展已无法满足大量数据产生和应用的需求,因此分布式文件存储系统应运而生,目前常用的开源的分布式文件系统有Hadoop、Lustre、MogileFS、GoogleFS等。Hadoop具有可靠性高、扩展性强、存储速度快、容错性高等一系列优势,其中一个组件是Hadoop分布式文件系统(Hadoop DistributedFileSystem,HDFS),HDFS有高容错性的特点,用来部署在低廉的硬件上,能够提供高吞吐量来访问应用程序的数据,适合应用于有着超大数据集的应用程序。With the popularization and application of the mobile Internet, global data information is growing explosively, and emerging industries such as artificial intelligence, machine learning, and cloud computing are developing rapidly. The development of traditional computer storage systems can no longer meet the needs of large amounts of data generation and application. At present, open source distributed file systems commonly used include Hadoop, Lustre, MogileFS, GoogleFS, etc. Hadoop has a series of advantages such as high reliability, strong scalability, fast storage speed, and high fault tolerance. One of the components is the Hadoop Distributed File System (Hadoop Distributed File System, HDFS). HDFS has high fault tolerance and is used to deploy on low-cost On the hardware, it can provide high throughput to access application data, which is suitable for applications with very large data sets.

然而,随着手机、平板电脑等移动终端的应用程序的出现,产生了大量的小文件,如图片、视频、文档等,这些小文件的体积大多大于1KB小于10MB。其中,HDFS在进行小文件读写时,命名空间(NameNode)需要大量空间来保存小文件的元数据,导致NameNode上的文件的目录信息庞大复杂,从而导致查询文件时检索效率低下。However, with the emergence of mobile terminal applications such as mobile phones and tablet computers, a large number of small files, such as pictures, videos, documents, etc., are produced, and the volume of these small files is mostly greater than 1KB and less than 10MB. Among them, when HDFS reads and writes small files, the namespace (NameNode) needs a large amount of space to save the metadata of small files, resulting in huge and complex directory information of files on the NameNode, resulting in low retrieval efficiency when querying files.

申请内容application content

本申请主要提供一种文件存储方法、装置、系统以及计算机可读存储介质,能够提高文件的检索效率。The present application mainly provides a file storage method, device, system and computer-readable storage medium, which can improve file retrieval efficiency.

本申请的技术方案是这样实现的:The technical scheme of the present application is realized like this:

本申请实施例提供了一种文件存储方法,该方法应用于基于Hadoop的文件存储系统中,Hadoop包括NameNode、客户端(Client)和目标集群;该方法包括:The embodiment of the present application provides a kind of file storage method, and this method is applied in the file storage system based on Hadoop, and Hadoop comprises NameNode, client (Client) and target cluster; This method comprises:

NameNode接收来自Client发送的针对目标文件的写入请求;其中,写入请求中携带有目标文件的文件大小;NameNode receives the write request for the target file sent from the Client; wherein, the write request carries the file size of the target file;

NameNode确定目标文件的元数据;其中,目标文件的元数据包括目标文件的存储地址;NameNode determines the metadata of the target file; wherein, the metadata of the target file includes the storage address of the target file;

当文件大小大于等于预设阈值时,NameNode将目标文件的元数据进行本地存储;When the file size is greater than or equal to the preset threshold, the NameNode stores the metadata of the target file locally;

当文件大小小于预设阈值时,NameNode将目标文件的元数据存储至目标集群;When the file size is smaller than the preset threshold, NameNode stores the metadata of the target file to the target cluster;

NameNode向Client返回目标文件的存储地址,以存储目标文件的文件内容。The NameNode returns the storage address of the target file to the Client to store the file content of the target file.

本申请实施例提供了一种文件存储装置,所述装置设置于基于Hadoop的文件存储系统中,所述Hadoop包括NameNode、Client和目标集群;所述装置包括:The embodiment of the present application provides a file storage device, the device is set in a Hadoop-based file storage system, the Hadoop includes NameNode, Client and target cluster; the device includes:

接收模块,用于接收来自Client发送的针对目标文件的写入请求;其中,写入请求中携带有目标文件的文件大小;The receiving module is configured to receive a write request for the target file sent from the Client; wherein, the write request carries the file size of the target file;

确定模块,用于确定目标文件的元数据;其中,目标文件的元数据包括目标文件的存储地址;A determining module, configured to determine metadata of the target file; wherein, the metadata of the target file includes a storage address of the target file;

第一存储模块,用于当文件大小大于等于预设阈值时,将目标文件的元数据进行本地存储;The first storage module is configured to locally store the metadata of the target file when the file size is greater than or equal to a preset threshold;

第二存储模块,用于当文件大小小于预设阈值时,将目标文件的元数据存储至目标集群;The second storage module is used to store the metadata of the target file to the target cluster when the file size is less than a preset threshold;

发送模块,用于向Client返回目标文件的存储地址,以存储目标文件的文件内容。The sending module is used to return the storage address of the target file to the Client, so as to store the file content of the target file.

本申请实施例还提供了一种文件存储系统,包括:处理器以及存储有所述处理器可执行指令的存储介质;所述存储介质通过通信总线依赖所述处理器执行操作,当所述指令被所述处理器执行时,执行上述一个或多个实施例所述文件存储方法。The embodiment of the present application also provides a file storage system, including: a processor and a storage medium storing instructions executable by the processor; the storage medium relies on the processor to perform operations through a communication bus, and when the instructions When executed by the processor, the file storage method described in one or more embodiments above is executed.

本申请实施例提供了一种计算机可读存储介质,存储有可执行指令,当所述可执行指令被一个或多个处理器执行的时候,执行上述一个或多个实施例所述文件存储方法。An embodiment of the present application provides a computer-readable storage medium, which stores executable instructions. When the executable instructions are executed by one or more processors, the file storage method described in one or more embodiments above is executed. .

本申请提供了一种文件存储方法、装置、系统以及计算机可读存储介质,该方法应用于基于Hadoop的文件存储系统中,Hadoop包括NameNode、Client和目标集群,该方法包括:NameNode接收来自Client发送的针对目标文件的写入请求,其中,写入请求中携带有目标文件的文件大小,NameNode确定目标文件的元数据,其中,目标文件的元数据包括目标文件的存储地址,当文件大小大于等于预设阈值时,NameNode将目标文件的元数据进行本地存储,当文件大小小于预设阈值时,NameNode将目标文件的元数据存储至目标集群,NameNode向Client返回目标文件的存储地址,以存储目标文件的文件内容。本申请实施例中,NameNode接收到来自Client的对目标文件的写入请求后,由NameNode判断该目标文件的大小,将大于等于预设阈值的目标文件的元数据存储在本地,将小于预设阈值的目标文件的元数据存储至目标集群,这样,通过大文件和小文件分开存储元数据,使得NameNode上的文件的目录信息简单化,基于简单化的文件的目录信息,在文件检索时,能够快速地检索出文件的目录信息,从而提升了HDFS的稳定性和文件检索速率。The application provides a file storage method, device, system and computer-readable storage medium, the method is applied in a Hadoop-based file storage system, Hadoop includes NameNode, Client and target cluster, the method includes: NameNode receives the A write request for the target file, wherein the write request carries the file size of the target file, NameNode determines the metadata of the target file, wherein the metadata of the target file includes the storage address of the target file, when the file size is greater than or equal to When the threshold is preset, the NameNode stores the metadata of the target file locally. When the file size is smaller than the preset threshold, the NameNode stores the metadata of the target file to the target cluster, and the NameNode returns the storage address of the target file to the Client to store the target The file content of the file. In the embodiment of this application, after the NameNode receives the write request for the target file from the Client, the NameNode judges the size of the target file, stores the metadata of the target file greater than or equal to the preset threshold locally, and stores the metadata of the target file smaller than the preset threshold The metadata of the target file of the threshold is stored in the target cluster. In this way, the metadata of the large file and the small file are stored separately, so that the directory information of the file on the NameNode is simplified. Based on the simplified directory information of the file, when the file is retrieved, It can quickly retrieve the directory information of the file, thereby improving the stability of HDFS and the file retrieval rate.

附图说明Description of drawings

图1为本申请实施例提供的一种可选的文件存储方法的流程示意图;FIG. 1 is a schematic flow diagram of an optional file storage method provided by an embodiment of the present application;

图2为本申请的实施例提供另一种可选的文件存储方法的流程示意图;FIG. 2 is a schematic flow diagram of another optional file storage method provided by an embodiment of the present application;

图3为本申请的实施例提供又一种可选的文件存储方法的流程示意图;FIG. 3 is a schematic flow diagram of another optional file storage method provided by an embodiment of the present application;

图4为本申请实施例提供的一种可选的HDFS的架构示意图;FIG. 4 is a schematic diagram of an optional HDFS architecture provided by an embodiment of the present application;

图5为本申请实施例提供的一种可选的文件储存方法的实例一的流程示意图;FIG. 5 is a schematic flowchart of Example 1 of an optional file storage method provided by the embodiment of the present application;

图6为本申请实施例提供的一种可选的文件储存方法的实例二的流程示意图;FIG. 6 is a schematic flowchart of Example 2 of an optional file storage method provided by the embodiment of the present application;

图7为本申请实施例提供的一种可选的文件合并方法的实例一的流程示意图;FIG. 7 is a schematic flowchart of Example 1 of an optional file merging method provided by the embodiment of the present application;

图8为本申请实施例提供的一种可选的文件合并方法的实例二的流程示意图;FIG. 8 is a schematic flowchart of Example 2 of an optional file merging method provided by the embodiment of the present application;

图9为本申请实施例提供的一种可选的执行文件合并的流程示意图;FIG. 9 is a schematic flowchart of an optional execution file merging provided by the embodiment of the present application;

图10为本申请实施例提供的一种可选的文件存储装置的结构示意图;FIG. 10 is a schematic structural diagram of an optional file storage device provided by an embodiment of the present application;

图11为本申请实施例提供的一种可选的文件存储系统的结构示意图。FIG. 11 is a schematic structural diagram of an optional file storage system provided by the embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.

面对HDFS的小文件存储问题,相关技术中提出了不同的解决方案:Facing the small file storage problem of HDFS, different solutions have been proposed in related technologies:

1、Hadoop Archive(HAR)文件归档1. Hadoop Archive (HAR) file archiving

HAR将HDFS中大量的小文件合并成大文件,减少了NameNode的内存占用,生成的归档包括两层索引和数据文件。然而,HAR文件归档读取文件要遍历两层索引,相比直接读取效率可能会降低。HAR merges a large number of small files in HDFS into large files, reducing the memory usage of NameNode, and the generated archive includes two layers of index and data files. However, HAR file archiving reads files to traverse two layers of indexes, which may be less efficient than direct reading.

2、SequenceFile方法2. SequenceFile method

SequenceFile序列化文件的核心思想是key存储文件的名字,value存储文件的内容,处理小文件时,将它们合并成大文件再存储起来。SequenceFile压缩了文件,节省了磁盘空间,但SequenceFile在文件查询时需要对整个SequenceFile文件进行遍历,检索效率低下。The core idea of SequenceFile serialized files is that the key stores the name of the file, and the value stores the content of the file. When processing small files, they are merged into large files and stored. SequenceFile compresses files and saves disk space, but SequenceFile needs to traverse the entire SequenceFile file during file query, resulting in low retrieval efficiency.

上述解决方案并未很好地解决大量小文件存储问题,特别是大量的小文件导致NameNode目录复杂,进而导致文件检索效率低的问题。The above solution does not solve the problem of storing a large number of small files well, especially the problem that a large number of small files leads to a complex NameNode directory, which in turn leads to low file retrieval efficiency.

为了提高文件检索效率,本申请实施例提供一种文件存储方法,该方法应用于基于Hadoop的文件存储系统中,Hadoop包括NameNode、Client和目标集群,图1为本申请实施例提供的一种可选的文件存储方法的流程示意图,参考图1所示,上述文件存储方法可以包括:In order to improve file retrieval efficiency, the embodiment of the present application provides a file storage method, which is applied to a Hadoop-based file storage system. Hadoop includes NameNode, Client and target cluster. FIG. 1 is a possible file storage method provided by the embodiment of the present application. The schematic flow chart of the selected file storage method, shown in Fig. 1 with reference to, above-mentioned file storage method can comprise:

S101:NameNode接收来自Client发送的针对目标文件的写入请求。S101: NameNode receives a write request for a target file from a Client.

其中,写入请求中携带有目标文件的文件大小。Wherein, the write request carries the file size of the target file.

需要说明的是,Client是整个HDFS数据读写的入口,负责管理文件系统的基本信息和发送读写请求等。It should be noted that the Client is the entry point for reading and writing the entire HDFS data, responsible for managing the basic information of the file system and sending read and write requests.

具体来说,Client可以将目标文件切分成多个数据块,还可以向NameNode上部署的NameNode代理发送目标文件的写入请求,NameNode代理根据该写入请求可以得到目标文件的文件大小,NameNode代理可以根据目标文件的文件大小,确定目标文件的元数据存储的位置。Specifically, the Client can divide the target file into multiple data blocks, and can also send a write request of the target file to the NameNode agent deployed on the NameNode. The NameNode agent can obtain the file size of the target file according to the write request. The NameNode agent The location where the metadata of the target file is stored can be determined according to the file size of the target file.

这样,就可以得到目标文件的文件大小,在得到文件大小后根据文件大小判断目标文件的元数据存储在NameNode上还是目标集群上,从而实现元数据分类存放,能够使得NameNode上目录信息简单化,提高文件检索的效率。In this way, the file size of the target file can be obtained. After the file size is obtained, it can be judged according to the file size whether the metadata of the target file is stored on the NameNode or the target cluster, so as to realize the classified storage of metadata and simplify the directory information on the NameNode. Improve the efficiency of file retrieval.

S102:NameNode确定目标文件的元数据。S102: The NameNode determines metadata of the target file.

其中,目标文件的元数据包括目标文件的存储地址。Wherein, the metadata of the target file includes the storage address of the target file.

具体来说,在用户上传文件时,Client发送的写入请求还可以包括文件名称,NameNode代理会根据文件名称先去检查目标文件是否已经存在目录,若存在,说明目标文件已经写入到HDFS中,若不存在,则允许目标文件的写入,即NameNode确定目标文件的元数据。在确定目标文件的元数据中,NameNode获取到目标文件的文件名称和文件大小之后,根据文件大小,从DataNode中空闲的数据块(block)中为目标文件确定出存储地址,副本数,block地址列表,block的副本地址等等信息,以确定出目标文件的元数据。在实际应用中,可以使用FileStatus类命令来查看HDFS中文件的元数据。Specifically, when a user uploads a file, the write request sent by the Client can also include the file name, and the NameNode agent will first check whether the target file already exists in the directory according to the file name. If it exists, it means that the target file has been written into HDFS , if it does not exist, the writing of the target file is allowed, that is, the NameNode determines the metadata of the target file. In determining the metadata of the target file, after the NameNode obtains the file name and file size of the target file, according to the file size, it determines the storage address, the number of copies, and the block address for the target file from the free data block (block) in the DataNode List, block copy address and other information to determine the metadata of the target file. In practical applications, you can use the FileStatus class command to view the metadata of files in HDFS.

这样,NameNode通过确定出的目标文件的元数据得到了目标文件的存储地址,进而将目标文件分类存放,从而实现了NameNode上目录信息简单化,能够提高文件检索的效率。In this way, the NameNode obtains the storage address of the target file through the determined metadata of the target file, and then stores the target files in categories, thereby simplifying the directory information on the NameNode and improving the efficiency of file retrieval.

S103:当文件大小大于等于预设阈值时,NameNode将目标文件的元数据进行本地存储。S103: When the size of the file is greater than or equal to the preset threshold, the NameNode locally stores the metadata of the target file.

其中,目标文件的元数据还包括:目标文件的block存储空间的大小和/或目标文件的文件大小;由于相关技术中,目标文件的元数据的存储格式为{目标文件的存储地址,目标文件的副本数,目标文件的block地址列表,目标文件的block的副本地址},所以,目标文件的元数据的存储格式可以为{目标文件的存储地址,目标文件的副本数,目标文件的block地址列表,目标文件的block的副本地址,目标文件的block存储空间的大小,目标文件的文件大小}。Wherein, the metadata of the target file also includes: the size of the block storage space of the target file and/or the file size of the target file; because in related technologies, the storage format of the metadata of the target file is {the storage address of the target file, the target file The number of copies of the target file, the block address list of the target file, the copy address of the block of the target file}, so the storage format of the metadata of the target file can be {storage address of the target file, the number of copies of the target file, the block address of the target file list, the copy address of the block of the target file, the size of the block storage space of the target file, the file size of the target file}.

需要说明的是,副本数为目标文件的备份个数,目标文件的每一个数据块可以有多个副本,并存放在不同的DataNode上,一般默认是3个;副本地址是每个数据块的副本的存储地址信息,也就是每个副本在DataNode上存储的位置信息。It should be noted that the number of copies is the number of backups of the target file. Each data block of the target file can have multiple copies and be stored on different DataNodes. Generally, there are 3 copies by default; the copy address is the address of each data block. The storage address information of the copy, that is, the location information of each copy stored on the DataNode.

其中,NameNode作为主节点,DataNode作为NameNode的从节点,会不断的和NameNode进行通信,HDFS初始化时,DataNode会将自己保存的数据块信息告知NameNode,从而使得NameNode知晓数据块和其副本在DataNode上的存储位置;后续DataNode在存储数据过程中,也周期性地发送DataNode的本地修改的相关信息给NameNode,并且也会接收NameNode的指令,其中,指令包括创建、移动或者删除本地Linux磁盘中的数据块。Among them, the NameNode is the master node, and the DataNode is the slave node of the NameNode, which will continuously communicate with the NameNode. When HDFS is initialized, the DataNode will inform the NameNode of the data block information saved by itself, so that the NameNode knows that the data block and its copy are on the DataNode. The storage location; subsequent DataNode also periodically sends relevant information about DataNode’s local modification to NameNode during the process of storing data, and also receives NameNode instructions, where the instructions include creating, moving or deleting data in the local Linux disk piece.

这样,NameNode将大于等于预设阈值的文件存储在本地,由NameNode自身管理较大文件的元数据,实现了目标文件的元数据的分类存放。In this way, the NameNode stores the files greater than or equal to the preset threshold locally, and the NameNode itself manages the metadata of the larger files, realizing the classified storage of the metadata of the target files.

S104:当文件大小小于预设阈值时,NameNode将目标文件的元数据存储至目标集群。S104: When the size of the file is smaller than the preset threshold, the NameNode stores the metadata of the target file to the target cluster.

这样,将小于预设阈值的文件的元数据存储至目标集群,由于目标集群可以提供查询服务,能够查询小于预设阈值的目标文件的元数据,提高了文件的检索效率。In this way, the metadata of the files smaller than the preset threshold is stored in the target cluster, and since the target cluster can provide query services, the metadata of the target files smaller than the preset threshold can be queried, which improves file retrieval efficiency.

可选的,目标集群可以是Redis集群,也可以是其他具有快速查询服务的集群,这里,本申请实施例对此不作具体限定。Optionally, the target cluster may be a Redis cluster or other clusters with fast query services, which is not specifically limited in this embodiment of the present application.

当目标集群为Redis集群时,由于Redis集群吞吐量高,可以提供快速查询服务,能够快速地查询小于预设阈值的目标文件的元数据,大大提高了文件的检索效率。When the target cluster is a Redis cluster, due to the high throughput of the Redis cluster, it can provide fast query services, and can quickly query the metadata of the target files smaller than the preset threshold, which greatly improves the file retrieval efficiency.

可选的,预设阈值为阈值调整参数和HDFS的block的存储空间的大小的乘积;其中,阈值调整参数大于0小于1。Optionally, the preset threshold is the product of the threshold adjustment parameter and the size of the HDFS block storage space; wherein, the threshold adjustment parameter is greater than 0 and less than 1.

示例性的,阈值调整参数可以用ratioSmallFile表示,当阈值调整参数的值是0.5,block的存储空间的大小为64MB时,那么预设阈值为32MB,也就是说,当文件大小小于预设阈值(ratioSmallFile*block=64*0.5=32MB)时,可以确定目标文件为小文件,将其元数据存储至目标集群。Exemplarily, the threshold adjustment parameter can be represented by ratioSmallFile. When the value of the threshold adjustment parameter is 0.5 and the storage space of the block is 64MB, then the preset threshold is 32MB, that is, when the file size is smaller than the preset threshold ( ratioSmallFile*block=64*0.5=32MB), it can be determined that the target file is a small file, and its metadata is stored in the target cluster.

需要说明的是,在HDFS中,block用于存储目标文件的文件内容,其大小决定了能够存储多少文件的内容,这里,将预设阈值设置为阈值调整参数与block的存储空间的大小的乘积,利用预设阈值来对目标文件进行分类,能够更好地对目标文件进行分类,从而实现对不同目标文件的元数据的分类存储,从而提高了文件检索效率。It should be noted that in HDFS, a block is used to store the file content of the target file, and its size determines how much file content can be stored. Here, the preset threshold is set as the product of the threshold adjustment parameter and the size of the block storage space , using the preset threshold to classify the target files can better classify the target files, thereby realizing classified storage of metadata of different target files, thereby improving file retrieval efficiency.

这样,可以通过阈值调整参数的设置灵活地调整预设阈值,使得分类存放文件的元数据时,实现灵活地区分大小文件。In this way, the preset threshold can be flexibly adjusted through the setting of the threshold adjustment parameter, so that when the metadata of the file is classified and stored, it is possible to flexibly distinguish large and small files.

S105:NameNode向Client返回目标文件的存储地址,以存储目标文件的文件内容。S105: The NameNode returns the storage address of the target file to the Client, so as to store the file content of the target file.

需要说明的是,NameNode可以向Client返回目标文件的存储地址,该存储地址中包含每个数据块的存储地址。It should be noted that the NameNode may return the storage address of the target file to the Client, and the storage address includes the storage address of each data block.

具体来说,NameNode向Client返回所要存储的数据块的DataNode节点位置信息,并记录目标文件写入过程中的日志信息;针对一个数据块,Client可以直接与某个DataNode节点建立通道,这个DataNode节点是数据块写入的主节点,并且,Client还可以将该数据块的副本写入其他DataNode节点;其他DataNode节点接收完数据块之后,会向主节点发送写入成功的消息,主节点会向Client发送该数据块已经全部写入成功的状态;目标文件中包含的其余数据块类似,这里,不再赘述。Specifically, the NameNode returns the DataNode node location information of the data block to be stored to the Client, and records the log information in the process of writing the target file; for a data block, the Client can directly establish a channel with a DataNode node, and the DataNode node It is the master node where the data block is written, and the client can also write the copy of the data block to other DataNode nodes; after other DataNode nodes receive the data block, they will send a write success message to the master node, and the master node will send a message to the master node The client sends the status that all the data blocks have been successfully written; the rest of the data blocks contained in the target file are similar, and will not be repeated here.

为了实现对目标文件的存储,Client在接收到目标文件的存储地址之后,根据目标文件的存储地址写入目标文件,并且,目标文件的存储地址所在的DataNode向Client返回写入成功的消息,至此,完成目标文件的存储。In order to realize the storage of the target file, after receiving the storage address of the target file, the Client writes the target file according to the storage address of the target file, and the DataNode where the storage address of the target file is located returns a message of successful writing to the Client, so far , to complete the storage of the target file.

另外,目标文件的所有数据块写入成功后,Client会向NameNode返回写入成功的消息。In addition, after all the data blocks of the target file are successfully written, the Client will return a successful write message to the NameNode.

这样,目标文件按照NameNode确定出的元数据中包含的存储地址写入目标文件的文件内容,使得目标文件成功写入HDFS中。In this way, the target file is written into the file content of the target file according to the storage address contained in the metadata determined by the NameNode, so that the target file is successfully written into the HDFS.

本申请实施例中,NameNode接收到来自Client的对目标文件写入请求后,由NameNode判断该文件大小,将大于等于预设阈值的目标文件的元数据存储在本地,将小于预设阈值的目标文件的元数据存储至目标集群,这样,通过大文件和小文件分开存储元数据,使得NameNode上的文件的目录信息简单化,基于简单化的文件的目录信息,在文件检索时,能够快速地检索出文件的目录信息,从而提升了HDFS的稳定性和文件检索速率。In the embodiment of this application, after the NameNode receives the request to write the target file from the Client, the NameNode judges the size of the file, stores the metadata of the target file greater than or equal to the preset threshold locally, and stores the metadata of the target file smaller than the preset threshold The metadata of the file is stored in the target cluster. In this way, the directory information of the file on the NameNode is simplified by storing the metadata separately for the large file and the small file. Based on the simplified directory information of the file, it can be retrieved quickly. The directory information of the file is retrieved, thereby improving the stability of HDFS and the file retrieval rate.

基于前述实施例,本申请提供另一种可选的文件存储方法,图2为本申请的实施例提供另一种可选的文件存储方法的流程示意图,参阅图2,该方法可以包括:Based on the foregoing embodiments, the present application provides another optional file storage method. FIG. 2 is a schematic flow diagram of another optional file storage method provided by the embodiment of the present application. Referring to FIG. 2, the method may include:

S201:NameNode接收来自Client发送的针对目标文件的读取请求。S201: The NameNode receives a read request for the target file from the Client.

其中,读取请求中携带有目标文件的文件大小和文件名称。Wherein, the read request carries the file size and file name of the target file.

具体来说,Client会向NameNode上部署的NameNode代理发起读取请求,来确定想要读取的目标文件包含的数据块所在的存储地址。Specifically, the Client will initiate a read request to the NameNode agent deployed on the NameNode to determine the storage address of the data block contained in the target file to be read.

S202:当文件大小大于等于预设阈值时,NameNode在本地存储的元数据中查找文件名称对应的元数据,并从文件名称对应的元数据中得到目标文件的存储地址。S202: When the file size is greater than or equal to the preset threshold, the NameNode searches locally stored metadata for metadata corresponding to the file name, and obtains a storage address of the target file from the metadata corresponding to the file name.

S203:当文件大小小于预设阈值时,NameNode向Redis集群发送读取请求,使得Redis在本地存储的元数据中查找文件名称对应的元数据,并从文件名称对应的元数据中得到目标文件的存储地址,将目标文件的存储地址发送到NameNode。S203: When the file size is smaller than the preset threshold, the NameNode sends a read request to the Redis cluster, so that Redis searches the locally stored metadata for the metadata corresponding to the file name, and obtains the target file from the metadata corresponding to the file name Storage address, send the storage address of the target file to the NameNode.

具体来说,用户在文件读取的时候,可以直接调用DFSDataInputStream.read这一命令,也可以通过调用FileSystem相关的命令来读取目标文件元数据,通过目标文件元数据中可以得到目标文件包含的数据块所在的存储地址。Specifically, when reading a file, the user can directly call the command DFSDataInputStream.read, or read the metadata of the target file by calling the command related to FileSystem. Through the metadata of the target file, the content contained in the target file can be obtained. The storage address where the data block is located.

S204:NameNode将目标文件的存储地址返回到Client,以读取目标文件的文件内容。S204: The NameNode returns the storage address of the target file to the Client, so as to read the file content of the target file.

具体来说,NameNode会返回目标文件包含的每个数据块的存储地址给Client,对于每个数据块,Client会接收到含有该数据块和其副本的DataNode节点地址,Client可以向对应的DataNode节点地址请求读取目标文件的文件内容,从而使得Client从对应的DataNode节点读取到目标文件的文件内容,完成目标文件的文件内容的读取。Specifically, the NameNode will return the storage address of each data block contained in the target file to the Client. For each data block, the Client will receive the address of the DataNode node containing the data block and its copy, and the Client can send the corresponding DataNode node The address requests to read the file content of the target file, so that the Client reads the file content of the target file from the corresponding DataNode node, and completes the reading of the file content of the target file.

本申请实施例中,NameNode接收到来自Client的文件读取请求后,由NameNode判断该文件的大小,大于等于预设阈值的目标文件的在本地查找对应的元数据,小于预设阈值的目标文件的在Redis集群查找对应的元数据,根据查找到的元数据得到目标文件的存储地址,并将存储地址返回至Client,这样,通过大文件和小文件分开查找存储元数据,提高了目标文件的元数据的查找效率,并且,Redis集群可以提供快速查询服务,查询小文件的目录信息时无需查询NameNode庞大复杂的目录,有效地提升了HDFS的稳定性和文件检索速率。In the embodiment of this application, after the NameNode receives the file read request from the Client, the NameNode judges the size of the file, and searches for the corresponding metadata locally for the target file greater than or equal to the preset threshold, and for the target file smaller than the preset threshold Find the corresponding metadata in the Redis cluster, obtain the storage address of the target file according to the found metadata, and return the storage address to the client, so that the storage metadata of the large file and the small file are searched separately, and the storage address of the target file is improved. In addition, the Redis cluster can provide fast query services. When querying the directory information of small files, there is no need to query the huge and complex directories of the NameNode, which effectively improves the stability of HDFS and the file retrieval rate.

针对HDFS的block存储空间的大小增大后,会导致已存储的文件全部变为小文件的问题,本申请提供又一种可选的文件存储方法,HDFS还包括:Spark,图3为本申请的实施例提供又一种可选的文件存储方法的流程示意图,参阅图3,该方法还可以包括:Aiming at the problem that the size of the block storage space of HDFS increases, all stored files will become small files. This application provides another optional file storage method. HDFS also includes: Spark, and Figure 3 shows this application The embodiment of the present invention provides a schematic flow chart of another optional file storage method, referring to Fig. 3, the method may also include:

S301:当Hadoop的block的存储空间发生变更,且不存在正在执行的待合并任务时,Hadoop的搜索引擎分别扫描NameNode和Redis集群中存储的文件的元数据,根据存储的文件的元数据确定待合并任务;S301: When the storage space of the block of Hadoop changes and there is no task to be merged that is being executed, the search engine of Hadoop scans the metadata of the files stored in the NameNode and the Redis cluster respectively, and determines the pending task according to the metadata of the stored files. Merge tasks;

S302:根据预设的小文件合并算法,执行待合并任务,得到新文件,将新文件确定为目标文件,并发送至Client,以返回执行NameNode接收来自Client发送的针对目标文件的写入请求。S302: Execute the task to be merged according to the preset small file merging algorithm, obtain a new file, determine the new file as the target file, and send it to the Client, so as to return and execute the NameNode to receive the write request for the target file from the Client.

需要说明的是,上述搜索引擎可以为Spark,还可以为Flink或者MapReduce,本申请实施例对此不作具体限定。It should be noted that the above search engine may be Spark, or Flink or MapReduce, which is not specifically limited in this embodiment of the present application.

具体来说,Spark执行S201~S204的步骤从NameNode和Redis集群中读取存储的文件的数据,由于block的存储空间的大小发生变更,需要对存储的文件按照block存储空间的大小和存储的文件的文件大小确定待合并文件。Specifically, Spark executes steps S201-S204 to read the stored file data from the NameNode and Redis clusters. Since the size of the block storage space changes, the stored files need to be stored according to the size of the block storage space and the stored file data. The file size determines the files to be merged.

可选的,HDFS还包括用于监控文件合并情况的Zookeeper,Zookeeper中可以设置合并任务标识,并且记录合并任务启动的时间和是否处于合并中。Optionally, HDFS also includes Zookeeper for monitoring file merging. The merging task identifier can be set in Zookeeper, and the time when the merging task is started and whether it is being merged is recorded.

具体来说,当block的存储空间的大小发生变更,且Zookeeper记录的数据显示不存在正在执行的待合并任务时,才扫描NameNode和Redis集群中存储的文件的元数据,进而实现文件合并。Specifically, when the size of the storage space of the block changes, and the data recorded by Zookeeper shows that there is no task to be merged that is being executed, the metadata of the files stored in the NameNode and Redis clusters are scanned, and then the files are merged.

示例性的,Zookeeper中可以设定合并任务标识,用参数hdfsMergeStatus表征处于合并中,用参数hdfsMergeStart表征合并任务启动的时间,当block的存储空间的大小由64MB变更为128MB后,又由128MB变更为256MB,此时,从Zookeeper中获取合并任务表示,判断64MB变更为128MB这一合并任务是否正在进行中,如果是,64MB变更为128MB的合并任务执行完成后,再执行128MB变更为256MB的合并任务,如果否,直接执行128MB变更为256MB的合并任务。Exemplarily, the merge task identifier can be set in Zookeeper, the parameter hdfsMergeStatus is used to indicate that the merge is in progress, and the parameter hdfsMergeStart is used to indicate the start time of the merge task. When the size of the block storage space is changed from 64MB to 128MB, it is changed from 128MB to 256MB. At this time, get the merge task indication from Zookeeper to determine whether the merge task of changing 64MB to 128MB is in progress. If so, after the merge task of changing 64MB to 128MB is completed, execute the merge task of changing 128MB to 256MB , if not, directly execute the merge task of changing 128MB to 256MB.

这样,当block的存储空间的大小发生变更,且Zookeeper记录的数据显示不存在正在执行的待合并任务时,才执行文件合并的步骤,如此,避免了合并任务未完成时被新的合并任务覆盖的情况发生。In this way, when the size of the storage space of the block changes, and the data recorded by Zookeeper shows that there is no ongoing task to be merged, the step of file merging is executed. In this way, it is avoided that the merging task is not overwritten by the new merging task. situation occurs.

需要说明的是,Spark将新文件确定为目标文件后,执行S101~S105的步骤将文件重新写入HDFS中。It should be noted that, after Spark determines the new file as the target file, steps S101 to S105 are executed to rewrite the file into HDFS.

另外,预设阈值设置为阈值调整参数与block的存储空间的大小的乘积,使得block的存储空间的大小发生变化时,预设阈值也随之发生变化,这样,使得文件存储系统能够针对block的存储空间的大小来确定预设阈值,进而使得对目标文件的元数据的分类存储能够适应发生变化后的block的存储空间的大小,也就是说,通过预设阈值的调整,文件存储系统能够随着发生变化后的block的存储空间的大小对目标文件的元数据进行适应的分类存储,更加智能化地解决了因block的存储空间的大小所导致的对目标文件的分类的不适。In addition, the preset threshold is set as the product of the threshold adjustment parameter and the size of the block’s storage space, so that when the size of the block’s storage space changes, the preset threshold also changes accordingly. In this way, the file storage system can target the block The size of the storage space is used to determine the preset threshold, so that the classified storage of the metadata of the target file can adapt to the size of the changed block storage space. That is to say, through the adjustment of the preset threshold, the file storage system can keep pace with According to the size of the changed block storage space, the meta data of the target file is adaptively stored in classification, which more intelligently solves the discomfort of classifying the target file caused by the size of the block storage space.

这样,根据预设的小文件合并算法执行待合并任务,能够最大程度地利用文件存储系统的运算资源,将合并得到的新文件重新写入HDFS中,使得当block存储空间的大小增大后,可以对已经存储在HDFS中的文件进行整理,以避免产生内存浪费的问题。In this way, the task to be merged is executed according to the preset small file merge algorithm, and the computing resources of the file storage system can be utilized to the greatest extent, and the merged new file can be rewritten into HDFS, so that when the size of the block storage space increases, Files already stored in HDFS can be organized to avoid memory waste.

在一种可选的实施例中,根据存储的文件的元数据确定待合并任务,包括:In an optional embodiment, determining the task to be merged according to the metadata of the stored file includes:

根据存储的文件的元数据读取存储的文件,将存储的文件确定为待合并文件;Read the stored file according to the metadata of the stored file, and determine the stored file as the file to be merged;

将待合并文件进行分组,得到分组后的待合并文件;Group the files to be merged to obtain the grouped files to be merged;

将分组后的待合并文件确定为待合并任务。Determine the grouped files to be merged as tasks to be merged.

其中,每组待合并文件的文件大小之和小于等于block的存储空间大小。Wherein, the sum of the file sizes of each group of files to be merged is less than or equal to the storage space size of the block.

示例性的,当block的存储空间的大小由64MB变更为128MB时,可以将10个12.8MB的待合并文件分为一组,也可以将一个文件大小为128MB,包含2个64MB的数据块的待合并文件分为一组,这里,本申请实施例对此不作具体限定。Exemplarily, when the storage space of a block is changed from 64MB to 128MB, ten 12.8MB files to be merged can be grouped together, or a file with a size of 128MB and two 64MB data blocks can be grouped. The files to be merged are divided into one group, which is not specifically limited in this embodiment of the present application.

这样,通过将文件大小之和小于等于block的存储空间大小的待合并文件分为一组,使得block的存储空间被充分利用。In this way, the storage space of the block is fully utilized by grouping the files to be merged whose sum of file sizes is less than or equal to the storage space of the block.

具体来说,一组待合并文件可以确定为一个待合并任务。Specifically, a group of files to be merged can be determined as a task to be merged.

这样,通过对待合并文件分组得到待合并任务,执行待合并任务完成文件合并,将一组待合并文件合并为一个新文件,能够提供文件合并的效率。In this way, the tasks to be merged are obtained by grouping the files to be merged, the tasks to be merged are executed to complete the file merger, and a group of files to be merged is merged into a new file, which can improve the efficiency of file mergers.

这样,通过上述分组的方式确定出待合并任务,使得能够根据预设的小文件合并算法执行待合并任务,能够最大程度地利用文件存储系统的运算资源,将合并得到的新文件重新写入HDFS中,使得当block存储空间的大小增大后,可以对已经存储在HDFS中的文件进行整理,以避免产生内存浪费的问题。In this way, the tasks to be merged are determined through the above-mentioned grouping method, so that the tasks to be merged can be executed according to the preset small file merge algorithm, and the computing resources of the file storage system can be utilized to the greatest extent, and the merged new files can be rewritten into HDFS In this way, when the size of the block storage space increases, the files already stored in HDFS can be organized to avoid the problem of memory waste.

本申请实施例中,在当block的存储空间的大小发生变更后,通过扫描NameNode和Redis集群中存储的文件的元数据,确定出待合并文件,将HDFS中存储的文件合并后重新写入,能够对已存储的文件进行重新整理,提高了存储空间的利用率。In the embodiment of the present application, after the size of the storage space of the block changes, the file to be merged is determined by scanning the metadata of the file stored in the NameNode and the Redis cluster, and the file stored in HDFS is merged and rewritten. The stored files can be rearranged, which improves the utilization rate of the storage space.

基于前述实施例,在本申请其他实施例中,S304可以包括:Based on the foregoing embodiments, in other embodiments of the present application, S304 may include:

获取待合并任务列表,执行队列的队列实际大小,执行队列的队列预估大小和文件存储系统的当前性能参数;Obtain the list of tasks to be merged, the actual size of the execution queue, the estimated size of the execution queue and the current performance parameters of the file storage system;

其中,队列实际大小和队列预估大小的初始值相等;待合并任务列表是由待合并任务形成的;Among them, the actual size of the queue is equal to the initial value of the estimated size of the queue; the list of tasks to be merged is formed by the tasks to be merged;

当待合并任务列表中任务个数大于0时,根据当前性能参数,计算得到文件存储系统的资源占用率。其中,资源占用率为文件存储系统在处理数据时的资源占用率;When the number of tasks in the task list to be merged is greater than 0, the resource occupancy rate of the file storage system is calculated according to the current performance parameters. Among them, the resource occupancy rate is the resource occupancy rate of the file storage system when processing data;

根据资源占用率所落入的预设区间,调整队列预估大小;Adjust the estimated size of the queue according to the preset range where the resource occupancy rate falls;

当队列预估大小大于队列实际大小时,将队列的实际大小更新为队列预估大小,将待合并任务列表中的待合并任务添加至执行队列中以执行队列中的待合并任务,更新待合并任务列表中的任务个数;When the estimated size of the queue is greater than the actual size of the queue, update the actual size of the queue to the estimated size of the queue, add the tasks to be merged in the list of tasks to be merged to the execution queue to execute the tasks to be merged in the queue, and update the tasks to be merged The number of tasks in the task list;

当队列预估大小小于等于队列实际大小时,返回执行根据资源占用率所落入的预设区间,调整队列预估大小;When the estimated size of the queue is less than or equal to the actual size of the queue, return to execute the preset interval according to the resource occupancy rate, and adjust the estimated size of the queue;

当待合并任务列表的任务个数等于0时,结束。When the number of tasks in the task list to be merged is equal to 0, end.

需要说明的是,文件存储系统的当前性能参数包括当前中央处理器(CentralProcessing Unit,CPU)占用率、当前内存占用率和当前带宽占用率。It should be noted that the current performance parameters of the file storage system include a current central processing unit (Central Processing Unit, CPU) occupancy rate, a current memory occupancy rate, and a current bandwidth occupancy rate.

示例性的,资源占用率可以用Temp表示,获取CPU占用率PC、内存占用率Pm和带宽占用率Pb的初始值,以及当前CPU占用率PC1、当前内存占用率Pm1和当前带宽占用率Pb1,其中,PC、Pm、Pb、PC1、Pm1和Pb1均在0到100%之间,资源占用率用Temp表示,可以通过下面的公式(1)计算得到;Exemplarily, the resource occupancy rate can be represented by Temp, and the initial values of the CPU occupancy rate PC , the memory occupancy rate Pm , and the bandwidth occupancy rate Pb , as well as the current CPU occupancy rate PC1 , the current memory occupancy rate Pm1 , and the current The bandwidth occupancy rate Pb1 , wherein, PC , Pm , Pb , PC1 , Pm1 , and Pb1 are all between 0 and 100%, and the resource occupancy rate is represented by Temp, which can be calculated by the following formula (1) get;

当Temp属于(-1,-0.25)时,也就是说,当前CPU占用率PC1、当前内存占用率Pm1和当前带宽占用率Pb1小于上一次调整得到的值,文件存储系统低负荷运行,没有充分利用文件存储系统的处理资源,那么,对队列预估大小的值进行增大;When Temp belongs to (-1, -0.25), that is to say, the current CPU occupancy rate PC1 , the current memory occupancy rate Pm1 and the current bandwidth occupancy rate Pb1 are less than the value obtained from the previous adjustment, the file storage system runs at a low load , does not make full use of the processing resources of the file storage system, then increase the estimated size of the queue;

当Temp属于[-0.25,0.25]时,也就是说,当前充分利用了文件存储系统的处理资源,那么,不需要改变队列预估大小的值;When Temp belongs to [-0.25, 0.25], that is to say, the current processing resources of the file storage system are fully utilized, then there is no need to change the value of the estimated queue size;

当Temp属于(0.25,1)时,也就是说,当前CPU占用率PC1、当前内存占用率Pm1和当前带宽占用率Pb1大于上一次调整得到的值,文件存储系统超负荷运行,文件存储系统的处理资源过载,那么,对队列预估大小的值进行减小;When Temp belongs to (0.25, 1), that is to say, the current CPU occupancy rate PC1 , the current memory occupancy rate Pm1 and the current bandwidth occupancy rate Pb1 are greater than the value obtained from the previous adjustment, the file storage system is overloaded, and the file If the processing resources of the storage system are overloaded, then reduce the value of the estimated queue size;

在通过Temp所落入的预设区间调整队列预估大小的值后,判断队列预估大小是否大于队列实际大小;After adjusting the value of the estimated size of the queue through the preset interval that Temp falls into, determine whether the estimated size of the queue is greater than the actual size of the queue;

如果是,将队列预估大小的值赋值给队列实际大小,将待合并任务列表中的待合并任务添加至执行队列中以执行队列中的待合并任务,并减少待合并任务列表中的任务个数;通过执行队列来执行待合并任务,由于执行队列的队列实际大小与当前性能参数相关,使得执行队列的队列实际大小随着当前性能参数的变化而变化,进而以此来确定所执行的待合并任务的个数,使得队列实际大小能够与文件存储系统的运算资源相适应,从而提高了对文件存储系统的利用率。If so, assign the value of the estimated size of the queue to the actual size of the queue, add the tasks to be merged in the list of tasks to be merged to the execution queue to execute the tasks to be merged in the queue, and reduce the number of tasks in the list of tasks to be merged The number of tasks to be merged is executed through the execution queue. Since the actual size of the queue in the execution queue is related to the current performance parameters, the actual size of the queue in the execution queue changes with the change of the current performance parameters. The number of combined tasks enables the actual size of the queue to adapt to the computing resources of the file storage system, thereby improving the utilization rate of the file storage system.

如果否,返回执行根据资源占用率所落入的预设区间,调整队列预估大小;If not, return to execute the preset interval according to the resource occupancy rate, and adjust the estimated size of the queue;

当待合并任务列表中的任务全部完成时,结束文件合并。When all the tasks in the to-be-merged task list are completed, the file merging ends.

本申请实施例中,根据当前性能参数,计算得到文件存储系统的资源占用率,根据资源占用率调整队列预估大小,判断队列预估大小与队列实际大小的关系,实现了队列实际大小的自适应调整,能够充分利用文件存储系统的运算资源。In the embodiment of the present application, the resource occupancy rate of the file storage system is calculated according to the current performance parameters, the estimated size of the queue is adjusted according to the resource occupancy rate, and the relationship between the estimated size of the queue and the actual size of the queue is judged, realizing automatic adjustment of the actual size of the queue. Adaptive adjustments can make full use of the computing resources of the file storage system.

基于前述实施例,在本申请其他实施例中,在S304中,得到新文件之后,将新文件确定为目标文件之前,该方法还可以包括:Based on the foregoing embodiments, in other embodiments of the present application, in S304, after obtaining the new file and before determining the new file as the target file, the method may further include:

搜索引擎将新文件对应的待合并文件确定为目标文件,向Client发送目标文件的元数据的删除请求,并转发至NameNode;The search engine determines the file to be merged corresponding to the new file as the target file, sends a delete request for the metadata of the target file to the Client, and forwards it to the NameNode;

其中,删除请求中携带有目标文件的文件大小和文件名称;Wherein, the file size and file name of the target file are carried in the deletion request;

当文件大小大于等于预设阈值时,NameNode将在本地存储的元数据中查找到的文件名称对应的元数据返回至Client,并删除文件名称对应的元数据;When the file size is greater than or equal to the preset threshold, the NameNode returns the metadata corresponding to the file name found in the locally stored metadata to the Client, and deletes the metadata corresponding to the file name;

当文件大小小于预设阈值时,Redis集群接收来自NameNode转发的删除请求,将在本地存储的元数据中查找到的文件名称对应的元数据返回至Client,并删除文件名称对应的元数据;When the file size is smaller than the preset threshold, the Redis cluster receives the deletion request forwarded from the NameNode, returns the metadata corresponding to the file name found in the locally stored metadata to the Client, and deletes the metadata corresponding to the file name;

Client根据文件名称对应的元数据中目标文件的存储地址,删除目标文件的文件内容。The client deletes the file content of the target file according to the storage address of the target file in the metadata corresponding to the file name.

需要说明的是,搜索引擎向Client发送针对新文件对应的待合并文件的删除请求,删除请求中包含新文件对应的待合并文件的文件大小和文件名称,如果新文件对应的待合并文件为大于等于预设阈值的文件,NameNode根据该文件名称查找到存储在本地的元数据,根据元数据得到新文件对应的待合并文件的存储地址;如果新文件对应的待合并文件为小于预设阈值的文件,NameNode上的NameNode代理将删除请求转发至Redis集群,由Redis集群根据新文件对应的待合并文件的文件名称查找到对应的元数据,根据元数据得到新文件对应的待合并文件的存储地址;NameNode代理接收Redis集群发送的新文件对应的待合并文件的存储地址,和NameNode得到的存储地址一起综合反馈给Client,Client根据存储地址删除新文件对应的待合并文件的文件内容,NameNode或者Redis集群删除新文件对应的待合并文件的元数据。It should be noted that the search engine sends a delete request to the client for the file to be merged corresponding to the new file, and the delete request includes the file size and file name of the file to be merged corresponding to the new file. If the file to be merged corresponding to the new file is larger than For files equal to the preset threshold, NameNode finds the metadata stored locally according to the file name, and obtains the storage address of the file to be merged corresponding to the new file according to the metadata; if the file to be merged corresponding to the new file is less than the preset threshold file, the NameNode agent on the NameNode forwards the deletion request to the Redis cluster, and the Redis cluster finds the corresponding metadata according to the file name of the file to be merged corresponding to the new file, and obtains the storage address of the file to be merged corresponding to the new file according to the metadata ;The NameNode agent receives the storage address of the file to be merged corresponding to the new file sent by the Redis cluster, and gives a comprehensive feedback to the Client together with the storage address obtained by the NameNode. The Client deletes the file content of the file to be merged corresponding to the new file according to the storage address. NameNode or Redis The cluster deletes the metadata of the file to be merged corresponding to the new file.

本申请实施例中,在执行待合并任务列表中的待合并任务,得到新文件后,删除用于合并得到新文件的待合并文件的元数据和文件内容,避免了冗余数据的存储,减小了HDFS的内存负担。In the embodiment of the present application, after executing the task to be merged in the task list to be merged and obtaining the new file, delete the metadata and file content of the file to be merged to obtain the new file, avoiding the storage of redundant data, reducing the Reduce the memory burden of HDFS.

基于前述实施例,在本申请其他实施例中,该方法还包括:Based on the foregoing embodiments, in other embodiments of the present application, the method further includes:

当待合并任务失败未得到新文件时,搜索引擎从合并失败的任务中的文件的元数据中获取合并失败的任务的文件的目录信息;When the task to be merged fails and no new file is obtained, the search engine obtains the directory information of the file of the failed task from the metadata of the file in the failed merged task;

搜索引擎根据合并失败的任务的文件的目录信息生成合并失败列表,并发送至Client。The search engine generates a merge failure list according to the directory information of the files of the failed merge task, and sends it to the client.

具体来说,发送合并失败列表至Client,可以是生成提示信息,通过邮件的形式发送给用户的终端,也可以是通过其他形式发送给用户的终端,这里,本申请实施例对此不作具体限定。Specifically, sending the merge failure list to the Client may be to generate prompt information and send it to the user's terminal in the form of an email, or it may be sent to the user's terminal in other forms. Here, the embodiment of the present application does not specifically limit this .

本申请实施例中,搜索引擎将合并失败的文件告知Client,使得合并失败的文件可以快速得到处理。In the embodiment of the present application, the search engine notifies the client of the files that fail to be merged, so that the files that fail to be merged can be quickly processed.

基于前述实施例,在本申请其他实施例中,该方法还包括:Based on the foregoing embodiments, in other embodiments of the present application, the method further includes:

对所生成的新文件对应的待合并任务的启动时间和结束时间进行记录。Record the start time and end time of the task to be merged corresponding to the generated new file.

需要说明的是,上述对所生成的新文件对应的待合并任务的启动时间和结束时间进行记录,可以是由HDFS中的Zookeeper来执行的,也可以是由Redis来执行的,这里,本申请实施例对此不做具体限定。It should be noted that the above-mentioned recording of the start time and end time of the task to be merged corresponding to the generated new file can be performed by Zookeeper in HDFS, or by Redis. Here, the application The embodiment does not specifically limit this.

其中,Zookeeper是一种分布式应用程序协调服务软件。Among them, Zookeeper is a distributed application program coordination service software.

可以理解地,通过记录新文件对应的待合并任务的启动时间和结束时间,可以知晓每个待合并任务属于正在执行的任务还是已经执行完成的任务,从而为确定是否在block的存储空间的大小发生变化时重新对发生变化的block中的文件进行合并。It can be understood that by recording the start time and end time of the task to be merged corresponding to the new file, it can be known whether each task to be merged belongs to the task being executed or the task that has been executed, so as to determine whether the size of the storage space of the block When a change occurs, merge the files in the changed block again.

如此,有助于实现对存储空间发生变化的block中的文件进行合并,以整合文件存储系统中空闲的存储空间,提高文件存储空间的利用率。In this way, it is helpful to merge the files in the block whose storage space changes, so as to integrate the idle storage space in the file storage system and improve the utilization rate of the file storage space.

下面举实例来对上述一个或多个实施例中的文件存储方法进行说明。An example is given below to illustrate the file storage method in one or more of the above embodiments.

图4为本申请实施例提供的一种可选的HDFS的架构示意图,如图4所示,HDFS包括:Client、大文件处理单元和中小文件处理单元,其中,大文件处理单元包括NameNode(NN)和SecondaryNameNode(SNN),NameNode上和SecondaryNameNode上分别部署有NameNode代理(NN代理)和SecondaryNameNode代理(SNN代理);中小文件处理单元包括Redis集群,其中,Redis集群包含多个节点,如图4中节点1、节点2和节点3,其中,节点1、节点2和节点3互相通信。Fig. 4 is the architecture diagram of a kind of optional HDFS that the embodiment of the present application provides, as shown in Fig. 4, HDFS comprises: Client, large file processing unit and medium and small file processing unit, wherein, large file processing unit comprises NameNode (NN ) and SecondaryNameNode (SNN), the NameNode and the SecondaryNameNode are respectively deployed with a NameNode agent (NN agent) and a SecondaryNameNode agent (SNN agent); the small and medium file processing unit includes a Redis cluster, wherein the Redis cluster includes multiple nodes, as shown in Figure 4 Node 1, Node 2, and Node 3, wherein Node 1, Node 2, and Node 3 communicate with each other.

图5为本申请实施例提供的一种可选的文件存储方法的实例之一的流程示意图,该方法可以应用于图4的HDFS架构中,如图5所示,针对文件写入来说,该文件存储方法可以包括:Fig. 5 is a schematic flow chart of one example of an optional file storage method provided by the embodiment of the present application. This method can be applied to the HDFS architecture of Fig. 4, as shown in Fig. 5, for file writing, The file storage method may include:

S501:NameNode代理接收到Client发送的目标文件的写入请求;S501: The NameNode agent receives the writing request of the target file sent by the Client;

其中,该文件写入请求携带目标文件的文件大小信息,使得NameNode代理可以根据预设的小文件识别规则确定文件的大小。Wherein, the file writing request carries the file size information of the target file, so that the NameNode agent can determine the file size according to the preset small file identification rule.

示例性的,预设的小文件识别规则可以是:在HDFS的配置文件中增加阈值调整参数,阈值调整参数取值为(0,1),可以用ratioSmallFile表示,当block的存储空间的大小为64MB时,如果需要写入的文件大小小于ratioSmallFile*block的存储空间的大小时,判断目标文件为小文件,否则,判断为大文件。Exemplary, the preset small file identification rule can be: increase the threshold adjustment parameter in the configuration file of HDFS, the threshold adjustment parameter value is (0,1), can be represented by ratioSmallFile, when the size of the storage space of block is When the size is 64MB, if the size of the file to be written is smaller than the storage space of ratioSmallFile*block, it is judged that the target file is a small file; otherwise, it is judged as a large file.

S502:当目标文件为大文件时,NameNode代理将请求发送至NameNode,将目标文件的元数据存储至NameNode;S502: When the target file is a large file, the NameNode agent sends the request to the NameNode, and stores the metadata of the target file to the NameNode;

当目标文件为小文件时,NameNode代理将请求发送至Redis集群,将目标文件的元数据存储至Redis集群。When the target file is a small file, the NameNode agent sends the request to the Redis cluster and stores the metadata of the target file to the Redis cluster.

S503:NameNode代理将接收到的NameNode或者Redis集群返回的目存储地址,并发送至所述Client端;S503: The NameNode agent sends the received destination storage address returned by the NameNode or the Redis cluster to the client;

其中,存储地址指的是可用于存储文件的DataNode的节点地址信息,包含主存储节点位置和副本的存储位置。Wherein, the storage address refers to the node address information of the DataNode that can be used to store the file, including the location of the primary storage node and the storage location of the copy.

S504:Client根据存储地址向对应的DataNode写入文件,并发送副本给对应的DataNode节点进行备份;S504: Client writes the file to the corresponding DataNode according to the storage address, and sends a copy to the corresponding DataNode for backup;

需要说明的是,在S504中,Client可以向多个DataNode节点发送副本。It should be noted that, in S504, the Client may send copies to multiple DataNodes.

S505:对副本进行备份的从DataNode写入文件后,向主DataNode发送文件写入成功的通知,主DataNode向Client发送文件写入成功的通知;S505: After writing the file from the DataNode that backs up the copy, send a notification that the file is written successfully to the main DataNode, and the main DataNode sends a notification that the file is written successfully to the Client;

需要说明的是,主DataNode和从DataNode如何写入数据块与S105相同,这里不再赘述。It should be noted that how the master DataNode and the slave DataNode write data blocks is the same as that of S105, and will not be repeated here.

S506:Client告知NameNode代理完成文件写入;S506: the Client notifies the NameNode agent to complete writing the file;

S507:如果目标文件为大文件,NameNode代理告知NameNode完成写入;如果小文件,NameNode代理告知Redis集群完成写入;S507: If the target file is a large file, the NameNode agent informs the NameNode to complete the writing; if the file is small, the NameNode agent informs the Redis cluster to complete the writing;

需要说明的是,DataNode不能告知NameNode或者Redis集群文件写入完成,需要由Client发送文件已经写入成功的消息给NameNode代理,由NameNode代理告知NameNode或者Redis集群。It should be noted that the DataNode cannot inform the NameNode or the Redis cluster that the file has been written. The Client needs to send a message that the file has been written successfully to the NameNode agent, and the NameNode agent notifies the NameNode or the Redis cluster.

S508:SecondaryNameNode向NameNode发送获取数据的请求;S508: SecondaryNameNode sends a request for acquiring data to NameNode;

S509:NameNode返回数据给SecondaryNameNode,SecondaryNameNode进行NameNode的元数据备份。S509: The NameNode returns data to the SecondaryNameNode, and the SecondaryNameNode performs metadata backup of the NameNode.

其中,SecondaryNameNode获取的是NameNode中存储的目标文件的元数据。Among them, what the SecondaryNameNode obtains is the metadata of the target file stored in the NameNode.

图6为本申请实施例提供的一种可选的文件存储方法的实例二的流程示意图,该方法可以应用于图4的HDFS架构中,如图6所示,针对文件读取来说,该文件存储方法可以包括:FIG. 6 is a schematic flow diagram of Example 2 of an optional file storage method provided by the embodiment of the present application. This method can be applied to the HDFS architecture of FIG. 4. As shown in FIG. 6, for file reading, the File storage methods can include:

S601:Client向NameNode代理发送目标文件的读取请求;S601: the client sends a read request of the target file to the NameNode agent;

其中,读取请求中携带有目标文件的文件大小和文件名称;Wherein, the file size and file name of the target file are carried in the read request;

S602:NameNode代理接收目标文件读取请求,如果为大目标文件,NameNode代理向NameNode发送读取请求,如果为小目标文件,NameNode代理向Redis集群发送读取请求;S602: The NameNode agent receives the target file read request, if it is a large target file, the NameNode agent sends a read request to the NameNode, and if it is a small target file, the NameNode agent sends a read request to the Redis cluster;

其中,判断目标文件的属于大文件还是小文件的方法与S602相同;Wherein, the method for judging whether the target file belongs to a large file or a small file is the same as S602;

S603:NameNode或者Redis集群将目标文件的存储地址发送给NameNode代理;S603: the NameNode or the Redis cluster sends the storage address of the target file to the NameNode agent;

S604:NameNode代理返回存储地址至Client;S604: The NameNode proxy returns the storage address to the Client;

S605:Client根据存储地址向对应的DataNode节点读取目标文件的数据;S605: The client reads the data of the target file from the corresponding DataNode node according to the storage address;

S606:DataNode节点发送目标文件的数据给Client。S606: The DataNode sends the data of the target file to the Client.

图7为本申请实施例提供的一种可选的文件合并方法的实例之一的流程示意图,该方法可以应用于包含图4的HDFS架构的文件存储系统中,如图7所示,针对文件合并来说,生成待合并文件列表可以包括:Fig. 7 is a schematic flow chart of one example of an optional file merging method provided in the embodiment of the present application. This method can be applied to a file storage system including the HDFS architecture of Fig. 4. As shown in Fig. 7, for files For merging, generating a list of files to be merged can include:

S701:扫描NameNode/Redis中元数据;S701: Scan metadata in NameNode/Redis;

S702:判断是否需要合并文件,如果是,执行S703,如果否,执行S704;S702: Determine whether the files need to be merged, if yes, execute S703, if not, execute S704;

需要说明的是,根据目标文件的block存储空间的大小和目标文件的大小可以判断是否需要合并该目标文件。It should be noted that whether the target file needs to be merged can be determined according to the size of the block storage space of the target file and the size of the target file.

S703:加入待合并文件list;S703: adding the list of files to be merged;

其中,待合并文件list是待合并文件列表,加入待合并list将NameNode和Redis集群中存储的元数据对应的目标文件加入待合并文件列表。Among them, the list of files to be merged is a list of files to be merged, adding the list of files to be merged adds the target files corresponding to the metadata stored in the NameNode and Redis cluster to the list of files to be merged.

S704:判断是否扫描完元数据,如果是,执行S705,如果否,返回执行S701;S704: Determine whether the metadata has been scanned, if yes, execute S705, if not, return to execute S701;

需要说明的是,由于block存储空间的大小发生变更,HDFS中存储的所有文件都可以视作需要合并的小文件,因此,需要对HDFS中存储的所有元数据进行扫描。It should be noted that due to changes in the size of the block storage space, all files stored in HDFS can be regarded as small files that need to be merged. Therefore, all metadata stored in HDFS needs to be scanned.

S705:生成待合并文件list。S705: Generate a list of files to be merged.

其中,待合并文件list中包含多个待合并文件。Wherein, the list of files to be merged includes multiple files to be merged.

图8为本申请实施例提供的一种可选的文件合并方法的实例二的流程示意图,该方法可以应用于包含图4的HDFS架构的文件存储系统中,如图8所示,针对文件合并来说,可以包括:FIG. 8 is a schematic flow diagram of Example 2 of an optional file merging method provided in the embodiment of the present application. This method can be applied to a file storage system including the HDFS architecture of FIG. 4 , as shown in FIG. 8 , for file merging For example, can include:

S801:获取待合并文件list;S801: Obtain a list of files to be merged;

这里,获取S705生成的待合并文件列表,并获取其中包含的待合并文件。Here, the list of files to be merged generated by S705 is obtained, and the files to be merged contained therein are obtained.

S802:根据集群性能,自适应调制合并队列大小;S802: According to the performance of the cluster, adaptively modulate the merge queue size;

S803:执行文件合并,将合并结果写入目标目录;S803: Execute file merging, and write the merging result into a target directory;

其中,将合并结果写入目标目录指的是将合并后的新文件,确定为目标文件,按照S501~S507的步骤将目标文件写入到HDFS中。Wherein, writing the merging result into the target directory refers to determining the new merged file as the target file, and writing the target file into HDFS according to steps S501-S507.

S802~S803是通过S901~S907实现的,图9为本申请实施例提供的一种可选的执行文件合并的流程示意图,该方法可以应用于包含图4的HDFS架构的文件存储系统中,如图9所示,S802~S803可以包括:S802~S803 are realized through S901~S907. FIG. 9 is a schematic flow diagram of an optional execution file merging provided by the embodiment of the present application. This method can be applied to a file storage system including the HDFS architecture in FIG. 4, such as As shown in Figure 9, S802-S803 may include:

S901:执行队列为LN,设定其初始队列大小为N,待合并任务列表为list,获取CPU占用率、内存占用率和带宽占用率的初始值;S901: the execution queue is LN, the initial queue size is set to N, the task list to be merged is list, and the initial values of CPU usage, memory usage and bandwidth usage are obtained;

其中,LN是用于文件合并的执行队列,设定其初始队列大小为N指的是设定执行队列预估大小的初始值为N;获取到的CPU占用率、内存占用率和带宽占用率的初始值分别用PC、Pm和Pb表示。Among them, LN is the execution queue used for file merging, setting its initial queue size to N means setting the initial value of the estimated size of the execution queue to N; the obtained CPU usage, memory usage and bandwidth usage The initial values of are denoted by PC , Pm and Pb respectively.

需要说明的是,S705中生成了待合并文件列表,将文件大小之和等于block存储空间的大小的多个待合并文件分为一组,再将一组待合并文件作为一个待合并任务,得到待合并任务列表。It should be noted that the list of files to be merged is generated in S705, multiple files to be merged with the sum of the files equal to the size of the block storage space are divided into a group, and a group of files to be merged is regarded as a task to be merged, and List of tasks to be merged.

S902:判断list.size是否等于0,如果是,则结束;如果否,执行S903;S902: Determine whether list.size is equal to 0, if yes, end; if not, execute S903;

其中,判断list.size是否等于0指的是判断待合并任务列表中是否还有待合并任务。Wherein, judging whether list.size is equal to 0 refers to judging whether there are still tasks to be merged in the task list to be merged.

S903:获取当前集群CPU占用率、当前内存占用率和当前带宽占用率;S903: Obtain the current cluster CPU occupancy rate, current memory occupancy rate and current bandwidth occupancy rate;

其中,当前集群的CPU占用率PC1、当前内存占用率Pm1和当前带宽占用率Pb1分别用PC1、Pm1和Pb1表示。Wherein, the CPU occupancy PC1 , the current memory occupancy Pm1 and the current bandwidth occupancy Pb1 of the current cluster are represented by PC1 , Pm1 and Pb1 respectively.

S904:计算最大资源占用率Temp;S904: Calculate the maximum resource occupancy rate Temp;

其中,计算Temp是通过上述公式(1)。Wherein, calculating Temp is through the above formula (1).

S905可以包括S9051~S9055:S905 may include S9051-S9055:

S9051:判断Temp<-0.5是否成立,如果是,令N==N+0.5N,PC==PC1,Pm==Pm1,Pb==Pb1;如果否,执行S9052;S9051: Determine whether Temp<-0.5 is true, if yes, set N==N+0.5N, PC ==PC1 , Pm ==Pm1 , Pb ==Pb1 ; if not, execute S9052;

S9052:判断Temp<-0.25且Temp≥-0.5是否成立,如果是,令N==N+0.25N,PC==PC1,Pm==Pm1,Pb==Pb1;如果否,执行S9053;S9052: Determine whether Temp<-0.25 and Temp≥-0.5 are true, if yes, set N==N+0.25N, PC ==PC1 , Pm ==Pm1 , Pb ==Pb1 ; if not , execute S9053;

S9053:判断Temp>0.25且Temp<0.5是否成立,如果是,令N==min(N*0.75,1),PC==PC1,Pm==Pm1,Pb==Pb1;如果否,执行S9054;S9053: Determine whether Temp>0.25 and Temp<0.5 are true, if yes, set N==min(N*0.75,1), PC ==PC1 , Pm ==Pm1 , Pb ==Pb1 ; If not, execute S9054;

S9054:判断Temp≥0.5是否成立,如果是,令N==min(N*0.5,1),PC==PC1,Pm==Pm1,Pb==Pb1;如果否,执行S9055;S9054: Determine whether Temp≥0.5 is established, if yes, set N==min(N*0.5,1), PC ==PC1 , Pm ==Pm1 , Pb ==Pb1 ; if not, execute S9055;

S9055:令PC==PC1,Pm==Pm1,Pb==Pb1S9055: Let PC ==PC1 , Pm ==Pm1 , Pb ==Pb1 ;

其中,S9051~S9055调整N的值,也就是说,更新队列预估大小的值。Wherein, S9051-S9055 adjust the value of N, that is, update the value of the estimated size of the queue.

S906:判断LN.size<N,如果是,执行S907,如果否,使用sleep(t)函数将任务挂起;S906: judge LN.size<N, if yes, execute S907, if no, use the sleep(t) function to suspend the task;

其中,LN.size指的是执行队列的实际大小,判断LN.size<N指的是判断执行队列的实际大小是否小于队列预估大小的值。Among them, LN.size refers to the actual size of the execution queue, and judging LN.size<N refers to judging whether the actual size of the execution queue is smaller than the estimated size of the queue.

S907:将LN.size更新为N,list.size减去m,并返回执行S902。S907: Update LN.size to N, subtract m from list.size, and return to execute S902.

需要说明的是,将更新后的队列预估大小的值N赋值给执行队列的实际大小LN.size,执行队列按照N的值执行合并任务,将待合并任务列表中m个待合并任务添加至执行队列中,同时待合并任务列表的大小减去m。It should be noted that the value N of the estimated size of the queue after the update is assigned to the actual size of the execution queue LN. In the execution queue, the size of the task list to be merged at the same time minus m.

示例性的,LN.size为5,N为10,将10赋值给LN.size,LN已经执行了7个合并任务,那么m=10-7=3,需要将待合并任务列表中3个待合并任务添加至LN中,同时,在待合并任务列表中移除这3个任务。Exemplarily, LN.size is 5, N is 10, assign 10 to LN.size, LN has executed 7 merging tasks, then m=10-7=3, need to add 3 merging tasks in the merging task list The merged task is added to LN, and at the same time, these three tasks are removed from the list of tasks to be merged.

S804:判断是否合并成功,如果是,执行S805,如果否,执行S806;S804: Determine whether the merge is successful, if yes, execute S805, if not, execute S806;

其中,记录目录至失败list指的是将用于合并的文件的目录信息生成失败列表。Wherein, recording the directory to the failure list refers to generating the failure list of the directory information of the files to be merged.

S805:删除源文件目录,移动结果文件至源文件目录;S805: delete the source file directory, and move the result file to the source file directory;

其中,源文件是新文件对应的原始文件,结果文件是合并后的新文件。Wherein, the source file is the original file corresponding to the new file, and the result file is the merged new file.

S806:判断合并list是否为空,如果是,执行S808;如果否,返回执行S802;S806: Determine whether the combined list is empty, if yes, execute S808; if not, return to execute S802;

S808:输出失败list。S808: output the failure list.

其中,输出失败list指的是将将S806得到的失败列表输出至用户的客户端,也就是告知用户列表中的任务合并失败。Wherein, outputting the failure list refers to outputting the failure list obtained in S806 to the user's client, that is, notifying the user that the tasks in the list fail to merge.

基于同一发明构思,本申请实施例提供一种文件存储系统,其中,文件存储系统包括NameNode、Client和目标集群,图10为本申请实施例提供的一种可选的文件存储装置的结构示意图,如图10所示,包括:Based on the same inventive concept, an embodiment of the present application provides a file storage system, wherein the file storage system includes a NameNode, a Client, and a target cluster. FIG. 10 is a schematic structural diagram of an optional file storage device provided in an embodiment of the present application. As shown in Figure 10, including:

接收模块101,用于接收来自Client发送的针对目标文件的写入请求;其中,写入请求中携带有目标文件的文件大小;The receiving module 101 is configured to receive a write request for the target file sent from the Client; wherein, the write request carries the file size of the target file;

确定模块102,用于确定目标文件的元数据;其中,目标文件的元数据包括目标文件的存储地址;A determining module 102, configured to determine metadata of the target file; wherein, the metadata of the target file includes a storage address of the target file;

第一存储模块103,用于当文件大小大于等于预设阈值时,将目标文件的元数据进行本地存储;The first storage module 103 is configured to store the metadata of the target file locally when the file size is greater than or equal to a preset threshold;

第二存储模块104,用于当文件大小小于预设阈值时,将目标文件的元数据存储至目标集群;The second storage module 104 is configured to store the metadata of the target file to the target cluster when the file size is less than a preset threshold;

发送模块105,用于向Client返回目标文件的存储地址,以存储目标文件的文件内容。The sending module 105 is configured to return the storage address of the target file to the Client, so as to store the file content of the target file.

可选的,目标集群为Redis集群。Optionally, the target cluster is a Redis cluster.

在本申请其他实施例中,该装置,还用于:In other embodiments of the present application, the device is also used for:

接收来自Client发送的针对目标文件的读取请求;其中,读取请求中携带有目标文件的文件大小和文件名称;Receive a read request for the target file from the Client; wherein, the read request carries the file size and file name of the target file;

当文件大小大于等于预设阈值时,在本地存储的元数据中查找文件名称对应的元数据,并从文件名称对应的元数据中得到目标文件的存储地址;When the file size is greater than or equal to the preset threshold, search the metadata corresponding to the file name in the locally stored metadata, and obtain the storage address of the target file from the metadata corresponding to the file name;

当文件大小小于预设阈值时,向Redis集群发送读取请求,使得Redis集群在本地存储的元数据中查找文件名称对应的元数据,并从文件名称对应的元数据中得到目标文件的存储地址,将目标文件的存储地址发送到NameNode;When the file size is smaller than the preset threshold, a read request is sent to the Redis cluster, so that the Redis cluster searches the locally stored metadata for the metadata corresponding to the file name, and obtains the storage address of the target file from the metadata corresponding to the file name , send the storage address of the target file to the NameNode;

将目标文件的存储地址返回到Client,以读取目标文件的文件内容。Return the storage address of the target file to the client to read the file content of the target file.

可选的,预设阈值为阈值调整参数和Hadoop的block的存储空间的大小的乘积;其中,阈值调整参数大于0小于1。Optionally, the preset threshold is the product of the threshold adjustment parameter and the storage space of the Hadoop block; wherein, the threshold adjustment parameter is greater than 0 and less than 1.

可选的,目标文件的元数据还包括:目标文件的block存储空间的大小和/或目标文件的文件大小。Optionally, the metadata of the target file further includes: the size of the block storage space of the target file and/or the file size of the target file.

在本申请其他实施例中,该装置,还用于:In other embodiments of the present application, the device is also used for:

当Hadoop的block的存储空间发生变更,且不存在正在执行的待合并任务时,Hadoop的搜索引擎分别扫描NameNode和Redis集群中存储的文件的元数据,根据存储的文件的元数据确定待合并任务;When the storage space of the Hadoop block changes and there is no task to be merged that is being executed, the search engine of Hadoop scans the metadata of the files stored in the NameNode and the Redis cluster respectively, and determines the task to be merged according to the metadata of the stored files ;

搜索引擎根据预设的小文件合并算法,执行待合并任务,得到新文件,将新文件确定为目标文件,并发送至Client,以返回执行NameNode接收来自Client发送的针对目标文件的写入请求。The search engine executes the task to be merged according to the preset small file merging algorithm, obtains the new file, determines the new file as the target file, and sends it to the Client to return and execute the NameNode to receive the write request for the target file from the Client.

在本申请其他实施例中,该装置根据存储的文件的元数据确定待合并任务,包括:In other embodiments of the present application, the device determines the task to be merged according to the metadata of the stored file, including:

根据存储的文件的元数据读取存储的文件内容,将存储的文件确定为待合并文件;Read the content of the stored file according to the metadata of the stored file, and determine the stored file as the file to be merged;

将待合并文件进行分组,得到分组后的待合并文件;其中,每组待合并文件的文件大小之和小于等于block的存储空间大小;Grouping the files to be merged to obtain the grouped files to be merged; wherein, the sum of the file sizes of each group of files to be merged is less than or equal to the storage space size of the block;

将分组后的待合并文件确定为待合并任务。Determine the grouped files to be merged as tasks to be merged.

在本申请其他实施例中,该装置根据预设的小文件合并算法,执行待合并任务,得到新文件中,包括:In other embodiments of the present application, the device executes the tasks to be merged according to the preset small file merging algorithm, and obtains new files, including:

获取待合并任务列表,执行队列的队列实际大小,执行队列的队列预估大小和文件存储系统的当前性能参数;其中,队列实际大小和队列预估大小的初始值相等;待合并任务列表是由待合并任务形成的;Obtain the task list to be merged, the actual queue size of the execution queue, the estimated queue size of the execution queue, and the current performance parameters of the file storage system; where the actual queue size and the initial value of the estimated queue size are equal; the task list to be merged is composed of Formed by tasks to be merged;

当待合并任务列表中任务个数大于0时,根据当前性能参数,计算得到文件存储系统的资源占用率;其中,资源占用率为文件存储系统在处理数据时的资源占用率;When the number of tasks in the task list to be merged is greater than 0, calculate the resource occupancy rate of the file storage system according to the current performance parameters; where the resource occupancy rate is the resource occupancy rate of the file storage system when processing data;

根据资源占用率所落入的预设区间,调整队列预估大小;Adjust the estimated size of the queue according to the preset range where the resource occupancy rate falls;

当队列预估大小大于队列实际大小时,将队列的实际大小更新为队列预估大小,将待合并任务列表中的待合并任务添加至执行队列中以执行队列中的待合并任务,更新待合并任务列表中的任务个数;When the estimated size of the queue is greater than the actual size of the queue, update the actual size of the queue to the estimated size of the queue, add the tasks to be merged in the list of tasks to be merged to the execution queue to execute the tasks to be merged in the queue, and update the tasks to be merged The number of tasks in the task list;

当队列预估大小小于等于队列实际大小时,返回执行根据资源占用率所落入的预设区间,调整队列预估大小;When the estimated size of the queue is less than or equal to the actual size of the queue, return to execute the preset interval according to the resource occupancy rate, and adjust the estimated size of the queue;

当待合并任务列表的任务个数等于0时,结束。When the number of tasks in the task list to be merged is equal to 0, end.

在本申请其他实施例中,该装置,还用于:In other embodiments of the present application, the device is also used for:

在得到新文件之后,在将新文件确定为目标文件之前,将新文件对应的待合并文件确定为目标文件,向Client发送目标文件的元数据的删除请求,并转发至NameNode;其中,删除请求中携带有目标文件的文件大小和文件名称;After obtaining the new file, before determining the new file as the target file, determine the file to be merged corresponding to the new file as the target file, send the metadata deletion request of the target file to the Client, and forward it to the NameNode; wherein, the deletion request Carry the file size and file name of the target file;

当文件大小大于等于预设阈值时,将在本地存储的元数据中查找到的文件名称对应的元数据返回至Client,并删除文件名称对应的元数据;When the file size is greater than or equal to the preset threshold, the metadata corresponding to the file name found in the locally stored metadata is returned to the Client, and the metadata corresponding to the file name is deleted;

当文件大小小于预设阈值时,接收来自NameNode转发的删除请求,将在本地存储的元数据中查找到的文件名称对应的元数据返回至Client,并删除文件名称对应的元数据;When the file size is smaller than the preset threshold, it receives the delete request forwarded from the NameNode, returns the metadata corresponding to the file name found in the locally stored metadata to the Client, and deletes the metadata corresponding to the file name;

根据文件名称对应的元数据中目标文件的存储地址,删除目标文件的文件内容。According to the storage address of the target file in the metadata corresponding to the file name, the file content of the target file is deleted.

在本申请其他实施例中,该装置,还用于:In other embodiments of the present application, the device is also used for:

对所生成的新文件对应的待合并任务的启动时间和结束时间进行记录。Record the start time and end time of the task to be merged corresponding to the generated new file.

在本申请其他实施例中,该装置,还用于:In other embodiments of the present application, the device is also used for:

当待合并任务失败未得到新文件时,从合并失败的任务中的文件的元数据中获取合并失败的任务的文件的目录信息;When the task to be merged fails and no new file is obtained, the directory information of the file of the failed task to be merged is obtained from the metadata of the file in the failed merged task;

根据合并失败的任务的文件的目录信息生成合并失败列表,并发送至Client。Generate a merge failure list according to the directory information of the files of the failed merge task, and send it to the client.

图11为本申请实施例提供的一种可选的文件存储系统的结构示意图,如图11所示,本申请实施例提供了一种文件存储系统1100,该文件存储系统11900与客户端相连接,包括:Fig. 11 is a schematic structural diagram of an optional file storage system provided by the embodiment of the present application. As shown in Fig. 11, the embodiment of the present application provides a file storage system 1100, and the file storage system 11900 is connected to the client ,include:

处理器111以及存储有处理器111可执行指令的存储介质112,存储介质112通过通信总线113依赖处理器111执行操作,当指令被处理器111执行时,执行上述一个或多个实施例所述的文件存储方法。The processor 111 and the storage medium 112 storing instructions executable by the processor 111. The storage medium 112 relies on the processor 111 to perform operations through the communication bus 113. When the instructions are executed by the processor 111, the above-mentioned one or more embodiments are executed. file storage method.

需要说明的是,实际应用时,终端中的各个组件通过通信总线113耦合在一起。可理解,通信总线113用于实现这些组件之间的连接通信。通信总线113除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图11中将各种总线都标为通信总线113。It should be noted that, in actual application, various components in the terminal are coupled together through the communication bus 113 . It can be understood that the communication bus 113 is used to realize connection and communication between these components. In addition to the data bus, the communication bus 113 also includes a power bus, a control bus and a status signal bus. However, the various buses are labeled as communication bus 113 in FIG. 11 for clarity of illustration.

本申请实施例提供了一种计算机存储介质,存储有可执行指令,当可执行指令被一个或多个处理器执行的时候,处理器执行上述一个或多个实施例所述的文件存储方法。An embodiment of the present application provides a computer storage medium storing executable instructions. When the executable instructions are executed by one or more processors, the processors execute the file storage method described in one or more embodiments above.

其中,计算机可读存储介质可以是磁性随机存取存储器(ferromagnetic randomaccess memory,FRAM)、只读存储器(Read Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除可编程只读存储器(ErasableProgrammable Read-Only Memory,EPROM)、电可擦除可编程只读存储器(ElectricallyErasable Programmable Read-Only Memory,EEPROM)、快闪存储器(Flash Memory)、磁表面存储器、光盘或只读光盘(Compact Disc Read-Only Memory,CD-ROM)等存储器。Wherein, the computer-readable storage medium may be a magnetic random access memory (ferromagnetic random access memory, FRAM), a read-only memory (Read Only Memory, ROM), a programmable read-only memory (Programmable Read-Only Memory, PROM), an erasable In addition to programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash memory (Flash Memory), magnetic surface memory, optical disc or Read-only CD (Compact Disc Read-Only Memory, CD-ROM) and other memory.

本领域内的技术人员应明白,本发明的实施例可提供为方法、系统或计算机程序产品。因此,本发明可采用硬件实施例、软件实施例或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems or computer program products. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

以上所述,仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围,凡在本申请的精神和原则之内所做的任何修改、等同替换和改进等,均应包含在本申请的保护范围之内。The above is only a preferred embodiment of the application and is not intended to limit the scope of protection of the application. Any modifications, equivalent replacements and improvements made within the spirit and principles of the application shall include Within the protection scope of this application.

Claims (14)

CN202210188324.7A2022-02-282022-02-28File storage method, device and system and computer readable storage mediumPendingCN116700596A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210188324.7ACN116700596A (en)2022-02-282022-02-28File storage method, device and system and computer readable storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210188324.7ACN116700596A (en)2022-02-282022-02-28File storage method, device and system and computer readable storage medium

Publications (1)

Publication NumberPublication Date
CN116700596Atrue CN116700596A (en)2023-09-05

Family

ID=87834437

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210188324.7APendingCN116700596A (en)2022-02-282022-02-28File storage method, device and system and computer readable storage medium

Country Status (1)

CountryLink
CN (1)CN116700596A (en)

Similar Documents

PublicationPublication DateTitle
US11809726B2 (en)Distributed storage method and device
KR102007070B1 (en)Reference block aggregating into a reference set for deduplication in memory management
US20190146946A1 (en)Method and device for archiving block data of blockchain and method and device for querying the same
CN110647497A (en) A high-performance file storage and management system based on HDFS
KR102187127B1 (en)Deduplication method using data association and system thereof
CN111273863B (en)Cache management
CN107506466B (en) Method and system for storing small files
CN115509440A (en)Storage system and data processing method
KR102599116B1 (en)Data input and output method using storage node based key-value srotre
CN108415962A (en)A kind of cloud storage system
CN114138711A (en)File migration method and device, storage medium and electronic equipment
CN114089924B (en)Block chain account book data storage system and method
WO2018077092A1 (en)Saving method applied to distributed file system, apparatus and distributed file system
CN113656363B (en) A data deduplication method, system, device and storage medium based on HDFS
CN104951475A (en)Distributed file system and implementation method
US9575679B2 (en)Storage system in which connected data is divided
CN115987759B (en)Data processing method, device, electronic equipment and storage medium
CN117493284B (en)File storage method, file reading method, file storage and reading system
CN116700596A (en)File storage method, device and system and computer readable storage medium
CN116661681A (en) A mirror storage method and computing device
US20190026304A1 (en)Container metadata separation for cloud tier
CN116016508A (en) A distributed object-based storage system and its control method
CN119376638B (en) Data processing method of key-value storage system and key-value storage system
CN119415040B (en) Data processing method, device and computer equipment
CN115510000B (en)File merging method, device, electronic equipment, storage medium and program product

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp