CN103020315B

Movatterモバイル変換

Info

Publication number: CN103020315B
Application number: CN201310009182.4A
Authority: CN
Inventors: 王蕾; 何连跃; 徐叶; 李姗姗; 戴华东; 吴庆波; 丁滟; 黄辰林; 付松龄
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-01-10
Filing date: 2013-01-10
Publication date: 2015-08-19
Anticipated expiration: 2033-01-10
Also published as: CN103020315A

Abstract

本发明公开了一种基于主从分布式文件系统的海量小文件存储方法，目的是解决主从分布式文件系统存储海量小文件产生的问题。技术方案是先部署并初始化海量小文件存储系统，然后客户端的SmallFileAPI根据从键盘接受的指令对小文件进行创建或读取。系统创建小文件时，SmallFileAPI根据从客户端获得的小文件路径新建小文件的数据文件，并写入小文件数据，同时在数据结点上创建小文件索引；系统读取小文件时，根据小文件路径获取其父目录对应的数据结点信息，并向其中任一数据结点发送索引请求，最后根据索引信息从数据文件中读取小文件数据。采用本发明可以解决海量小文件存储元数据庞大的问题，提高海量小文件存储系统写效率，且可保证系统的可靠性。

The invention discloses a method for storing a large number of small files based on a master-slave distributed file system, and aims to solve the problem of storing a large number of small files in the master-slave distributed file system. The technical solution is to deploy and initialize a large number of small file storage systems first, and then the client's SmallFileAPI creates or reads small files according to the instructions received from the keyboard. When the system creates a small file, SmallFileAPI creates a new small file data file according to the small file path obtained from the client, writes the small file data, and creates a small file index on the data node; when the system reads a small file, it The file path obtains the data node information corresponding to its parent directory, and sends an index request to any of the data nodes, and finally reads the small file data from the data file according to the index information. By adopting the invention, the problem of huge amount of metadata stored in massive small files can be solved, the writing efficiency of the massive small file storage system can be improved, and the reliability of the system can be guaranteed.

Description

Translated fromChinese

一种基于主从分布式文件系统的海量小文件存储方法A Massive Small File Storage Method Based on Master-Slave Distributed File System

技术领域technical field

本发明涉及海量小文件在面向海量大文件存储的主从分布式文件系统上的存储方法。The invention relates to a method for storing a large number of small files on a master-slave distributed file system oriented to storage of a large number of large files.

背景技术Background technique

随着新型计算技术的发展，无论是企业还是个人的数据都开始迅速增长。海量数据增长带来的不仅仅是存储容量问题，还给数据管理、存储性能带来了挑战，成为云计算时代需要解决的核心问题。为了保证数据的高可用、高可靠和经济性，云计算采用分布式存储的方式来存储数据，采用冗余存储的方式来保证数据的可靠性。为了满足大量用户的需求，云计算的存储技术必须具有高吞吐率和高传输率。针对云计算的数据存储问题，工业界和学术界提出了多种解决方案，这其中包括Google文件系统GFS，Hadoop开源文件系统HDFS、面向半结构化和结构化数据存储的NoSQL存储系统Dynamo、Cassandra、MongoDB等。With the development of new computing technologies, both business and personal data began to grow rapidly. The growth of massive data brings not only the problem of storage capacity, but also brings challenges to data management and storage performance, which has become the core problem that needs to be solved in the era of cloud computing. In order to ensure high availability, high reliability and economy of data, cloud computing uses distributed storage to store data, and redundant storage to ensure data reliability. In order to meet the needs of a large number of users, the storage technology of cloud computing must have high throughput and high transfer rate. For the data storage problem of cloud computing, industry and academia have proposed a variety of solutions, including Google file system GFS, Hadoop open source file system HDFS, NoSQL storage system Dynamo for semi-structured and structured data storage, Cassandra , MongoDB, etc.

在云计算早期，存储系统的设计主要面向海量大文件的高效存储和访问，对小文件的支持较弱，但是随着个人终端和移动互联网的发展，小文件在云存储系统内所占的比例越来越高，海量小文件的高效存储和访问成为需要迫切解决的问题。小文件指的是大小从几KB到几十KB的文件。例如，淘宝网，需要存储海量的商品图片，这些图片都是小文件；谷歌、百度等搜索引擎需要从网络上抓取成万上亿的网页，这些网页都是小文件。万亿数量规模的小文件构成了海量小文件，如果海量小文件不能高效存储，将会导致面对海量小文件应用实现不了或者满足不了客户的要求。本发明主要解决海量小文件的高效存储问题。In the early days of cloud computing, the design of the storage system was mainly for efficient storage and access of massive large files, and the support for small files was weak. However, with the development of personal terminals and mobile Internet, the proportion of small files in the cloud storage system Higher and higher, the efficient storage and access of massive small files has become an urgent problem to be solved. Small files refer to files ranging in size from a few KB to tens of KB. For example, Taobao.com needs to store a large number of commodity pictures, and these pictures are small files; search engines such as Google and Baidu need to crawl tens of millions of web pages from the Internet, and these web pages are all small files. Trillions of small files constitute a large number of small files. If the large number of small files cannot be stored efficiently, it will lead to the failure of the application of the large number of small files or the failure to meet the requirements of customers. The invention mainly solves the problem of efficient storage of massive small files.

主从分布式文件系统的体系结构如附图1所示。此类分布式文件系统由一个集中式的元数据服务器（也称为元数据结点）和多个分布式的数据服务器（也称为数据结点）组成。元数据服务器管理文件系统的元数据，包括文件系统的目录结构，每个文件的存储位置、大小、各种属性等内容。数据服务器存储文件系统的数据，即文件本身。客户端访问主从分布式文件系统时，首先访问元数据服务器，获取文件的元数据信息，然后根据这些信息，访问存储对应文件的数据服务器，获得文件。此类文件系统的优点在于设计实现简单、容易管理，可通过简单的技术实现高容错、高可靠、高吞吐率的设计。缺点是，如果采用单个元数据服务器，则其会成为系统访问的性能瓶颈，并且容易出现单点失效。如果采用元数据服务器集群，则会导致元数据管理复杂，并且降低元数据访问的效率。The architecture of the master-slave distributed file system is shown in Figure 1. This type of distributed file system consists of a centralized metadata server (also known as a metadata node) and multiple distributed data servers (also known as data nodes). The metadata server manages the metadata of the file system, including the directory structure of the file system, the storage location, size, and various attributes of each file. Data servers store the data of the file system, that is, the files themselves. When the client accesses the master-slave distributed file system, it first accesses the metadata server to obtain the metadata information of the file, and then according to the information, accesses the data server that stores the corresponding file to obtain the file. The advantage of this type of file system is that it is simple to design and implement, easy to manage, and can achieve high fault tolerance, high reliability, and high throughput design through simple technology. The disadvantage is that if a single metadata server is used, it will become a performance bottleneck for system access and is prone to a single point of failure. If a metadata server cluster is used, it will lead to complex metadata management and reduce the efficiency of metadata access.

此类分布式文件系统的典型代表是Google文件系统GFS，Hadoop文件系统HDFS，Luster、PVFS等。其中HDFS是开源的，它能够运行在通用硬件平台上。运行HDFS的集群由一个元数据结点和多个数据结点组成。HDFS是一个高度容错性的系统，适合部署在廉价的计算机上，提供高吞吐量的数据访问，非常适合大规模的数据集上的应用，并支持流式读取文件系统的数据。Typical representatives of such distributed file systems are Google file system GFS, Hadoop file system HDFS, Luster, PVFS, etc. Among them, HDFS is open source, and it can run on general hardware platforms. A cluster running HDFS consists of a metadata node and multiple data nodes. HDFS is a highly fault-tolerant system that is suitable for deployment on cheap computers, provides high-throughput data access, is very suitable for applications on large-scale data sets, and supports streaming data from the file system.

目前，主从分布式文件系统的存储海量小文件主要有以下几种方法：At present, there are mainly the following methods for storing massive small files in the master-slave distributed file system:

方法一是Hadoop归档文件，简称HAR（Hadoop Archive）。为了解决HDFS中存储海量小文件会耗尽元数据结点的内存的问题，Hadoop提出了HAR归档文件的方法。通过HAR可以高效地将文件放入HDFS块中，在减少元数据结点内存使用的同时，仍然允许对文件进行透明的访问。该方法将多个小文件打包成一个HAR文件。但是HAR文件存在一些不足：首先，HAR文件一旦创建便不可修改，要增加或删除小文件，必须创建新HAR文件，其次，创建了HAR文件后，原来的小文件不会自动删除，需要手动去做删除。因此HAR文件在处理海量小文件时非常低效。The first method is the Hadoop archive file, referred to as HAR (Hadoop Archive). In order to solve the problem that storing a large number of small files in HDFS will exhaust the memory of metadata nodes, Hadoop proposes the method of HAR archive files. Files can be efficiently placed into HDFS blocks through HAR, reducing the memory usage of metadata nodes while still allowing transparent access to files. This method packs multiple small files into a HAR file. However, there are some disadvantages of HAR files: First, once a HAR file is created, it cannot be modified. To add or delete a small file, a new HAR file must be created. Do delete. Therefore, HAR files are very inefficient when dealing with a large number of small files.

方法二是中科院Xuhui Liu,Jizhong Han等提出的Implementing WebGIS on Hadoop:A Case Study of Improving Small File I/O Performance on HDFS方法：将小文件打包，形成一个大文件，大文件的头部保存该文件内所有小文件的索引信息。该大文件作为主从分布式文件系统的文件存储。每次小文件检索，客户端首先查询元数据服务器，获得大文件的元数据信息，客户端再与数据服务器交互，读取大文件，从头部获取索引信息，再读取后面的文件内容。由于面向海量大文件存储的分布式文件系统，多采用流式读取的方式，随机读取的效率低，延迟大，当多个客户端同时读取多个小文件的时候，效率很低，灵活性差。The second method is Implementing WebGIS on Hadoop: A Case Study of Improving Small File I/O Performance on HDFS proposed by Xuhui Liu, Jizhong Han, etc. of the Chinese Academy of Sciences: Pack small files to form a large file, and save the file at the head of the large file Index information of all small files within. The large file is stored as a file in the master-slave distributed file system. Every time a small file is retrieved, the client first queries the metadata server to obtain the metadata information of the large file, and then the client interacts with the data server to read the large file, obtain the index information from the head, and then read the content of the subsequent file. Due to the distributed file system for massive large file storage, the method of streaming reading is mostly used, the efficiency of random reading is low, and the delay is large. When multiple clients read multiple small files at the same time, the efficiency is very low. Poor flexibility.

方法三是北京邮电大学江柳等公布的HDFS下小文件存储优化相关技术研究的方法：将小文件保存在数据结点的数据块中，小文件的元数据记录小文件在数据块中的位置等信息，并被保存在数据结点上。所有小文件的元数据信息也保存在元数据服务器的硬盘上。每次小文件检索，客户端先查询最近访问的数据结点上是否有该小文件，如果没有就需要访问元数据结点，而元数据服务器需要读取硬盘获得被访问小文件的完整元数据信息，返回给客户端，客户端再与数据节点交互，获取小文件。该方法存在的问题是，小文件的元数据访问效率低，延迟长。The third method is the research method of small file storage optimization related technologies under HDFS announced by Beijing University of Posts and Telecommunications Jiangliu et al.: save small files in data blocks of data nodes, and the metadata of small files records the location of small files in data blocks and other information, and is stored on the data node. The metadata information of all small files is also stored on the hard disk of the metadata server. Every time a small file is retrieved, the client first inquires whether there is the small file on the recently accessed data node, if not, it needs to access the metadata node, and the metadata server needs to read the hard disk to obtain the complete metadata of the accessed small file The information is returned to the client, and the client interacts with the data node to obtain the small file. The problem with this method is that the metadata access efficiency of small files is low and the delay is long.

综上，目前主从分布式文件系统均是基于大文件存储的方法，如用来存储海量小文件，则普遍存在系统效率低下、小文件索引查询效率低、系统可靠性差等问题。如何在面向海量大文件存储的主从式分布文件系统上高效、可靠地存储海量小文件是本领域技术人员关注的技术问题。To sum up, the current master-slave distributed file system is based on the method of storing large files. If it is used to store a large number of small files, there are generally problems such as low system efficiency, low efficiency of small file index query, and poor system reliability. How to efficiently and reliably store a large number of small files on a master-slave distributed file system oriented to storage of large files is a technical issue that is concerned by those skilled in the art.

发明内容Contents of the invention

本发明要解决的技术问题在于：一般的主从分布式文件系统能够存储超大数据规模的文件，具有高容错性和高可扩展性，但是用来存储海量小文件会产生一些问题：（1）集中式元数据服务主从分布式文件系统只有单一元数据结点，文件数量决定元数据的规模，海量小文件会耗费元数据结点的内存，它的元数据会耗尽元数据结点的内存而超出计算机硬件所能达到的极限。（2）海量小文件的检索效率低，一旦文件数据量达到一定规模之后，文件的检索效率急剧下降，导致系统执行缓慢。The technical problem to be solved by the present invention is: the general master-slave distributed file system can store files with a large data scale, and has high fault tolerance and high scalability, but it will cause some problems when used to store a large number of small files: (1) Centralized metadata service The master-slave distributed file system has only a single metadata node, and the number of files determines the size of the metadata. A large number of small files will consume the memory of the metadata node, and its metadata will exhaust the memory of the metadata node. memory beyond the limits of computer hardware. (2) The retrieval efficiency of massive small files is low. Once the file data volume reaches a certain scale, the file retrieval efficiency drops sharply, resulting in slow system execution.

本发明的技术方案是：Technical scheme of the present invention is:

第一步，部署海量小文件存储系统。海量小文件存储系统由主从分布式文件系统以及主从分布式文件系统各个结点上用于处理海量小文件的软件组成。这些软件包括元数据结点上的索引位置维护模块、数据结点上的小文件索引模块、客户端缓存模块以及客户端操作小文件专用接口SmallFileAPI。索引位置维护模块为每个目录分配数据结点（以IP地址和端口号作为标识），对目录与数据结点的映射关系进行排序，向客户端返回管理该小文件索引的数据结点，即小文件目录所分配到的数据结点。索引位置维护模块采用索引位置映射表来保存目录与数据结点的映射关系。索引位置映射表由目录、主数据结点、主数据结点更新标志、第一副本数据结点、第一副本数据结点更新标志、第二副本数据结点、第二副本数据结点更新标志七项组成。这三个更新标志的取值为“Y”和“N”两种。“Y”表示数据结点上该目录下的小文件索引是最新的，不需要进行更新，“N”表示不是最新，需要更新。索引位置维护模块创建队列waitArrangeQueue，其队列项为目录路径，用来记录未能够成功分配数据结点的目录，以等待分布式文件系统有新的数据结点加入时，索引位置维护模块对waitArrangeQueue中的目录进行重新分配；索引位置维护模块还在磁盘上创建两个空文件，分别为文件dirToDatanode和文件waitArrangeDir，文件dirToDatanode用来存储索引位置映射表的内容，waitArrangeDir用来存储队列waitArrangeQueue的内容。客户端的缓存模块的容量初始化为M，可存储M个目录对应的缓存记录，每个缓存记录存储用户最近经常使用的管理小文件索引的数据结点地址（即目录对应的索引位置映射表项）以及小文件的数据文件输入输出信息，M根据用户需要自行设置，M为正整数；小文件索引模块接收客户端对索引创建或查询的请求，根据索引位置映射表中更新标志判断是否需要加载小文件索引数据，将小文件索引或创建是否成功的结果返回给客户端。小文件索引模块启动时需要创建小文件索引数据结构Index，它是根据目录路径用B-树（见R.Bayer andE.M.McCreight1972年在期刊Acta Informatica的文章“大型有序索引的管理和维护”）进行排序的数据结构，其结点是目录下小文件索引记录的集合，同样也是一个B-树，以小文件名为序。小文件索引记录代表一个小文件的索引，包含小文件的路径名、小文件的数据文件路径以及小文件在数据文件中的偏移量；SmallFileAPI是完成客户端与海量小文件存储系统数据交互的软件，包括创建和读取小文件的操作。The first step is to deploy a massive small file storage system. The massive small file storage system consists of a master-slave distributed file system and software for processing massive small files on each node of the master-slave distributed file system. These software include the index location maintenance module on the metadata node, the small file index module on the data node, the client cache module and the SmallFileAPI, a special interface for the client to operate small files. The index position maintenance module allocates data nodes (identified by IP address and port number) for each directory, sorts the mapping relationship between directories and data nodes, and returns the data nodes that manage the small file index to the client, namely The data node to which the small file directory is allocated. The index position maintenance module uses the index position mapping table to save the mapping relationship between the directory and the data node. The index location mapping table consists of directory, master data node, master data node update flag, first copy data node, first copy data node update flag, second copy data node, second copy data node update flag Composed of seven items. These three update flags have two values, "Y" and "N". "Y" indicates that the small file index under the directory on the data node is the latest and does not need to be updated, and "N" indicates that it is not the latest and needs to be updated. The index position maintenance module creates a queue waitArrangeQueue, and its queue item is a directory path, which is used to record the directory where the data node cannot be allocated successfully, so that when the distributed file system has a new data node to join, the index position maintenance module will update the waitArrangeQueue The index position maintenance module also creates two empty files on the disk, namely the file dirToDatanode and the file waitArrangeDir. The file dirToDatanode is used to store the content of the index position mapping table, and waitArrangeDir is used to store the content of the queue waitArrangeQueue. The capacity of the client's cache module is initialized to M, which can store cache records corresponding to M directories. Each cache record stores the address of the data node that manages the index of the small file frequently used by the user recently (that is, the index location mapping table entry corresponding to the directory) As well as the data file input and output information of small files, M is set according to user needs, and M is a positive integer; the small file index module receives the client's request for index creation or query, and judges whether to load small files according to the update flag in the index position mapping table. File index data, returns the result of whether the small file index or creation is successful to the client. When the small file index module is started, it is necessary to create a small file index data structure Index, which uses a B-tree based on the directory path (see the article "Management and Maintenance of Large Sequential Indexes" in the journal Acta Informatica by R.Bayer and E.M.McCreight in 1972) ") sorting data structure, its node is a collection of small file index records under the directory, which is also a B-tree, ordered by small file names. The small file index record represents the index of a small file, including the path name of the small file, the data file path of the small file, and the offset of the small file in the data file; SmallFileAPI is to complete the data interaction between the client and the massive small file storage system Software, including operations for creating and reading small files.

第二步，对海量小文件存储系统进行初始化，包括以下步骤：The second step is to initialize the massive small file storage system, including the following steps:

2.1初始化索引位置映射表，方法是从文件dirToDatanode中读取索引位置映射表的数据。如果dirToDatanode文件为空，索引位置映射表将初始化为空表。之后，索引位置映射表一旦有修改其所有数据要重新保存到dirToDatanode文件中。2.1 Initialize the index location mapping table by reading the data of the index location mapping table from the file dirToDatanode. If the dirToDatanode file is empty, the index location mapping table will be initialized as an empty table. After that, once the index location mapping table is modified, all its data must be saved to the dirToDatanode file again.

2.2初始化等待队列waitArrangeQueue，方法是从文件waitArrangeDir中读取队列数据，如果waitArrangeDir文件为空，waitArrangeQueue将初始化为空队列。之后，waitArrangeQueue一旦有修改其所有数据要重新保存到waitArrangeDir文件中。2.2 Initialize the waiting queue waitArrangeQueue by reading the queue data from the file waitArrangeDir. If the waitArrangeDir file is empty, waitArrangeQueue will be initialized as an empty queue. After that, once the waitArrangeQueue has modified all its data, it will be saved to the waitArrangeDir file again.

2.3初始化索引数据结构Index，将Index初始化为一个空的B-树，Index根据客户端提出的需求动态地从索引文件和索引日志文件中读取索引数据。2.3 Initialize the index data structure Index, initialize Index to an empty B-tree, and Index dynamically reads index data from index files and index log files according to the requirements of the client.

第三步，客户端的SmallFileAPI根据从键盘接受的指令对小文件进行操作，如果是创建小文件，执行第四步，如果是读取小文件转第八步。In the third step, the client's SmallFileAPI operates on the small file according to the instructions received from the keyboard. If it is to create a small file, execute the fourth step. If it is to read a small file, go to the eighth step.

第四步，SmallFileAPI从客户端获得创建小文件的路径，然后获取小文件路径所指示的目录（简称小文件目录）下的数据文件和小文件目录对应的索引位置映射表项。数据文件是主从分布式文件系统的命名空间实际存在的文件，用来存储同一目录下所有小文件的数据。数据文件由数据文件头和随后的数据记录组成。数据文件头由四个字段组成，第一字段占三个字节，描述文件类型，用以将数据文件与其他普通的文件进行区别；第二个字段占一个字节，表示数据文件的版本号（Version）；第三个字段表示键类型，说明键是用何种数据类型进行存储；第四个字段表示值的类型，说明值是何种数据类型进行存储。数据文件头后紧跟一条或多条记录，每一条记录存储了一个小文件完整的数据。每条记录由记录长度、键长度、键、值四项组成。其中键的内容为小文件的文件名，值的内容为小文件的内容，键长度为小文件的文件名的长度。每个小文件作为一条记录，存储小文件时直接在数据文件的尾部追加。In the fourth step, SmallFileAPI obtains the path to create a small file from the client, and then obtains the index location mapping table entry corresponding to the data file and the small file directory under the directory indicated by the small file path (referred to as the small file directory). The data file is a file that actually exists in the namespace of the master-slave distributed file system, and is used to store the data of all small files in the same directory. A data file consists of a data file header followed by data records. The data file header consists of four fields, the first field occupies three bytes, describes the file type, and is used to distinguish the data file from other ordinary files; the second field occupies one byte, indicating the version number of the data file (Version); the third field indicates the key type, indicating what data type the key is stored in; the fourth field indicates the value type, indicating what data type the value is stored in. The data file header is followed by one or more records, and each record stores the complete data of a small file. Each record consists of four items: record length, key length, key, and value. The content of the key is the file name of the small file, the content of the value is the content of the small file, and the length of the key is the length of the file name of the small file. Each small file is regarded as a record, which is directly appended at the end of the data file when storing the small file.

4.1客户端的SmallFileAPI查询缓存模块中是否包含小文件目录的相关信息。小文件目录的相关信息包括为该目录对应索引位置映射表项和小文件数据文件的输入输出信息。如果能够从缓存模块中获得，转第五步。若在缓存模块中未找到，执行步骤4.2。4.1 The client's SmallFileAPI queries whether the cache module contains relevant information about the small file directory. The relevant information of the small file directory includes the input and output information of the index position mapping table entry corresponding to the directory and the small file data file. If it can be obtained from the cache module, go to step five. If not found in the cache module, go to step 4.2.

4.2客户端的SmallFileAPI根据小文件的路径，提取小文件目录的路径，如果该小文件目录不存在，则创建小文件目录，同时元数据节点的索引位置维护模块为该目录分配三个数据结点，将目录与三个数据结点的映射关系作为映射表的表项插入到索引位置映射表中。4.2 The SmallFileAPI of the client extracts the path of the small file directory according to the path of the small file. If the small file directory does not exist, the small file directory is created. At the same time, the index position maintenance module of the metadata node allocates three data nodes for the directory, The mapping relationship between the directory and the three data nodes is inserted into the index position mapping table as an entry of the mapping table.

索引位置维护模块为该目录分配三个数据结点的具体方法为：The specific method for the index location maintenance module to assign three data nodes to the directory is as follows:

4.2.1，索引位置维护模块从元数据结点维护的全部数据结点信息（主从分布式文件系统数据结点会在元数据结点进行注册，所以元数据结点中会有集群中所有数据结点的信息）中随机获得三个数据结点，如果成功获取三个数据结点，将这三个数据结点的更新标志初始化为Y，转步骤4.2.3，如果未找到三个数据结点，执行步骤4.2.2。4.2.1, the index position maintenance module maintains all the data node information from the metadata node (the master-slave distributed file system data node will be registered with the metadata node, so the metadata node will have all Randomly obtain three data nodes from the data node information), if the three data nodes are successfully obtained, initialize the update flags of the three data nodes to Y, go to step 4.2.3, if no three data nodes are found node, go to step 4.2.2.

4.2.2，将未能够分配数据结点的目录加入到队列waitArrangeQueue中，将waitArrangeQueue的内容重新保存到waitArrangeDir文件中。向客户端返回操作失败的信号，转十三步，结束操作。4.2.2, add the directory that cannot allocate data nodes to the queue waitArrangeQueue, and save the content of waitArrangeQueue to the waitArrangeDir file again. Return a signal of operation failure to the client, go to step 13, and end the operation.

4.2.3，索引位置维护模块将索引位置映射表重新保存到dirToDatanode文件中。4.2.3, the index location maintenance module saves the index location mapping table to the dirToDatanode file again.

4.3，令变量X=1；4.3, let the variable X=1;

4.4，若小文件目录下的数据文件dataX不存在则由元数据结点在小文件的目录下创建dataX；4.4, if the data file dataX in the small file directory does not exist, the metadata node will create dataX in the small file directory;

4.5，若小文件目录下的数据文件dataX存在，客户端的SmallFileAPI向元数据结点请求获取小文件目录下的数据文件dataX的输出信息，如果成功获取dataX输出信息（此时小文件目录下dataX正在被其它客户端占用），执行步骤4.6；如果未成功获取dataX输出信息，令X增1，若X<=P，转4.4，若X>P，向客户端返回错误信息，转第十三步；P为在该目录下能创建的数据文件个数，P为正整数，P的值由用户自行设置，一般P=32；4.5. If the data file dataX in the small file directory exists, the client's SmallFileAPI requests the metadata node to obtain the output information of the data file dataX in the small file directory. If the output information of dataX is successfully obtained (dataX in the small file directory is currently Occupied by other clients), execute step 4.6; if the output information of dataX is not obtained successfully, increase X by 1, if X<=P, go to 4.4, if X>P, return an error message to the client, go to step 13 ;P is the number of data files that can be created in this directory, P is a positive integer, the value of P is set by the user, generally P=32;

4.6客户端的SmallFileAPI向元数据节点的索引位置维护模块提出查询小文件目录对应的数据结点请求，索引位置维护模块查询索引位置映射表并将该小文件目录对应的这条表项返回给客户端。4.6 The SmallFileAPI of the client makes a request to the index location maintenance module of the metadata node to query the data node corresponding to the small file directory, and the index location maintenance module queries the index location mapping table and returns the entry corresponding to the small file directory to the client .

4.7客户端的SmallFileAPI将该小文件目录对应的索引位置映射表项和该小文件的数据文件输出信息记录到缓存模块，如果缓存模块已满，则通过LRU（最近最少使用）算法进行淘汰。4.7 The client's SmallFileAPI records the index location mapping table entry corresponding to the small file directory and the data file output information of the small file to the cache module. If the cache module is full, it will be eliminated through the LRU (least recently used) algorithm.

第五步，客户端将小文件的数据写入到数据文件中。客户端将小文件作为一个数据记录写到4.5所得到的数据文件dataX中，并且返回小文件在数据文件中的偏移量，即数据记录在数据文件中的位置。In the fifth step, the client writes the data of the small file into the data file. The client writes the small file as a data record into the data file dataX obtained in 4.5, and returns the offset of the small file in the data file, that is, the position of the data record in the data file.

第六步，数据结点的小文件索引模块创建小文件索引。客户端向4.6获得的索引位置映射表项的三个数据结点中的主数据结点发送小文件路径、存储小文件的数据文件的名称、小文件在数据文件中的偏移量以及数据结点更新标志并提出创建小文件索引的请求。如果该主数据结点出现故障，向客户端返回故障结果，转第十三步，如果主数据结点正常，则主数据结点接到请求后由小文件索引模块进行如下工作：In the sixth step, the small file index module of the data node creates a small file index. The client sends the path of the small file, the name of the data file storing the small file, the offset of the small file in the data file, and the data Click the update flag and make a request to create a small file index. If the main data node fails, return the fault result to the client, and go to step 13. If the main data node is normal, the small file index module will perform the following work after the main data node receives the request:

6.1小文件索引模块根据主数据结点的更新标志判断是否需要对该目录下的小文件索引进行更新操作，更新标志位Y执行6.2，更新标志为N转至6.3。6.1 The small file index module judges whether it is necessary to update the small file index under the directory according to the update flag of the main data node. If the update flag is Y, execute 6.2, and if the update flag is N, go to 6.3.

6.2小文件索引模块在分布式文件系统中读取路径为/index/小文件目录路径.index和/index/小文件目录路径.log两个文件。小文件目录路径.index文件称为索引文件，存放该目录下所有小文件索引记录，这些索引记录以B-树数据结构对小文件路径名进行排序后保存。小文件目录路径.log称为日志文件，存放对该目录下小文件索引的一些操作记录，包括创建、删除索引，它由操作类型和索引记录组成。操作类型指操作的动作，如创建和删除。6.2 The small file index module reads two files in the distributed file system: /index/small file directory path.index and /index/small file directory path.log. The small file directory path.index file is called an index file, which stores all small file index records in this directory, and these index records sort the small file path names in a B-tree data structure and save them. The small file directory path.log is called a log file, which stores some operation records of the small file index in the directory, including creating and deleting indexes, and it consists of operation types and index records. The operation type refers to the action of the operation, such as create and delete.

小文件索引模块根据小文件目录路径.index文件和小文件目录路径.log文件读取索引数据，步骤如下：The small file index module reads index data according to the small file directory path.index file and the small file directory path.log file. The steps are as follows:

6.2.1，小文件索引模块读取小文件目录路径.index的数据，根据这些数据生成B-树作为结点插入到内存的索引数据结构Index中。6.2.1, the small file index module reads the data of the small file directory path.index, and generates a B-tree based on these data and inserts it into the index data structure Index of the memory as a node.

6.2.2，小文件索引模块依次读取小文件目录路径.log中索引的操作记录，按照这些操作记录重新进行操作。如操作类型为创建，则提取该操作记录的索引信息按照B-树插入算法插入到索引数据结构Index中。6.2.2, the small file indexing module reads the operation records indexed in the small file directory path.log in turn, and re-operates according to these operation records. If the operation type is creation, extract the index information of the operation record and insert it into the index data structure Index according to the B-tree insertion algorithm.

6.2.3，小文件索引模块重命名小文件目录路径.index为小文件目录路径.index.tmp。新建索引文件命名为小文件目录路径.index，并将Index中该小文件目录对应的所有索引记录保存到新建的索引文件中，删除后缀为.tmp的索引文件。6.2.3, the small file index module renames the small file directory path.index to small file directory path.index.tmp. The new index file is named as the small file directory path .index, and all the index records corresponding to the small file directory in Index are saved to the newly created index file, and the index file with the suffix .tmp is deleted.

6.3清空小文件目录路径.log文件内容，并由小文件索引模块获取该小文件目录路径.log的写操作信息，准备进行写日志操作。6.3 Empty the content of the small file directory path.log file, and the small file index module obtains the write operation information of the small file directory path.log, and prepares for writing the log.

6.4小文件索引模块将从客户端获取的小文件路径名、数据文件名、偏移量信息生成索引记录，查询索引数据结构Index，将索引记录按小文件路径名排序插入到Index树中的对应位置，将创建索引的操作写到通过6.3获得的小文件目录路径.log中。6.4 The small file index module will generate index records from the small file path name, data file name, and offset information obtained from the client, query the index data structure Index, and insert the index records into the corresponding index tree in the order of small file path names location, write the index creation operation to the small file directory path .log obtained through 6.3.

6.5小文件索引模块向客户端发送小文件创建成功的信号。6.5 The small file indexing module sends a signal to the client that the small file is created successfully.

第七步，客户端的缓存模块修改小文件目录对应的缓存记录中主数据结点的更新标志。主数据结点的更新标志修改为N，其余第一副本数据结点和第二副本数据结点的更新标志修改为Y。转第十三步。In the seventh step, the cache module of the client modifies the update flag of the primary data node in the cache record corresponding to the small file directory. The update flag of the primary data node is changed to N, and the update flags of the other first copy data nodes and second copy data nodes are changed to Y. Go to step thirteen.

第四步到第七步为海量小文件存储系统的创建小文件的过程。The fourth step to the seventh step are the process of creating small files for the massive small file storage system.

第八步，客户端的SmallFileAPI根据小文件路径获得小文件目录，根据小文件目录查找客户端缓存模块，若找不到小文件目录对应的数据结点信息则执行第九步，如果查找到了小文目录对应的数据结点信息，则转第十步。In the eighth step, the SmallFileAPI of the client obtains the small file directory according to the small file path, and searches for the client cache module according to the small file directory. If the data node information corresponding to the small file directory cannot be found, execute the ninth step. For the data node information corresponding to the directory, go to the tenth step.

第九步，客户端的SmallFileAPI向元数据结点的索引位置维护模块提出查询目录索引位置的请求，索引位置维护模块查询索引位置映射表将该小文件目录对应的索引位置映射表项返回给客户端，客户端将获取的信息记录到缓存模块中。In the ninth step, the SmallFileAPI of the client sends a request to the index position maintenance module of the metadata node to query the index position of the directory, and the index position maintenance module queries the index position mapping table and returns the index position mapping table item corresponding to the small file directory to the client , the client records the acquired information into the cache module.

第十步，客户端的SmallFileAPI选择三个数据结点中的任意一个发送查询小文件索引的请求，如果更新标志为Y，则小文件索引模块对小文件目录下所有小文件的索引进行更新，其具体步骤如下：In the tenth step, the SmallFileAPI of the client selects any one of the three data nodes to send a request for querying the small file index. If the update flag is Y, the small file index module updates the indexes of all small files in the small file directory. Specific steps are as follows:

10.1，小文件索引模块读取小文件目录路径.index的数据，并生成B-树作为结点插入到索引数据结构Index中。10.1, the small file index module reads the data of the small file directory path .index, and generates a B-tree as a node and inserts it into the index data structure Index.

10.2，小文件索引模块依次读取小文件目录路径.log的索引操作记录，按照这些操作记录重新进行操作。10.2, the small file index module reads the index operation records of the small file directory path.log in turn, and re-operates according to these operation records.

10.3，数据结点通过小文件索引模块查询Index将小文件的索引记录返回给客户端。10.3, the data node queries the Index through the small file index module and returns the index records of the small files to the client.

第十一步，客户端的SmallFileAPI根据小文件的索引记录中数据文件名，查询客户端缓存模块，获取数据文件的输入信息，如果没有则利用分布文件系统读文件接口获取小文件的数据文件输入信息，并记录到缓存模块。In the eleventh step, the client's SmallFileAPI queries the client cache module to obtain the input information of the data file according to the data file name in the index record of the small file, and if not, uses the distributed file system read file interface to obtain the data file input information of the small file , and log to the cache module.

第十二步，客户端的SmallFileAPI根据小文件的索引记录中小文件在数据文件中的偏移量从数据文件的中读取小文件的数据。In the twelfth step, the SmallFile API of the client reads the data of the small file from the data file according to the offset of the small file in the data file in the index record of the small file.

第八步到第十二步为海量小文件存储系统的读取小文件的过程。The eighth step to the twelfth step are the process of reading small files in the massive small file storage system.

第十三步，客户端的SmallFileAPI判定是否仍有指令输入，若有，转第三步；若无，结束。In the thirteenth step, the SmallFileAPI of the client determines whether there is still an instruction input, if yes, go to the third step; if not, end.

本发明是一种基于主从分布式文件系统的海量小文件存储方法，采用本发明可以达到以下技术效果：The present invention is a method for storing massive small files based on a master-slave distributed file system, and the following technical effects can be achieved by adopting the present invention:

（1）它通过第五步将小文件数据存储在主从分布式文件系统中，实现数据的分布式存储和容错，达到了数据的大规模存储和可靠性。(1) It stores small file data in the master-slave distributed file system through the fifth step, realizing distributed storage and fault tolerance of data, and achieving large-scale storage and reliability of data.

（2）通过第六步将小文件的索引分布到各个数据结点来管理解决了单元数据结点在存储海量小文件时的问题，同时步骤6.2通过分布式文件系统存储小文件索引，利用分布式文件系统本身的容错机制对小文件索引进行容错，降低了小文件索引丢失的危险。(2) Through the sixth step, the index of small files is distributed to each data node to manage and solve the problem of unit data nodes storing a large number of small files. The fault-tolerant mechanism of the standard file system itself is fault-tolerant for small file indexes, reducing the risk of loss of small file indexes.

（3）在以上基础上，海量小文件存储系统在客户端缓存模块缓存了用户常用的小文件索引位置信息和数据文件的信息，避免与元数据结点频繁地交互，大大提高了系统的性能。(3) On the basis of the above, the massive small file storage system caches the user's commonly used small file index location information and data file information in the client cache module, avoiding frequent interaction with metadata nodes and greatly improving system performance .

实验表明本发明能够很好地解决海量小文件存储元数据庞大的问题，并且海量小文件存储系统写效率得到很大的提高，小文件索引容错保证了系统的可靠性。Experiments show that the present invention can well solve the problem of massive small file storage metadata, and the writing efficiency of the massive small file storage system is greatly improved, and the small file index fault tolerance ensures the reliability of the system.

附图说明Description of drawings

图1背景技术的主从分布式文件系统的结构图；The structural diagram of the master-slave distributed file system of Fig. 1 background technology;

图2本发明第一步部署的海量小文件存储系统总体结构图；Fig. 2 is the general structural diagram of the massive small file storage system deployed in the first step of the present invention;

图3本发明总体流程图；Fig. 3 overall flow chart of the present invention;

图4本发明的索引位置映射表结构图。Fig. 4 is a structural diagram of the index location mapping table of the present invention.

图5本发明第四步创建的数据文件结构图；The structure diagram of the data file created in the fourth step of Fig. 5 of the present invention;

图6本发明6.4步小文件索引模块生成的小文件索引记录结构图；Fig. 6 is the structure diagram of the small file index record generated by the 6.4 step small file index module of the present invention;

具体实施方式Detailed ways

结合附图说明本发明的具体实施方式。The specific embodiment of the present invention will be described with reference to the accompanying drawings.

图1是主从分布式文件系统的结构图。Figure 1 is a structural diagram of a master-slave distributed file system.

图2是本发明第一步构建的海量小文件存储系统的总体结构图。海量小文件存储系统由主从分布式文件系统以及主从分布式文件系统各个结点上用于处理海量小文件的软件组成。这些软件包括元数据结点上的索引位置维护模块、数据结点上的小文件索引模块、客户端缓存模块以及客户端操作小文件专用接口SmallFileAPI。索引位置维护模块为每个目录分配数据结点（以IP地址和端口号作为标识），对目录与数据结点的映射关系进行排序，向客户端返回管理该小文件索引的数据结点，即小文件目录所分配到的数据结点。索引位置维护模块采用索引位置映射表来保存目录与数据结点的映射关系。索引位置映射表如图4所示由目录、主数据结点、主数据结点更新标志、第一副本数据结点、第一副本数据结点更新标志、第二副本数据结点、第二副本数据结点更新标志七项组成。这三个更新标志的取值为“Y”和“N”两种。“Y”表示数据结点上该目录下的小文件索引是最新的，不需要进行更新，“N”表示不是最新，需要更新。索引位置维护模块创建队列waitArrangeQueue，其队列项为目录路径，用来记录未能够成功分配数据结点的目录，以等待分布式文件系统有新的数据结点加入时，索引位置维护模块对waitArrangeQueue中的目录进行重新分配；索引位置维护模块还在磁盘上创建两个空文件，分别为文件dirToDatanode和文件waitArrangeDir，文件dirToDatanode用来存储索引位置映射表的内容，waitArrangeDir用来存储队列waitArrangeQueue的内容。客户端的缓存模块的容量初始化为M，可存储M个目录对应的缓存记录，每个缓存记录存储用户最近经常使用的管理小文件索引的数据结点地址（即目录对应的索引位置映射表项）以及小文件的数据文件输入输出信息，M根据用户需要自行设置，M为正整数；小文件索引模块接收客户端对索引创建或查询的请求，根据索引位置映射表中更新标志判断是否需要加载小文件索引数据，将小文件索引或创建是否成功的结果返回给客户端。小文件索引模块启动时需要创建小文件索引数据结构Index，它是根据目录路径用B-树进行排序的数据结构，其结点是目录下小文件索引记录的集合，同样也是一个B-树，以小文件名为序。小文件索引记录代表一个小文件的索引，包含小文件的路径名、小文件的数据文件路径以及小文件在数据文件中的偏移量；SmallFileAPI是完成客户端与海量小文件存储系统数据交互的软件，包括创建和读取小文件的操作。Fig. 2 is an overall structural diagram of the massive small file storage system constructed in the first step of the present invention. The massive small file storage system consists of a master-slave distributed file system and software for processing massive small files on each node of the master-slave distributed file system. These software include the index position maintenance module on the metadata node, the small file index module on the data node, the client cache module and the SmallFileAPI, a special interface for the client to operate small files. The index position maintenance module allocates data nodes (identified by IP address and port number) for each directory, sorts the mapping relationship between directories and data nodes, and returns the data nodes that manage the small file index to the client, namely The data node to which the small file directory is allocated. The index position maintenance module uses the index position mapping table to save the mapping relationship between the directory and the data node. The index location mapping table is composed of directory, master data node, master data node update flag, first copy data node, first copy data node update flag, second copy data node, second copy as shown in Figure 4 The data node update flag consists of seven items. These three update flags have two values, "Y" and "N". "Y" indicates that the small file index under the directory on the data node is the latest and does not need to be updated, and "N" indicates that it is not the latest and needs to be updated. The index position maintenance module creates a queue waitArrangeQueue, and its queue item is a directory path, which is used to record the directory where the data node cannot be allocated successfully, so that when the distributed file system has a new data node to join, the index position maintenance module will update the waitArrangeQueue The index position maintenance module also creates two empty files on the disk, namely the file dirToDatanode and the file waitArrangeDir. The file dirToDatanode is used to store the content of the index position mapping table, and waitArrangeDir is used to store the content of the queue waitArrangeQueue. The capacity of the client's cache module is initialized to M, which can store cache records corresponding to M directories. Each cache record stores the address of the data node that manages the index of the small file frequently used by the user recently (that is, the index location mapping table entry corresponding to the directory) As well as the data file input and output information of small files, M is set according to user needs, and M is a positive integer; the small file index module receives the client's request for index creation or query, and judges whether to load small files according to the update flag in the index position mapping table. File index data, returns the result of whether the small file index or creation is successful to the client. When the small file index module starts, it needs to create a small file index data structure Index, which is a data structure sorted by B-tree according to the directory path, and its nodes are the collection of small file index records in the directory, which is also a B-tree. Order by small file name. The small file index record represents the index of a small file, including the path name of the small file, the data file path of the small file, and the offset of the small file in the data file; SmallFileAPI is to complete the data interaction between the client and the massive small file storage system Software, including operations for creating and reading small files.

图3是本发明的总体流程图。Fig. 3 is an overall flow chart of the present invention.

第一步，部署海量小文件存储系统。The first step is to deploy a massive small file storage system.

第二步，海量小文件存储系统的初始化。The second step is the initialization of the massive small file storage system.

第三步，选择对小文件的操作，如果是创建小文件转第四步，如果是读取小文件转第八步。The third step is to choose the operation on small files. If it is to create a small file, go to the fourth step. If it is to read a small file, go to the eighth step.

第四步，客户端的SmallFileAPI根据创建小文件的路径来新建存储小文件的数据文件。In the fourth step, the SmallFileAPI of the client creates a data file for storing the small file according to the path for creating the small file.

第五步，客户端的SmallFileAPI将小文件的数据写入到数据文件中。In the fifth step, the SmallFileAPI of the client writes the data of the small file into the data file.

第六步，小文件索引模块创建小文件的索引。In the sixth step, the small file indexing module creates an index of small files.

第七步，客户端修改缓存模块中小文件目录对应的缓存记录的主数据结点的更新标志。转第十三步。In the seventh step, the client modifies the update flag of the primary data node of the cache record corresponding to the small file directory in the cache module. Go to step thirteen.

第八步，客户端的SmallFileAPI根据小文件路径获得小文件目录，在缓存模块中查找目录对应的数据结点信息，如果查找不到数据结点信息，则执行第九步，如果查找到了数据结点信息，则转第十步。In the eighth step, the client's SmallFileAPI obtains the small file directory according to the small file path, and searches for the data node information corresponding to the directory in the cache module. If the data node information cannot be found, execute the ninth step. If the data node is found information, go to the tenth step.

第九步，客户端的SmallFileAPI向元数据结点的索引位置维护模块提出查询目录索引位置的请求，将三个数据结点及更新标志返回给客户端，客户端将获取的信息记录到缓存模块中。In the ninth step, the SmallFileAPI of the client sends a request to the index position maintenance module of the metadata node to query the index position of the directory, and returns the three data nodes and the update flag to the client, and the client records the obtained information into the cache module .

第十步，客户端的SmallFileAPI选择三个数据结点中的任意一个发送查询小文件索引的请求。In the tenth step, the SmallFileAPI of the client selects any one of the three data nodes to send a request for querying the small file index.

第十一步，客户端的SmallFileAPI根据小文件的索引记录中数据文件名，在缓存模块中查找数据文件的输入信息，如果没有则利用分布文件系统读文件接口获取小文件的数据文件输入信息，将小文件的数据文件输入信息记录到缓存模块。In the eleventh step, the SmallFileAPI of the client searches for the input information of the data file in the cache module according to the name of the data file in the index record of the small file. The data file input information of the small file is recorded to the cache module.

第十二步，客户端根据小文件的索引记录中小文件在数据文件中的偏移量从数据文件的中读取小文件的数据。In the twelfth step, the client reads the data of the small file from the data file according to the offset of the small file in the data file in the index record of the small file.

第十三步，客户端的SmallFileAPI判定是否键盘仍有指令输入，若有，转第三步；若无，结束。In the thirteenth step, the SmallFileAPI of the client side determines whether the keyboard still has command input, if yes, go to the third step; if not, end.

图4是小文件的目录与数据节点映射表的结构。每个目录对应有3个数据结点，代表数据结点的具体位置。每个数据结点其后的更新标志表示当前该数据结点内存中的小文件索引是否需要更新，如果为Y，则表示需要更新，如果为N则标识不需要更新。Fig. 4 is a structure of a small file directory and a data node mapping table. Each directory corresponds to 3 data nodes, representing the specific location of the data nodes. The update flag behind each data node indicates whether the small file index in the memory of the current data node needs to be updated. If it is Y, it means that it needs to be updated. If it is N, it means that it does not need to be updated.

图5是小文件存储的数据文件结构。数据文件由数据文件头和随后的数据记录组成。数据文件头由四个字段组成，第一字段占三个字节，描述文件类型，用以将数据文件与其他普通的文件进行区别；第二个字段占一个字节，表示数据文件的版本号（Version）；第三个字段表示键类型，说明键是用何种数据类型进行存储；第四个字段表示值的类型，说明值是何种数据类型进行存储。数据文件头后紧跟一条或多条记录，每一条记录存储了一个小文件完整的数据。每条记录由记录长度、键长度、键、值四项组成。其中键的内容为小文件的文件名，值的内容为小文件的内容，键长度为小文件的文件名的长度。Figure 5 is the data file structure of small file storage. A data file consists of a data file header followed by data records. The data file header consists of four fields, the first field occupies three bytes, describes the file type, and is used to distinguish the data file from other ordinary files; the second field occupies one byte, indicating the version number of the data file (Version); the third field indicates the key type, indicating what data type the key is stored in; the fourth field indicates the value type, indicating what data type the value is stored in. The data file header is followed by one or more records, and each record stores the complete data of a small file. Each record consists of four items: record length, key length, key, and value. The content of the key is the file name of the small file, the content of the value is the content of the small file, and the length of the key is the length of the file name of the small file.

图6是6.4步小文件索引模块生成的索引记录结构图。每个索引记录中包含小文件的路径、数据文件名以及小文件在数据文件中的偏移量。Fig. 6 is a structural diagram of index records generated by the small file index module in step 6.4. Each index record contains the path of the small file, the name of the data file, and the offset of the small file in the data file.