CN110674087A

Movatterモバイル変換

Info

Publication number: CN110674087A
Application number: CN201910829794.5A
Authority: CN
Inventors: 钱克功; 沈网中
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2020-01-10
Also published as: WO2021043088A1

Abstract

Translated fromChinese

本发明涉及一种人工智能技术，揭露了一种文件查询方法，包括：获取客户端的收藏文件集，在文件系统中创建所述收藏文件集的业务描述，并将创建业务描述后的所述收藏文件集存入云存储中；通过关键词提取算法对所述业务描述进行关键词抽取，得到所述业务描述的关键词，并将所述关键词转换为词向量后存储所述词向量；接收用户输入的查询内容，计算出所述查询内容与所述词向量的相似度；根据所述相似度选择对应的业务描述，通过多策略检索方式向所述云存储进行收藏文件的查询，并将查询结果返回给所述用户。本发明还提出一种文件查询装置以及一种计算机可读存储介质。本发明实现了文件的精确查询。

The invention relates to an artificial intelligence technology, and discloses a file query method, comprising: acquiring a collection file set of a client, creating a service description of the collection file set in a file system, and creating the collection after the service description is created. The file set is stored in the cloud storage; keyword extraction is performed on the business description through a keyword extraction algorithm to obtain the keywords of the business description, and the keyword is converted into a word vector to store the word vector; receiving The query content input by the user is calculated, and the similarity between the query content and the word vector is calculated; the corresponding business description is selected according to the similarity, and the cloud storage is queried for the collection file through the multi-strategy retrieval method, and the The query result is returned to the user. The present invention also provides a file query device and a computer-readable storage medium. The present invention realizes the precise query of files.

Description

Translated fromChinese

文件查询方法、装置及计算机可读存储介质File query method, device and computer-readable storage medium

技术领域technical field

本发明涉及人工智能技术领域，尤其涉及一种文件查询方法、装置及计算机可读存储介质。The present invention relates to the technical field of artificial intelligence, and in particular, to a file query method, device and computer-readable storage medium.

背景技术Background technique

随着技术的发展，信息量呈爆炸性增长，越来越多的文件需要存储在用户的计算机中。计算机的文件系统负责为用户建立文件，通过存入、读出、修改、转储文件，控制文件的存取。当用户不再使用文件时可以撤销、删除文件等，所以计算机的文件系统可以支撑起海量文件的存储。但对于用户来说，面对海量的文件，检索出目标文件就需要耗费一定的时间和精力，在目前业内还没有出现相关技术或产品可以进行快速文件的查询。With the development of technology, the amount of information has exploded, and more and more files need to be stored in users' computers. The file system of a computer is responsible for creating files for users, and controlling access to files by storing, reading, modifying, and dumping files. When users no longer use files, they can undo, delete files, etc., so the file system of the computer can support the storage of massive files. However, for users, in the face of massive files, it takes a certain amount of time and energy to retrieve the target file. At present, there is no relevant technology or product in the industry that can quickly query files.

发明内容SUMMARY OF THE INVENTION

本发明提供一种文件查询方法、装置及计算机可读存储介质，其主要目的在于当用户进行对文本中文件查询时，给用户呈现出精准的文件查询结果。The present invention provides a file query method, device and computer-readable storage medium, the main purpose of which is to present accurate file query results to users when users query files in text.

为实现上述目的，本发明提供的一种文件查询方法，包括：To achieve the above purpose, a file query method provided by the present invention includes:

获取客户端的收藏文件集，在文件系统中创建所述收藏文件集的业务描述，并将创建业务描述后的所述收藏文件集存入云存储中；Obtaining the collection file set of the client, creating a service description of the collection file set in the file system, and storing the collection file set after the service description is created into cloud storage;

通过关键词提取算法对所述业务描述进行关键词抽取，得到所述业务描述的关键词，并将所述关键词转换为词向量后存储所述词向量；Extract keywords from the business description through a keyword extraction algorithm to obtain the keywords of the business description, convert the keywords into word vectors, and store the word vectors;

接收用户输入的查询内容，计算出所述查询内容与所述词向量的相似度；Receive the query content input by the user, and calculate the similarity between the query content and the word vector;

根据所述相似度选择对应的业务描述，通过多策略检索方式向所述云存储进行收藏文件的查询，并将查询结果返回给所述用户。According to the similarity, the corresponding business description is selected, and the cloud storage is queried for the favorite file through a multi-strategy retrieval method, and the query result is returned to the user.

可选地，所述获取客户端的收藏文件集包括：Optionally, the obtaining the collection file set of the client includes:

从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集；或The collection file set is obtained by traversing and retrieving from the local disk of the client; or

根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。The collection file set is obtained by searching from a search engine using keywords according to user requirements.

可选地，所述通过关键词提取算法对所述业务描述进行关键词抽取，包括：Optionally, performing keyword extraction on the business description through a keyword extraction algorithm, including:

对所述业务描述进行分词操作；perform word segmentation on the business description;

计算所述业务描述中的任意两个词W_i和W_j的依存关联度：Calculate the dependency relationship between any two words W_i and W_j in the business description:

其中，Dep(W_i，W_j)表示所述词W_i和W_j的依存关联度，len(W_i，W_j)表示所述词W_i和W_j之间的依存路径长度，b是超参数；Wherein, Dep(W_i , W_j ) represents the degree of dependency between the words Wi and W_j , len(W_i , W_j ) represents the_{length of the dependency path between the words Wi and W j}_,_and b is hyperparameters;

计算所述词W_i和W_j的引力：Calculate the gravitational_{force of the words Wi and W j}_:

其中，f_grav(W_i，W_j)表示所述词W_i和W_j的引力，tfidf(W_i)表示词W_i的TF-IDF值，tfidf(W_j)表示词W_j的TF-IDF值，TF表示词频，IDF表示逆文档频率指数，d是词W_i和W_j的词向量之间的欧式距离；Among them, f_grav (W_i , W_j ) represents the gravitational force of the words Wi and W_j , tfidf(W_i ) represents the TF-IDF value of the word_{Wi, and tfidf(W j}₎_represents the TF-IDF of the word W_j IDF value, TF represents the word frequency, IDF represents the inverse document frequency index, d is the Euclidean distance between the word vectors of words_{Wi and W j}_;

根据计算的所述依存关联度和所述引力得到所述词W_i和W_j之间的关联强度为：According to the calculated degree of dependency and the gravitational force, the strength of the association between the words W_i and W_j is obtained as:

weight(W_i，W_j)＝Dep(W_i，W_j)*fgrav(W_i，W_j)weight(W_i , W_j )=Dep(W_i , W_j )*fgrav(W_i , W_j )

结合所述关联强度计算出所述词W_j的重要度得分：The importance score of the word W_j is calculated in combination with the association strength:

其中，

是与顶点W_i有关的集合，η为阻尼系数；in,

is the set related to the vertex Wi,_η is the damping coefficient;

根据所述词W_i的重要度得分选取t个得分最高的词作为所述业务描述的关键词。According to the importance score of the word_Wi , the t words with the highest scores are selected as the keywords of the business description.

可选地，所述查询内容与所述词向量的相似度的计算公式为：Optionally, the calculation formula of the similarity between the query content and the word vector is:

其中，X表示所述词向量，Y表示所述查询内容。Wherein, X represents the word vector, and Y represents the query content.

可选地，述通过多策略检索方式向所述云存储进行收藏文件的查询，包括：Optionally, querying the cloud storage for favorite files through a multi-strategy retrieval method includes:

预设所述用户输入的查询内容中原字符串为m，所述收藏文件的业务描述目标字符串为n；It is preset that the original character string in the query content input by the user is m, and the service description target character string of the collection file is n;

记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次数L；Record the editing times L of deletion, insertion, and replacement operations required to convert the original character string m into the target character string n;

选取所述L值最小的对应收藏文件作为查询结果，并返回给所述用户。The corresponding collection file with the smallest L value is selected as the query result, and returned to the user.

此外，为实现上述目的，本发明还提供一种文件查询装置，该装置包括存储器和处理器，所述存储器中存储有可在所述处理器上运行的文件查询程序，所述文件查询程序被所述处理器执行时实现如下步骤：In addition, in order to achieve the above object, the present invention also provides a file query device, the device includes a memory and a processor, the memory stores a file query program that can run on the processor, and the file query program is The processor implements the following steps when executing:

其中，f_grav(W_i，W_j)表示所述词W_i和W_j的引力，tfidf(W_i)表示词W_i的TF-IDF值，tfidf(W_j)表示词W_j的TF-IDF值，TF表示词频，IDF表示逆文档频率指数，d是词W_i和W_i的词向量之间的欧式距离；Among them, f_grav (W_i , W_j ) represents the gravitational force of the words Wi and W_j , tfidf(W_i ) represents the TF-IDF value of the word_{Wi, and tfidf(W j}₎_represents the TF-IDF of the word W_j IDF value, TF represents the word frequency, IDF represents the inverse document frequency index, d is the_Euclidean distance between the word_Wi and the word vector of Wi;

weight(W_i，W_j)＝Dep(W_i，W_j)*f_grav(W_i，W_j)weight(W_i , W_j )=Dep(W_i , W_j )*f_grav (W_i , W_j )

其中，

是与顶点W_i有关的集合，η为阻尼系数；in,

is the set related to the vertex Wi,_η is the damping coefficient;

此外，为实现上述目的，本发明还提供一种计算机可读存储介质，所述计算机可读存储介质上存储有文件查询程序，所述文件查询程序可被一个或者多个处理器执行，以实现如上所述的文件查询方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, where a file query program is stored on the computer-readable storage medium, and the file query program can be executed by one or more processors to achieve The steps of the file query method as described above.

本发明提出的文件查询方法、装置及计算机可读存储介质，在用户在进行文件查询时，基于客户端的收藏文件，对所述收藏文件进行业务描述的分析，计算出用户输入所需文件的查询内容与分析后的所述业务描述的相似度，根据所述相似度，通过多策略检索方式进行文件的查询，并将查询结果返回给所述用户，可以给用户呈现出精准的文件查询结果。The file query method, device and computer-readable storage medium provided by the present invention, when a user performs a file query, based on a client's favorite file, analyzes the business description of the favorite file, and calculates the query of the user's input required file. According to the similarity between the content and the analyzed business description, according to the similarity, a multi-strategy retrieval method is used to query the file, and the query result is returned to the user, which can present the user with an accurate file query result.

附图说明Description of drawings

图1为本发明一实施例提供的文件查询方法的流程示意图；1 is a schematic flowchart of a file query method provided by an embodiment of the present invention;

图2为本发明一实施例提供的文件查询装置的内部结构示意图；FIG. 2 is a schematic diagram of the internal structure of a file query device provided by an embodiment of the present invention;

图3为本发明一实施例提供的文件查询装置中文件查询程序的模块示意图。FIG. 3 is a schematic block diagram of a file query program in a file query device according to an embodiment of the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明提供一种文件查询方法。参照图1所示，为本发明一实施例提供的文件查询方法的流程示意图。该方法可以由一个装置执行，该装置可以由软件和/或硬件实现。The invention provides a file query method. Referring to FIG. 1 , it is a schematic flowchart of a file query method provided by an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented in software and/or hardware.

在本实施例中，文件查询方法包括：In this embodiment, the file query method includes:

S1、获取客户端的收藏文件集，在文件系统中创建所述收藏文件集的业务描述，并将创建业务描述后的所述收藏文件集存入云存储中。S1. Acquire a favorite file set of a client, create a service description of the favorite file set in a file system, and store the favorite file set after the business description is created into cloud storage.

本发明较佳实施例中，所述客户端又称用户端，指的是与服务器相对应，为客户提供本地服务的程序。所述客户端的收藏文件集通过以下两种方式取得到：方式一、从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集；方式二、根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。In a preferred embodiment of the present invention, the client is also called a client, which refers to a program corresponding to the server and providing local services for the client. The collection file set of the client is obtained in the following two ways: the first way is to traverse and retrieve the collection file set from the local disk of the client; Search to get the collection file set.

所述云存储指的是一种网上在线存储(Cloud storage)的模式，即把数据存放在通常由第三方托管的多台虚拟服务器，而非专属的服务器上。The cloud storage refers to a mode of online online storage (Cloud storage), that is, data is stored in multiple virtual servers usually hosted by a third party instead of a dedicated server.

较佳地，本发明中所述文件系统为Hadoop文件系统(Hadoop Distributed FileSystem，HDFS)。所述HDFS具有高容错性，可以部署在低成本的硬件之上，同时所述HDFS放松了对可移植操作系统接口的需求，使其可以以流的形式访问文件数据，从而提供高吞吐量地对应用程序的数据进行访问，适合大数据集的应用程序。Preferably, the file system in the present invention is a Hadoop file system (Hadoop Distributed File System, HDFS). The HDFS is highly fault-tolerant and can be deployed on low-cost hardware, while the HDFS relaxes the need for a portable operating system interface, allowing it to stream file data, providing high-throughput Access to application data, suitable for applications with large data sets.

详细地，所述HDFS是由一个NameNode(主节点)和n个DataNode(从节点)组成，其中，所述NameNode主要负责管理文件命名空间和客户端访问的主服务器，所述DataNode负责对文件存储进行管理。本发明较佳实施例在所述HDFS文件系统的主节点中创建所述收藏文件集的业务描述。In detail, the HDFS is composed of a NameNode (master node) and n DataNodes (slave nodes), wherein the NameNode is mainly responsible for managing file namespaces and a master server accessed by clients, and the DataNode is responsible for storing files to manage. In a preferred embodiment of the present invention, the service description of the collection file set is created in the master node of the HDFS file system.

进一步地，所述业务描述指的是对所述收藏文件集的内容简要概括，也可以表示为所述收藏文件集的名称，本发明较佳实施例在所述Hadoop的主节点建立多个不同的业务描述，并在所述主节点下设置若干个从节点用来存储所述业务描述的对应收藏文件，于是，可以通过对所述业务描述的检索，实现对所述业务描述的对应收藏文件的查询。Further, the service description refers to a brief summary of the content of the collection file set, and can also be expressed as the name of the collection file set. In a preferred embodiment of the present invention, multiple service description, and set several slave nodes under the master node to store the corresponding collection files of the service description, so the corresponding collection files of the service description can be realized by retrieving the service description. query.

S2、通过关键词提取算法对所述业务描述进行关键词抽取，得到所述业务描述的关键词，并将所述关键词转换为词向量后存储所述词向量。S2. Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors.

本发明较佳实施例中，所述通过关键词提取算法对所述业务描述进行关键词抽取包括：In a preferred embodiment of the present invention, the performing keyword extraction on the business description through a keyword extraction algorithm includes:

其中，是与顶点W_i有关的集合，η为阻尼系数；in, is the set related to the vertex Wi,_η is the damping coefficient;

优选地，本发明根据所述词的重要度得分选取t个得分最高的词作为所述业务描述的关键词。Preferably, the present invention selects t words with the highest scores as keywords of the business description according to the importance scores of the words.

进一步地，本发明利用独热表示(one hot)算法将关键词转换为词向量进行表示。所述独热表示算法是词的向量表示的一种基本方法，和词袋模型思想类似，通过提取语料库中所有的词构建一个词典，所述词典中的每一个词都用一个词向量表示，其中词向量的维度和词典规模相等，并且向量中只有当前词对应的维度的值是1，其余维度的值全部为0，于是，本发明将所述业务描述的关键词的维度转化为1，其余词的维度为0，从而可以将所述关键词转换为词向量表示。Further, the present invention utilizes a one-hot algorithm to convert keywords into word vectors for representation. The one-hot representation algorithm is a basic method of vector representation of words. Similar to the bag-of-words model, a dictionary is constructed by extracting all words in the corpus. Each word in the dictionary is represented by a word vector. The dimension of the word vector is equal to the size of the dictionary, and only the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0. Therefore, the present invention converts the dimension of the keyword of the business description into 1, The dimensions of the remaining words are 0, so that the keywords can be converted into word vector representations.

S3、接收用户输入的查询内容，计算出所述查询内容与所述词向量的相似度。S3. Receive the query content input by the user, and calculate the similarity between the query content and the word vector.

本发明较佳实施例通过利用cosin方法(余弦相似度)计算出所述查询内容与所述词向量的相似度。所述余弦相似度是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量，其中，当所述余弦相似度的余弦值越接近1，表明两个向量之间夹角越接近0度，即两个向量越相似。其中，所述余弦相似度的计算公式如下所示：A preferred embodiment of the present invention calculates the similarity between the query content and the word vector by using the cosin method (cosine similarity). The cosine similarity is to use the cosine value of the angle between two vectors in the vector space as a measure to measure the difference between two individuals, wherein, when the cosine value of the cosine similarity is closer to 1, it indicates that the two vectors are closely related. The closer the included angle is to 0 degrees, the more similar the two vectors are. Wherein, the calculation formula of the cosine similarity is as follows:

其中，X表示所述词向量，Y表示所述查询内容，所述余弦相似度的余弦值的相似性范围为-1到1：当所述余弦值为-1时，表示所述查询内容与所述词向量指向的方向正好截然相反，说明所述查询内容与所述词向量相似度为0，当所述余弦值为1表示表示所述查询内容与所述词向量指向的方向是完全相同的，说明所述表示所述查询内容与所述词向量相似度为100％，当所述余弦值为0时，表示所述查询内容与所述词向量之间是独立的，说明所述查询内容与所述词向量之间为中度的相似性或相异性。本发明根据所述余弦值得到所述查询内容和所述词向量的相似度。Wherein, X represents the word vector, Y represents the query content, and the cosine similarity of the cosine similarity has a similarity range of -1 to 1: when the cosine value is -1, it means that the query content is the same as The direction of the word vector is exactly opposite, which means that the similarity between the query content and the word vector is 0. When the cosine value is 1, it means that the query content and the word vector point in exactly the same direction. , indicating that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it indicates that the query content and the word vector are independent, indicating that the query There is a moderate similarity or dissimilarity between the content and the word vector. The present invention obtains the similarity between the query content and the word vector according to the cosine value.

S4、根据所述相似度选择对应的业务描述，通过多策略检索方式向所述云存储进行收藏文件的查询，并将查询结果返回给所述用户。S4. Select a corresponding service description according to the similarity, query the cloud storage for a favorite file through a multi-strategy retrieval method, and return the query result to the user.

本发明较佳实施例中所述多策略检索方式包括莱文斯坦距离法(LevenshteinDistance，LD)。当用户输入的查询内容时，通过上述相似度计算方法，与所述云存储中收藏文件的业务描述进行对比，判断是否匹配。若匹配，则直接返回该收藏文件给所述用户；若不匹配，将用户输入的查询内容与收藏文件中业务描述的关键词进行相似度计算，并预设阈值为0.8，将相似度结果大于所述预设阈值的业务描述对应的收藏文件作为查询结果，并返回给所述用户。In a preferred embodiment of the present invention, the multi-strategy retrieval method includes Levenshtein Distance (LD). When the query content input by the user is compared with the service description of the bookmarked file in the cloud storage through the above similarity calculation method, it is judged whether it matches. If it matches, the collection file is directly returned to the user; if it does not match, the similarity calculation is performed between the query content input by the user and the keywords of the business description in the collection file, and the preset threshold is 0.8, and the similarity result is greater than The favorite file corresponding to the service description of the preset threshold is used as the query result, and returned to the user.

进一步地，当所述相似度结果均没有大于预设阈值时，本发明通过所述LD计算所述用户输入的查询内容与所述收藏文件的业务描述中字符串之间的相似度。详细地，本发明预设所述用户输入的查询内容中原字符串为m，所述收藏文件的业务描述目标字符串为n，记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次数L，并将2个字符串m、n的L记为lev_m，n(|m|，|n|)，其中|m|，|n|分别为字符串m，n的长度离。其中，当L越大，字符串的相似度越低，于是，本发明选取所述L值最小的对应收藏文件作为查询结果，并返回给所述用户。Further, when none of the similarity results is greater than a preset threshold, the present invention calculates the similarity between the query content input by the user and the character string in the service description of the favorite file through the LD. In detail, the present invention presupposes that the original character string in the query content input by the user is m, the service description target character string of the collection file is n, and it is necessary to record the conversion of the original character string m into the target character string n. The number of edits L of deletion, insertion, and replacement operations is L, and the L of the two strings m and n is recorded as lev_{m, n} (|m|, |n|), where |m|, |n| are characters respectively The length of the string m, n is apart. Wherein, when L is larger, the similarity of character strings is lower. Therefore, the present invention selects the corresponding favorite file with the smallest L value as the query result, and returns it to the user.

发明还提供一种文件查询装置。参照图2所示，为本发明一实施例提供的文件查询装置的内部结构示意图。The invention also provides a file query device. Referring to FIG. 2, it is a schematic diagram of an internal structure of a file query apparatus provided by an embodiment of the present invention.

在本实施例中，所述文件查询装置1可以是PC(Personal Computer，个人电脑)，或者是智能手机、平板电脑、便携计算机等终端设备，也可以是一种服务器等。该文件查询装置1至少包括存储器11、处理器12，通信总线13，以及网络接口14。In this embodiment, thefile query apparatus 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server or the like. Thefile query device 1 at least includes amemory 11 , aprocessor 12 , acommunication bus 13 , and anetwork interface 14 .

其中，存储器11至少包括一种类型的可读存储介质，所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是文件查询装置1的内部存储单元，例如该文件查询装置1的硬盘。存储器11在另一些实施例中也可以是文件查询装置1的外部存储设备，例如文件查询装置1上配备的插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)等。进一步地，存储器11还可以既包括文件查询装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于文件查询装置1的应用软件及各类数据，例如文件查询程序01的代码等，还可以用于暂时地存储已经输出或者将要输出的数据。Thememory 11 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, thememory 11 may be an internal storage unit of thefile query apparatus 1 , such as a hard disk of thefile query apparatus 1 . In other embodiments, thememory 11 may also be an external storage device of thefile query device 1, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, thememory 11 may also include both an internal storage unit of thefile query apparatus 1 and an external storage device. Thememory 11 can not only be used to store application software installed in thefile search device 1 and various types of data, such as the code of thefile search program 01, etc., but also can be used to temporarily store data that has been output or will be output.

处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片，用于运行存储器11中存储的程序代码或处理数据，例如执行文件查询程序01等。Theprocessor 12 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor or other data processing chips in some embodiments, for running program codes or processing stored in thememory 11 data, such as executingfile query program 01, etc.

通信总线13用于实现这些组件之间的连接通信。Thecommunication bus 13 is used to realize the connection communication between these components.

网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口)，通常用于在该装置1与其他电子设备之间建立通信连接。Thenetwork interface 14 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between theapparatus 1 and other electronic devices.

可选地，该装置1还可以包括用户接口，用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard)，可选的用户接口还可以包括标准的有线接口、无线接口。可选地，在一些实施例中，显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。其中，显示器也可以适当的称为显示屏或显示单元，用于显示在文件查询装置1中处理的信息以及用于显示可视化的用户界面。Optionally, thedevice 1 may further include a user interface, and the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display can also be appropriately called a display screen or a display unit, which is used for displaying the information processed in thefile query apparatus 1 and for displaying a visual user interface.

图2仅示出了具有组件11-14以及文件查询程序01的文件查询装置1，本领域技术人员可以理解的是，图1示出的结构并不构成对文件查询装置1的限定，可以包括比图示更少或者更多的部件，或者组合某些部件，或者不同的部件布置。FIG. 2 only shows thefile query device 1 having the components 11-14 and thefile query program 01. Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on thefile query device 1, and may include Fewer or more components than shown, or some components are combined, or a different arrangement of components.

在图2所示的装置1实施例中，存储器11中存储有文件查询程序01；处理器12执行存储器11中存储的文件查询程序01时实现如下步骤：In the embodiment of thedevice 1 shown in FIG. 2 , thefile query program 01 is stored in thememory 11; theprocessor 12 implements the following steps when executing thefile query program 01 stored in the memory 11:

步骤一、获取客户端的收藏文件集，在文件系统中创建所述收藏文件集的业务描述，并将创建业务描述后的所述收藏文件集存入云存储中。Step 1: Acquire the favorite file set of the client, create a service description of the favorite file set in the file system, and store the favorite file set after the business description is created in the cloud storage.

步骤二、通过关键词提取算法对所述业务描述进行关键词抽取，得到所述业务描述的关键词，并将所述关键词转换为词向量后存储所述词向量。Step 2: Extract keywords from the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors.

对所述业务描述进行分词操作；计算所述业务描述中的任意两个词W_i和W_j的依存关联度：A word segmentation operation is performed on the business description; the dependency degree of any two words W_i and W_j in the business description is calculated:

步骤三、接收用户输入的查询内容，计算出所述查询内容与所述词向量的相似度。Step 3: Receive the query content input by the user, and calculate the similarity between the query content and the word vector.

步骤四、根据所述相似度选择对应的业务描述，通过多策略检索方式向所述云存储进行收藏文件的查询，并将查询结果返回给所述用户。Step 4: Select the corresponding service description according to the similarity, query the cloud storage for the favorite file through a multi-strategy retrieval method, and return the query result to the user.

可选地，在其他实施例中，文件查询程序还可以被分割为一个或者多个模块，一个或者多个模块被存储于存储器11中，并由一个或多个处理器(本实施例为处理器12)所执行以完成本发明，本发明所称的模块是指能够完成特定功能的一系列计算机程序指令段，用于描述文件查询程序在文件查询装置中的执行过程。Optionally, in other embodiments, the file query program can also be divided into one or more modules, one or more modules are stored in thememory 11, and are processed by one or more processors (in this embodiment, processing 12) to complete the present invention, a module in the present invention refers to a series of computer program instruction segments capable of accomplishing specific functions, and is used to describe the execution process of the file query program in the file query device.

例如，参照图3所示，为本发明文件查询装置一实施例中的文件查询程序的程序模块示意图，该实施例中，所述文件查询程序可以被分割为业务描述创建模块10、关键词提取模块20、相似度计算模块30以及查询模块40示例性地：For example, referring to FIG. 3 , it is a schematic diagram of program modules of a file query program in an embodiment of the file query device of the present invention. In this embodiment, the file query program can be divided into a servicedescription creation module 10 , akeyword extraction module 10 , and a keyword extraction module. Themodule 20, thesimilarity calculation module 30 and thequery module 40 exemplarily:

所述业务描述创建模块10用于：获取客户端的收藏文件集，在文件系统中创建所述收藏文件集的业务描述，并将创建业务描述后的所述收藏文件集存入云存储中。The servicedescription creation module 10 is configured to: obtain the favorite file set of the client, create a service description of the favorite file set in the file system, and store the favorite file set after the service description is created into the cloud storage.

所述关键词提取模块20用于：通过关键词提取算法对所述业务描述进行关键词抽取，得到所述业务描述的关键词，并将所述关键词转换为词向量后存储所述词向量。Thekeyword extraction module 20 is used for: extracting keywords from the business description through a keyword extraction algorithm to obtain keywords of the business description, and converting the keywords into word vectors to store the word vectors .

所述相似度计算模块30用于：接收用户输入的查询内容，计算出所述查询内容与所述词向量的相似度。Thesimilarity calculation module 30 is configured to: receive the query content input by the user, and calculate the similarity between the query content and the word vector.

所述查询模块40用于：根据所述相似度选择对应的业务描述，通过多策略检索方式向所述云存储进行收藏文件的查询，并将查询结果返回给所述用户。Thequery module 40 is configured to: select a corresponding business description according to the similarity, query the cloud storage for a favorite file through a multi-strategy retrieval method, and return the query result to the user.

上述文本业务描述创建模块10、关键词提取模块20、相似度计算模块30以及查询模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同，在此不再赘述。The functions or operation steps implemented by the above program modules such as the text servicedescription creation module 10 , thekeyword extraction module 20 , thesimilarity calculation module 30 , and thequery module 40 when executed are substantially the same as those in the above-mentioned embodiment, and are not repeated here.

此外，本发明实施例还提出一种计算机可读存储介质，所述计算机可读存储介质上存储有文件查询程序，所述文件查询程序可被一个或多个处理器执行，以实现如下操作：In addition, an embodiment of the present invention also provides a computer-readable storage medium, where a file query program is stored on the computer-readable storage medium, and the file query program can be executed by one or more processors to achieve the following operations:

本发明计算机可读存储介质具体实施方式与上述文件查询装置和方法各实施例基本相同，在此不作累述。The specific implementation manner of the computer-readable storage medium of the present invention is basically the same as the above-mentioned embodiments of the file query apparatus and method, and will not be described in detail here.

需要说明的是，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprising", "comprising" or any other variation thereof herein are intended to encompass a non-exclusive inclusion such that a process, device, article or method comprising a list of elements includes not only those elements, but also includes no explicit Other elements listed, or those inherent to such a process, apparatus, article, or method are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art. The computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disc), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims

Translated fromChinese

1.一种文件查询方法，其特征在于，所述方法包括：1. A file query method, characterized in that the method comprises:

2.如权利要求1所述的文件查询方法，其特征在于，所述获取客户端的收藏文件集包括：2. The file query method according to claim 1, wherein the acquiring the collection file set of the client comprises:

3.如权利要求1所述的文件查询方法，其特征在于，所述通过关键词提取算法对所述业务描述进行关键词抽取，包括：3. The file query method according to claim 1, wherein the keyword extraction is performed on the business description by a keyword extraction algorithm, comprising:

其中，Dep(W_i,W_j)表示所述词W_i和W_j的依存关联度，len(W_i,W_j)表示所述词W_i和W_j之间的依存路径长度，b是超参数；Wherein, Dep(W_i , W_j ) represents the degree of dependency between the words Wi and W_j , len(W_i , W_j ) represents the_{length of the dependency path between the words Wi and W j}_,_and b is hyperparameters;

其中，f_grav(W_i,W_j)表示所述词W_i和W_j的引力，tfidf(W_i)表示词W_i的TF-IDF值，tfidf(W_j)表示词W_j的TF-IDF值，TF表示词频，IDF表示逆文档频率指数，d是词W_i和W_j的词向量之间的欧式距离；Among them, f_grav (W_i ,W_j ) represents the gravitational force of the words Wi and W_j , tfidf(W_i ) represents the TF-IDF value of the word_{Wi, and tfidf(W j}₎_represents the TF-IDF of the word W_j IDF value, TF represents the word frequency, IDF represents the inverse document frequency index, d is the Euclidean distance between the word vectors of words_{Wi and W j}_;

weight(W_i,W_j)＝Dep(W_i,W_j)*f_grav(W_i,W_j)weight(W_i ,W_j )=Dep(W_i ,W_j )*f_grav (W_i ,W_j )

结合所述关联强度计算出所述词W_i的重要度得分：The importance_score of the word Wi is calculated in combination with the association strength:

其中，

是与顶点W_i有关的集合，η为阻尼系数；in,

is the set related to the vertex Wi,_η is the damping coefficient;

4.如权利要求1所述的文件查询方法，其特征在于，所述查询内容与所述词向量的相似度的计算公式为：4. The file query method according to claim 1, wherein the calculation formula of the similarity between the query content and the word vector is:

5.如权利要求1至4中任一项所述的文件查询方法，其特征在于，所述通过多策略检索方式向所述云存储进行收藏文件的查询，包括：5. The file query method according to any one of claims 1 to 4, wherein the querying of a favorite file to the cloud storage through a multi-strategy retrieval method comprises:

6.一种文件查询装置，其特征在于，所述装置包括存储器和处理器，所述存储器上存储有可在所述处理器上运行的文件查询程序，所述文件查询程序被所述处理器执行时实现如下步骤：6. A file query device, characterized in that the device comprises a memory and a processor, the memory stores a file query program that can run on the processor, and the file query program is executed by the processor. The following steps are implemented when executing:

7.如权利要求6所述的文件查询装置，其特征在于，所述获取客户端的收藏文件集包括：7. The file query device according to claim 6, wherein the acquiring the collection file set of the client comprises:

8.如权利要求6所述的文件查询装置，其特征在于，所述通过关键词提取算法对所述业务描述进行关键词抽取，包括：8. The file query device according to claim 6, wherein the performing keyword extraction on the business description by a keyword extraction algorithm, comprising:

对所述业务描述进行分词操作；perform a word segmentation operation on the business description;

其中，

是与顶点W_i有关的集合，η为阻尼系数；in,

is the set related to the vertex Wi,_η is the damping coefficient;

9.如权利要求6所述的文件查询装置，其特征在于，所述查询内容与所述词向量的相似度的计算公式为：9. The file query device according to claim 6, wherein the calculation formula of the similarity between the query content and the word vector is:

10.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储有文件查询程序，所述文件查询程序可被一个或者多个处理器执行，以实现如权利要求1至5中任一项所述的文件查询方法的步骤。10. A computer-readable storage medium, characterized in that, a file query program is stored on the computer-readable storage medium, and the file query program can be executed by one or more processors to realize the steps of claim 1 to Steps of the file query method described in any one of 5.