技术领域technical field
本申请总体上涉及自然语言处理的领域,尤其涉及提取语句中的多词单元的方法和设备以及训练用于提取语句中的多词单元的人工神经网络的方法和设备。The present application generally relates to the field of natural language processing, and in particular relates to a method and device for extracting multi-word units in a sentence and a method and device for training an artificial neural network for extracting multi-word units in a sentence.
背景技术Background technique
经典的自然语言处理系统通常假设每个词为一个语义单元,但是这并没有包含多词单元的情形。多词单元跨越词的边界,因而多词单元有着特殊的解读方法。识别和提取多词单元是多词单元处理领域的主要关注点,并且也被认为是进一步研究的瓶颈。多词单元是自然语言处理中比较普遍并且没有精确定义的一个概念。一般,多词单元指的是两个或两个以上的词单元同时出现在一起的概率相对较高的词组合,并且该词组合具有完整的语义。多词单元在自然语言处理领域是相当普遍的现象,因此多词单元的识别和提取非常重要。由于没有充足的词搭配知识,并且词组合信息分散于各个分词之中,因此将分开的词根据原意重新组合以成为独立语义单元,从而获得原来的完整语义是非常困难的,尤其是处理像中文这种文字间没有分割的语言。Classical natural language processing systems usually assume that each word is a semantic unit, but this does not include the case of multi-word units. Multi-word units cross word boundaries, so multi-word units have special interpretation methods. Identifying and extracting multi-word units is a major concern in the field of multi-word unit processing, and is also considered a bottleneck for further research. Multi-word unit is a common and not precisely defined concept in natural language processing. Generally, a multi-word unit refers to a word combination with a relatively high probability that two or more word units appear together at the same time, and the word combination has complete semantics. Multi-word units are quite common in the field of natural language processing, so the identification and extraction of multi-word units is very important. Since there is not enough word collocation knowledge, and the word combination information is scattered in each participle, it is very difficult to recombine the separated words according to the original meaning to become an independent semantic unit, so as to obtain the original complete semantics, especially when dealing with Chinese A language in which there is no division between scripts.
多词单元的识别和提取可广泛应用于机器翻译、高效句法分析、优化信息检索和词义消歧等方面。目前普遍应用于识别和提取多词单元的方法有排序方法、局部最大值方法(Local Maxima)和条件随机场方法(Conditional Random Fields)等。在识别和提取多词单元时使用的特征值包括分词间互信息、t分数、熵和共现频率等。另外,识别和提取多词单元还涉及分词工具、词形标注工具、词性标注工具和停词表等的使用。The identification and extraction of multi-word units can be widely used in machine translation, efficient syntactic analysis, optimized information retrieval and word sense disambiguation. At present, the methods commonly used to identify and extract multi-word units include sorting methods, local maximum methods (Local Maxima) and conditional random field methods (Conditional Random Fields), etc. The feature values used in identifying and extracting multi-word units include mutual information between word segments, t-score, entropy and co-occurrence frequency, etc. In addition, identifying and extracting multi-word units also involves the use of word segmentation tools, word form tagging tools, part-of-speech tagging tools, and stop word tables.
现有技术中的识别和提取多词单元的方法基本上采用如下过程:对目标语句进行分词和/或词性标注;根据分析和/或词性标注的结果计算相应的特征值,例如频率、分词共现率和互信息等;以及根据所计算的特征值使用特定算法或模型对候选多词单元进行筛选,从而得到比较准确的多词单元。但是,现有技术中的方法无法保证对目标语句进行分词和/或词性标注的准确性,从而经常引入错误信息,导致训练过程中的信息本身就包含相互矛盾的数据,或者导致实际应用中的特征值本身与实际情况有偏差。The method for identifying and extracting multi-word units in the prior art basically adopts the following process: carry out word segmentation and/or part-of-speech tagging to the target sentence; calculate corresponding feature values according to the results of analysis and/or part-of-speech tagging, such as frequency, word segmentation occurrence rate and mutual information, etc.; and use a specific algorithm or model to screen candidate multi-word units according to the calculated feature values, so as to obtain more accurate multi-word units. However, the methods in the prior art cannot guarantee the accuracy of word segmentation and/or part-of-speech tagging for the target sentence, thus often introducing error information, causing the information in the training process itself to contain contradictory data, or causing inconsistency in practical applications. The eigenvalues themselves deviate from reality.
多词单元是与短语或词块不同的概念,因此多词单元的识别和提取方法不同于短语或词块的识别和提取方法。具体地,短语中的某些介词短语并不具有完整的语义,因此利用短语的识别和提取方法来识别和提取多词单元并不能取得良好的效果。另外,词块是定义在句法层面中的,因此在识别和提取词块时需要考虑组成词块的句法信息和词性信息,对于语义的完整性并没有严格的要求,所以将词块的识别和提取方法应用到多词单元的识别和提取也是不可行的。Multi-word units are different concepts from phrases or word chunks, so the recognition and extraction methods of multi-word units are different from those of phrases or word chunks. Specifically, some prepositional phrases in phrases do not have complete semantics, so using phrase recognition and extraction methods to identify and extract multi-word units cannot achieve good results. In addition, lexical chunks are defined at the syntactic level, so when identifying and extracting lexical chunks, it is necessary to consider the syntactic information and part-of-speech information that make up the lexical chunks. There is no strict requirement for semantic integrity, so the recognition of lexical chunks and It is also not feasible to apply the extraction method to the recognition and extraction of multi-word units.
因此,期望提供一种提取语句中的多词单元的方法和设备,其能够提高多词单元的识别和提取的准确性和效率。Therefore, it is desirable to provide a method and device for extracting multi-word units in sentences, which can improve the accuracy and efficiency of identifying and extracting multi-word units.
发明内容Contents of the invention
在下文中将给出关于本发明的简要概述,以便提供关于本发明的某些方面的基本理解。应当理解,这个概述并不是关于本发明的穷举性概述。它并不是意图确定本发明的关键或重要部分,也不是意图限定本发明的范围。其目的仅仅是以简化的形式给出某些概念,以此作为稍后论述的更详细描述的前序。In the following, a brief overview of the present invention is given in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical parts of the invention nor to delineate the scope of the invention. Its purpose is merely to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
本发明将人工神经网络应用到多词单元的识别和提取。人工神经网络是一种模拟动物神经网络行为特征来进行分布式并行信息处理的算法模型。人工神经网络依靠系统的复杂程度,通过调整内部大量节点之间的相互连接关系,达到处理信息的目的。人工神经网络包括大量的节点及其之间的相互连接。人工神经网络中的每个节点表示一种特定的输出函数,节点之间的连接表示对应于该连接的加权值,称之为权重,其相当于人工神经网络的记忆。人工神经网络的输出根据人工神经网络的连接方式、权重值和输出函数的不同而不同。The invention applies the artificial neural network to the identification and extraction of multi-word units. Artificial neural network is an algorithm model that simulates the behavioral characteristics of animal neural networks for distributed parallel information processing. Relying on the complexity of the system, the artificial neural network achieves the purpose of processing information by adjusting the interconnection relationship between a large number of internal nodes. Artificial neural networks include a large number of nodes and their interconnections. Each node in the artificial neural network represents a specific output function, and the connection between nodes represents the weighted value corresponding to the connection, called weight, which is equivalent to the memory of the artificial neural network. The output of the artificial neural network varies according to the connection method, weight value and output function of the artificial neural network.
根据本发明的实施例,提供了一种提取语句中的多词单元的方法,包括:针对将语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或多个语言学特征作为特征量;将特征量作为人工神经网络的参数输入到人工神经网络中;采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性来判断该分词是否为多词单元的一部分;以及提取相邻的两个或更多个被判断为多词单元的一部分的分词,以形成多词单元,其中,该方法还包括:获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块中的分词的特征量。According to an embodiment of the present invention, a method for extracting a multi-word unit in a sentence is provided, including: for each word segmentation block in a plurality of word segmentation blocks obtained by segmenting the sentence, obtaining the word segmentation in each word segmentation block One or more linguistic features are used as feature quantities; the feature quantities are input into the artificial neural network as parameters of the artificial neural network; the artificial neural network is used to calculate the first possibility that the word segmentation in each word segmentation block is a part of the multi-word unit and the second possibility that the participle is not part of the multi-word unit, and judge whether the participle is part of the multi-word unit according to the first possibility and the second possibility; and extract two or more adjacent The word segmentation that is judged to be a part of the multi-word unit to form the multi-word unit, wherein the method further includes: obtaining the result of the judgment of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and using the feedback information as the current word segmentation The feature amount of the word segmentation in the block.
根据上述提取语句中的多词单元的方法,还包括:依次将语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。According to the above method for extracting multi-word units in a sentence, it further includes: sequentially combining N adjacent word segments in the sentence into N-tuples to form a word block, wherein N is a natural number greater than or equal to 2.
根据上述提取语句中的多词单元的方法,还包括:将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;以及根据泛化N元组中的分词的词形特征和词性特征,从词性容错模板中获取泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,并且将词性容错信息也作为N元组中的分词的特征量。According to the above-mentioned method for extracting the multi-word unit in the sentence, it also includes: replacing the word form of the participle in the N-tuple with the corresponding part of speech to obtain a generalized N-tuple that mixes the word form and the part of speech; and according to the generalization The part-of-speech features and part-of-speech features of the word segmentation in the N-tuple, the extraction probability that the word segmentation in the generalized N-tuple is a part of the multi-word unit is obtained from the part-of-speech fault-tolerant template as the part-of-speech fault-tolerant information, and the part-of-speech fault-tolerant information is also used as N The feature amount of the tokens in the tuple.
根据本发明的另一实施例,提供了一种提取语句中的多词单元的设备,包括:语言学特征获取单元,其针对将语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或多个语言学特征作为特征量;输入单元,其将特征量作为人工神经网络的参数输入到人工神经网络中;判断单元,其采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性来判断该分词是否为多词单元的一部分;以及提取单元,其提取相邻的两个或更多个被判断为多词单元的一部分的分词,以形成多词单元,其中,该设备还包括:反馈信息获取单元,其获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块的特征量。According to another embodiment of the present invention, a device for extracting multi-word units in a sentence is provided, including: a linguistic feature acquisition unit, for each of the multiple word segmentation blocks obtained by segmenting the sentence, Obtain one or more linguistic features of the word segmentation in each word segmentation block as a feature quantity; the input unit inputs the feature quantity into the artificial neural network as a parameter of the artificial neural network; the judging unit uses the artificial neural network to calculate each The first possibility that the word segmentation in a word segmentation block is a part of the multi-word unit and the second possibility that the word segmentation is not part of the multi-word unit, and judge whether the word segmentation is a multi-word unit according to the first possibility and the second possibility. A part of the word unit; and an extracting unit, which extracts two or more adjacent word segmentations that are judged as part of the multi-word unit to form a multi-word unit, wherein the device also includes: a feedback information acquisition unit, which Obtain the judgment result of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and use the feedback information as the feature quantity of the current word segmentation block.
根据上述提取语句中的多词单元的设备,还包括:组合单元,其依次将语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。According to the above-mentioned device for extracting multi-word units in a sentence, it further includes: a combination unit, which sequentially combines N adjacent word segments in the sentence into N-tuples to form a word block, wherein N is a natural number greater than or equal to 2.
根据上述提取语句中的多词单元的设备,还包括:泛化单元,其将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;以及词性容错信息获取单元,其根据泛化N元组中的分词的词形特征和词性特征,从词性容错模板中获取泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,并且将词性容错信息也作为N元组中的分词的特征量。According to the above-mentioned device for extracting the multi-word unit in the sentence, it also includes: a generalization unit, which replaces the word form of the word segmentation in the N-tuple with the corresponding part of speech, so as to obtain a generalized N-tuple that mixes the word form and the part of speech and the part-of-speech fault-tolerant information acquisition unit, which, according to the morphological feature and the part-of-speech feature of the participle in the generalized N-tuple, obtains the extraction probability that the participle in the generalized N-tuple is a part of the multi-word unit from the part-of-speech fault-tolerant template as The part-of-speech error tolerance information is also used as the feature quantity of the word segmentation in the N-tuple.
根据本发明的又一实施例,提供了一种训练人工神经网络的方法,人工神经网络用于提取语句中的多词单元,该方法包括:针对将每个训练语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或更多个语言学特征作为特征量,其中,训练语句中的多词单元已被标注;将特征量作为人工神经网络的参数输入到人工神经网络中;采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性的比较结果来判断该分词是否为多词单元的一部分;以及根据判断的结果和标注的结果,来训练人工神经网络,其中,该方法还包括:获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块中的分词的特征量。According to yet another embodiment of the present invention, a method for training an artificial neural network is provided, the artificial neural network is used to extract multi-word units in a sentence, the method includes: performing word segmentation on each training sentence for multiple word segmentation For each word segmentation block in the block, one or more linguistic features of the word segmentation in each word segmentation block are obtained as a feature quantity, wherein the multi-word unit in the training sentence has been marked; the feature quantity is used as the artificial neural network. The parameters are input into the artificial neural network; the artificial neural network is used to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first possibility and the second possibility to judge whether the word segmentation is a part of the multi-word unit; and according to the results of the judgment and the results of the label, to train the artificial neural network, wherein, the method also includes: obtaining the current word segmentation block The judgment result of the adjacent previous word segmentation block is used as feedback information, and the feedback information is also used as the feature quantity of the word segmentation in the current word segmentation block.
根据上述一种训练人工神经网络的方法,还包括:依次将训练语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。According to the above method for training an artificial neural network, it further includes: sequentially combining N adjacent word segments in the training sentence into N-tuples to form a word segment block, wherein N is a natural number greater than or equal to 2.
根据上述一种训练人工神经网络的方法,还包括:将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;以及根据标注的结果和泛化N元组中的分词的词形特征和词性特征,计算泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,以生成词性容错模板。According to the above-mentioned method for training an artificial neural network, it also includes: replacing the word form of the word segmentation in the N-tuple with the corresponding part of speech, so as to obtain a generalized N-tuple that mixes the word form and the part of speech; and according to the marked result and the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple, and calculate the extraction probability that the word segmentation in the generalized N-tuple is a part of the multi-word unit as the part-of-speech error-tolerant information to generate the part-of-speech error-tolerant template.
根据本发明的再一实施例,提供了一种训练人工神经网络的设备,该人工神经网络用于提取语句中的多词单元,该设备包括:语言学特征获取装置,其针对将每个训练语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或更多个语言学特征作为特征量,其中,训练语句中的多词单元已被标注;输入装置,其将特征量作为人工神经网络的参数输入到人工神经网络中;判断装置,采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性的比较结果来判断该分词是否为多词单元的一部分;以及训练装置,其根据判断的结果和标注的结果,来训练人工神经网络,其中,该设备还包括:反馈信息获取装置,其获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块中的分词的特征量。According to yet another embodiment of the present invention, there is provided a device for training an artificial neural network, the artificial neural network is used to extract multi-word units in a sentence, the device includes: a linguistic feature acquisition device, which targets each training For each word segmentation block in the multiple word segmentation blocks obtained by word segmentation of the statement, one or more linguistic features of the word segmentation in each word segmentation block are obtained as feature quantities, wherein the multi-word units in the training sentence have been marked; The input device is used to input the feature quantity into the artificial neural network as the parameter of the artificial neural network; the judging device adopts the artificial neural network to calculate the first possibility that the participle in each participle block is a part of the multi-word unit and that the participle is not The second possibility of a part of the multi-word unit, and judge whether the participle is a part of the multi-word unit according to the comparison result of the first possibility and the second possibility; , to train the artificial neural network, wherein the device also includes: a feedback information acquisition device, which acquires the result of the judgment of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and uses the feedback information as the current word segmentation block. The feature quantity of the word segmentation.
根据本发明,通过将具有反馈配置的人工神经网络应用于多词单元的识别和提取,可以提高多词单元的识别和提取的准确性和效率。According to the present invention, by applying the artificial neural network with feedback configuration to the identification and extraction of the multi-word unit, the accuracy and efficiency of the identification and extraction of the multi-word unit can be improved.
附图说明Description of drawings
本发明可以通过参考下文中结合附图所给出的描述而得到更好的理解,其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似的部件。所述附图连同下面的详细说明一起包含在本说明书中并且形成本说明书的一部分,而且用来进一步举例说明本发明的优选实施例和解释本发明的原理和优点。在附图中:The present invention can be better understood by referring to the following description given in conjunction with the accompanying drawings, wherein the same or similar reference numerals are used throughout to designate the same or similar parts. The accompanying drawings, together with the following detailed description, are incorporated in and form a part of this specification, and serve to further illustrate preferred embodiments of the invention and explain the principles and advantages of the invention. In the attached picture:
图1是示出根据本发明的实施例的提取语句中的多词单元的方法的示意性流程图;Fig. 1 is a schematic flow diagram illustrating a method for extracting multi-word units in a sentence according to an embodiment of the present invention;
图2是示出根据本发明的实施例的利用具有反馈配置的人工神经网络提取语句中的多词单元的示意图;Fig. 2 is a schematic diagram illustrating the use of an artificial neural network with a feedback configuration to extract multi-word units in a sentence according to an embodiment of the present invention;
图3是示出根据本发明的实施例的采用N元组来提取语句中的多词单元的方法的示意性流程图;Fig. 3 is a schematic flowchart showing a method for extracting multi-word units in a sentence using N-tuples according to an embodiment of the present invention;
图4是示出根据本发明的实施例的采用N元组来提取语句中的多词单元的示意图;Fig. 4 is a schematic diagram illustrating the use of N-tuples to extract multi-word units in a sentence according to an embodiment of the present invention;
图5是示出根据本发明的实施例的采用N元组来获取词形提取概率和/或词性提取概率的方法的示意性流程图;Fig. 5 is a schematic flow chart illustrating a method for obtaining word form extraction probability and/or part of speech extraction probability using N-tuples according to an embodiment of the present invention;
图6是示出根据本发明的实施例的采用N元组进行词性容错的方法的示意性流程图;FIG. 6 is a schematic flowchart showing a method for part-of-speech fault tolerance using N-tuples according to an embodiment of the present invention;
图7是示出根据本发明的实施例的采用N元组进行词性容错的示意图;FIG. 7 is a schematic diagram showing part-of-speech tolerance using N-tuples according to an embodiment of the present invention;
图8是示出根据本发明的实施例的提取语句中的多词单元的设备的示意性框图;FIG. 8 is a schematic block diagram showing a device for extracting multi-word units in a sentence according to an embodiment of the present invention;
图9是示出根据本发明的另一实施例的提取语句中的多词单元的设备的示意性框图;Fig. 9 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to another embodiment of the present invention;
图10是示出根据本发明的另一实施例的提取语句中的多词单元的设备的示意性框图;Fig. 10 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to another embodiment of the present invention;
图11是示出根据本发明的另一实施例的提取语句中的多词单元的设备的示意性框图;FIG. 11 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to another embodiment of the present invention;
图12是示出根据本发明的实施例的训练用于提取语句中的多词单元的人工神经网络的方法的示意性流程图;Fig. 12 is a schematic flow diagram showing a method for training an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention;
图13是示出根据本发明的实施例的采用N元组来训练用于提取语句中的多词单元的人工神经网络的方法的示意性流程图;FIG. 13 is a schematic flow chart showing a method for training an artificial neural network for extracting multi-word units in a sentence using N-tuples according to an embodiment of the present invention;
图14是示出根据本发明的实施例的采用N元组生成词形模板和/或词性模板的方法的示意性流程图;Fig. 14 is a schematic flow diagram illustrating a method for generating word form templates and/or part-of-speech templates using N-tuples according to an embodiment of the present invention;
图15是示出根据本发明的实施例的采用N元组生成词性容错模板的方法的示意性流程图;FIG. 15 is a schematic flow chart showing a method for generating a part-of-speech fault-tolerant template using an N-tuple according to an embodiment of the present invention;
图16是示出根据本发明的实施例的采用N元组生成词性容错模板的示意图;Fig. 16 is a schematic diagram illustrating the use of N-tuples to generate part-of-speech fault-tolerant templates according to an embodiment of the present invention;
图17是示出根据本发明的实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图;Fig. 17 is a schematic block diagram illustrating a device for training an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention;
图18是示出根据本发明的另一实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图;18 is a schematic block diagram illustrating a device for training an artificial neural network for extracting multi-word units in a sentence according to another embodiment of the present invention;
图19是示出根据本发明的另一实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图;Fig. 19 is a schematic block diagram illustrating a device for training an artificial neural network for extracting multi-word units in a sentence according to another embodiment of the present invention;
图20是示出根据本发明的另一实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图;以及Fig. 20 is a schematic block diagram illustrating a device for training an artificial neural network for extracting multi-word units in a sentence according to another embodiment of the present invention; and
图21是示出可用于作为实施根据本发明的实施例的信息处理设备的示意性框图。FIG. 21 is a schematic block diagram showing an information processing device usable as an embodiment for implementing the present invention.
具体实施方式detailed description
在下文中将结合附图对本发明的示例性实施例进行描述。为了清楚和简明起见,在说明书中并未描述实际实施方式的所有特征。然而,应该了解,在开发任何这种实际实施方式的过程中可以做出很多特定于实施方式的决定,以便实现开发人员的具体目标,并且这些决定可能会随着实施方式的不同而有所改变。Exemplary embodiments of the present invention will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. However, it should be understood that many implementation-specific decisions can be made during the development of any such actual implementation in order to achieve the developer's specific goals, and that these decisions may vary from implementation to implementation .
在此,还需要说明的一点是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与根据本发明的方案密切相关的装置结构,而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the device structure closely related to the solution according to the present invention is shown in the drawings, and the relationship with the present invention is omitted. Little other details.
下面将结合图1和图2来描述根据本发明的实施例的提取语句中的多词单元的方法。图1是示出根据本发明的实施例的提取语句中的多词单元的方法的示意性流程图,而图2是示出根据本发明的实施例的利用具有反馈配置的人工神经网络提取语句中的多词单元的示意图。A method for extracting multi-word units in a sentence according to an embodiment of the present invention will be described below with reference to FIG. 1 and FIG. 2 . Fig. 1 is a schematic flowchart showing a method for extracting multi-word units in a sentence according to an embodiment of the present invention, and Fig. 2 is a schematic flow chart showing a method for extracting a sentence using an artificial neural network with a feedback configuration according to an embodiment of the present invention Schematic representation of the multi-word unit in .
如图1所示,该处理在S100开始。接着,该处理前进到S102。As shown in FIG. 1, the process starts at S100. Next, the process proceeds to S102.
在S102,针对将语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或多个语言学特征作为特征量。In S102, for each of the plurality of word segmentation blocks obtained by segmenting the sentence, one or more linguistic features of the word segmentation in each word segmentation block are obtained as feature quantities.
对语料中的语句进行分词,从而将语句切分为多个分词块,其中分词块中可以包含至少一个分词。对切分得到的多个分词块中的每个分词块中的分词按照其在原来的语句中的语序依次进行处理。例如,可以对分词块中的分词进行处理以获取分词的一个或多个语言学特征。例如,分词的语言学特征可以为以下中的一个或更多个:分词的词性、分词的词形、分词序号或分词出现概率。本领域技术人员应当理解,分词的语言学特征不限于上面列举的示例。在获取分词的语言学特征之后,可以将获得的分词的语言学特征作为特征量以用于后续的处理。Word segmentation is performed on the sentences in the corpus, so as to divide the sentences into multiple word segmentation blocks, wherein at least one word segmentation block may contain at least one word segmentation. The word segmentation in each of the plurality of word segmentation blocks obtained by segmentation is sequentially processed according to its word order in the original sentence. For example, the word segmentation in the word segmentation block may be processed to obtain one or more linguistic features of the word segmentation. For example, the linguistic features of a word segment may be one or more of the following: part of speech of a word segment, word form of a word segment, sequence number of a word segment, or occurrence probability of a word segment. Those skilled in the art should understand that the linguistic features of word segmentation are not limited to the examples listed above. After acquiring the linguistic features of the word segmentation, the acquired linguistic features of the word segmentation can be used as feature quantities for subsequent processing.
例如,对于语句“最初施用引物的步骤”,对该语句进行分词,从而得到如下的分词结果“最初/施用/引/物/的/步骤”,也就是说,将语句“最初施用引物的步骤”切分为以下多个分词块{“最初”,“施用”,“引”,“物”,“的”,“步骤”},其中每个分词块中包含一个分词。接着,对得到的多个分词块中的每个分词块中的分词{“最初”,“施用”,“引”,“物”,“的”,“步骤”}按照“最初”→“施用”→“引”→“物”→“的”→“步骤”的顺序依次进行处理。例如,可以对多个分词{“最初”,“施用”,“引”,“物”,“的”,“步骤”}进行处理以分别得到上述各个分词的词性{“(最初)形容词”,“(施用)动词”,“(引)名词”,“(物)名词”,“(的)介词”,“(步骤)名词”}。本领域技术人员应当理解,还可以获得上述多个分词{“最初”,“施用”,“引”,“物”,“的”,“步骤”}的其它语言学特征,这里不再赘述。For example, for the sentence "the step of initially applying primers", the sentence is word-segmented, so as to obtain the following word segmentation result "initial/administration/primer/material/of/step", that is to say, the sentence "the step of initially applying primers "Segmentation into the following multiple word segmentation blocks {"initially", "administration", "leading", "things", "of", "steps"}, wherein each word segmentation block contains a word segmentation. Then, for the word segmentation {"initially", "administration", "leading", "material", "of", "step"} in each of the obtained multiple word segmentation blocks according to "initial" → "administration" "→"Introduction"→"Material"→"of"→"Step"The order is processed in turn. For example, multiple participle {"initially", "application", "lead", "material", "of", "step"} can be processed to obtain the part of speech of each participle {"(initially) adjective", "(use) verb", "(introduce) noun", "(object) noun", "(of) preposition", "(step) noun"}. Those skilled in the art should understand that other linguistic features of the above-mentioned participles {"initially", "administration", "introduction", "material", "of", "step"} can also be obtained, and will not be repeated here.
在S102之后,该处理前进到S104。在S104,将特征量作为人工神经网络的参数输入到人工神经网络中。After S102, the process proceeds to S104. In S104, the feature quantity is input into the artificial neural network as a parameter of the artificial neural network.
如图2所示,人工神经网络205中的每个圆圈代表一个或多个神经元,用来处理圆圈内标识的信息。人工神经网络205中的神经元分为三个层次组合在一起,分别为:输入层202、隐匿层203和输出层204。后一层的神经元的值由前一层的神经元的值计算得到。图2中的黑箭头代表人工神经网络205中信息的流动方向,相邻的两层神经元是完全连接的,并且信息由前一层流向后一层。本领域技术人员应当理解,虽然图2中的隐匿层203仅示出了一层,但是根据实际需要,隐匿层203可以包括两层或更多层。As shown in FIG. 2 , each circle in the artificial neural network 205 represents one or more neurons for processing the information identified in the circle. The neurons in the artificial neural network 205 are divided into three layers and combined together, namely: input layer 202 , hidden layer 203 and output layer 204 . The values of the neurons in the next layer are calculated from the values of the neurons in the previous layer. The black arrows in FIG. 2 represent the flow direction of information in the artificial neural network 205, neurons in two adjacent layers are fully connected, and information flows from the previous layer to the subsequent layer. Those skilled in the art should understand that although the concealment layer 203 in FIG. 2 shows only one layer, according to actual needs, the concealment layer 203 may include two or more layers.
如图2所示,在人工神经网络205的输入层202中,将当前正处理的分词的t个特征量{特征量1,特征量2,…,特征量i,…,特征量t-1,特征量t}作为人工神经网络205的参数输入到人工神经网络205中,其中,i和t均为大于或等于1的自然数,并且1≤i≤t。可以将上述步骤S102中提取的分词的一个或多个语言学特征作为上述特征量。例如,可以将分词的词性、分词的词形、分词序号或分词出现概率作为上述特征量。As shown in FIG. 2, in the input layer 202 of the artificial neural network 205, the t feature quantities {feature quantity 1, feature quantity 2, ..., feature quantity i, ..., feature quantity t-1 , feature quantity t} is input into the artificial neural network 205 as a parameter of the artificial neural network 205, wherein both i and t are natural numbers greater than or equal to 1, and 1≤i≤t. One or more linguistic features of the word segmentation extracted in the above step S102 may be used as the above feature quantity. For example, the part of speech of a participle, the form of a participle, the serial number of a participle, or the probability of occurrence of a participle may be used as the above-mentioned feature quantity.
还是以语句“最初施用引物的步骤”为例,对于分词“最初”,例如可以获取分词“最初”的词性“名词”、分词“最初”的词形“最初”、分词“最初”的序号“1”和分词“最初”的出现概率“0.43”等作为分词“最初”的特征量,并且将分词“最初”的上述特征量作为人工神经网络205的参数输入到人工神经网络205中。Still take the sentence "the step of initially applying primers" as an example, for the participle "initially", for example, the part of speech "noun" of the participle "initially", the word form "initially" of the participle "initially", and the sequence number of the participle "initially"" 1" and the occurrence probability "0.43" of the participle "first" are used as the feature quantities of the participle "first", and the above-mentioned feature quantities of the participle "first" are input into the artificial neural network 205 as parameters of the artificial neural network 205.
在S104之后,该处理前进到S106。在S106,采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性来判断该分词是否为多词单元的一部分。After S104, the process proceeds to S106. In S106, the artificial neural network is used to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first possibility and the second Possibility to determine whether the word is part of a multi-word unit.
在将特征量作为人工神经网络205的参数输入到人工神经网络205中之后,人工神经网络205根据下面的公式来确定当前神经元的值:After the feature quantity is input into the artificial neural network 205 as the parameter of the artificial neural network 205, the artificial neural network 205 determines the value of the current neuron according to the following formula:
f(x)=K((∑iwi×gi(x))+biasW+biasV)f(x)=K((∑i wi ×gi (x))+biasW+biasV)
其中,K表示活化函数,例如可以将用作活化函数。wi表示当前神经元与前一层神经元中的第i个神经元之间的权重,在图2中由黑线表示。gi(x)表示在前一层神经元中由黑线连接到当前神经元的所有神经元的值。biasW和biasV分别表示当前神经元的偏置权重和偏置值。本领域技术人员应当理解,上述活化函数和用来确定当前神经元的值的公式仅是示例性的,还可以采用其它形式的活化函数,或者采用其它形式的公式来确定当前神经元的值。Among them, K represents the activation function, for example, the used as an activation function.wi represents the weight between the current neuron and the ith neuron in the previous layer of neurons, represented by the black line in Figure 2. gi (x) represents the value of all neurons connected by black lines to the current neuron in the previous layer of neurons. biasW and biasV represent the bias weight and bias value of the current neuron, respectively. Those skilled in the art should understand that the above activation function and the formula used to determine the value of the current neuron are only exemplary, and other forms of activation functions or formulas can be used to determine the value of the current neuron.
在图2所示的人工神经网络205中,输入层202中的神经元的值就是特征量本身的值,每一条黑线代表一个特定的权重。除输入层202中的神经元以外,隐匿层203和输出层204中的神经元都有偏置权重和偏置值。In the artificial neural network 205 shown in FIG. 2 , the values of the neurons in the input layer 202 are the values of the feature quantities themselves, and each black line represents a specific weight. Except for the neurons in the input layer 202, the neurons in the hidden layer 203 and the output layer 204 all have bias weights and bias values.
如图2所示,人工神经网络205中的输出层204包括两个神经元:表示当前处理的分词是多词单元的一部分的第一可能性的神经元206,和表示当前处理的分词不是多词单元的一部分的第二可能性的神经元207。具体地,神经元206的值表示通过人工神经网络205计算得到的确定当前处理的分词是多词单元的一部分的可能性或概率。例如,如果神经元206的值为0.9,则表示人工神经网络205通过计算确定当前处理的分词是多词单元的一部分的可能性或概率为0.9。类似地,神经元207的值表示通过人工神经网络205计算得到的确定当前处理的分词不是多词单元的一部分的可能性或概率。例如,如果神经元207的值为0.6,则表示人工神经网络205通过计算确定当前处理的分词不是多词单元的一部分的可能性或概率为0.6。As shown in Figure 2, the output layer 204 in the artificial neural network 205 includes two neurons: a neuron 206 indicating the first possibility that the currently processed word segmentation is part of a multi-word unit, and a neuron 206 indicating that the currently processed word segmentation is not a multi-word unit. The neuron 207 of the second possibility of a part of the word unit. Specifically, the value of the neuron 206 represents the possibility or probability calculated by the artificial neural network 205 to determine that the currently processed word is a part of the multi-word unit. For example, if the value of the neuron 206 is 0.9, it means that the possibility or probability that the artificial neural network 205 determines through calculation that the currently processed word is a part of the multi-word unit is 0.9. Similarly, the value of the neuron 207 represents the possibility or probability calculated by the artificial neural network 205 to determine that the currently processed token is not a part of the multi-word unit. For example, if the value of the neuron 207 is 0.6, it means that the possibility or probability that the artificial neural network 205 determines through calculation that the currently processed word is not a part of the multi-word unit is 0.6.
在计算得到由神经元206的值表示的第一可能性和由神经元207的值表示的第二可能性之后,如图2中的208所示的,可以对第一可能性和第二可能性进行比较。如果第一可能性大于等于第二可能性,则如图2中的210所示的,判断当前处理的分词是多词单元的一部分。如果第一可能性小于第二可能性,则如图2中的209所示的,判断当前处理的分词不是多词单元的一部分。例如,针对当期处理的分词,如果由神经元206的值表示的第一可能性为0.9,而由神经元207的值表示的第二可能性为0.6,则由于第一可能性0.9大于第二可能性0.6,所以判断当前处理的分词为多词单元的一部分。然后,可以在图2的211处将分词的序号n加1得到序号为n+1的分词,以便对序号为n+1的分词进行处理。After calculating the first possibility represented by the value of neuron 206 and the second possibility represented by the value of neuron 207, as shown in 208 in FIG. gender for comparison. If the first possibility is greater than or equal to the second possibility, as shown in 210 in FIG. 2 , it is determined that the currently processed word segmentation is part of a multi-word unit. If the first possibility is smaller than the second possibility, as shown in 209 in FIG. 2 , it is determined that the currently processed word segmentation is not a part of the multi-word unit. For example, for the word segmentation of current processing, if the first possibility represented by the value of neuron 206 is 0.9, and the second possibility represented by the value of neuron 207 is 0.6, then since the first possibility 0.9 is greater than the second The possibility is 0.6, so it is judged that the currently processed word segmentation is part of the multi-word unit. Then, at 211 in FIG. 2 , the sequence number n of the word segment can be increased by 1 to obtain the word segment with the sequence number n+1, so as to process the word segment with the sequence number n+1.
在S106之后,该处理前进到S108。在S108,提取相邻的两个或更多个被判断为多词单元的一部分的分词,以形成多词单元。After S106, the process proceeds to S108. In S108, extract two or more adjacent word segments that are judged to be a part of the multi-word unit to form a multi-word unit.
还是以语句“最初施用引物的步骤”为例,在分词得到的多个分词块中的分词{“最初”,“施用”,“引”,“物”,“的”,“步骤”}中,假设分词“引”和分词“物”被判断为是多词单元的一部分,并且由于分词“引”和分词“物”是相邻的两个分词,因此提取分词“引”和分词“物”以形成多词单元“引物”。如果有多于两个的相邻分词被判断为均为多词单元的一部分,则也将这样的多于两个的相邻分词提取出来以形成多词单元。Still taking the sentence "the step of initially applying primers" as an example, in the word segmentation {"initially", "administration", "leading", "material", "of", "step"} in the multiple word segmentation blocks obtained by word segmentation , assuming that the participle "引" and the participle "物" are judged to be part of a multi-word unit, and since the participle "引" and the participle "物" are adjacent two participles, the participle "引" and the participle "物" are " to form the multi-word unit "primer". If more than two adjacent word segments are judged to be part of the multi-word unit, then such more than two adjacent word segments are also extracted to form the multi-word unit.
在S108之后,该处理前进到S110。在S110,获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块中的分词的特征量。After S108, the process proceeds to S110. At S110, the judgment result of the previous word segmentation block adjacent to the current word segmentation block is obtained as feedback information, and the feedback information is also used as the feature value of the word segmentation in the current word segmentation block.
如图2所示,假设n和n+1等表示所处理的分词块的序号,则当处理完序号为n的分词块之后,紧接着将序号加1以处理下一个分词块(即序号为n+1的分词块)。此时,序号为n+1的分词块成为当前分词块,而序号为n的分词块为与当前分词块相邻的先前分词块。因为已经对序号为n的先前分词块进行了处理,所以已经获取了序号为n的先前分词块中的分词是多词单元的一部分还是不是多词单元的一部分的判断结果。因此,如图2所示,可以将序号为n的先前分词块的判断结果作为反馈信息反馈回到人工神经网络205的输入层202,并且在对序号为n+1的当前分词块进行处理时,将该反馈信息也作为序号为n+1的当前分词块中的分词的特征量输入到人工神经网络205中。也就是说,使序号为n的先前分词块的判断结果参与到序号为n+1的当前分词块的判断处理中。As shown in Figure 2, assuming that n and n+1 represent the sequence number of the processed word segmentation block, then after the word segmentation block with the sequence number n is processed, the sequence number is added by 1 to process the next word segmentation block (that is, the sequence number is n+1 word segmentation block). At this time, the word segmentation block with the serial number n+1 becomes the current word segmentation block, and the word segmentation block with the serial number n is the previous word segmentation block adjacent to the current word segmentation block. Because the previous word segmentation block with sequence number n has been processed, the judgment result of whether the word segmentation in the previous word segmentation block with sequence number n is a part of the multi-word unit or not is already obtained. Therefore, as shown in Figure 2, the judgment result of the previous word segmentation block with the sequence number n can be fed back to the input layer 202 of the artificial neural network 205 as feedback information, and when the current word segmentation block with the sequence number n+1 is processed , the feedback information is also input into the artificial neural network 205 as the feature quantity of the word segmentation in the current word segmentation block whose sequence number is n+1. That is to say, the judgment result of the previous word segmentation block with the sequence number n is involved in the judgment process of the current word segmentation block with the sequence number n+1.
由于人工神经网络205具有反馈配置,即人工神经网络205在判断当前分词块中的分词是否为多词单元的一部分时,还考虑与当前分词块相邻的先前分词块中的分词是否为多词单元的一部分,所以人工神经网络205判断分词是否为多词单元的一部分的准确性和效率可以在很大程度上得到提高。Since the artificial neural network 205 has a feedback configuration, that is, when the artificial neural network 205 judges whether the word segmentation in the current word segmentation block is a part of a multi-word unit, it also considers whether the word segmentation in the previous word segmentation blocks adjacent to the current word segmentation block is multi-word Therefore, the accuracy and efficiency of the artificial neural network 205 in judging whether the word segmentation is part of a multi-word unit can be greatly improved.
最后,该处理在S112处结束。Finally, the process ends at S112.
根据本实施例的方法,通过将具有反馈配置的人工神经网络应用于多词单元的识别和提取,可以提高多词单元的识别和提取的准确性和效率。According to the method of this embodiment, by applying the artificial neural network with feedback configuration to the recognition and extraction of multi-word units, the accuracy and efficiency of recognition and extraction of multi-word units can be improved.
下面结合图3和图4来描述根据本发明的实施例的采用N元组来提取语句中的多词单元的方法。图3是示出根据本发明的实施例的采用N元组来提取语句中的多词单元的方法的示意性流程图,而图4是示出根据本发明的实施例的采用N元组来提取语句中的多词单元的示意图。The method for extracting multi-word units in a sentence by using N-tuples according to an embodiment of the present invention will be described below with reference to FIG. 3 and FIG. 4 . Fig. 3 is a schematic flow diagram illustrating a method for extracting multi-word units in a sentence by using N-tuples according to an embodiment of the present invention, and Fig. Schematic diagram of extracting multi-word units in a sentence.
如图3所示,该处理在S300开始。接着,该处理前进到S302。As shown in FIG. 3, the process starts at S300. Next, the process proceeds to S302.
在S302,依次将语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。In S302, sequentially combine N adjacent word segments in the sentence into N-tuples to form a word segment block, where N is a natural number greater than or equal to 2.
可以将语句中相邻的N个分词组合为N元组以形成分词块,并且以N元组为单位进行后续的处理。例如,可以将与当前分词左右相邻的两个分词与当前分词组合为三元组。对于在句首的分词,三元组的第一个元素为空;对于在句尾的分词,三元组的最后一个元素为空。N adjacent word segments in the sentence can be combined into N-tuples to form a word-segment block, and subsequent processing is performed in units of N-tuples. For example, two participles adjacent to the left and right of the current participle can be combined with the current participle into a triplet. For a participle at the beginning of a sentence, the first element of the triplet is empty; for a participle at the end of a sentence, the last element of the triplet is empty.
还是以语句“最初施用引物的步骤”为例,可以如图4中的黑方框所示,依次将上述语句中的分词“最初”和“施用”组合为三元组<NULL,最初,施用>,将分词“最初”、“施用”和“引”组合为三元组<最初,施用,引>,……,将分词“的”和“步骤”组合为三元组<的,步骤,NULL>,其中,NULL表示空。容易理解,在此,三元组即是包含有三个分词的分词块的一种示例。Still taking the sentence "the step of initially applying primers" as an example, as shown in the black box in Figure 4, the participle "initially" and "administration" in the above sentence can be combined in turn into a triplet <NULL, initially, application >, combine the participle "initially", "administration" and "introduction" into a triplet <initially, application, introduction>, ..., combine the participle "of" and "step" into a triplet <of, step, NULL>, where NULL means empty. It is easy to understand that here, a trigram is an example of a participle block containing three participles.
在确定N元组之后,可以获取N元组中的每个元素的语言学特征。例如,可以用词性分析工具获取N元组中的每个元素的词性。例如,可以使用斯坦福词性分析工具来获取N元组中的每个元素的词性。如图4所示,对于三元组<最初,施用,引>,可以分别获得其中的第一个元素“最初”的词性为形容词JJ,第二个元素“施用”的词性为动词VBG,以及第三个元素“引”的词性为名词NN。另外,也可以采用相应的工具获取N元组中的每个元素的其它语言学特征,这里不再赘述。After the N-tuple is determined, the linguistic features of each element in the N-tuple can be obtained. For example, a part-of-speech analysis tool can be used to obtain the part-of-speech of each element in the N-tuple. For example, the Stanford part-of-speech analysis tool can be used to obtain the part-of-speech of each element in the N-tuple. As shown in Figure 4, for the triplet <initially, apply, quote>, the part of speech of the first element "initially" is the adjective JJ, the part of speech of the second element "administer" is the verb VBG, and The part of speech of the third element "引" is a noun NN. In addition, corresponding tools may also be used to obtain other linguistic features of each element in the N-tuple, which will not be repeated here.
在获取了N元组中的每个元素的语言学特征之后,可以将获取的每个元素的语言学特征均作为该元素的属性。例如,如图4所示,针对N元组中的每个元素,总共列出了m个属性{属性1,属性2,属性3,…,属性m},其中m为大于或等于1的自然数。上述m个属性例如可以是分词的词性、分词的词形、分词序号或分词出现概率等等,但不限于此。例如,对于三元组<最初,施用,引>中的第一个元素“最初”,可以获取其属性1的值为“1”,属性2的值为“2”,属性3的值为“23”,……,属性m的值为“假”。After the linguistic feature of each element in the N-tuple is acquired, the acquired linguistic feature of each element may be used as an attribute of the element. For example, as shown in Figure 4, for each element in the N-tuple, a total of m attributes {attribute 1, attribute 2, attribute 3, ..., attribute m} are listed, where m is a natural number greater than or equal to 1 . The aforementioned m attributes may be, for example, the part of speech of the word segment, the word form of the word segment, the serial number of the word segment, or the occurrence probability of the word segment, etc., but are not limited thereto. For example, for the first element "initial" in the triple <initial, application, introduction>, the value of attribute 1 can be obtained as "1", the value of attribute 2 is "2", and the value of attribute 3 is " 23", ..., the value of attribute m is "false".
可以以N元组为单位,依次将N元组中的每个元素的m个属性作为特征量输入到人工神经网络(ANN)205中进行计算,以判断该元素是否为多词单元的一部分,其具体判断过程及后续处理与图1中的步骤S106至步骤S110的处理类似,只是分词块中所包含的分词的数量不同而已,因此其具体细节在此不再赘述。图4中的叉号表示对应的元素被判断为不是多词单元的一部分,而对号表示对应的元素被判断为是多词单元的一部分。两个或两个以上连续的对号表示一个完整的多词单元。如图4所示,因为元素“引”对应对号,元素“物”也对应对号,并且元素“引”和“物”彼此相邻,因此将“引物”提取为多词单元。The m attributes of each element in the N-tuple can be input into the artificial neural network (ANN) 205 as a feature quantity in order to determine whether the element is a part of the multi-word unit, taking the N-tuple as a unit. The specific judgment process and subsequent processing are similar to the processing of steps S106 to S110 in FIG. 1 , except that the number of word segmentations contained in the word segmentation block is different, so the specific details will not be repeated here. A cross in FIG. 4 indicates that the corresponding element is judged not to be a part of the multi-word unit, and a check mark indicates that the corresponding element is judged to be a part of the multi-word unit. Two or more consecutive checkmarks represent a complete multi-word unit. As shown in Fig. 4, because the element "quote" corresponds to a pair number, the element "thing" also corresponds to a pair number, and the elements "quote" and "thing" are adjacent to each other, so "primer" is extracted as a multi-word unit.
最后,该处理在S304处结束。Finally, the process ends at S304.
根据本实施例的方法,可以以N元组为单位进行处理以提取语句中的多词单元,从而进一步提高多词单元的识别和提取的准确性和效率。According to the method of this embodiment, processing can be performed in units of N-tuples to extract multi-word units in sentences, thereby further improving the accuracy and efficiency of identifying and extracting multi-word units.
下面结合图5来描述根据本发明的实施例的采用N元组来获取词形提取概率和/或词性提取概率的方法。图5是示出根据本发明的实施例的采用N元组来获取词形提取概率和/或词性提取概率的方法的示意性流程图。The method for obtaining word form extraction probability and/or part-of-speech extraction probability by using N-tuples according to an embodiment of the present invention will be described below with reference to FIG. 5 . Fig. 5 is a schematic flow chart illustrating a method for obtaining word form extraction probability and/or part of speech extraction probability by using N-tuples according to an embodiment of the present invention.
如图5所示,该处理开始于S500。接着,该处理前进到S502。As shown in FIG. 5, the process starts at S500. Next, the process proceeds to S502.
在步骤S502,根据N元组中的分词的词形特征,从词形模板中获取N元组中的分词是多词单元的一部分的词形提取概率,并且将词形提取概率也作为N元组中的分词的特征量。In step S502, according to the morphological features of the word segmentation in the N-tuple, the word form extraction probability that the word segmentation in the N-tuple is a part of the multi-word unit is obtained from the word form template, and the word form extraction probability is also used as an N-gram The feature amount of the word segmentation in the group.
例如,对于三元组<最初,施用,引>,该三元组<最初,施用,引>中的分词的词形特征为“最初,施用,引”。可以根据上述词形特征“最初,施用,引”在词形模板中查找对应的词形,从而得到与该词形对应的词形提取概率,该词形提取概率表示该三元组<最初,施用,引>中的分词“最初”、“施用”或“引”是多词单元的一部分的概率。然后,可以将获取的词形提取概率也作为该三元组<最初,施用,引>中的分词的特征量输入到人工神经网络205中。如果没有查找到词形提取概率,则按照预设的默认概率进行处理。词形模板中预先存储了N元组的词形及其对应的词形提取概率,该词形提取概率表示该N元组中的分词为多词单元的一部分的概率。本领域技术人员可以理解,词形模板可以预先设定。另外,作为替代,词形模板也可以通过对人工神经网络205进行训练来生成。作为非限制性的示例,下文中将对如何通过对人工神经网络205进行训练来生成词形模板进行详细描述。For example, for the triple <initially, apply, quote>, the participle in the triple <originally, apply, quote> has the morphological feature of "originally, apply, quote". According to the above-mentioned morphological feature "initially, use, lead", the corresponding word form can be searched in the word form template, so as to obtain the word form extraction probability corresponding to the word form, and the word form extraction probability indicates that the triple <initially, The probability that the participle "initially", "administration", or "introduction" in admin, citation is part of a multi-word unit. Then, the acquired word form extraction probabilities may also be input into the artificial neural network 205 as feature quantities of the word segmentation in the triplet <initially, applied, cited>. If no lemmatization probability is found, it will be processed according to the preset default probability. The word form of the N-tuple and its corresponding word form extraction probability are pre-stored in the word form template, and the word form extraction probability indicates the probability that a word in the N-tuple is a part of a multi-word unit. Those skilled in the art can understand that word form templates can be preset. In addition, as an alternative, the word form template can also be generated by training the artificial neural network 205 . As a non-limiting example, how to generate word form templates by training the artificial neural network 205 will be described in detail below.
在S502之后,该处理前进到S504。在S504,根据N元组中的分词的词性特征,从词性模板中获取N元组中的分词是多词单元的一部分的词性提取概率,并且将词性提取概率也作为N元组中的分词的特征量。After S502, the process proceeds to S504. In S504, according to the part-of-speech feature of the word segmentation in the N-tuple, the part-of-speech extraction probability that the word segmentation in the N-tuple is a part of the multi-word unit is obtained from the part-of-speech template, and the part-of-speech extraction probability is also used as the part-of-speech extraction probability of the word segmentation in the N-tuple. Feature amount.
类似地,例如,对于三元组<最初,施用,引>,该三元组<最初,施用,引>中的分词的词性特征为“形容词,动词,名词”。可以根据上述词性特征“形容词,动词,名词”在词性模板中查找对应的词性,从而得到与该词性对应的词性提取概率,该词性提取概率表示该三元组<最初,施用,引>中的分词“最初”、“施用”或“引”是多词单元的一部分的概率。然后,可以将获取的词性提取概率也作为该三元组<最初,施用,引>中的分词的特征量输入到人工神经网络205中。如果没有查找到词性提取概率,则按照预设的默认概率进行处理。词性模板中预先存储了N元组的词性及其对应的词性提取概率,该词性提取概率表示该N元组中的分词为多词单元的一部分的概率。本领域技术人员可以理解,词性模板可以预先设定。另外,作为替代,词性模板也可以通过对人工神经网络205进行训练来生成。作为非限制性的示例,下文中将对如何通过对人工神经网络205进行训练来生成词性模板进行详细描述。Similarly, for example, for the triple <initially, apply, cite>, the part-of-speech feature of the participle in the triple <initially, apply, cite> is "adjective, verb, noun". The corresponding part of speech can be searched in the part of speech template according to the above part of speech feature "adjective, verb, noun", so as to obtain the part of speech extraction probability corresponding to the part of speech, and the part of speech extraction probability represents the The probability that the participle "initially", "administration", or "introduction" is part of a multi-word unit. Then, the obtained part-of-speech extraction probability may also be input into the artificial neural network 205 as the feature quantity of the word segmentation in the triplet <initially, applied, cited>. If the part-of-speech extraction probability is not found, it will be processed according to the preset default probability. The part-of-speech template of the N-tuple and its corresponding part-of-speech extraction probability are pre-stored in the part-of-speech template, and the part-of-speech extraction probability indicates the probability that the segmented word in the N-tuple is a part of the multi-word unit. Those skilled in the art can understand that the part-of-speech template can be preset. In addition, as an alternative, the part-of-speech template can also be generated by training the artificial neural network 205 . As a non-limiting example, how to generate a part-of-speech template by training the artificial neural network 205 will be described in detail below.
最后,该处理在S506处结束。Finally, the process ends at S506.
本领域技术人员应当理解,图5中所示的步骤S502和S504可以顺序执行,也可以并行执行,或者可以仅执行步骤S502和S504中的任一个。根据本实施例的方法,可以根据N元组从词形模板和词性模板中获取词形提取概率和/或词性提取概率,以利用有关多词单元的已有知识并且增加输入到人工神经网络中的特征量,从而进一步提高了多词单元的识别和提取的准确性和效率。Those skilled in the art should understand that steps S502 and S504 shown in FIG. 5 may be executed sequentially or in parallel, or only any one of steps S502 and S504 may be executed. According to the method of this embodiment, the word form extraction probability and/or the part of speech extraction probability can be obtained from the word form template and the part of speech template according to the N-tuple, so as to utilize the existing knowledge about the multi-word unit and increase the input into the artificial neural network The feature quantity, thereby further improving the accuracy and efficiency of the recognition and extraction of multi-word units.
下面结合图6和图7来描述根据本发明的实施例的采用N元组进行词性容错的方法。图6是示出根据本发明的实施例的采用N元组进行词性容错的方法的示意性流程图,而图7是示出根据本发明的实施例的采用N元组进行词性容错的示意图。The method for performing part-of-speech fault tolerance using N-tuples according to an embodiment of the present invention will be described below with reference to FIG. 6 and FIG. 7 . Fig. 6 is a schematic flow chart showing a method for part-of-speech fault tolerance using N-tuples according to an embodiment of the present invention, and Fig. 7 is a schematic diagram showing a method for part-of-speech fault tolerance using N-tuples according to an embodiment of the present invention.
如图6所示,该处理开始于S600。接着,该处理前进到S602。As shown in FIG. 6, the process starts at S600. Next, the process advances to S602.
在步骤S602,将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组。In step S602, the form of the word segmentation in the N-tuple is replaced with the corresponding part of speech, so as to obtain a generalized N-tuple that mixes the form and part of speech.
下面结合图7来描述根据本发明的实施例的采用N元组进行词性容错的方法。如图7所示,在702处,选择要进行处理的可能包含错误词性的N元组。例如,对于语句“抗原释放物释放抗原”进行分词后得到的多个分词{“抗原”,“释放”,“物”,“释放”,“抗原”},可将分词“抗原”、“释放”和“物”形成为一个三元组<抗原,释放,物>,其中分词“抗原”的词性被标注为“名词”,分词“释放”的词性被标注为“动词”,分词“物”的词性被标注为“名词”。假设要处理的三元组为<抗原,释放,物>,并且“抗原释放物”应该是一个多词单元,但是由于其中的分词“释放”的词性被错误地标注为动词,所以在分析“释放”这个分词时不会将其标注为多词单元的一部分,从而无法正确识别整个多词表达“抗原释放物”。The method for performing part-of-speech fault tolerance using N-tuples according to an embodiment of the present invention will be described below with reference to FIG. 7 . As shown in FIG. 7 , at 702 , an N-tuple that may contain a wrong part of speech is selected to be processed. For example, for the multiple participle words {"antigen", "release", "object", "release", "antigen"} obtained after word segmentation of the sentence "antigen release material release antigen", the participle words "antigen", "release " and "thing" form a triplet <antigen, release, thing>, where the part of speech of the participle "antigen" is marked as "noun", the part of speech of the participle "release" is marked as "verb", and the part of speech of the participle "thing" The part of speech of is marked as "noun". Assume that the triplet to be processed is <antigen, release, thing>, and "antigen release" should be a multi-word unit, but because the part of speech of the participle "release" is wrongly marked as a verb, so in the analysis " Release" will not be marked as part of the multi-word unit, so that the entire multi-word expression "antigen release" cannot be correctly identified.
如图7所示,在704处进行N元组泛化。下面结合图16来描述N元组的泛化过程。如图16所示,在1602处确定需要泛化的N元组,并且确定该N元组中的元素的个数N。在1604处,选择要泛化的元素的个数x,x一般从1开始,根据x的值将任意x个分词泛化为词性。在1606处,根据x的值从待泛化的N元组中选择x个元素,并列出所有可能的组合,将该元素以其词性代替词形放回N元组中,并存储所有可能的泛化后的N元组。在1608处判断x是否等于N,如果为否,则在1610处将x加1,以在1612处得到新的x值。然后,根据新的x值重复1604、1606和1608处的处理,直至x等于N为止。As shown in FIG. 7 , N-tuple generalization is performed at 704 . The generalization process of the N-tuple is described below in conjunction with FIG. 16 . As shown in FIG. 16 , at 1602 , an N-tuple that needs to be generalized is determined, and the number N of elements in the N-tuple is determined. At 1604, the number x of elements to be generalized is selected, and x generally starts from 1, and any x participles are generalized into parts of speech according to the value of x. At 1606, select x elements from the N-tuple to be generalized according to the value of x, and list all possible combinations, put the element back into the N-tuple with its part of speech instead of word form, and store all possible combinations The generalized N-tuple of . At 1608 it is determined whether x is equal to N, if not, at 1610 x is incremented to obtain a new value of x at 1612 . Then, the processes at 1604, 1606 and 1608 are repeated according to the new value of x until x is equal to N.
还是以语句“抗原释放物释放抗原”进行分词后得到的多个分词{“抗原”,“释放”,“物”,“释放”,“抗原”}为例,假设要对三元组<抗原,释放,物>进行泛化,则该三元组中的元素的个数N为3,x可以为1、2或3。当x为1时,将三元组<抗原,释放,物>中的一个元素的词形替换为词性,从而可以得到如下的泛化后的三元组:<名词,释放,物>,<抗原,动词,物>,<抗原,释放,名词>。当x为2时,将三元组<抗原,释放,物>中的两个元素的词形替换为词性,从而可以得到如下的泛化后的三元组:<名词,动词,物>,<抗原,动词,名词>,<名词,释放,名词>。当x为3时,将三元组<抗原,释放,物>中的三个元素的词形替换为词性,从而可以得到如下的泛化后的三元组:<名词,动词,名词>。Still take the multiple word segmentation {"antigen", "release", "object", "release", "antigen"} obtained after the word segmentation of the sentence "antigen release releases antigen", assuming that the triple <antigen , release, thing> generalize, then the number N of elements in this triple is 3, and x can be 1, 2 or 3. When x is 1, replace the word form of an element in the triplet <antigen, release, thing> with the part of speech, so that the following generalized triplet can be obtained: <noun, release, thing>, < Antigen, verb, thing>, <antigen, release, noun>. When x is 2, replace the word form of the two elements in the triplet <antigen, release, thing> with the part of speech, so that the following generalized triplet can be obtained: <noun, verb, thing>, <antigen, verb, noun>, <noun, release, noun>. When x is 3, the word forms of the three elements in the triple <antigen, release, object> are replaced by parts of speech, so that the following generalized triple can be obtained: <noun, verb, noun>.
在S602之后,该处理前进到S604。在S604,根据泛化N元组中的分词的词形特征和词性特征,从词性容错模板中获取泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,并且将词性容错信息也作为N元组中的分词的特征量。After S602, the process proceeds to S604. In S604, according to the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple, the extraction probability that the word segmentation in the generalized N-tuple is a part of the multi-word unit is obtained from the part-of-speech fault-tolerant template as the part-of-speech fault-tolerant information, and The part-of-speech tolerance information is also used as the feature quantity of the word segmentation in the N-tuple.
通过上述步骤S602的处理可以得到所有可能的泛化后的N元组。然后,如图7所示,在706处,可以根据所有可能的泛化后的N元组,分别在词性容错模板中查找对应的泛化N元组,从而获取与泛化N元组对应的提取概率作为词性容错信息,该提取概率表示该泛化N元组中的分词是多词单元的一部分的概率。可以将获取的词性容错信息也作为N元组中的分词的特征量输入到人工神经网络205中,并且结合在708处的人工神经网络的其它特征量进行训练,从而在710处使人工神经网络强化对判断结果的影响。因此,如在712处所述的,可以在错误词性出现在目标元素中时,降低词性错误造成的偏差,从而实现词性容错。All possible generalized N-tuples can be obtained through the processing of the above step S602. Then, as shown in FIG. 7 , at 706, according to all possible generalized N-tuples, the corresponding generalized N-tuples can be searched in the part-of-speech fault-tolerant template, thereby obtaining the corresponding generalized N-tuples. The extracted probability is used as the part-of-speech error tolerance information, and the extracted probability indicates the probability that the participle in the generalized N-tuple is a part of the multi-word unit. The obtained part-of-speech error tolerance information can also be input into the artificial neural network 205 as the feature quantity of the word segmentation in the N-tuple, and combined with other feature quantities of the artificial neural network at 708 for training, so that at 710 the artificial neural network Strengthen the impact on the judgment result. Therefore, as described at 712 , when the wrong part of speech appears in the target element, the deviation caused by the wrong part of speech can be reduced, so as to achieve part of speech tolerance.
如果没有查找到作为词性容错信息的提取概率,则按照预设的默认概率进行处理。词性容错模板中预先存储了泛化N元组及其对应的提取概率,该提取概率表示该泛化N元组中的分词为多词单元的一部分的概率。本领域技术人员可以理解,词性容错模板可以预先设定。另外,作为替代,词性容错模板也可以通过对人工神经网络205进行训练来生成。作为非限制性的示例,下文中将对如何通过对人工神经网络205进行训练来生成词性容错模板进行详细描述。If no extraction probability is found as the part-of-speech fault-tolerant information, it will be processed according to the preset default probability. The generalized N-tuple and its corresponding extraction probability are pre-stored in the part-of-speech fault-tolerant template, and the extraction probability indicates the probability that a word segment in the generalized N-tuple is a part of a multi-word unit. Those skilled in the art can understand that the part-of-speech fault-tolerant template can be preset. In addition, as an alternative, the part-of-speech fault-tolerant template can also be generated by training the artificial neural network 205 . As a non-limiting example, how to generate a part-of-speech fault-tolerant template by training the artificial neural network 205 will be described in detail below.
还是以上述三元组<抗原,释放,物>为例,通过泛化可以得到以下一系列的泛化三元组:<名词,释放,物>,<抗原,动词,物>,<抗原,释放,名词>,<名词,动词,物>,<抗原,动词,名词>,<名词,释放,名词>,<名词,动词,名词>。根据上述一系列的泛化三元组中的每个,分别在词性容错模板中查找对应的泛化三元组,从而得到三元组<抗原,释放,物>中的分词为多词单元的一部分的提取概率作为词性容错信息。Still taking the above triplet <antigen, release, object> as an example, the following series of generalized triples can be obtained through generalization: <noun, release, object>, <antigen, verb, object>, <antigen, Release, noun>, <noun, verb, thing>, <antigen, verb, noun>, <noun, release, noun>, <noun, verb, noun>. According to each of the above-mentioned series of generalized triples, the corresponding generalized triples are searched in the part-of-speech fault-tolerant template, so as to obtain that the word segmentation in the triplet <antigen, release, object> is a multi-word unit A part of the extracted probability is used as part-of-speech error tolerance information.
最后,该处理在S606处结束。Finally, the process ends at S606.
根据本实施例的方法,可以缓解由词性标注错误引起的特征值的偏差,因此即使在词性标注过程中引用了错误信息,也可以正确地识别和提取语句中的多词单元,从而可以进一步提高多词单元的识别和提取的准确性和效率。According to the method of this embodiment, the deviation of the characteristic value caused by the part-of-speech tagging error can be alleviated, so even if the wrong information is quoted in the part-of-speech tagging process, the multi-word unit in the sentence can be correctly identified and extracted, thereby further improving Accuracy and efficiency of recognition and extraction of multi-word units.
下面结合图8至图11来说明根据本发明的实施例的提取语句中的多词单元的设备。A device for extracting multi-word units in a sentence according to an embodiment of the present invention will be described below with reference to FIGS. 8 to 11 .
图8是示出根据本发明的实施例的提取语句中的多词单元的设备的示意性框图。如图8所示,提取语句中的多词单元的设备800包括:语言学特征获取单元802,其针对将语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或多个语言学特征作为特征量;输入单元804,其将特征量作为人工神经网络的参数输入到人工神经网络中;判断单元806,其采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性来判断该分词是否为多词单元的一部分;提取单元808,其提取相邻的两个或更多个被判断为多词单元的一部分的分词,以形成多词单元;以及反馈信息获取单元810,其获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块中的分词的特征量。FIG. 8 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to an embodiment of the present invention. As shown in FIG. 8 , the device 800 for extracting multi-word units in a sentence includes: a linguistic feature acquisition unit 802, which, for each of the multiple word segmentation blocks obtained by segmenting the sentence, obtains the One or more linguistic features of the participle of the word as a feature quantity; input unit 804, which inputs the feature quantity into the artificial neural network as a parameter of the artificial neural network; judging unit 806, which uses the artificial neural network to calculate each word segmentation block The participle is the first possibility of being part of the multi-word unit and the second possibility that the participle is not part of the multi-word unit, and according to the first possibility and the second possibility, it is judged whether the participle is part of the multi-word unit Extraction unit 808, which extracts adjacent two or more word segmentations that are judged as a part of the multi-word unit, to form a multi-word unit; and feedback information acquisition unit 810, which acquires the previous adjacent word segmentation block The judgment result of the word segmentation block is used as feedback information, and the feedback information is also used as the feature quantity of the word segmentation in the current word segmentation block.
需要指出的是,在与设备有关的实施例中所涉及的相关术语或表述与以上对根据本发明的实施例的方法的实施例阐述中所使用的术语或表述对应,在此不再赘述。It should be noted that the relevant terms or expressions involved in the embodiments related to the device correspond to the terms or expressions used in the above descriptions of the method according to the embodiments of the present invention, and will not be repeated here.
图9是示出根据本发明的另一实施例的提取语句中的多词单元的设备的示意性框图。如图9所示,提取语句中的多词单元的设备900包括语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810和组合单元902。提取语句中的多词单元的设备900中的语言学特征获取单元802、输入单元804、判断单元806、提取单元808和反馈信息获取单元810与提取语句中的多词单元的设备800中的语言学特征获取单元802、输入单元804、判断单元806、提取单元808和反馈信息获取单元810相同,其细节在此不再赘述。另外,提取语句中的多词单元的设备900中的组合单元902用于依次将语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。Fig. 9 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to another embodiment of the present invention. As shown in FIG. 9 , the device 900 for extracting multi-word units in a sentence includes a linguistic feature acquisition unit 802 , an input unit 804 , a judgment unit 806 , an extraction unit 808 , a feedback information acquisition unit 810 and a combination unit 902 . The linguistic feature acquisition unit 802, the input unit 804, the judging unit 806, the extraction unit 808 and the feedback information acquisition unit 810 in the device 900 for extracting multi-word units in a sentence and the language in the device 800 for extracting a multi-word unit in a sentence The learning feature acquiring unit 802, the input unit 804, the judging unit 806, the extracting unit 808 and the feedback information acquiring unit 810 are the same, and the details thereof will not be repeated here. In addition, the combination unit 902 in the device 900 for extracting multi-word units in a sentence is used to sequentially combine N adjacent word segments in the sentence into N-tuples to form a word block, where N is a natural number greater than or equal to 2.
图10是示出根据本发明的另一实施例的提取语句中的多词单元的设备的示意性框图。如图10所示,提取语句中的多词单元的设备1000包括语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810、组合单元902、词形提取概率获取单元1002和词性提取概率获取单元1004。提取语句中的多词单元的设备1000中的语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810和组合单元902与提取语句中的多词单元的设备900中的语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810和组合单元902相同,其细节在此不再赘述。另外,提取语句中的多词单元的设备1000中的词形提取概率获取单元1002,其根据N元组中的分词的词形特征,从词形模板中获取N元组中的分词是多词单元的一部分的词形提取概率,并且将词形提取概率也作为N元组中的分词的特征量;词性提取概率获取单元1004,其根据N元组中的分词的词性特征,从词性模板中获取N元组中的分词是多词单元的一部分的词性提取概率,并且将词性提取概率也作为N元组中的分词的特征量。Fig. 10 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to another embodiment of the present invention. As shown in Figure 10, the device 1000 for extracting multi-word units in a sentence includes a linguistic feature acquisition unit 802, an input unit 804, a judgment unit 806, an extraction unit 808, a feedback information acquisition unit 810, a combination unit 902, and a word form extraction probability An acquisition unit 1002 and a part-of-speech extraction probability acquisition unit 1004 . The linguistic feature acquisition unit 802, the input unit 804, the judging unit 806, the extraction unit 808, the feedback information acquisition unit 810 and the combination unit 902 in the device 1000 for extracting multi-word units in a sentence and the device for extracting multi-word units in a sentence The linguistic feature acquiring unit 802, the input unit 804, the judging unit 806, the extracting unit 808, the feedback information acquiring unit 810 and the combining unit 902 in 900 are the same, and the details thereof will not be repeated here. In addition, the word form extraction probability acquisition unit 1002 in the device 1000 for extracting multi-word units in a sentence obtains from the word form template that the participle in the N-tuple is a multi-word The word form extraction probability of a part of the unit, and the word form extraction probability is also used as the feature quantity of the participle in the N-tuple; The part-of-speech extraction probability that the word segmentation in the N-tuple is a part of the multi-word unit is obtained, and the part-of-speech extraction probability is also used as the feature quantity of the word segmentation in the N-tuple.
图11是示出根据本发明的另一实施例的提取语句中的多词单元的设备的示意性框图。如图11所示,提取语句中的多词单元的设备1100包括语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810、组合单元902、泛化单元1102和词性容错信息获取单元1104。提取语句中的多词单元的设备1100中的语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810和组合单元902与提取语句中的多词单元的设备900中的语言学特征获取单元802、输入单元804、判断单元806、提取单元808、反馈信息获取单元810和组合单元902相同,其细节在此不再赘述。另外,提取语句中的多词单元的设备1100中的泛化单元1102将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化模板;词性容错信息获取单元1104获取泛化模板中的中间分词为多词单元的一部分的概率作为词性容错信息,并且将词性容错信息也作为N元组中的每个分词的特征量。Fig. 11 is a schematic block diagram illustrating a device for extracting multi-word units in a sentence according to another embodiment of the present invention. As shown in Figure 11, the device 1100 for extracting multi-word units in a sentence includes a linguistic feature acquisition unit 802, an input unit 804, a judgment unit 806, an extraction unit 808, a feedback information acquisition unit 810, a combination unit 902, and a generalization unit 1102 and a part-of-speech error-tolerant information acquisition unit 1104 . The linguistic feature acquisition unit 802, the input unit 804, the judging unit 806, the extraction unit 808, the feedback information acquisition unit 810 and the combination unit 902 in the device 1100 for extracting multi-word units in a sentence and the device for extracting multi-word units in a sentence The linguistic feature acquiring unit 802, the input unit 804, the judging unit 806, the extracting unit 808, the feedback information acquiring unit 810 and the combining unit 902 in 900 are the same, and the details thereof will not be repeated here. In addition, the generalization unit 1102 in the device 1100 for extracting the multi-word units in the sentence replaces the word form of the participle in the N-tuple with the corresponding part of speech to obtain a generalization template that mixes the form and part of speech; part of speech error tolerance information The acquisition unit 1104 acquires the probability that the middle word in the generalization template is a part of the multi-word unit as part-of-speech error tolerance information, and uses the part-of-speech error tolerance information as the feature value of each word segment in the N-tuple.
上述图8至图11中的各个装置和/或单元例如可以被配置成按照相应方法中的相应步骤的工作方式来操作。细节参见上述针对根据本申请的实施例的方法所阐述的实施例。在此不再赘述。For example, each device and/or unit in the above-mentioned Fig. 8 to Fig. 11 may be configured to operate according to the working manner of the corresponding steps in the corresponding method. For details, refer to the embodiments described above for the method according to the embodiments of the present application. I won't repeat them here.
下面将结合图12来描述根据本发明的实施例的训练用于提取语句中的多词单元的人工神经网络的方法。图12是示出根据本发明的实施例的训练用于提取语句中的多词单元的人工神经网络的方法的示意性流程图。A method for training an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention will be described below with reference to FIG. 12 . Fig. 12 is a schematic flowchart illustrating a method for training an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention.
如图12所示,该处理在S1200开始。接着,该处理前进到S1202。As shown in FIG. 12, the process starts at S1200. Next, the process advances to S1202.
在S1202,针对将每个训练语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或更多个语言学特征作为特征量,其中,训练语句中的多词单元已被标注。In S1202, for each of the multiple word segmentation blocks obtained by segmenting each training sentence, one or more linguistic features of the word segmentation in each word segmentation block are obtained as feature quantities, wherein, in the training sentence The multi-word units of have been annotated.
除了是处理对每个训练语句进行分词得到的多个分词块之外,S1202的处理与图1中的S102的处理基本相同,其具体细节在此不再赘述。另外,关于训练语句,已经对其中的多词单元进行了标注。The processing of S1202 is basically the same as the processing of S102 in FIG. 1 except for processing multiple word segmentation blocks obtained by segmenting each training sentence, and details thereof will not be repeated here. In addition, regarding the training sentences, the multi-word units in them have been marked.
在S1202之后,该处理前进到S1204。在S1204,将特征量作为人工神经网络的参数输入到人工神经网络中。After S1202, the process proceeds to S1204. In S1204, the feature quantity is input into the artificial neural network as a parameter of the artificial neural network.
除了是处理对每个训练语句进行分词得到的多个分词块之外,S1204的处理与图1中的S104的处理基本相同,其具体细节在此不再赘述。The processing of S1204 is basically the same as the processing of S104 in FIG. 1 except for processing multiple word segmentation blocks obtained by segmenting each training sentence, and details thereof will not be repeated here.
在S1204之后,该处理前进到S1206。在S1206,采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性来判断该分词是否为多词单元的一部分。After S1204, the process proceeds to S1206. In S1206, the artificial neural network is used to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first possibility and the second Possibility to determine whether the word is part of a multi-word unit.
除了是处理对每个训练语句进行分词得到的多个分词块之外,S1206的处理与图1中的S106的处理基本相同,其具体细节在此不再赘述。The processing of S1206 is basically the same as the processing of S106 in FIG. 1 except for processing multiple word segmentation blocks obtained by segmenting each training sentence, and details thereof will not be repeated here.
在S1206之后,该处理前进到S1208。在S1208,根据判断的结果和标注的结果,来训练人工神经网络。After S1206, the process proceeds to S1208. At S1208, the artificial neural network is trained according to the judgment result and the marked result.
人工神经网络205的训练过程就是对人工神经网络205中的权值进行求解的过程。本发明中采用BP(Back Propagation,误差反向传播)算法来进行人工神经网络205的训练。具体过程如下:The training process of the artificial neural network 205 is the process of solving the weights in the artificial neural network 205 . In the present invention, a BP (Back Propagation, error backpropagation) algorithm is used to train the artificial neural network 205 . The specific process is as follows:
a)初始化人工神经网络205,选用随机产生的权重;a) Initialize the artificial neural network 205 and select randomly generated weights;
b)将带有期望值的训练数据的项目逐一输入到人工神经网络205中,并且计算输出值;b) Input the items of the training data with expected values into the artificial neural network 205 one by one, and calculate the output value;
c)比较输出值与期望值之间的差异,计算人工神经网络205中的每个神经元的误差;c) comparing the difference between the output value and the expected value, and calculating the error of each neuron in the artificial neural network 205;
d)调整权重并减小误差;d) adjust weights and reduce errors;
e)重复执行步骤b)-d),直至误差小于预定的阈值为止。本领域技术人员应当理解,可以根据经验值、或者根据实验来设定上述预定的阈值。e) Repeat steps b)-d) until the error is smaller than a predetermined threshold. Those skilled in the art should understand that the aforementioned predetermined threshold can be set according to empirical values or experiments.
训练人工神经网络205的过程由输出层神经元权值向隐匿层神经元权重逐一进行求解,分别计算每个权重的变化量。首先,按照下面的公式求解每个输出层神经元的误差:其中,是第i个神经元所期望的输出值,是第i个神经元的实际输出值,是活化函数的导数。按照下面的公式计算隐匿层神经元的误差:其中,wij是第j个输出层神经元与第i个隐匿层神经元之间的权值,是第i个输出层神经元的误差,是第i个隐匿层神经元的实际输出值,其中h表示该神经元是隐匿层神经元。输入层神经元的输入值即为输出值,因此没有误差。The process of training the artificial neural network 205 is solved one by one from the output layer neuron weight to the hidden layer neuron weight, and the variation of each weight is calculated respectively. First, solve for the error of each output layer neuron according to the following formula: in, is the expected output value of the i-th neuron, is the actual output value of the i-th neuron, is the derivative of the activation function. Calculate the error of hidden layer neurons according to the following formula: Among them, wij is the weight between the j-th output layer neuron and the i-th hidden layer neuron, is the error of the i-th output layer neuron, is the actual output value of the i-th hidden layer neuron, where h indicates that the neuron is a hidden layer neuron. The input value of the input layer neuron is the output value, so there is no error.
计算出每个神经元的误差后,可以计算权重的调整幅度:Δw=ρ×δi×ni,其中ρ是学习率,δi是第i个神经元的误差,ni是当前神经元的值。新的权重就是当前权重加上Δw。After calculating the error of each neuron, the adjustment range of the weight can be calculated: Δw=ρ×δi ×ni , where ρ is the learning rate, δi is the error of the i-th neuron, and ni is the current neuron value. The new weight is the current weight plus Δw.
本领域技术人员应当理解,上述训练人工神经网络205的方法仅是示例性的,还可以采用其它的方法来训练人工神经网络205。Those skilled in the art should understand that the above method for training the artificial neural network 205 is only exemplary, and other methods can also be used for training the artificial neural network 205 .
在S1208之后,该处理前进到S1210。在S1210,获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词的特征量。After S1208, the process proceeds to S1210. At S1210, the judgment result of the previous word segmentation block adjacent to the current word segmentation block is obtained as feedback information, and the feedback information is also used as the feature quantity of the current word segmentation block.
除了是处理对每个训练语句进行分词得到的多个分词块之外,S1210的处理与图1中的S110的处理基本相同,其具体细节在此不再赘述。The processing of S1210 is basically the same as the processing of S110 in FIG. 1 except for processing multiple word segmentation blocks obtained by segmenting each training sentence, and details thereof will not be repeated here.
最后,该处理在S1212处结束。Finally, the process ends at S1212.
根据本实施例的方法,通过训练可以得到具有反馈配置的人工神经网络,将训练得到的人工神经网络应用于多词单元的识别和提取,可以提高多词单元的识别和提取的准确性和效率。According to the method of this embodiment, the artificial neural network with feedback configuration can be obtained through training, and the artificial neural network obtained by training is applied to the identification and extraction of multi-word units, which can improve the accuracy and efficiency of identification and extraction of multi-word units .
下面结合图13来描述根据本发明的实施例的采用N元组来训练用于提取语句中的多词单元的人工神经网络的方法。图13是示出根据本发明的实施例的采用N元组来训练用于提取语句中的多词单元的人工神经网络的方法的示意性流程图。The following describes a method for using N-tuples to train an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention with reference to FIG. 13 . FIG. 13 is a schematic flowchart illustrating a method for training an artificial neural network for extracting multi-word units in a sentence using N-tuples according to an embodiment of the present invention.
如图13所示,该处理在S1300开始。接着,该处理前进到S1302。As shown in FIG. 13, the process starts at S1300. Next, the process advances to S1302.
在S1302,依次将训练语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。At S1302, sequentially combine N adjacent word segments in the training sentence into N-tuples to form a word segment block, where N is a natural number greater than or equal to 2.
除了是处理对每个训练语句进行分词得到的多个分词块之外,S1302的处理与图3中的S302的处理基本相同,其具体细节在此不再赘述。The processing of S1302 is basically the same as the processing of S302 in FIG. 3 except for processing multiple word segmentation blocks obtained by segmenting each training sentence, and details thereof will not be repeated here.
最后,该处理在S1304处结束。Finally, the process ends at S1304.
根据本实施例的方法,可以根据N元组的诸如词性组合知识、词形组合知识等已有知识来训练人工神经网络,将训练得到的人工神经网络应用于提取语句中的多词单元,可以进一步提高多词单元的识别和提取的准确性和效率。According to the method of the present embodiment, the artificial neural network can be trained according to existing knowledge such as part-of-speech combination knowledge and word-form combination knowledge of the N-tuple, and the artificial neural network obtained by training is applied to the multi-word unit in the extracted sentence, which can Further improve the accuracy and efficiency of recognition and extraction of multi-word units.
下面结合图14来描述根据本发明的实施例的采用N元组生成词形模板和/或词性模板的方法。图14是示出根据本发明的实施例的采用N元组生成词形模板和/或词性模板的方法的示意性流程图。The method for generating word form templates and/or part-of-speech templates using N-tuples according to an embodiment of the present invention will be described below with reference to FIG. 14 . Fig. 14 is a schematic flow chart illustrating a method for generating word form templates and/or part-of-speech templates using N-tuples according to an embodiment of the present invention.
如图14所示,该处理开始于S1400。接着,该处理前进到S1402。As shown in FIG. 14, the process starts at S1400. Next, the process advances to S1402.
在步骤S1402,根据标注的结果和N元组中的分词的词形特征,计算N元组中的分词被标注为多词单元的一部分的词形提取概率,以生成词形模板。In step S1402, according to the tagged result and the morphological features of the word segments in the N-tuple, calculate the morphological extraction probability that the word segment in the N-tuple is marked as a part of the multi-word unit, so as to generate a word form template.
例如,对于三元组<最初,施用,引>,其中的分词“最初”和“施用”被标注为不是多词单元的一部分,而其中的分词“引”被标注为是多词单元的一部分,并且该三元组<最初,施用,引>中的分词的词形特征为“最初,施用,引”。可以根据上述信息,通过人工神经网络205来计算该三元组<最初,施用,引>中的分词“最初”、“施用”或“引”被标注多词单元的一部分的词形提取概率,并且相关联地存储该词形提取概率和当前分词所对应的三元组,从而生成词形模板。For example, for the triple <originally, apply, cite>, the participle "originally" and "administrative" are marked as not part of the multi-word unit, while the participle "yin" is marked as part of the multi-word unit , and the word form feature of the participle in the triple <originally, apply, quote> is "at first, apply, quote". According to the above information, the word form extraction probability of part of the marked multi-word unit of the participle "initially", "administration" or "introduction" in the triple <initially, application, quotation> can be calculated by the artificial neural network 205, And the word form extraction probability is stored in association with the triplet corresponding to the current word segmentation, so as to generate a word form template.
在步骤S1404,根据标注的结果和N元组中的分词的词性特征,计算N元组中的分词是多词单元的一部分的词性提取概率,以生成词性模板。In step S1404, according to the tagged result and the part-of-speech feature of the word segmentation in the N-tuple, calculate the part-of-speech extraction probability that the word segmentation in the N-tuple is a part of the multi-word unit, so as to generate a part-of-speech template.
类似地,例如,对于三元组<最初,施用,引>,其中的分词“最初”和“施用”被标注为不是多词单元的一部分,而其中的分词“引”被标注为是多词单元的一部分,并且该三元组<最初,施用,引>中的分词的词性特征为“形容词,动词,名词”。可以根据上述信息,通过人工神经网络205来计算该三元组<最初,施用,引>中的分词“最初”、“施用”或“引”被标注多词单元的一部分的词性提取概率,并且相关联地存储该词性提取概率和当前分词所对应的三元组,从而生成词性模板。Similarly, for example, for the triple <initially, admin, citation>, the participle "initially" and "administration" are marked as not part of the multi-word unit, while the participle "cite" is marked as multi-word part of the unit, and the participle in the triple <originally, apply, cite> has the part-of-speech feature of "adjective, verb, noun". According to the above information, the part-of-speech extraction probability of the part of the word "initially", "administration" or "introduction" in the triple group <initially, application, quotation> is calculated by the artificial neural network 205, and The part-of-speech extraction probability and the triple corresponding to the current word segmentation are stored in association, so as to generate a part-of-speech template.
最后,该处理在S1406处结束。Finally, the process ends at S1406.
本领域技术人员应当理解,图14中所示的步骤S1402和S1404可以顺序执行,也可以并行执行,或者可以仅执行步骤S1402和S1404中的任一个。根据本实施例的方法,可以采用N元组来训练人工神经网络以生成词形模板或词性模板,将生成的词形模板和词性模板应用于多词单元的识别和提取,可以进一步提高多词单元的识别和提取的准确性和效率。Those skilled in the art should understand that steps S1402 and S1404 shown in FIG. 14 may be executed sequentially or in parallel, or only any one of steps S1402 and S1404 may be executed. According to the method of this embodiment, N-tuples can be used to train the artificial neural network to generate word form templates or part-of-speech templates, and the generated word form templates and part-of-speech templates are applied to the recognition and extraction of multi-word units, which can further improve the multi-word Accuracy and efficiency of unit identification and extraction.
下面结合图15和图16来描述根据本发明的实施例的采用N元组生成词性容错模板的方法。图15是示出根据本发明的实施例的采用N元组生成词性容错模板的方法的示意性流程图。图16是示出根据本发明的实施例的采用N元组生成词性容错模板的示意图。The method for generating part-of-speech fault-tolerant templates using N-tuples according to an embodiment of the present invention will be described below with reference to FIG. 15 and FIG. 16 . Fig. 15 is a schematic flowchart illustrating a method for generating a part-of-speech fault-tolerant template using N-tuples according to an embodiment of the present invention. Fig. 16 is a schematic diagram showing the use of N-tuples to generate part-of-speech fault-tolerant templates according to an embodiment of the present invention.
如图15所示,该处理开始于S1500。接着,该处理前进到S1502。As shown in FIG. 15, the process starts at S1500. Next, the process advances to S1502.
在步骤S1502,将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组。In step S1502, the form of the word segmentation in the N-tuple is replaced with the corresponding part of speech, so as to obtain a generalized N-tuple that mixes the form and part of speech.
除了是处理对每个训练语句进行分词得到的多个分词之外,S1502的处理与图6中的S602的处理基本相同,其具体细节在此不再赘述。The processing of S1502 is basically the same as the processing of S602 in FIG. 6 except for processing multiple word segmentations obtained by segmenting each training sentence, and details thereof will not be repeated here.
在S1502之后,该处理前进到S1504。在S1504,根据标注的结果和泛化N元组中的分词的词形特征和词性特征,计算泛化N元组中的分词被标注多词单元的一部分的提取概率作为词性容错信息,以生成词性容错模板。After S1502, the process proceeds to S1504. In S1504, according to the tagged result and the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple, calculate the extraction probability of the word segmentation in the generalized N-tuple being part of the marked multi-word unit as the part-of-speech error tolerance information to generate Part of Speech tolerance template.
通过上述步骤S1502的处理可以得到所有可能的泛化后的N元组。然后,可以根据标注的结果和所有可能的泛化后的N元组,分别计算泛化N元组中的分词被标注多词单元的一部分的提取概率作为词性容错信息。All possible generalized N-tuples can be obtained through the processing of the above step S1502. Then, according to the tagged results and all possible generalized N-tuples, the extraction probability of part of the tagged multi-word unit in the generalized N-tuple can be calculated as part-of-speech error tolerance information.
还是以上述三元组<抗原,释放,物>为例,其中分词“抗原”、“释放”和“物”均被标注为是多词单元的一部分,上述三元组通过泛化可以得到以下一系列的泛化三元组:<名词,释放,物>,<抗原,动词,物>,<抗原,释放,名词>,<名词,动词,物>,<抗原,动词,名词>,<名词,释放,名词>,<名词,动词,名词>。因此,如图16所示,在1614处,根据上述标注的结果和上述一系列的泛化三元组中的每个,分别计算上述泛化三元组中的分词被标注为多词单元的一部分的提取概率作为词性容错信息,并且相关联地存储该词性容错信息和当前分词所对应的三元组,从而生成词性容错模板。Still taking the above triplet <antigen, release, object> as an example, the participle words "antigen", "release" and "object" are all marked as part of the multi-word unit. The above triplet can be generalized to get the following A series of generalized triples: <noun, release, thing>, <antigen, verb, thing>, <antigen, release, noun>, <noun, verb, thing>, <antigen, verb, noun>, < noun, release, noun>, <noun, verb, noun>. Therefore, as shown in FIG. 16, at 1614, according to the result of the above-mentioned annotation and each of the above-mentioned series of generalized triples, respectively calculate Part of the extraction probability is used as part-of-speech error-tolerant information, and the triples corresponding to the part-of-speech error-tolerant information and the current word segmentation are stored in association, thereby generating a part-of-speech error-tolerant template.
由于大部分词性容错模板中均包含词性信息和词形信息,并且N元组模板中不仅包含当前目标分词还包含当前分词的前后分词信息,所以可以极大地弱化单个错误词性所造成的影响,当将错误词性输入到人工神经网络中,词性容错模板中的分词是多词单元的一部分的概率可以通过人工神经网络的计算来抑制错误词性对最终判断结果的影响。Since most part-of-speech fault-tolerant templates contain part-of-speech information and morphological information, and the N-tuple template contains not only the current target word segment but also the information before and after the current word segment, it can greatly weaken the impact of a single wrong part of speech. The wrong part of speech is input into the artificial neural network, and the probability that the word segmentation in the part-of-speech fault-tolerant template is a part of the multi-word unit can be calculated by the artificial neural network to suppress the influence of the wrong part of speech on the final judgment result.
最后,该处理在S1506处结束。Finally, the process ends at S1506.
根据本实施例的方法,可以在训练人工神经网络的过程中缓解由词性标注错误引起的特征值的偏差,并且生成词性容错模板,如果将生成的词性容错模板应用于多词单元的识别和提取,则即使在词性标注过程中引用了错误信息,也可以正确地识别和提取语句中的多词单元,从而可以进一步提高多词单元的识别和提取的准确性和效率。According to the method of this embodiment, the deviation of the feature value caused by the part-of-speech tagging error can be alleviated in the process of training the artificial neural network, and the part-of-speech fault-tolerant template is generated, if the generated part-of-speech fault-tolerant template is applied to the identification and extraction of multi-word units , even if erroneous information is quoted during the part-of-speech tagging process, the multi-word units in the sentence can be correctly identified and extracted, so that the accuracy and efficiency of the identification and extraction of multi-word units can be further improved.
下面结合图17至图20来说明根据本发明的实施例的训练用于提取语句中的多词单元的人工神经网络的设备。A device for training an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention will be described below with reference to FIGS. 17 to 20 .
图17是示出根据本发明的实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图。如图17所示,训练用于提取语句中的多词单元的人工神经网络的设备1700包括:语言学特征获取装置1702,其针对将每个训练语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或更多个语言学特征作为特征量,其中,训练语句中的多词单元已被标注;输入装置1704,其将特征量作为人工神经网络的参数输入到人工神经网络中;判断装置1706,其采用人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据第一可能性和第二可能性的比较结果来判断该分词是否为多词单元的一部分;训练装置1708,其根据判断的结果和标注的结果,来训练人工神经网络;以及反馈信息获取装置1710,其获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将反馈信息也作为当前分词块中的分词的特征量。Fig. 17 is a schematic block diagram showing a device for training an artificial neural network for extracting multi-word units in a sentence according to an embodiment of the present invention. As shown in FIG. 17 , the device 1700 for training an artificial neural network for extracting multi-word units in a sentence includes: a linguistic feature acquisition device 1702, which targets each of the multiple word segmentation blocks obtained by segmenting each training sentence A word segmentation block, obtain one or more linguistic features of the word segmentation in each word segmentation block as a feature quantity, wherein, the multi-word unit in the training sentence has been marked; input device 1704, which uses the feature quantity as an artificial neural network Parameter input in artificial neural network; judging means 1706, it adopts artificial neural network to calculate the first possibility that the participle in each participle block is a part of multi-word unit and the second possibility that this participle is not a part of multi-word unit , and judge whether the word segmentation is a part of the multi-word unit according to the comparison result of the first possibility and the second possibility; training device 1708, which trains the artificial neural network according to the result of the judgment and the result of the label; and feedback Information acquiring means 1710, which acquires the judgment result of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and uses the feedback information as the feature quantity of the word segmentation in the current word segmentation block.
需要指出的是,在与设备有关的实施例中所涉及的相关术语或表述与以上对根据本发明的实施例的方法的实施例阐述中所使用的术语或表述对应,在此不再赘述。It should be noted that the relevant terms or expressions involved in the embodiments related to the device correspond to the terms or expressions used in the above descriptions of the method according to the embodiments of the present invention, and will not be repeated here.
图18是示出根据本发明的另一实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图。如图18所示,训练用于提取语句中的多词单元的人工神经网络的设备1800包括语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710和组合装置1802。训练用于提取语句中的多词单元的人工神经网络的设备1800中的语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708和反馈信息获取装置1710与训练用于提取语句中的多词单元的人工神经网络的设备1700中的语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708和反馈信息获取装置1710相同,其细节在此不再赘述。另外,训练用于提取语句中的多词单元的人工神经网络的设备1800中的组合装置1802依次将训练语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。Fig. 18 is a schematic block diagram showing a device for training an artificial neural network for extracting multi-word units in a sentence according to another embodiment of the present invention. As shown in Figure 18, the device 1800 for training the artificial neural network used to extract the multi-word units in the sentence includes linguistic feature acquisition means 1702, input means 1704, judgment means 1706, training means 1708, feedback information acquisition means 1710 and combination device 1802. The linguistic feature acquisition means 1702, the input means 1704, the judging means 1706, the training means 1708 and the feedback information acquisition means 1710 in the device 1800 for training the artificial neural network for extracting the multi-word units in the sentence are used for training in the extraction of the sentence The linguistic feature acquiring means 1702, input means 1704, judging means 1706, training means 1708 and feedback information acquiring means 1710 in the multi-word unit artificial neural network device 1700 are the same, and the details thereof will not be repeated here. In addition, the combination device 1802 in the device 1800 for training the artificial neural network for extracting the multi-word units in the sentence sequentially combines the adjacent N participles in the training sentence into N-tuples to form participle blocks, where N is greater than or A natural number equal to 2.
图19是示出根据本发明的另一实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图。如图19所示,训练用于提取语句中的多词单元的人工神经网络的设备1900包括语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710、组合装置1802、词形模板生成装置1902和词性模板生成装置1904。训练用于提取语句中的多词单元的人工神经网络的设备1900中的语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710和组合装置1802与训练用于提取语句中的多词单元的人工神经网络的设备1800中的语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710和组合装置1802相同,其细节在此不再赘述。另外,训练用于提取语句中的多词单元的人工神经网络的设备1900中的词性模板生成装置1902,其根据标注的结果和N元组中的分词的词形特征,计算N元组中的分词是多词单元的一部分的词形提取概率,以生成词形模板;和/或词性模板生成装置1904,其根据标注的结果和N元组中的分词的词性特征,计算N元组中的分词是多词单元的一部分的词性提取概率,以生成词性模板。Fig. 19 is a schematic block diagram showing a device for training an artificial neural network for extracting multi-word units in a sentence according to another embodiment of the present invention. As shown in Figure 19, the device 1900 for training the artificial neural network used to extract the multi-word units in the sentence includes a linguistic feature acquisition device 1702, an input device 1704, a judgment device 1706, a training device 1708, a feedback information acquisition device 1710, a combination The device 1802, the device 1902 for generating a word form template, and the device 1904 for generating a part-of-speech template. Linguistic feature acquiring means 1702, input means 1704, judging means 1706, training means 1708, feedback information acquiring means 1710 and combination means 1802 in the equipment 1900 for extracting the artificial neural network of multi-word units in sentences are used for training The linguistic feature acquisition means 1702, input means 1704, judgment means 1706, training means 1708, feedback information acquisition means 1710 and combination means 1802 in the device 1800 of artificial neural network for extracting multi-word units in sentences are the same, and the details are in This will not be repeated here. In addition, the part-of-speech template generation device 1902 in the device 1900 for training the artificial neural network used to extract the multi-word units in the sentence calculates the part-of-speech template in the N-tuple according to the tagged results and the morphological features of the word segmentation in the N-tuple. Word segmentation is the word form extraction probability of a part of the multi-word unit, to generate a word form template; and/or a part-of-speech template generation device 1904, which calculates the part-of-speech feature in the N-tuple according to the marked result and the part-of-speech feature of the word segmentation in the N-tuple. Tokenization is the part-of-speech probabilities extracted from a part of a multi-word unit to generate part-of-speech templates.
图20是示出根据本发明的另一实施例的训练用于提取语句中的多词单元的人工神经网络的设备的示意性框图。如图20所示,训练用于提取语句中的多词单元的人工神经网络的设备2000包括语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710、组合装置1802、泛化装置2002和词性容错模板生成装置2004。训练用于提取语句中的多词单元的人工神经网络的设备2000中的语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710和组合装置1802与训练用于提取语句中的多词单元的人工神经网络的设备1800中的语言学特征获取装置1702、输入装置1704、判断装置1706、训练装置1708、反馈信息获取装置1710和组合装置1802相同,其细节在此不再赘述。另外,训练用于提取语句中的多词单元的人工神经网络的设备2000中的泛化装置2002,将N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;词性容错模板生成装置2004,其根据标注的结果和泛化N元组中的分词的词形特征和词性特征,计算泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,以生成词性容错模板。Fig. 20 is a schematic block diagram showing a device for training an artificial neural network for extracting multi-word units in a sentence according to another embodiment of the present invention. As shown in Figure 20, the device 2000 for training the artificial neural network used to extract the multi-word units in the sentence includes a linguistic feature acquisition device 1702, an input device 1704, a judgment device 1706, a training device 1708, a feedback information acquisition device 1710, a combination The device 1802 , the generalization device 2002 and the part-of-speech fault-tolerant template generation device 2004 . Linguistic feature acquiring means 1702, input means 1704, judging means 1706, training means 1708, feedback information acquiring means 1710 and combining means 1802 in the equipment 2000 of artificial neural network for extracting multi-word units in sentences are used for training The linguistic feature acquisition means 1702, input means 1704, judgment means 1706, training means 1708, feedback information acquisition means 1710 and combination means 1802 in the device 1800 of artificial neural network for extracting multi-word units in sentences are the same, and the details are in This will not be repeated here. In addition, the generalization device 2002 in the device 2000 for training the artificial neural network used to extract the multi-word units in the sentence replaces the word form of the participle in the N-tuple with the corresponding part of speech to obtain a mixture of word form and part of speech The generalized N-tuple; the part-of-speech fault-tolerant template generation device 2004, which calculates whether the word segmentation in the generalized N-tuple is a multi-word unit according to the result of the label and the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple. Part of the extracted probabilities are used as part-of-speech fault-tolerant information to generate part-of-speech fault-tolerant templates.
本领域技术人员理解,在上面描述的根据本发明各实施例的提取语句中的多词单元的方法中的各步骤或者提取语句中的多词单元的设备中的各功能单元,可以根据实际需要进行任意的组合,即,一个提取语句中的多词单元的方法实施例中的处理步骤可以与其它提取语句中的多词单元的方法实施例中的处理步骤进行组合,或者,一个提取语句中的多词单元的设备实施例中的功能单元可以与其它提取语句中的多词单元的设备实施例中的功能单元进行组合,以便实现所期望的技术目的。类似地,在上面描述的根据本发明各实施例的训练人工神经网络的方法中的各步骤或者训练人工神经网络的设备中的各功能单元,可以根据实际需要进行任意的组合,即,一个训练人工神经网络的方法实施例中的处理步骤可以与其它训练人工神经网络的方法实施例中的处理步骤进行组合,或者,一个训练人工神经网络的设备实施例中的功能单元可以与其它训练人工神经网络的设备实施例中的功能单元进行组合,以便实现所期望的技术目的Those skilled in the art understand that each step in the above-described method for extracting multi-word units in a sentence according to various embodiments of the present invention or each functional unit in the device for extracting multi-word units in a sentence can be based on actual needs Carry out any combination, that is, the processing step in the method embodiment of the multi-word unit in an extraction sentence can be combined with the processing step in the method embodiment of other extraction multi-word unit in the sentence, or, in an extraction sentence The functional units in the embodiment of the multi-word unit device can be combined with the functional units in other device embodiments for extracting multi-word units in sentences, so as to achieve the desired technical purpose. Similarly, each step in the method for training an artificial neural network described above according to each embodiment of the present invention or each functional unit in a device for training an artificial neural network can be combined arbitrarily according to actual needs, that is, a training The processing steps in the method embodiment of the artificial neural network can be combined with the processing steps in other method embodiments for training the artificial neural network, or, the functional units in one embodiment of the device for training the artificial neural network can be combined with other training artificial neural network The functional units in the device embodiment of the network are combined in order to achieve the desired technical purpose
此外,本申请的实施例还提出了一种程序产品,该程序产品承载机器可执行的指令,当在信息处理设备上执行指令时,指令使得信息处理设备执行根据上述本发明的实施例的提取语句中的多词单元的方法。类似地,本申请的实施例还提出了一种程序产品,该程序产品承载机器可执行的指令,当在信息处理设备上执行指令时,指令使得信息处理设备执行根据上述本发明的实施例的训练人工神经网络的方法。In addition, the embodiments of the present application also propose a program product, the program product carries machine-executable instructions, and when the instructions are executed on the information processing device, the instructions cause the information processing device to perform extraction according to the above-mentioned embodiments of the present invention Methods for multi-word units in sentences. Similarly, the embodiments of the present application also propose a program product, the program product carries machine-executable instructions, and when the instructions are executed on the information processing device, the instructions cause the information processing device to execute the Methods for training artificial neural networks.
此外,本申请的实施例还提出了一种存储介质,该存储介质包括机器可读的程序代码,当在信息处理设备上执行程序代码时,程序代码使得信息处理设备执行根据上述本发明的实施例的提取语句中的多词单元的方法。类似地,本申请的实施例还提出了一种存储介质,该存储介质包括机器可读的程序代码,当在信息处理设备上执行程序代码时,程序代码使得信息处理设备执行根据上述本发明的实施例的训练人工神经网络的方法。In addition, the embodiments of the present application also propose a storage medium, the storage medium includes machine-readable program codes, and when the program codes are executed on the information processing equipment, the program codes cause the information processing equipment to execute the above-mentioned implementation of the present invention. A method for extracting multi-word units in a sentence. Similarly, the embodiments of the present application also propose a storage medium, the storage medium includes machine-readable program code, and when the program code is executed on the information processing device, the program code causes the information processing device to execute the above-mentioned method according to the present invention. The method for training the artificial neural network of the embodiment.
相应地,用于承载上述存储有机器可读取的指令代码的程序产品的存储介质也包括在本发明的公开中。存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。Correspondingly, a storage medium for carrying the program product storing the above-mentioned machine-readable instruction codes is also included in the disclosure of the present invention. Storage media includes, but is not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.
根据本发明的实施例的提取语句中的多词单元的设备及其各个组成单元可通过软件、固件、硬件或其组合的方式进行配置。类似地,根据本发明的实施例的训练人工神经网络的设备及其各个组成单元也可通过软件、固件、硬件或其组合的方式进行配置。配置可使用的具体手段或方式为本领域技术人员所熟知,在此不再赘述。在通过软件或固件实现的情况下,从存储介质或网络向具有专用硬件结构的信息处理设备(例如图21所示的通用计算机2100)安装构成该软件的程序,该计算机在安装有各种程序时,能够执行各种功能等。The device for extracting multi-word units in a sentence according to the embodiments of the present invention and its constituent units can be configured by means of software, firmware, hardware or a combination thereof. Similarly, the device for training an artificial neural network and its constituent units according to the embodiments of the present invention may also be configured by means of software, firmware, hardware or a combination thereof. Specific means or manners that can be used for configuration are well known to those skilled in the art, and will not be repeated here. In the case of realization by software or firmware, a program constituting the software is installed from a storage medium or a network to an information processing device having a dedicated hardware structure (for example, a general-purpose computer 2100 shown in FIG. 21 ), which is installed with various programs , various functions and the like can be performed.
在图21中,中央处理单元(CPU)2101根据只读存储器(ROM)2102中存储的程序或从存储部分2108加载到随机存取存储器(RAM)2103的程序执行各种处理。在RAM 2103中,也根据需要存储当CPU 2101执行各种处理等等时所需的数据。CPU 2101、ROM 2102和RAM 2103经由总线2104彼此连接。输入/输出接口2105也连接到总线2104。In FIG. 21 , a central processing unit (CPU) 2101 executes various processes according to programs stored in a read only memory (ROM) 2102 or loaded from a storage section 2108 to a random access memory (RAM) 2103 . In the RAM 2103, data required when the CPU 2101 executes various processes and the like is also stored as necessary. The CPU 2101 , ROM 2102 , and RAM 2103 are connected to each other via a bus 2104 . The input/output interface 2105 is also connected to the bus 2104 .
下述部件连接到输入/输出接口2105:输入部分2106(包括键盘、鼠标等等)、输出部分2107(包括显示器,比如阴极射线管(CRT)、液晶显示器(LCD)等,和扬声器等)、存储部分2108(包括硬盘等)、通信部分2109(包括网络接口卡比如LAN卡、调制解调器等)。通信部分2109经由网络比如因特网执行通信处理。根据需要,驱动器2110也可连接到输入/输出接口2105。可拆卸介质2111比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器2110上,使得从中读出的计算机程序根据需要被安装到存储部分2108中。The following components are connected to the input/output interface 2105: an input section 2106 (including a keyboard, a mouse, etc.), an output section 2107 (including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.), Storage part 2108 (including hard disk, etc.), communication part 2109 (including network interface cards such as LAN cards, modems, etc.). The communication section 2109 performs communication processing via a network such as the Internet. A driver 2110 may also be connected to the input/output interface 2105 as needed. A removable medium 2111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 2110 as necessary, so that a computer program read therefrom is installed into the storage section 2108 as necessary.
在通过软件实现上述系列处理的情况下,从网络比如因特网或存储介质比如可拆卸介质2111安装构成软件的程序。In the case of realizing the above-described series of processing by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 2111 .
本领域的技术人员应当理解,这种存储介质不局限于图21所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质2111。可拆卸介质2111的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者,存储介质可以是ROM 2102、存储部分2108中包含的硬盘等等,其中存有程序,并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 2111 shown in FIG. 21 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of removable media 2111 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read-only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including )) and semiconductor memory. Alternatively, the storage medium may be the ROM 2102, a hard disk contained in the storage section 2108, or the like, in which the programs are stored and distributed to users together with devices containing them.
指令代码由机器读取并执行时,可执行上述根据本发明实施例的方法。When the instruction code is read and executed by the machine, the above method according to the embodiment of the present invention can be executed.
最后,还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。此外,在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。再者,由措辞“第一”,“第二”,“第三”等等限定的技术特征或者参数,并不因为这些措辞的使用而具有特定的顺序或者优先级或者重要性程度。换句话说,这些措辞的使用只是为了区分或识别这些技术特征或者参数而没有任何其他的限定含义。Finally, it should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also Other elements not expressly listed, or inherent to the process, method, article, or apparatus are also included. Furthermore, without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element. Furthermore, the technical features or parameters defined by the words "first", "second", "third" and so on do not have a specific order or priority or degree of importance because of the use of these words. In other words, these terms are used only to distinguish or identify these technical features or parameters without any other limiting meaning.
通过以上的描述不难看出,本发明的实施例提供的技术方案包括但不限于:It is not difficult to see from the above description that the technical solutions provided by the embodiments of the present invention include but are not limited to:
附记1、一种提取语句中的多词单元的方法,包括:Additional Note 1. A method for extracting multi-word units in a sentence, comprising:
针对将语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中分词的一个或多个语言学特征作为特征量;For each of the multiple word segmentation blocks obtained by segmenting the sentence, one or more linguistic features of the word segmentation in each word segmentation block are obtained as feature quantities;
将所述特征量作为人工神经网络的参数输入到所述人工神经网络中;Inputting the feature quantity into the artificial neural network as a parameter of the artificial neural network;
采用所述人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据所述第一可能性和第二可能性来判断该分词是否为多词单元的一部分;以及The artificial neural network is used to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first possibility and the second possibility Two possibilities to determine whether the participle is part of a multi-word unit; and
提取相邻的两个或更多个被判断为多词单元的一部分的分词,以形成多词单元,Extract adjacent two or more word segments that are judged to be part of a multi-word unit to form a multi-word unit,
其中,所述方法还包括:获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将所述反馈信息也作为所述当前分词块中的分词的特征量。Wherein, the method further includes: obtaining the judgment result of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and using the feedback information as the feature quantity of the word segmentation in the current word segmentation block.
附记2、根据附记1所述的方法,其中,所述语言学特征为以下中的一个或更多个:分词的词性、分词的词形、分词序号或分词出现概率。Supplement 2. The method according to Supplement 1, wherein the linguistic feature is one or more of the following: part of speech of a participle, word form of a participle, sequence number of a participle, or occurrence probability of a participle.
附记3、根据附记1-2中任一项所述的方法,还包括:Supplementary Note 3. The method according to any one of Supplementary Notes 1-2, further comprising:
依次将所述语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。The adjacent N participles in the sentence are sequentially combined into N-tuples to form a participle block, where N is a natural number greater than or equal to 2.
附记4、根据附记3所述的方法,还包括:Supplement 4. The method according to Supplement 3, further comprising:
根据所述N元组中的分词的词形特征,从词形模板中获取所述N元组中的分词是多词单元的一部分的词形提取概率,并且将所述词形提取概率也作为所述N元组中的分词的特征量;和/或According to the morphological feature of the participle in the N-tuple, obtain the word form extraction probability that the participle in the N-tuple is a part of the multi-word unit from the word form template, and use the participle word form extraction probability as The feature quantity of the word segmentation in the N-tuple; and/or
根据所述N元组中的分词的词性特征,从词性模板中获取所述N元组中的分词是多词单元的一部分的词性提取概率,并且将所述词性提取概率也作为所述N元组中的分词的特征量。According to the part-of-speech feature of the word segmentation in the N-tuple, obtain the part-of-speech extraction probability that the word segmentation in the N-tuple is a part of the multi-word unit from the part-of-speech template, and use the part-of-speech extraction probability as the N-gram The feature amount of the word segmentation in the group.
附记5、根据附记4所述的方法,还包括:Supplementary Note 5. According to the method described in Supplementary Note 4, further comprising:
将所述N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;以及replacing the word form of the participle in the N-tuple with the corresponding part-of-speech to obtain a generalized N-tuple that mixes the word form and the part-of-speech; and
根据所述泛化N元组中的分词的词形特征和词性特征,从词性容错模板中获取所述泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,并且将所述词性容错信息也作为所述N元组中的分词的特征量。According to the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple, the extraction probability that the word segmentation in the generalized N-tuple is a part of the multi-word unit is obtained from the part-of-speech fault-tolerant template as the part-of-speech fault-tolerant information, and The part-of-speech error tolerance information is also used as the feature quantity of the word segmentation in the N-tuple.
附记6、一种提取语句中的多词单元的设备,包括:Additional Note 6. A device for extracting multi-word units in a sentence, comprising:
语言学特征获取单元,其针对将语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或多个语言学特征作为特征量;A linguistic feature acquisition unit, for each of the multiple word segmentation blocks obtained by segmenting the sentence, acquires one or more linguistic features of the word segmentation in each word segmentation block as a feature quantity;
输入单元,其将所述特征量作为人工神经网络的参数输入到所述人工神经网络中;an input unit, which inputs the feature quantity into the artificial neural network as a parameter of the artificial neural network;
判断单元,其采用所述人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据所述第一可能性和第二可能性来判断该分词是否为多词单元的一部分;以及A judging unit, which uses the artificial neural network to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first Likelihood and Second Likelihood to determine whether the participle is part of an MMU; and
提取单元,其提取相邻的两个或更多个被判断为多词单元的一部分的分词,以形成多词单元,an extracting unit that extracts adjacent two or more word segments judged to be part of a multi-word unit to form a multi-word unit,
其中,所述设备还包括:反馈信息获取单元,其获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将所述反馈信息也作为所述当前分词块中的分词的特征量。Wherein, the device further includes: a feedback information acquisition unit, which acquires the result of the judgment of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and uses the feedback information as the result of the word segmentation in the current word segmentation block Feature amount.
附记7、根据附记6所述的设备,其中,所述语言学特征为以下中的一个或更多个:分词的词性、分词的词形、分词序号或分词出现概率。Supplement 7. The device according to Supplement 6, wherein the linguistic feature is one or more of the following: part of speech of a word segment, word form of a word segment, sequence number of a word segment, or occurrence probability of a word segment.
附记8、根据附记6-7中任一项所述的设备,还包括:Supplement 8. The device according to any one of Supplements 6-7, further comprising:
组合单元,其依次将所述语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。A combining unit, which sequentially combines the adjacent N participles in the sentence into N-tuples to form a participle block, wherein N is a natural number greater than or equal to 2.
附记9、根据附记8所述的设备,还包括:Supplement 9. The equipment described in Supplement 8, further comprising:
词形提取概率获取单元,其根据所述N元组中的分词的词形特征,从词形模板中获取所述N元组中的分词是多词单元的一部分的词形提取概率,并且将所述词形提取概率也作为所述N元组中的分词的特征量;和/或A word form extraction probability acquisition unit, which, according to the word form feature of the participle in the N-tuple, obtains the word form extraction probability that the participle in the N-tuple is a part of the multi-word unit from the word form template, and will The word form extraction probability is also used as the feature quantity of the word segmentation in the N-tuple; and/or
词性提取概率获取单元,其根据所述N元组中的分词的词性特征,从词性模板中获取所述N元组中的分词是多词单元的一部分的词性提取概率,并且将所述词性提取概率也作为所述N元组中的分词的特征量。The part-of-speech extraction probability acquisition unit, according to the part-of-speech feature of the part-of-speech in the N-tuple, obtains the part-of-speech extraction probability that the part-of-speech in the N-tuple is a part of the multi-word unit from the part-of-speech template, and extracts the part-of-speech The probability is also used as the feature quantity of the word segmentation in the N-tuple.
附记10、根据附记8所述的设备,还包括:Supplement 10. The equipment described in Supplement 8, further comprising:
泛化单元,其将所述N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;以及A generalization unit, which replaces the word form of the participle in the N-tuple with the corresponding part of speech, so as to obtain a generalized N-tuple that mixes the word form and the part of speech; and
词性容错信息获取单元,其根据所述泛化N元组中的分词的词形特征和词性特征,从词性容错模板中获取所述泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,并且将所述词性容错信息也作为所述N元组中的每个分词的特征量。A part-of-speech fault-tolerant information acquisition unit, which, according to the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple, obtains the extraction that the word segmentation in the generalized N-tuple is a part of the multi-word unit from the part-of-speech fault-tolerant template The probability is used as part-of-speech error-tolerant information, and the part-of-speech error-tolerant information is also used as the feature value of each word segment in the N-tuple.
附记11、一种训练人工神经网络的方法,所述人工神经网络用于提取语句中的多词单元,所述方法包括:Supplementary Note 11. A method for training an artificial neural network, the artificial neural network is used to extract multi-word units in a sentence, and the method includes:
针对将每个训练语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或更多个语言学特征作为特征量,其中,所述训练语句中的多词单元已被标注;For each word segmentation block in the plurality of word segmentation blocks obtained by performing word segmentation on each training sentence, one or more linguistic features of the word segmentation in each word segmentation block are obtained as feature quantities, wherein, in the training sentence Multi-word units have been labeled;
将所述特征量作为人工神经网络的参数输入到所述人工神经网络中;Inputting the feature quantity into the artificial neural network as a parameter of the artificial neural network;
采用所述人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据所述第一可能性和第二可能性的比较结果来判断该分词是否为多词单元的一部分;以及The artificial neural network is used to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first possibility and the second possibility A comparison of two possibilities to determine whether the participle is part of a multi-word unit; and
根据判断的结果和标注的结果,来训练所述人工神经网络,training the artificial neural network according to the results of the judgment and the results of the labeling,
其中,所述方法还包括:获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将所述反馈信息也作为所述当前分词块中的分词的特征量。Wherein, the method further includes: obtaining the judgment result of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and using the feedback information as the feature quantity of the word segmentation in the current word segmentation block.
附记12、根据附记11所述的方法,其中,所述语言学特征为以下中的一个或更多个:分词的词性、分词的词形、分词序号或分词出现概率。Supplementary Note 12. The method according to Supplementary Note 11, wherein the linguistic feature is one or more of the following: part of speech of a word segment, word form of a word segment, sequence number of a word segment, or occurrence probability of a word segment.
附记13、根据附记11或12所述的方法,还包括:Supplement 13. The method according to Supplement 11 or 12, further comprising:
依次将所述训练语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。The N adjacent word segments in the training sentence are sequentially combined into N-tuples to form a word segment block, where N is a natural number greater than or equal to 2.
附记14、根据附记13所述的方法,还包括:Supplementary Note 14. The method according to Supplementary Note 13, further comprising:
根据标注的结果和所述N元组中的分词的词形特征,计算所述N元组中的分词是多词单元的一部分的词形提取概率,以生成词形模板;和/或According to the marked result and the morphological features of the word segmentation in the N-tuple, calculate the word form extraction probability that the word segmentation in the N-tuple is a part of the multi-word unit, to generate a word form template; and/or
根据标注的结果和所述N元组中的分词的词性特征,计算所述N元组中的分词是多词单元的一部分的词性提取概率,以生成词性模板。According to the marked result and the part-of-speech feature of the part-of-speech feature in the N-tuple, calculate the part-of-speech extraction probability that the part-word in the N-tuple is a part of the multi-word unit, so as to generate a part-of-speech template.
附记15、根据附记13所述的方法,还包括:Supplementary Note 15. The method according to Supplementary Note 13, further comprising:
将所述N元组中的分词的词形替换为相应的词性,以得到混合了词形与词性的泛化N元组;以及replacing the word form of the participle in the N-tuple with the corresponding part-of-speech to obtain a generalized N-tuple that mixes the word form and the part-of-speech; and
根据标注的结果和所述泛化N元组中的分词的词形特征和词性特征,计算所述泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,以生成词性容错模板。According to the result of labeling and the morphological features and part-of-speech features of the word segmentation in the generalized N-tuple, calculate the extraction probability that the word segmentation in the generalized N-tuple is a part of the multi-word unit as the part-of-speech error tolerance information, to generate Part of Speech tolerance template.
附记16、一种训练人工神经网络的设备,所述人工神经网络用于提取语句中的多词单元,所述设备包括:Supplementary note 16. A device for training an artificial neural network, which is used to extract multi-word units in a sentence, said device comprising:
语言学特征获取装置,其针对将每个训练语句进行分词得到的多个分词块中的每个分词块,获取每个分词块中的分词的一个或更多个语言学特征作为特征量,其中,所述训练语句中的多词单元已被标注;A linguistic feature acquisition device, which acquires one or more linguistic features of the word segmentation in each word segmentation block as feature quantities for each of the multiple word segmentation blocks obtained by segmenting each training sentence, wherein , the multi-word units in the training sentence have been marked;
输入装置,其将所述特征量作为人工神经网络的参数输入到所述人工神经网络中;an input device, which inputs the feature quantity into the artificial neural network as a parameter of the artificial neural network;
判断装置,采用所述人工神经网络计算每个分词块中的分词是多词单元的一部分的第一可能性和该分词不是多词单元的一部分的第二可能性,并且根据所述第一可能性和第二可能性的比较结果来判断该分词是否为多词单元的一部分;以及Judging means, using the artificial neural network to calculate the first possibility that the word in each word segmentation block is a part of the multi-word unit and the second possibility that the word is not a part of the multi-word unit, and according to the first possibility Whether the participle is a part of the multi-word unit is judged by comparing the results of the sex and the second possibility; and
训练装置,其根据判断的结果和标注的结果,来训练所述人工神经网络,a training device, which trains the artificial neural network according to the judged result and the marked result,
其中,所述设备还包括:反馈信息获取装置,其获取与当前分词块相邻的先前分词块的判断的结果作为反馈信息,并且将所述反馈信息也作为所述当前分词块中的分词的特征量。Wherein, the device further includes: a feedback information acquisition device, which acquires the result of the judgment of the previous word segmentation block adjacent to the current word segmentation block as feedback information, and uses the feedback information as the information of the word segmentation in the current word segmentation block. Feature amount.
附记17、根据附记16所述的设备,其中,所述语言学特征为以下中的一个或更多个:分词的词性、分词的词形、分词序号或分词出现概率。Addendum 17. The device according to Addendum 16, wherein the linguistic feature is one or more of the following: part of speech of a word segment, word form of a word segment, sequence number of a word segment, or occurrence probability of a word segment.
附记18、根据附记16或17所述的设备,还包括:Supplement 18. The equipment described in Supplement 16 or 17, further comprising:
组合装置,其依次将所述训练语句中相邻的N个分词组合为N元组以形成分词块,其中N为大于或等于2的自然数。A combination device, which sequentially combines N adjacent word segments in the training sentence into N-tuples to form a word block, wherein N is a natural number greater than or equal to 2.
附记19、根据附记18所述的设备,还包括:Supplement 19. The device according to Supplement 18, further comprising:
词形模板生成装置,其根据标注的结果和所述N元组中的分词的词形特征,计算所述N元组中的分词是多词单元的一部分的词形提取概率,以生成词形模板;和/或A word form template generating device, which calculates the word form extraction probability that the participle in the N-tuple is a part of the multi-word unit according to the result of the annotation and the word form feature of the participle in the N-tuple, to generate the word form templates; and/or
词性模板生成装置,其根据标注的结果和所述N元组中的分词的词性特征,计算所述N元组中的分词是多词单元的一部分的词性提取概率,以生成词性模板。A part-of-speech template generating device, which calculates the part-of-speech extraction probability that the part-of-speech in the N-tuple is a part of the multi-word unit according to the marked result and the part-of-speech feature of the part-of-speech in the N-tuple, so as to generate a part-of-speech template.
附记20、根据附记18所述的设备,还包括:Supplement 20. The device according to Supplement 18, further comprising:
泛化装置,其将所述N元组中的分词的词形替换为相应的词形,以得到混合了词形与词性的泛化N元组;以及a generalization device, which replaces the word form of the participle in the N-tuple with the corresponding word form, so as to obtain a generalized N-tuple that mixes the word form and the part of speech; and
词性容错模板生成装置,其根据标注的结果和所述泛化N元组中的分词的词形特征和词性特征,计算所述泛化N元组中的分词是多词单元的一部分的提取概率作为词性容错信息,以生成词性容错模板。A part-of-speech fault-tolerant template generation device, which calculates the extraction probability that the participle in the generalized N-tuple is a part of the multi-word unit according to the marked result and the morphological features and part-of-speech features of the participle in the generalized N-tuple As part-of-speech fault-tolerant information to generate a part-of-speech fault-tolerant template.
尽管已示出和描述了本发明的优选实施例,可以设想,本领域的技术人员可在所附权利要求的精神和范围内设计对本发明的各种修改。While a preferred embodiment of the invention has been shown and described, it is contemplated that various modifications of the invention can be devised by those skilled in the art within the spirit and scope of the appended claims.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210320806.XACN103678318B (en) | 2012-08-31 | 2012-08-31 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201210320806.XACN103678318B (en) | 2012-08-31 | 2012-08-31 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
| Publication Number | Publication Date |
|---|---|
| CN103678318A CN103678318A (en) | 2014-03-26 |
| CN103678318Btrue CN103678318B (en) | 2016-12-21 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201210320806.XAExpired - Fee RelatedCN103678318B (en) | 2012-08-31 | 2012-08-31 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
| Country | Link |
|---|---|
| CN (1) | CN103678318B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105404632B (en)* | 2014-09-15 | 2020-07-31 | 深港产学研基地 | System and method for carrying out serialized annotation on biomedical text based on deep neural network |
| CN107301454B (en)* | 2016-04-15 | 2021-01-22 | 中科寒武纪科技股份有限公司 | Artificial neural network reverse training device and method supporting discrete data representation |
| CN107977352A (en)* | 2016-10-21 | 2018-05-01 | 富士通株式会社 | Information processor and method |
| CN107273356B (en) | 2017-06-14 | 2020-08-11 | 北京百度网讯科技有限公司 | Word segmentation method, device, server and storage medium based on artificial intelligence |
| CN109829162B (en)* | 2019-01-30 | 2022-04-08 | 新华三大数据技术有限公司 | Text word segmentation method and device |
| CN110532551A (en)* | 2019-08-15 | 2019-12-03 | 苏州朗动网络科技有限公司 | Method, equipment and the storage medium that text key word automatically extracts |
| CN111291195B (en)* | 2020-01-21 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal and readable storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101093504A (en)* | 2006-03-24 | 2007-12-26 | 国际商业机器公司 | System for extracting new compound word |
| CN101187921A (en)* | 2007-12-20 | 2008-05-28 | 腾讯科技(深圳)有限公司 | Chinese compound words extraction method and system |
| CN101354712A (en)* | 2008-09-05 | 2009-01-28 | 北京大学 | Chinese term automatic extraction system and method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101093504A (en)* | 2006-03-24 | 2007-12-26 | 国际商业机器公司 | System for extracting new compound word |
| CN101187921A (en)* | 2007-12-20 | 2008-05-28 | 腾讯科技(深圳)有限公司 | Chinese compound words extraction method and system |
| CN101354712A (en)* | 2008-09-05 | 2009-01-28 | 北京大学 | Chinese term automatic extraction system and method |
| Title |
|---|
| A study on multi-word extraction from Chinese documents;Wen Zhang等;《Advanced Web and Network Technologies, and Applications》;20080428;42-53* |
| Improving word representations via global context and multiple word prototypes;Eric H. Huang等;《Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics》;20120714;873-882* |
| 基于神经网络汉语分词模型的优化;何嘉等;《成都信息工程学院学报》;20061231;812-815* |
| 神经网络和匹配融合的中文分词研究;李华;《心智与计算》;20100630;117-127* |
| Publication number | Publication date |
|---|---|
| CN103678318A (en) | 2014-03-26 |
| Publication | Publication Date | Title |
|---|---|---|
| Zhao et al. | MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance | |
| US11580415B2 (en) | Hierarchical multi-task term embedding learning for synonym prediction | |
| CN106777275B (en) | Entity attribute and property value extracting method based on more granularity semantic chunks | |
| Korhonen | Subcategorization acquisition | |
| Klementiev et al. | Inducing crosslingual distributed representations of words | |
| CN103678318B (en) | Multi-word unit extraction method and equipment and artificial neural network training method and equipment | |
| CN105988990B (en) | Chinese zero-reference resolution device and method, model training method and storage medium | |
| JP5936698B2 (en) | Word semantic relation extraction device | |
| US9892111B2 (en) | Method and device to estimate similarity between documents having multiple segments | |
| JP5356197B2 (en) | Word semantic relation extraction device | |
| US9514098B1 (en) | Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases | |
| Xu et al. | Cross-domain and semisupervised named entity recognition in chinese social media: A unified model | |
| CN112805715B (en) | Identifying entity-attribute relationships | |
| Taslimipoor et al. | Shoma at parseme shared task on automatic identification of vmwes: Neural multiword expression tagging with high generalisation | |
| Zennaki et al. | Unsupervised and lightly supervised part-of-speech tagging using recurrent neural networks | |
| Korpusik et al. | Data collection and language understanding of food descriptions | |
| Banerjee et al. | Generating abstractive summaries from meeting transcripts | |
| Zhao et al. | Integrating Ontology-Based Approaches with Deep Learning Models for Fine-Grained Sentiment Analysis. | |
| Jawad et al. | RUSAS: Roman Urdu Sentiment Analysis System. | |
| Liu et al. | Language model augmented relevance score | |
| Duque et al. | Can multilinguality improve biomedical word sense disambiguation? | |
| CN118395205B (en) | Multi-mode cross-language detection method and device | |
| Behera | An Experiment with the CRF++ Parts of Speech (POS) Tagger for Odia. | |
| Chang et al. | Unsupervised constraint driven learning for transliteration discovery | |
| CN115712713A (en) | Text matching method, device and system and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date:20161221 Termination date:20180831 | |
| CF01 | Termination of patent right due to non-payment of annual fee |