CN113488034A

Movatterモバイル変換

Info

Publication number: CN113488034A
Application number: CN202010342423.7A
Authority: CN
Inventors: 蒋鹏民; 孟卫明; 王月岭; 高雪松; 唐至威; 王彦芳; 刘波; 刘帅帅; 田羽慧; 陈维强
Original assignee: Hisense Co Ltd
Current assignee: Hisense Group Co Ltd; Hisense Co Ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-10-08

Abstract

The application discloses a method, a device, equipment and a medium for processing voice information, which are used for solving the problems that the existing method for processing the voice information is not high in accuracy and consumes a large amount of storage space to store templates. According to the method and the device, each keyword contained in the context of the voice information to be recognized is directly extracted, the corresponding target entity is searched in the knowledge graph according to each contained keyword, the connection relation between each target entity is determined, the target semantics of the voice information to be recognized is determined, and the target semantics of the voice information to be recognized can be accurately recognized without considering the expression mode of the voice information to be recognized, so that the accuracy is higher when the subsequent processing is carried out according to the recognized semantics, and the user experience is improved.

Description

Voice information processing method, device, equipment and medium

Technical Field

The present application relates to the field of natural language understanding technologies, and in particular, to a method, an apparatus, a device, and a medium for processing voice information.

Background

With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent devices, the frequency of human-computer interaction in daily work and life of people is higher and higher. Because people can directly interact with intelligent equipment through voice, voice interaction is promoted to become an important means of human-computer interaction, and human-computer interaction technology is also applied to more and more intelligent equipment, such as: the system comprises a television, an intelligent housekeeper server, an intelligent home and other equipment. For the man-machine interaction technology, how to recognize the target semantics of the collected voice information to be recognized and process the voice information based on the recognized semantics is a relatively concerned problem in recent years.

In the prior art, an Artificial Intelligence Markup Language (aiml) template matching method may be adopted to identify the target semantics of the collected speech information to be identified, and perform corresponding processing based on the identified semantics. The aiml template matching method mainly carries out voice recognition on voice information to be recognized, converts a voice recognition result into corresponding text information, and then matches the text information with a preset text template respectively; and if the target text is successfully matched with the preset text template, acquiring the corresponding semantics of the matched template.

For the aiml template matching method, the following problems exist when the target semantics of the voice information to be recognized are recognized:

(1) because human language has diversity, complexity and ambiguity, one semantic generally has multiple expressions, such as multiple expressions of 'open music', 'please help me to open music', 'help me to open music', and the like, the semantic expressions of the various expressions are 'open music', but if the semantic expressions of the various expression modes are identified, text templates of different expression modes of the semantic expressions need to be stored in advance aiming at each semantic expression, and the resource consumption for storing the text templates is very large.

(2) The method directly matches the text information with the template, lacks context, causes low accuracy of the recognized semantics, causes inaccurate subsequent corresponding processing according to the recognized semantics, and reduces user experience.

Disclosure of Invention

The application provides a voice information processing method, a voice information processing device, voice information processing equipment and a voice information processing medium, which are used for solving the problems that the existing voice information processing method is not high in accuracy and consumes a large amount of storage space to store templates.

In a first aspect, the present application provides a method for processing voice information, where the method includes:

acquiring text information of voice information to be recognized;

identifying whether each character in the text information is a character in a keyword or not through a pre-trained semantic annotation model; determining each keyword according to each character identified as a character in the keywords; respectively determining a target entity corresponding to each keyword in the knowledge graph and a connection relation existing between the target entities in the knowledge graph according to a pre-constructed knowledge graph, and determining the semanteme corresponding to the connection relation in the knowledge graph as the target semanteme of the voice information to be recognized.

In a second aspect, the present application further provides an apparatus for processing voice information, the apparatus comprising:

the acquiring unit is used for acquiring text information of the voice information to be recognized;

the recognition unit is used for recognizing whether each character in the text information is a character in the keyword or not through a pre-trained semantic annotation model;

a determining unit configured to determine each keyword based on each character recognized as a character in the keywords;

and the processing unit is used for respectively determining a target entity corresponding to each keyword in the knowledge graph and a connection relation existing between the target entities in the knowledge graph according to a pre-constructed knowledge graph, and determining the semantic corresponding to the connection relation in the knowledge graph as the target semantic of the voice information to be recognized.

In a third aspect, the present application further provides an intelligent housekeeping server, where the intelligent housekeeping server at least includes a processor and a memory, and the processor is configured to implement the steps of the method for processing the voice information as described in any one of the above when executing the computer program stored in the memory.

In a fourth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the method for processing voice information as described in any one of the above.

According to the method and the device, each keyword contained in the context of the voice information to be recognized is directly extracted, the corresponding target entity is searched in the knowledge graph according to each contained keyword, the connection relation between each target entity is determined, the target semantics of the voice information to be recognized is determined, and the target semantics of the voice information to be recognized can be accurately recognized without considering the expression mode of the voice information to be recognized, so that the accuracy is higher when the subsequent processing is carried out according to the recognized semantics, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating a processing procedure of voice information according to some embodiments of the present application;

FIG. 2 is a diagram illustrating a specific process for processing voice information according to some embodiments of the present application;

FIG. 3 is a diagram illustrating a specific process for processing voice information according to some embodiments of the present application;

FIG. 4 is a schematic illustration of a knowledge-graph result provided by some embodiments of the present application;

FIG. 5 is a diagram illustrating a training process for a semantic annotation model according to some embodiments of the present application;

FIG. 6 is a process for preprocessing text data samples according to some embodiments of the present application;

FIG. 7 is a diagram of a basic framework structure of a prior art BERT pre-training model;

FIG. 8 is a diagram illustrating a specific process for testing a pre-trained semantic annotation model according to some embodiments of the present application;

FIG. 9 is a diagram illustrating a specific process for processing voice information according to some embodiments of the present application;

fig. 10 is a schematic structural diagram of a device for processing voice information according to some embodiments of the present application;

fig. 11 is a schematic structural diagram of a server according to some embodiments of the present application.

Detailed Description

In order to improve the accuracy of the determined semantics and improve the user experience, the application provides a voice information processing method and device, an intelligent housekeeping server and a medium.

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the practical application process, when a user wants to control the smart home device in a home through voice information, for example, an air conditioner in a bedroom is turned on, information is inquired, and the voice information to be recognized input by the user can be collected through smart devices such as a smart sound box, a smart housekeeper server and a smart television. The intelligent device for collecting the voice information to be recognized can directly perform semantic recognition on the voice information to be recognized locally, for example, an aiml semantic template matching method is used for determining the target semantic corresponding to the voice information to be recognized, so that corresponding operation is executed, and the intelligent home device is controlled according to the expectation of a user.

Certainly, in some possible embodiments, due to the limitation of the performance of the local processor, the smart device that collects the voice information to be recognized may not be able to perform semantic recognition on the collected cause information to be recognized, and then the collected voice information to be recognized or the text information of the voice information to be recognized may be sent to the server, the server performs semantic recognition on the text information of the voice information to be recognized, determines a target semantic corresponding to the voice information to be recognized, and the server determines the control information according to the target semantic and sends the control information to the target smart home device, so that the target smart home device performs corresponding operations, and thus the smart home device is controlled according to the expectation of the user.

Fig. 1 is a schematic diagram of a processing process of voice information according to some embodiments of the present application, where the process includes:

s101: and acquiring text information of the voice information to be recognized.

The processing method of the voice information is applied to the electronic equipment, and the electronic equipment can be intelligent equipment or a server. The intelligent device can be, for example, a smart television, a smart home, and the like.

The text information of the voice information to be recognized acquired by the electronic device can be generated according to the received voice information to be recognized, and can also be the text information of the voice information to be recognized sent by other intelligent devices.

The voice information to be recognized received by the electronic device may be the voice information collected by the electronic device itself, or the voice information sent by other intelligent devices, and is not specifically limited herein.

S102: and identifying whether each character in the text information of the voice information to be identified is a character in the keyword or not through a pre-trained semantic annotation model.

Specifically, after the electronic device obtains the text information of the voice information to be recognized, it can recognize whether each character in the text information of the voice information to be recognized is a character in the keyword or not through a pre-trained semantic labeling model, such as a BERT-blst (tm) -CRF model.

The keywords refer to words capable of reflecting part of semantics of the text information, for example, "please help me to open the air conditioner" in the text information, "open" and "air conditioner" are semantics of the text information, "what is the weather of the open day of beijing," open day, "and" weather. In the present application, the keyword may be a verb, a place noun, a name of a person, a name of a brand, or the like.

It should be noted that, the process of performing speech recognition on speech information to be recognized belongs to the prior art, and is not described herein again.

S103: each keyword is determined from each character identified as a character in the keyword.

Since each keyword in the text information may not be continuous, such as the text information "inquire about airline tickets flying to beijing," and "inquire," "north," "jing," "airplane," "ticket" are all the characters in the text information that are the keywords, when determining each keyword according to each character identified as a character in the keyword, each keyword can be determined directly according to each character that is continuous and is a character in the keyword.

Still taking the above as an example, the text information "query air ticket flying to beijing" is determined for each continuous character, which is a character in the keyword, and each keyword included is "query", "beijing", and "air ticket".

S104: respectively determining target entities corresponding to the keywords in the knowledge graph according to a pre-constructed knowledge graph, and if connection relations exist between the target entities in the knowledge graph, determining the semantics corresponding to the connection relations in the knowledge graph as the target semantics of the voice information to be recognized.

In order to accurately determine the target semantics of the voice information to be recognized, in the application, a knowledge graph is constructed in advance, and the knowledge graph stores entities corresponding to each keyword, connection relations among the entities, and semantics corresponding to each connection relation. Wherein the entities can be directly represented as keywords. After each keyword is acquired based on the above embodiment, the target entity corresponding to each keyword in the knowledge graph is respectively determined by traversing the pre-constructed knowledge graph. For example, the keyword "two tigers" corresponds to the target entities of "two tigers", and "open" corresponds to the target entities of "open", etc.

After each target entity is obtained, whether the target entities have a connection relation in the knowledge graph is determined, and therefore whether the voice information to be recognized has corresponding target semantics is determined. Specifically, if a connection relationship exists between each target entity in the knowledge graph, which indicates that the target semantics of the voice information to be recognized are stored in the knowledge graph, the semantics corresponding to the connection relationship between each target entity in the knowledge graph is determined as the target semantics of the voice information to be recognized. And if the connection relation does not exist between any two target entities in the knowledge graph, which indicates that the semantics of the voice information to be recognized is not stored in the knowledge graph, filtering the text information of the voice information to be recognized, and acquiring the text information of the next voice information to be recognized.

For example, in text information "turn on air conditioner" of the voice information to be recognized, keywords are "turn on" and "air conditioner", according to a pre-constructed knowledge graph, target entities corresponding to each keyword in the knowledge graph are respectively determined to be "turn on" and "air conditioner", a connection relation exists between the target entities in the knowledge graph, and semantics "turn on air conditioner" corresponding to the connection relation in the knowledge graph is determined to be target semantics of the recognized voice information.

The method comprises the steps that keywords are 'open', 'view' and 'schoolbag' contained in text information 'open view schoolbag' of voice information to be recognized, according to a pre-constructed knowledge graph, target entities corresponding to the keywords in the knowledge graph are respectively determined to be 'open', 'view' and 'schoolbag', connection relations among the target entities are determined to be absent in the knowledge graph, then the text information 'open view schoolbag' of the voice information to be recognized is filtered, and the next text information of the voice information to be recognized is obtained.

However, due to the complexity and diversity of languages, it may happen that the connection relationship does not exist between each recognized target entity, but the target semantics of the speech information to be recognized may also be recognized through the connection relationship existing between some target entities. Therefore, in the present application, when a connection relationship exists between at least two target entities, it is described that a target semantic meaning of the speech information to be recognized may be stored in the knowledge graph, and a semantic meaning corresponding to the connection relationship existing between the at least two target entities may be determined as the target semantic meaning of the speech information to be recognized.

For example, if the text information "telling me the price of the air ticket flying to beijing in tomorrow" of the recognized voice information contains the keywords "telling", "tomorrow", "beijing", "air ticket" and "price", respectively determining the target entity corresponding to each keyword in the knowledge map as "telling", "tomorrow", "beijing", "air ticket" and "price" according to the pre-constructed knowledge map, determining that the target entities "telling", "tomorrow", "beijing", "air ticket" have a connection relation in the knowledge map, and determining the semantic "inquiry air ticket" corresponding to the connection relation in the knowledge map as the target semantic of the recognized voice information.

According to the method and the device, each keyword contained in the context of the voice information to be recognized is directly extracted, the corresponding target entity is searched in the knowledge graph according to each contained keyword, the connection relation between the target entities is determined, the target semantics of the voice information to be recognized is determined, and the target semantics of the voice information to be recognized can be accurately recognized without considering the expression mode of the voice information to be recognized, so that the accuracy is higher when the subsequent processing is carried out according to the recognized semantics, and the user experience is improved.

In order to accurately determine each keyword included in the speech information to be recognized, on the basis of the above embodiment, in the present application, the recognizing, by using the pre-trained semantic annotation model, whether each character in the text information is a character in the keyword includes:

identifying whether each character in the text information is a character in the keyword or not and whether each character is information of a first character in the keyword or not through the semantic annotation model;

determining each keyword according to each character identified as a character in the keywords comprises:

for each character that is the first character in the keyword, each keyword is determined based on the character that is the first character in the keyword and each character that is subsequent thereto and that is identified as a non-first character in the keyword.

Since in some cases, it may happen that a plurality of keywords appear together in succession, such as "turn on air conditioner", "turn on music", etc., when each keyword is determined only based on each character recognized as a character in the keywords, it is easy to recognize the plurality of continuous keywords as one keyword, thereby affecting recognition of subsequent semantics. Therefore, when the text information of the speech information to be recognized is recognized through the pre-trained semantic annotation model, not only is it necessary to recognize whether each character in the text information is a character in the keyword, but also it is necessary to recognize whether the character is the first character in the keyword.

For example, through a pre-trained semantic annotation model, a label corresponding to each word in text information "please help me open music" of the voice information to be recognized is [ 000 bb I ], wherein words with labels different from 0 are all characters in the keyword, and the character with the label B is a first character of the keyword.

Based on this, after identifying whether each character in the text information of the voice information to be identified is a character in the keyword and whether it is information of the first character in the keyword, for each character in which each is the first character in the keyword, such as the above-mentioned "typing" and "sound", each keyword, such as the above-mentioned "opening" and "music", is determined based on the character which is the first character in the keyword and each character which is continuous after the character and which is identified as a non-first character in the keyword, such as the above-mentioned "typing" and "opening", "sound" and "music" thereafter.

According to the method and the device, whether each character in the text information of the voice information to be recognized is a character in the keyword or not and whether the character is the information of the first character in the keyword or not is recognized through the semantic annotation model trained in advance, so that each keyword contained in the text information of the voice information to be recognized can be determined accurately in the follow-up process, and the target semantic meaning of the voice information to be recognized can be recognized accurately.

In order to accurately determine the target semantics of the speech information to be recognized, on the basis of the foregoing embodiments, in the present application, the determining, according to a pre-constructed knowledge graph, each target entity corresponding to each keyword in the knowledge graph respectively includes:

and judging whether the keyword is matched with any preset keyword stored in the knowledge graph or not aiming at each keyword, if so, determining an entity corresponding to the matched preset keyword according to the corresponding relation between the preset keyword stored in the knowledge graph and the entity, and taking the entity as a target entity corresponding to the keyword.

Generally, if each keyword is taken as an entity, the knowledge graph established according to each entity is too numerous and complicated to search, and the connection relations between a plurality of entities are all corresponding to one semantic, for example, the connection relation between two entities, namely 'open' and 'two tigers', and the connection relation between two entities, namely 'open' and 'forgetting water', are corresponding to the semantic of 'playing music', and the 'two tigers' and the 'forgetting water' are different songs contained in the category of music. Many songs are included only for the music category, and if each song name is taken as an entity, the constructed knowledge graph is very complicated and includes other categories such as videos, places, and the like in addition to the music category.

Therefore, for convenience of searching, in the present application, each preset keyword included in each entity may be stored in advance, for example, "music" is used as the entity, each song name included in the entity of "music" is stored, and the keyword corresponding to each song name belongs to the entity of "music". After each keyword is obtained based on the above embodiment, for each keyword, the keyword is matched with each preset keyword stored in the knowledge graph, so as to determine the entity to which each keyword belongs.

Specifically, if there is a preset keyword matched with the keyword, which indicates that an entity corresponding to the keyword can be identified, determining an entity to which the matched preset keyword belongs, and taking the entity to which the matched preset keyword belongs as a target entity corresponding to the keyword.

For example, the keyword "two tigers" is matched with the preset keyword "two tigers" stored in the knowledge graph, which indicates that the entity corresponding to the keyword can be identified, the entity "music" to which the matched preset keyword "two tigers" belongs is determined, and the entity "music" is used as the target entity corresponding to the keyword "two tigers".

In order to accurately process the voice information to be recognized, after determining the target semantics of the voice information to be recognized, the method further includes:

determining target operation corresponding to the target semantics according to a corresponding relation between pre-stored semantics and operation;

determining target equipment corresponding to the target operation according to a pre-stored corresponding relation between the operation and the equipment;

and determining the specific operation content of the target operation executed by the target equipment according to each matched preset keyword.

After the target semantics of the speech information to be recognized is determined based on the above embodiment, subsequent processing needs to be performed on the speech information to be recognized. In order to accurately process the speech information to be recognized, in the present application, a corresponding relationship between semantics and operations is pre-stored, for example, an operation corresponding to "play music" is to open a music APP, and an operation corresponding to "inquire a ticket price" is to inquire and output a ticket price, etc. Therefore, after the target semantic meaning of the voice information to be recognized is determined, the target operation corresponding to the target semantic meaning is determined according to the corresponding relation between the pre-stored semantic meaning and the operation.

Further, since different devices can perform different operations, for example, an air conditioner can perform temperature adjustment, wind direction adjustment, and the like, and a sound box can perform functions of playing music and adjusting volume, in order to further accurately process voice information to be recognized, in the present application, a correspondence relationship between an operation and a device is stored in advance, and after a target operation is determined based on the above embodiment, a target device corresponding to the target operation is determined according to the correspondence relationship between the operation and the device stored in advance, for example, a target device corresponding to a target operation of playing music is a sound box, and a target device corresponding to a target operation of turning on an air conditioner is an air conditioner.

Furthermore, after the target device and the target operation are determined, it may happen that the target device still does not know how to further operate according to the target operation, for example, the target device is a sound box, the target operation is playing music, and the sound box cannot further operate when performing the operation of playing music because it does not know which song is played. Therefore, in order to further accurately control the target device, in the present application, the specific operation content of the target operation to be performed by the target device may be enriched according to each matched preset keyword obtained in the above embodiment. For example, according to each matched preset keyword "open" and "two tigers", it is determined that the specific content of the sound box in executing the operation of playing music is that the music of "two tigers" is played, so that the sound box is accurately controlled.

The following describes in detail a method for processing speech information according to the present application, with reference to specific embodiments. Fig. 2 is a schematic diagram of a specific processing process of voice information according to some embodiments of the present application, where the process includes:

s201: and acquiring text information of the voice information to be recognized.

S202: and identifying whether each character in the text information of the voice information to be identified is a character in the keyword or not and whether each character is information of a first character in the keyword or not through a semantic annotation model.

S203: for each character that is the first character in the keyword, each keyword is determined based on the character that is the first character in the keyword and each character that is subsequent thereto and that is identified as a non-first character in the keyword.

S204: and judging whether the keyword is matched with any preset keyword stored in the knowledge graph or not for each keyword, if so, executing S205, and otherwise, executing S201.

S205: and determining an entity corresponding to the matched preset keyword according to the corresponding relation between the preset keyword and the entity stored in the knowledge graph, and taking the entity as a target entity corresponding to the keyword.

S206: and judging whether the target entities have a connection relation in the knowledge graph, if so, executing S207, otherwise, executing S201.

S207: and determining the semantics corresponding to the connection relation in the knowledge graph as the target semantics of the voice information to be recognized.

S208: and determining the target operation corresponding to the target semantic according to the corresponding relation between the pre-stored semantic and the operation.

S209: and determining target equipment corresponding to the target operation according to the corresponding relation between the operation and the equipment which is stored in advance.

S210: and determining the specific operation content of the target equipment for executing the target operation according to each matched preset keyword.

On the basis of the foregoing embodiments, in order to further accurately process the speech information to be recognized, in the present application, the determining, according to the correspondence between the preset keyword and the entity stored in the knowledge graph, the entity corresponding to the matched preset keyword, and taking the entity as the target entity corresponding to the keyword includes:

for each matched preset keyword, when the number of target entities to which the matched preset keyword belongs is greater than 1, determining each target entity as a candidate entity, outputting prompt information for selecting each candidate entity, and taking the selected candidate entity as the target entity corresponding to the matched preset keyword;

the determining the target device corresponding to the target operation according to the pre-stored correspondence between the operation and the device includes:

and when the number of the target devices is larger than 1, determining each target device as a candidate device, outputting prompt information for selecting each candidate device, and taking the selected candidate device as the target device.

In practical applications, when determining the target semantics of the voice information to be recognized, the same keyword may correspond to different entities, for example, an entity to which the keyword "forgetting water" belongs may be music or video, which results in that the target semantics of the voice information to be recognized cannot be accurately determined. Therefore, in order to accurately identify the entity to which each keyword belongs, in the present application, for each matched preset keyword, when it is determined that the number of the entities to which the matched preset keyword belongs is greater than 1, each target entity corresponding to the matched preset keyword is taken as a candidate entity, and prompt information for selecting each candidate entity is output, so that a user further selects each output candidate entity.

The prompt information for outputting and selecting each candidate entity may be prompt information for selecting the audio format of each candidate entity through voice broadcast, for example, the candidate entity has music and video, the prompt information for selecting each candidate entity through voice broadcast "please determine to select music, or video", or prompt information corresponding to the text format of each candidate entity may be displayed on a display interface, for example, the candidate entity has music and video, and prompt information "please determine to select" of the text format of each candidate entity is displayed on the display interface: A. music B, video ". The two modes of outputting the prompt information can also be combined at the same time, namely the prompt information in the audio format is broadcasted and the prompt information in the text format is displayed on the display interface.

Specifically, which mode is selected to output the prompt information may be preset according to the preference of the user, or may be selected according to the capability of the electronic device, for example, some electronic devices do not have a display interface capable of displaying the prompt information, and for these electronic devices, when the prompt information is output, the prompt information in the audio format may be broadcasted.

After the prompt information for selecting each candidate entity is output, the selection result of the user for each candidate entity can be received, the selected candidate entity in the selection result is identified, and the selected candidate entity is used as the target entity corresponding to the matched preset keyword.

Correspondingly, in the application, the user can preset the form of the electronic equipment for receiving the input selection result aiming at each prompting information output form. Aiming at the prompt information in the broadcast audio format and/or the prompt information in the text format displayed on the display interface, the selection result can be input in a voice information mode or in a mode of clicking each candidate entity displayed on the display interface. Specifically, if the prompt information in the audio format is broadcasted, the selection result input in the form of voice information can be received; if the prompt information in the text format is displayed on the display interface, the selection result input in the form of voice information can be received, or the selection result can be input in the form of clicking each candidate entity displayed on the display interface.

In addition, as the functions of each smart device become more and more comprehensive, some functions between smart devices may overlap, for example, a smart television may also play music, a smart phone may also play video, and play music, etc., so that it is not known which device should be controlled to perform the target operation after the target operation is determined. Therefore, in order to further accurately control the target devices, in the present application, based on the above-described embodiment, after the target devices are determined, it is determined whether the number of the target devices is greater than 1, thereby determining whether to output the prompt information for selecting a device.

Specifically, when it is determined that the number of target devices is greater than 1, which indicates that there are a plurality of target devices that can perform the target operation, each target device is taken as a candidate device, prompt information for selecting each candidate device is output, and the selected candidate device is taken as the target device.

The manner of outputting the prompt information for selecting each candidate device and receiving the selection result of the candidate device selected by the user is the same as above, and is not described herein again.

In addition, because the problem of false awakening may occur, after the prompt information is output, the selection result input aiming at the prompt information is not received within a certain set time length, and the problem of false awakening is confirmed to exist, then the voice information to be recognized is filtered, and the next voice information to be recognized is collected.

The following describes in detail a method for processing speech information according to the present application, with reference to specific embodiments. Fig. 3 is a schematic diagram of a specific processing process of voice information according to some embodiments of the present application, where the process includes:

s301: and acquiring text information of the voice information to be recognized.

S302: and identifying whether each character in the text information of the voice information to be identified is a character in the keyword or not and whether each character is information of a first character in the keyword or not through a semantic annotation model.

S303: for each character that is the first character in the keyword, each keyword is determined based on the character that is the first character in the keyword and each character that is subsequent thereto and that is identified as a non-first character in the keyword.

S304: and judging whether the keyword is matched with any preset keyword stored in the knowledge graph or not for each keyword, if so, executing S305, and otherwise, executing S301.

S305: and determining the entity corresponding to the matched preset keyword according to the corresponding relation between the preset keyword and the entity stored in the knowledge graph. Judging whether the number of target entities to which the matched preset keywords belong is greater than 1 or not according to each matched preset keyword, if so, determining each target entity as a candidate entity, outputting prompt information for selecting each candidate entity, and taking the selected candidate entity as the target entity corresponding to the matched preset keyword; otherwise, directly taking the entity as the target entity corresponding to the keyword.

S306: and judging whether the target entities have a connection relation in the knowledge graph, if so, executing S307, and otherwise, executing S301.

S307: and determining the semantics corresponding to the connection relation in the knowledge graph as the target semantics of the voice information to be recognized.

S308: and determining the target operation corresponding to the target semantic according to the corresponding relation between the pre-stored semantic and the operation.

S309: and determining target equipment corresponding to the target operation according to the corresponding relation between the operation and the equipment which is stored in advance. Judging whether the number of the target devices is greater than 1, if so, determining each target device as a candidate device, outputting prompt information for selecting each candidate device, and taking the selected candidate device as the target device; otherwise, S310 is performed.

S310: and determining the specific operation content of the target equipment for executing the target operation according to each matched preset keyword.

In order to accurately determine the target semantics of the speech information to be recognized, on the basis of the above embodiments, in the present application, the knowledge graph is determined as follows:

obtaining semantics corresponding to any text data sample in a sample set, each keyword contained in the sample set and an entity to which each keyword belongs; and storing the corresponding relation between each keyword and the corresponding entity;

for each keyword in each text data sample, if an entity to which the keyword belongs does not exist in an entity graph of the knowledge graph, writing the belonging entity into the entity graph;

and searching an entity to which each keyword of each text data sample belongs aiming at each text data sample, judging whether a connection relation exists between corresponding entities in the entity graph, if not, establishing the connection relation of the corresponding entities in the entity graph, and storing the corresponding relation of the connection relation and the semantics corresponding to the text data sample.

In order to accurately determine the target semantics of the voice information to be recognized, a knowledge graph can be constructed according to the semantics corresponding to each text data sample in a sample set acquired in advance, each contained keyword and an entity to which each keyword belongs. Corresponding semantics of the text data samples in the sample set can be the same or different, but keywords contained in the text data samples with the same semantics are not completely the same.

The equipment for constructing the knowledge graph and the electronic equipment for performing semantic recognition by using the knowledge graph subsequently can be the same or different.

Specifically, the semantics corresponding to any text data sample in the sample set, each keyword included in the sample set, and the entity to which each keyword belongs are obtained, and then the corresponding relationship between each keyword and the corresponding entity is stored in the entity graph. If the corresponding relationship between the keyword and the corresponding entity is stored, filtering the corresponding relationship between the keyword to be stored and the corresponding entity; otherwise, storing the corresponding relation between the key words to be stored and the corresponding entities.

And after the corresponding relation between each keyword and the corresponding entity is determined to be stored, judging whether the entity to which the keyword belongs exists in the entity graph or not aiming at each keyword in each text data sample, if so, not writing the entity to which the keyword belongs, and if not, writing the entity to which the keyword belongs in the entity graph.

Although the entity to which each keyword included in each text data sample belongs has been written in the current entity graph, the connection relationship is not necessarily established for each entity corresponding to each text data sample, therefore, after determining that the entity to which each keyword included in each text data sample belongs is stored in the entity graph, it is determined whether the connection relationship between the entities to which each keyword belongs exists in the entity graph, if so, the text data sample is filtered, if not, the connection relationship between each entity corresponding to the text data sample is established in the entity graph, and the correspondence between the connection relationship and the semantic corresponding to the text data sample is stored.

The sample set comprises a large number of text data samples, the operation is carried out on each text data sample, and when the corresponding semantics of each text data sample can be determined through the entity graph, the construction of the knowledge graph is determined to be completed.

In a possible implementation manner, when the knowledge graph is constructed, the text data samples in the sample set can be divided into training text data samples and testing text data samples, the knowledge graph is constructed based on the training text data samples, and then the reliability of the constructed knowledge graph is verified based on the testing text data samples.

In order to accurately determine the target semantics of the speech information to be recognized, on the basis of the above embodiments, in the present application, the semantic annotation model is trained in the following manner:

acquiring any text data sample in a sample set and a corresponding labeling sequence, wherein the labeling sequence comprises information of whether each character in the text data sample is a character in a keyword or not and whether each character is a first character in the keyword or not;

and training an original semantic annotation model according to the text data sample and the annotation sequence.

In order to accurately determine each keyword contained in the speech information to be recognized, the original semantic annotation model can be trained according to each text data sample in a pre-acquired sample set and a corresponding annotation sequence thereof. The labeling sequence includes information about whether each character in the corresponding text data sample is a character in the keyword and whether the character is a first character in the keyword, for example, characters corresponding to labels other than 0 in the labeling sequence are all characters in the keyword, characters corresponding to labels other than 0 in the labeling sequence are all not characters in the keyword, and characters corresponding to labels B are all the first characters in the keyword.

In addition, in order to increase the diversity of the text data samples, the text data samples include text data samples of different expression modes, such as "turn on air conditioner", "turn off air conditioner", and the like.

It should be noted that the device for training the semantic annotation model may be the same as or different from the electronic device for subsequent semantic recognition.

And the original semantic annotation model can be used for identifying whether each character in the text data sample is a character in the keyword or not, and training the original semantic annotation model according to whether each identified character is the character in the keyword or not and the corresponding annotation sequence so as to adjust the parameter value of each parameter of the original semantic annotation model.

For example, the text data sample is "price for inquiring air tickets flying to beijing", the corresponding annotation sequence is [ B I00B I0B I0B I ], each character in the text data sample is identified as [ bb ib I B I0B I0B I ] through the original semantic annotation model, the character is inconsistent with the corresponding annotation sequence, and the text data sample is determined to be recognized by mistake.

And (3) the sample set for training the original semantic annotation model contains a large number of text data samples, the operation is carried out on each text data sample, and when the preset convergence condition is met, the training of the original semantic annotation model is finished.

The meeting of the preset convergence condition can be that the text data samples in the sample set pass through the original semantic annotation model, the number of the correctly identified text data samples is greater than a set number, or the number of iterations for training the original semantic annotation model reaches a set maximum number of iterations, and the like. The specific implementation can be flexibly set, and is not particularly limited herein.

In a possible implementation manner, when training the original semantic annotation model, the text data samples in the sample set may be divided into training text data samples and testing text data samples, the original semantic annotation model is trained based on the training text data samples, and then the reliability of the trained semantic annotation model is performed based on the testing text data samples.

The following describes the training process of the BERT-BILSTM-CRF model by using the semantic annotation model in detail:

fig. 5 is a schematic diagram of a training process of a specific semantic annotation model according to some embodiments of the present application, where the training process includes:

firstly, performing BIO labeling on each text data sample in a sample set to obtain a labeling sequence of each text data sample, wherein B represents the beginning of an entity or a relationship, I represents a non-beginning part of the entity or the relationship, and O represents a part which is not the entity or the relationship. Taking two text data samples of "open music" and "please help me to open music" as an example, the corresponding labeling sequence is: [ CLS ] beat B-open I-open sound B-open sound I-app [ SEP ], [ CLS ] ask O I O beat B-open I-open sound B-app I-app [ SEP ], add [ CLS ] at the beginning and add [ SEP ] at the end of each text data sample.

Secondly, preprocessing each text data sample and the corresponding text data sample to generate a corresponding word embedding vector, a corresponding sentence embedding vector and a corresponding position embedding vector.

Fig. 6 is a process for preprocessing a text data sample according to some embodiments of the present disclosure.

Wherein, Token Embeddings represents word embedding, Segment Embeddings represents sentence embedding, E_ARepresenting a first sentence, E_BThe second sentence, Position Embeddings, represents Position embedding.

Also taking the above as an example, according to [ CLS]Attack B-open to open I-open sound B-app music I-app [ SEP ]]，[CLS]Please help O I O to beat B-open to open I-open sound B-open music I-app [ SEP ]]The generated word embedding vector is E_[CLS]、E_{[ make]}、E_{[ open to]、}…E_[SEP]、E_[CLS]、E_{[ please note that]}、E_{[ Upper part]}、…E_[SEP]The generated sentence embedding vector is E_[A]、E_[A]、E_[A]、…E_[A]、E_[B]、E_[B]、E_[B]、…E_[B]Generated position embedding vector E_[0]、E_[1]、E_[2]、…E_[5]、E_[6]、E_[7]、E_[8]、…E_[14]。

And inputting the generated word embedding vector, sentence embedding vector and position embedding vector into a BERT pre-training model depending on Google open source together to obtain a space 512-dimensional vector of each word.

Fig. 7 is a schematic diagram of a basic framework structure of a conventional BERT pre-training model.

Because the BERT pre-training model is trained on the basis of 33 hundred million words of BooksCropus and English Wikipedia data, the standard version of the BERT pre-training model has 1 hundred million parameter quantities and is equal to GPT. In addition, google uses 16 TPU clusters (64 TPU in total) to train the large version of the BERT pre-training model in 4 days, and the large version of the trained BERT pre-training model has more than 3 hundred million parameters, which is the largest pre-training model in the current natural language processing. Any mask may be used to increase the generalization capability of the model based on the BERT pre-training model; using a Transformer to fully contact the relevant information of the context; learning sentence-level relevant features; and accurately identifying each character and generating a vector corresponding to each character.

And finally, inputting the obtained vector corresponding to each word into an original BilSTM + CRF network model so as to identify whether each word is each character of the characters in the keyword or not and the information that each word is the first character in the keyword.

And respectively carrying out the operations on each text data sample in the sample set to obtain a trained BERT-BILSTM-CRF model, namely obtaining a pre-trained semantic annotation model.

And after the pre-trained semantic annotation model is obtained, testing the semantic annotation model based on the test text data sample. Fig. 8 is a schematic diagram of a specific testing process for a pre-trained semantic annotation model according to some embodiments of the present application, where the process includes:

s801: any test text data sample is obtained, such as "please help me to open music", and each keyword "open" and "music" corresponding to the test text data sample.

S802: the test text data samples are input to a pre-trained BERT-BILSTM-CRF model.

S803: and obtaining a label corresponding to each word in the test text data sample, and asking O to help O I O to make B-app and start I-app sound B-app music I-app.

S804: and processing the label corresponding to each word to obtain each keyword, namely 'open' and 'music'.

In order to accurately determine the target semantic meaning of the voice information to be recognized, on the basis of the foregoing embodiments, in this application, after determining the connection relationship existing in the knowledge graph between the target entities, the method further includes, before determining the semantic meaning corresponding to the connection relationship in the knowledge graph as the target semantic meaning of the voice information to be recognized, the method further includes:

and if the number of the semantemes corresponding to the connection relation in the knowledge graph is more than 1, filtering the text information of the voice information to be recognized.

Due to the diversity of languages, when the processing method of the speech information of the present application is actually applied to determine whether the target entities have the connection relationship in the knowledge graph, there may be a case where there is a connection relationship between some target entities and a connection relationship between another part of target entities, for example, it is determined that there are four target entities, which are A, B, C, D respectively, where there is a connection relationship between a and B and a connection relationship between C and D, and the semantics corresponding to the two connection relationships are different, that is, the number of the semantics corresponding to the two connection relationships is greater than 1, and this case may cause the subsequent electronic device to fail to correctly recognize the target semantics of the speech information to be recognized, so that the speech information to be recognized cannot be correctly processed.

Therefore, in order to accurately determine the target semantics of the voice information to be recognized, in the present application, after determining that the connection relationship exists between the target entities in the indication map based on the above embodiment, it is determined that the number of the semantics corresponding to the connection relationship is greater than 1, and if it is determined that the electronic device cannot correctly recognize the target semantics of the voice information to be recognized, the text information of the voice information to be recognized is filtered; otherwise, executing the subsequent step of determining the semantic corresponding to the connection relation in the knowledge graph as the target semantic of the voice information to be recognized.

Fig. 9 is a schematic diagram of a specific processing procedure of voice information according to some embodiments of the present application, where the processing procedure includes:

s901: and acquiring text information of the voice information to be recognized.

S902: and carrying out voice recognition on the voice information to be recognized, and converting the voice information to be recognized into corresponding text information.

S903: and identifying whether each character in the text information of the voice information to be identified is a character in the keyword or not and whether each character is information of a first character in the keyword or not through a semantic annotation model.

Specifically, each keyword is determined for each character that is the first character in the keyword according to the character that is the first character in the keyword and each character that is continuous thereafter and is identified as a non-first character in the keyword.

And judging whether the keyword is matched with any preset keyword stored in the knowledge graph or not aiming at each keyword, if so, executing the subsequent steps, and otherwise, executing S901.

S904: and searching a target entity corresponding to each keyword in the knowledge graph, determining whether a connection relation exists between the target entities, if so, executing S905, otherwise, filtering text information of the voice information to be recognized, and executing S901.

Specifically, according to the corresponding relationship between the preset keyword and the entity stored in the knowledge graph, the entity corresponding to the matched preset keyword is determined, and the entity is used as the target entity corresponding to the keyword.

Wherein, when confirming the entity corresponding to the matched preset keyword, the method further comprises: and aiming at each matched preset keyword, when the number of target entities to which the matched preset keyword belongs is more than 1, determining each target entity as a candidate entity, outputting prompt information for selecting each candidate entity, and taking the selected candidate entity as the target entity corresponding to the matched preset keyword.

S905: and determining the semantics corresponding to the connection relation between the target entities as the target semantics of the voice information to be recognized.

After S904 and before S905, it is determined whether the number of semantics corresponding to the connection relationship acquired in S904 is greater than 1, if so, the text information of the speech information to be recognized is filtered, and S901 is executed, otherwise, S905 is executed.

And after the target semantics are identified, determining the target operation corresponding to the target semantics according to the corresponding relation between the pre-stored semantics and the operation.

And determining target equipment corresponding to the target operation according to the corresponding relation between the operation and the equipment which is stored in advance.

When determining the target device corresponding to the target operation, the method further includes: and when the number of the target devices is more than 1, determining each target device as a candidate device, outputting prompt information for selecting each candidate device, and taking the selected candidate device as the target device.

And determining the specific operation content of the target equipment for executing the target operation according to each matched preset keyword.

The present application provides a device for processing voice information, and fig. 10 is a schematic structural diagram of a device for processing voice information according to some embodiments of the present application, where the device includes:

anacquisition unit 1001 configured to acquire text information of speech information to be recognized;

the identifyingunit 1002 is configured to identify whether each character in the text information is a character in a keyword through a pre-trained semantic annotation model;

a determiningunit 1003 for determining each keyword from each character recognized as a character in the keywords;

theprocessing unit 1004 is configured to respectively determine, according to a pre-constructed knowledge graph, a target entity corresponding to each keyword in the knowledge graph, and a connection relationship between the target entities in the knowledge graph, and determine semantics corresponding to the connection relationship in the knowledge graph as target semantics of the to-be-recognized voice information.

Fig. 11 is a schematic structural diagram of a server according to some embodiments of the present application, and on the basis of the foregoing embodiments, the present application further provides a server, as shown in fig. 11, including: the system comprises aprocessor 1101, acommunication interface 1102, amemory 1103 and acommunication bus 1104, wherein theprocessor 1101, thecommunication interface 1102 and thememory 1103 are communicated with each other through thecommunication bus 1104;

thememory 1103 has stored therein a computer program that, when executed by theprocessor 1101, causes theprocessor 1101 to perform the steps of:

acquiring text information of voice information to be recognized;

identifying whether each character in the text information is a character in a keyword or not through a pre-trained semantic annotation model;

determining each keyword according to each character identified as a character in the keywords;

respectively determining target entities corresponding to the keywords in the knowledge graph according to a pre-constructed knowledge graph, and if connection relations exist between the target entities in the knowledge graph, determining the semantics corresponding to the connection relations in the knowledge graph as the target semantics of the voice information to be recognized.

Because the principle of the server to solve the problem is similar to the method for processing the voice information, the implementation of the server can refer to the implementation of the method, and repeated details are not repeated.

The communication bus mentioned in the above server may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

Thecommunication interface 1102 is used for communication between the server and other devices described above.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

On the basis of the foregoing embodiments, the present application further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program is run on the processor, the processor is caused to execute the following steps:

acquiring text information of voice information to be recognized;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for processing voice information, the method comprising:

acquiring text information of voice information to be recognized;

2. The method of claim 1, wherein the identifying whether each character in the text information is a character in a keyword through a pre-trained semantic annotation model comprises:

3. The method according to claim 1, wherein the determining each target entity corresponding to each keyword in the knowledge graph according to a pre-constructed knowledge graph comprises:

4. The method according to claim 3, wherein after determining the target semantics of the speech information to be recognized, the method further comprises:

5. The method according to claim 3 or 4, wherein the determining an entity corresponding to the matched preset keyword according to the correspondence between the preset keyword and the entity stored in the knowledge graph, and taking the entity as a target entity corresponding to the keyword comprises:

6. The method of claim 3, wherein the knowledge-graph is determined by:

7. The method of claim 2, wherein the semantic annotation model is trained by:

8. The method according to claim 1, wherein after determining the connection relationship existing in the knowledge-graph between the target entities, before determining the semantics corresponding to the connection relationship in the knowledge-graph as the target semantics of the speech information to be recognized, the method further comprises:

9. An apparatus for processing speech information, the apparatus comprising:

10. A server, characterized in that the server comprises at least a processor and a memory, the processor being adapted to carry out the steps of the method for processing speech information according to any one of claims 1-8 when executing a computer program stored in the memory.