Disclosure of Invention
In view of the above, the present invention aims to provide a method, an apparatus and an electronic device for processing a mixed multilingual navigation voice command, so as to solve the drawbacks of the mixed multilingual navigation voice command caused by constructing a pronunciation dictionary and performing model adjustment.
The technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a method for processing a mixed multilingual navigation voice command, including:
Under a navigation scene, the country or region where the user is currently located is predetermined;
Cutting a voice instruction input by a user into a place name section and a non-place name section;
Invoking a voice processing strategy matched with the country or region where the user is currently located, and identifying a voice instruction corresponding to the place name section;
and carrying out navigation intention understanding by combining the recognition results of the place name section and the non-place name section.
In at least one possible implementation manner, the pre-determining the country or region in which the user is currently located includes acquiring electronic fence information corresponding to a border line or region boundary line from a navigation map.
In at least one possible implementation manner, the processing method further comprises the steps of updating the acquired electronic fence information by using the position information of the user, or dynamically adjusting the period of acquiring the electronic fence information according to the position information, the moving speed information and the distance information between the electronic fence and the user.
In at least one possible implementation manner, the step of cutting the voice command input by the user into the place name section and the non-place name section includes:
Cutting is performed in a manner of semantic understanding of the voice recognition contents, or in a manner of classifying the voice recognition contents, or in an audio dimension by utilizing the difference of voice instructions.
In at least one possible implementation manner, after the voice command corresponding to the place name section is identified, judging whether the identification result meets the preset confidence coefficient requirement, and if not, adopting the voice processing algorithm which is the same as that of the non-place name section to identify the voice command corresponding to the place name section.
In at least one possible implementation manner, after the voice command corresponding to the place name section is identified, when the identification result is judged to be unable to correspond to the place name of the country or region where the user is currently located, the identification result is corrected and a plurality of corrected place names close to the identification result are provided for the user to confirm.
In at least one possible implementation manner, the processing method further comprises the steps of detecting whether the voice command input by the user contains the place name or not before the place name and non-place name are cut, and if not, directly adopting a voice processing strategy corresponding to the main language to recognize and understand the input complete voice command.
In a second aspect, the present invention provides a mixed multilingual navigation voice instruction processing apparatus, including:
the region determination module is used for determining the country or region where the user is currently located in advance in a navigation scene;
The instruction cutting module is used for cutting a voice instruction input by a user into a place name section and a non-place name section;
The place name recognition module is used for calling a voice processing strategy matched with the country or region where the user is currently located and recognizing a voice instruction corresponding to the place name section;
and the navigation intention understanding module is used for carrying out navigation intention understanding by combining the identification results of the place name section and the non-place name section.
In a third aspect, the present invention provides an electronic device, comprising:
One or more processors, a memory, and one or more computer programs, the memory may employ a non-volatile storage medium, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the device, cause the device to perform the method as in the first aspect or any of the possible implementations of the first aspect.
The main conception of the invention is that the country or region where the user is currently located is predetermined under the navigation scene, after the user inputs the voice command, the command is cut into a place name section and a non-place name section, the voice processing strategy matched with the country or region where the user is currently located is called, the voice command corresponding to the place name section is identified, and finally the navigation intention is understood by combining the identification results of the place name section and the non-place name section. The conventional positioning thought is replaced by judging the current country or region, so that a voice processing strategy matched with the local language can be predetermined, and the input mixed multilingual voice instruction can be subjected to targeted recognition processing by cutting the input voice, so that a more reliable and accurate navigation intention understanding result is provided. The method and the device do not need to consume cost to construct a dictionary and do not need to carry out a large amount of parameter adjustment on the existing model, and can more economically and efficiently process the situation of mixed languages in a navigation scene.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
Aiming at the situation that the navigation voice command of the non-subject language is mixed in the navigation scene, the invention provides an embodiment of at least one mixed multilingual navigation voice command processing method, as shown in fig. 1, which specifically comprises the following steps:
Step S1, under a navigation scene, the country or region where the user is currently located is predetermined;
In practical operation, the process can be implemented based on various means, for example, coordinate information of the position where the user is located is adopted, and the longitude and latitude of the current position where the user is located can be obtained through navigation technologies such as satellite positioning, etc., so that the mode is relatively conventional, although the implementation obstacle is low, certain disadvantages exist, namely, deviation can be generated in single coordinate positioning, or in a real application scene, the moving user can not reliably obtain stable coordinate information of the moving user, particularly, in the vehicle driving process, the moving speed of the user is relatively high, and if the moving user is combined with areas with dense distribution in different countries such as europe, the current country or region where the user is located can not be reliably and accurately judged.
Based on such considerations, the present invention proposes in some preferred embodiments that the electronic fence information corresponding to border lines or regions in the navigation map be preferably used for weighted excitation for subsequent mixed language processing, i.e. in the concept of the preferred embodiment, the country or region in which the user is currently located is determined by the concept of "range", which is more reliably adapted to the navigation scenario as required by the single passing through the location position, especially for specific scenarios driving or crossing border lines/regions.
The invention does not need to describe and limit the invention, but can also supplement the description that, on the basis of the concept that the country or region where the user is located is determined by the scope, in other preferred embodiments of the invention, the obtained electronic fence information can be updated together with the position information (which can come from the longitude and latitude coordinates) of the user, and the period of obtaining the electronic fence information can be dynamically adjusted by further combining the moving speed information of the user and the distance information of the electronic fence based on the obtained electronic fence information, so that the reliable and accurate switching of the country/region where the user is located can be realized in certain scenes, such as cross-border driving, for example, the electronic fence information can be obtained once every 15 seconds (can be adjusted as required) when the user is judged to be closer to the certain country based on the information.
S2, cutting a voice instruction input by a user into a place name section and a non-place name section;
One skilled in the art can understand at least two aspects, one of which is that the scene aimed by the present invention is navigation, and accordingly, the user voice received by the system is navigation-related and relatively high probability will include a voice command similar to destination query navigation, and the other is that the default language of the navigation system is necessarily the language (such as the native language) used by the user, and the present invention is called the subject language, which indicates that the language is the most dominant basic language for voice processing of navigation.
Based on the above two aspects, in actual operation, after receiving navigation related voice provided by a user, through default subject language, the input voice can be identified and transcribed, and the input voice command is split into a place name segment and a non-place name segment according to a predetermined semantic understanding strategy, and because the method is a mature technology, the invention only makes a schematic introduction: for example, after preprocessing the audio of the input voice, performing transcription through a voice recognition model of a certain subject language, completing word segmentation sentence segmentation processing, and then dividing a part which is possibly a place name and a part which is not the place name in an original input instruction according to a preset expression template, a semantic slot and the like by combining an intention understanding model; in addition, the method of classifying the voice recognition content can be considered, when the method is implemented, a classifier trained according to a set target can be adopted to classify the voice recognition content, so that a part of a place name class and a part of a non-place name class are obtained, in some preferred embodiments of the invention, the first aspect and the second aspect mentioned above are considered, preferably, the voice part different from the main language can be judged as the place name segment directly through the difference of the voice dimension, the method can remarkably improve the integral processing time, and the implementation method also has the mature voice and acoustic processing schemes in the field for reference. However, the cutting method is not limited to the above method, and the technical tool is not the focus of the present invention.
S3, invoking a voice processing strategy matched with the country or region where the user is currently located, and identifying a voice instruction corresponding to the place name section;
and S4, combining the recognition results of the place name section and the non-place name section to understand the navigation intention.
For the above two steps, it may be pointed out that, as described above, the recognition of the non-place name segment may be obtained through a speech processing model related to the main language by default of the navigation system, and in different embodiments, the recognition opportunity may not be limited, and the recognition of the place name segment may be preferentially performed by using the speech processing model for the current world/region determined by the previous step, and then the recognition results of the two steps are comprehensively considered, and may be input into a natural language processing algorithm mature in the art, so as to understand the navigation expectations input by the user, and perform corresponding conventional navigation operations such as path planning, which are not repeated in the present invention.
It may be further noted that, in some preferred embodiments of the present invention, in consideration of special situations in real scenes, reliable location name recognition results may not be obtained if the voice processing model of the local country/region is directly used, and the following preferred schemes are proposed herein to deal with.
(1) After performing location name recognition processing by using a voice processing strategy matched with the local international/regional in the step S3, judging whether the recognition result meets a preset confidence requirement (in actual operation, the recognition result can be compared according to a score mechanism), if yes, continuing to execute the step S4, if not, adopting the voice processing algorithm identical with the non-location name to recognize the voice command corresponding to the location name, and then executing the step S4. That is, the station in this embodiment is located in the system, and considers that if the voice processing model of the local country/region cannot obtain the place name section voice with the confidence meeting the requirement, the place name section voice is still considered to be in the subject language with high probability, so that the voice processing model for processing the non-place name section is invoked to identify the place name section.
(2) After performing the location name recognition processing using the voice processing policy of the local international/regional matching in step S3, if it is determined that the recognition result cannot correspond to the location name of the country/region, it is preferable to correct the recognition result and provide the corrected local location name (closest to the recognition result) or local location names (Top N closest to the recognition result) for the user to confirm, and then step S4 is performed. The corrected place name may be provided in the form of voice and/or text output, and the manner of correcting the recognition result in this embodiment may refer to the existing voice and text correction manner. That is, the station in this embodiment is located in the system, and considers that if the recognition result of the place name section obtained by the voice processing model of the local country/region may not be accurately matched to the corresponding local place name due to inaccurate pronunciation of the local language by the user, a policy for assisting the user in correction is adopted. In addition, the idea of matching the recognition result with the local place name is mentioned herein, that is, in other embodiments of the present invention, but not limited to, a place name library of different countries/regions may be constructed by using the local language in the early stage, including cities, scenic spots, roads, addresses where buildings are located, and the like, and when the recognition result of the place name section obtained through the voice processing model of the local country/region is obtained, the matching operation may be performed with the place name library.
Finally, it may be further added that, in the foregoing description step S2, the voice command input by the user is related to the address in the navigation scene, but in consideration of the actual application, there is a navigation voice command that does not input the location name, so in other preferred schemes, the invention proposes that before the location name section and the non-location name section are cut, whether the voice command input by the user includes the location name can be detected, for example, whether the voice command includes the location name content can be primarily determined by conventional and mature techniques such as language recognition, understanding and classification, and if the primary determination result is not included, the voice processing strategy corresponding to the default main language of the system can be directly adopted to recognize and intend to understand the input complete voice command, and the links of dividing and calling the country/region voice processing algorithm are skipped, so that the user requirement can be responded more quickly.
Based on the above embodiments, and in conjunction with the examples mentioned above, references are provided herein to assist in describing the working process of the solution of the present invention, in which a user in native english drives locally in russia, and when using a navigation application, the navigation system, with default subject language english, determines that the current user is located in russia (the core goal of the process is to serve the corresponding language recognition model required for follow-up, on implementation of the solution, that the russia recognition model may be pre-selected for follow-up invocation) according to the electronic fence characterizing the russia border line provided by the high-precision map module in current navigation. When the user inputs "na vite to (english, meaning Navigate to) a pi of у ю a/щ a/b (russian, meaning red field)" the voice interactive instruction mixed with the multiple languages, the navigation system recognizes the english part and the non-english part from the acoustic layer, determines the english part as a non-place name segment according to a predetermined policy, and determines the non-english part as a place name segment.
Then, based on the fact that the current user is located in Russian by the electronic fence in advance, the constructed Russian voice recognition model is called, voice recognition processing is carried out on the audio segment corresponding to the region name section 'kappa pi' at у ю pi 'at щ pi' at the second time, and certainly, the non-region name section part can be recognized by directly adopting a navigation system default language, namely an English voice recognition model.
Finally, the recognized text output by the two voice recognition models can be spliced according to the audio time sequence of the input voice and sent to a pre-built natural language understanding algorithm for text-based semantic analysis processing, wherein the supplementary explanation is that the natural language understanding process can be irrelevant to language types, the intention understanding result can be obtained by only collecting data sets of different languages according to real requirements to train the existing intention understanding models in actual operation, and for the purpose of simplifying the training of the natural language understanding algorithm, the recognized text 'Navigate to k' alpha 'can also be converted into text' Navigate to Red Square 'by a pre-set Russian translation service of' Ji 'e у ю pi'm, 'Ji' щ a 'alpha' by the pre-set Ji 'alpha' and then sent to the intention understanding model trained by only English samples in advance. Therefore, through the whole flow, the navigation system can determine that the user expects to drive from the current place to the red field in Moscow, and then gives corresponding planned routes and other relevant navigation information.
In summary, the main concept of the present invention is to pre-determine the country or region where the user is currently located in the navigation scene, after the user inputs the voice command, cut the command into a place name segment and a non-place name segment, invoke a voice processing policy matched with the country or region where the user is currently located, identify the voice command corresponding to the place name segment, and finally combine the identification results of the place name segment and the non-place name segment to understand the navigation intention. The conventional positioning thought is replaced by judging the current country or region, so that a voice processing strategy matched with the local language can be predetermined, and the input mixed multilingual voice instruction can be subjected to targeted recognition processing by cutting the input voice, so that a more reliable and accurate navigation intention understanding result is provided. The invention does not need to cost to construct a dictionary for voice recognition and does not need to carry out a large amount of parameter adjustment on the existing voice recognition model, and can more economically and efficiently process the situation of mixed languages in a navigation scene.
Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a mixed multilingual navigation voice command processing device, as shown in fig. 2, which may specifically include the following components:
The area judging module 1 is used for presetting the country or area where the user is currently located in a navigation scene;
the instruction cutting module 2 is used for cutting a voice instruction input by a user into a place name section and a non-place name section;
The place name recognition module 3 is used for calling a voice processing strategy matched with the country or region where the user is currently located and recognizing a voice instruction corresponding to the place name section;
And the navigation intention understanding module 4 is used for carrying out navigation intention understanding by combining the identification results of the place name section and the non-place name section.
It should be understood that the above division of the components in the mixed multilingual navigation voice command processing apparatus shown in fig. 2 is merely a division of logic functions, and may be fully or partially integrated into a physical entity or may be physically separated. The components can be realized in the form of software calling through the processing element, can be realized in the form of hardware, and can also be realized in the form of software calling through the processing element, and can be realized in the form of hardware. For example, some of the above modules may be individually set up processing elements, or may be integrated in a chip of the electronic device. The implementation of the other components is similar. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above components may be one or more integrated circuits configured to implement the above methods, such as one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors (DIGITAL SINGNAL Processor; DSP), or one or more field programmable gate arrays (Field Programmable GATE ARRAY; FPGA), or the like. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).
In view of the foregoing examples and preferred embodiments thereof, it will be appreciated by those skilled in the art that in actual operation, the technical concepts of the present invention may be applied to various embodiments, and the present invention is schematically illustrated by the following carriers:
(1) An electronic device. The apparatus may in particular comprise one or more processors, a memory and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or equivalent implementations.
The electronic device may be an electronic device related to a computer, such as, but not limited to, various interactive terminals, electronic products, mobile terminals, and the like.
Fig. 3 is a schematic structural diagram of an embodiment of an electronic device according to the present invention, and in particular, an electronic device 900 includes a processor 910 and a memory 930. Wherein the processor 910 and the memory 930 may communicate with each other via an internal connection, and transfer control and/or data signals, the memory 930 is configured to store a computer program, and the processor 910 is configured to call and execute the computer program from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, more commonly referred to as separate components, and the processor 910 is configured to execute program code stored in the memory 930 to perform the functions described above. In particular implementations, the memory 930 may also be integrated within the processor 910 or separate from the processor 910.
In addition, to further improve the functionality of the electronic device 900, the device 900 may further comprise one or more of an input unit 960, a display unit 970, audio circuitry 980, a camera 990, a sensor 901, etc., which may further comprise a speaker 982, a microphone 984, etc. Wherein the display unit 970 may include a display screen.
Further, the apparatus 900 may also include a power supply 950 for providing electrical power to various devices or circuits in the apparatus 900.
It should be appreciated that the operation and/or function of the various components in the apparatus 900 may be found in particular in the foregoing description of embodiments of the method, system, etc., and detailed descriptions thereof are omitted here as appropriate to avoid redundancy.
It should be appreciated that the processor 910 in the electronic device 900 shown in fig. 3 may be a system on a chip SOC, where the processor 910 may include a central processing unit (Central Processing Unit; hereinafter referred to as a CPU), and may further include other types of processors, such as an image processor (Graphics Processing Unit; hereinafter referred to as a GPU), and so on, which will be described in detail below.
In general, portions of the processors or processing units within the processor 910 may cooperate to implement the preceding method flows, and corresponding software programs for the portions of the processors or processing units may be stored in the memory 930.
(2) A computer data storage medium having stored thereon a computer program or the above-mentioned means which, when executed, causes a computer to perform the steps/functions of the foregoing embodiments or equivalent implementations.
In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer data storage medium. Based on such understanding, certain aspects of the present invention may be embodied in the form of a software product as described below, in essence, or as a part of, contributing to the prior art.
It is especially pointed out that the storage medium may refer to a server or a similar computer device, in particular, i.e. a storage means in the server or a similar computer device, in which the aforementioned computer program or the aforementioned means are stored.
(3) A computer program product (which may comprise the apparatus described above) which, when run on a terminal device, causes the terminal device to perform the hybrid multilingual navigation voice instruction processing method of the foregoing embodiment or equivalent.
From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the above-described computer program product may include, but is not limited to, an APP.
The above device/terminal may be a computer device, and the hardware structure of the computer device may further specifically include at least one processor, at least one communication interface, at least one memory, and at least one communication bus, where the processor, the communication interface, and the memory may all complete communication with each other through the communication bus. The processor may be a central Processing unit CPU, DSP, microcontroller or digital signal processor, and may further include a GPU, an embedded neural network processor (Neural-network Process Units; hereinafter referred to as NPU) and an image signal processor (IMAGE SIGNAL Processing; hereinafter referred to as ISP), which may further include an ASIC, or one or more integrated circuits configured to implement embodiments of the present invention, etc., and may further have a function of operating one or more software programs, which may be stored in a storage medium such as a Memory, etc., and the aforementioned Memory/storage medium may include a non-volatile Memory (non-removable disk, U-disk, removable hard disk, optical disk, etc., and a Read-Only Memory (ROM; hereinafter referred to as ROM), a random access Memory (Random Access Memory; hereinafter referred to as RAM), etc.
In the embodiments of the present invention, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent a, b, c, a and b, a and c, b and c, or a and b and c, wherein a, b, c may be single or plural.
Those of skill in the art will appreciate that the various modules, units, and method steps described in the embodiments disclosed herein can be implemented in electronic hardware, computer software, and combinations of electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
And wherein the modules, units, etc. illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of places, e.g. nodes of a system network. In particular, some or all modules and units in the system can be selected according to actual needs to achieve the purpose of the embodiment scheme. Those skilled in the art will understand and practice the invention without undue burden.
The construction, features and effects of the present invention are described in detail with reference to the embodiments shown in the drawings, but the above-mentioned embodiments and the technical features related to the preferred embodiments are only preferred embodiments of the present invention, and it should be understood that those skilled in the art may reasonably combine and arrange the above-mentioned embodiments into various equivalent schemes without departing from or changing the design concept and technical effects of the present invention, so that the present invention is not limited by the scope of the embodiments shown in the drawings, and all changes made according to the concepts of the present invention or modifications to equivalent embodiments are within the scope of the present invention without departing from the spirit covered by the specification and drawings.