AssociatedStep (ii) of	Input device	Output of	Time consuming
				1.22.12.23.13.2	User languageVoice request[ air volume 3 ]Stage (C)	The ASR features of [ three windshields ] are stored in the data storage service:{“asrAlignment”；[{“conf”：1.0，“end”：820，“pinyin”："feng", "start":640, "word": "wind" }, { "conf":1.0,“end”：940，“pinyin”：“liang”，“start”：820，“word”："amount" }, { "conf":1.0, "end":1120, "pinyin": "san","start":940, "word": "three" }, { "conf":1.0, "end":1340,"pinyin": "dang", "start":1120, "word": gear]，“eof”：true}	Asynchronous50ms

Referring to fig. 4, beforestep 03, the voice interaction method includes:

021: the method comprises the steps that when downstream logic processing is carried out on a voice request, the voice request is sent to an acoustic rejection service to be processed, so that acoustic features are obtained;

022: rejecting the voice request according to the acoustic characteristics to obtain an acoustic rejection result;

023: and sending the acoustic features and the acoustic rejection results to a context service so as to store the acoustic features and the acoustic rejection results as context features into a data storage service.

The processor is used for sending the voice request to the acoustic rejection service for processing while performing downstream logic processing on the voice request so as to obtain acoustic features; rejecting the voice request according to the acoustic characteristics to obtain an acoustic rejection result; and sending the acoustic characteristics and the acoustic rejection results to a context service so as to store the acoustic characteristics and the acoustic rejection results as context characteristics in a data storage service.

Specifically, as shown in steps 1.1, 4, 5, 6, and 7 in fig. 3, while performing downstream logic processing on the voice request, the ASR recognition service sends the voice request to the acoustic rejection service, and the acoustic rejection service performs processing to obtain acoustic features, and then determines whether the voice request is rejected according to the acoustic features, that is, rejects the voice request according to the acoustic features to obtain an acoustic rejection result. Wherein the acoustic features refer to pure audio features.

The acoustic rejection results include three results of whether the voice request is an invalid input, and whether the voice request is an invalid input are uncertain, which can be represented by 0, 1, and 2, 0 representing passing, the voice request is not an invalid input, 1 representing rejection, the voice request is an invalid input, and 2 representing uncertain whether the voice request is an invalid input.

In the feature processing stages (steps 1 to 7), the description will be given by taking as an example that the previous voice request of the user is "turn on the air conditioner" and the current voice request is "third air volume", and for the current voice request "third air volume", the step of storing the acoustic feature and the acoustic rejection result of the voice request in the data storage service may be expressed in a table form shown in table 2 below.

TABLE 2

AssociatedStep (ii) of	Input device	Output of	Time consuming
				1.14567	User' sAudio frequency[ air volume ]Third stage	Acoustic rejection results: the conditional rejection =0 logging context service (0:releasing; 1: refusing; 2: indeterminate) acoustic MFCC features: N-M feature matrix loggingThe context service and the ASR features constitute context features: { "data";[{“MFCCFeat”：[]，“acousticRejection”：“0”，“asrAlignment”：[{“conf”：1.0，“end”：1120，“pinyin”："da kai", "start":660, "word": "open" }, { "conf":0.795，“end”：1820，“pinyin”：“kongtiao”，“start”：1120，"word": "air conditioner" }]，“eof”：true}，{“MFCCFeat”：[]，“acousticRejection”：“0”，“asrAlignment”：[{“conf”：1.0，“end”：820，“pinyin”：“feng”，“start”：640，“word”："wind" }, { "conf":1.0, "end":940, "pinyin": "liang" of the plant tissue,"start":820, "word": "amount" }, { "conf":1.0, "end":1120,"pinyin": "san", "start":940, "word": "three" }, { "conf"：1.0，“end”：1340，“pinyin”：“dang”，“start”：1120，"word": gear]，“eof”：true}]}	Asynchronous100ms

The acoustic rejection service may then store the acoustic features and the acoustic rejection results as context features in the data storage service to facilitate direct retrieval of the context features with the acoustic features and the acoustic rejection results from the context service upon subsequent asynchronous requests.

The steps of performing downstream logic processing on the voice request to obtain the downstream logic processing result in fig. 3 may be expressed in a table form as shown in table 3 below.

TABLE 3

AssociatedStep (ii) of	Input device	Output the output	Time consuming
				2.1 8	Asynchronous first request body:{“data”：{“async”：1，“hardwareid”：“xxxxxx”，“msgld”："xxxxxx", "query": wind (Chinese character of 'feng')Volume three, recordidd:“rcidxxxxxx”，“scenelds”：xxxxxx}，“status”：“xxxxxx”}	after obtaining the downstream logic processing result, the second synchronization is formedA step requesting body: { "data": { "async":1,“hardwareid”：“xxxxxx”，“msgld”："xxxxxx", "query": 'three wind volume gears',“recordid”：“rcidxxxxxx”，“scenelds”：xxxxxx}，“status”：“xxxxxx”，“domains”：[{“domainConfidence”：0.91380006，“domainName”：“ac”，“intents”：[{“intentConfidence”：1.0，“intentName”：“ac wind set”，“slotConfidence”：0.0，“slots”：[{“name”：“number”，“pos”：[2.2]，"rawvalue": "three", "value type":“string”}]}]}]，}}	synchronization200-500ms

Further, referring to fig. 5,step 03 includes:

031: sending an asynchronous request to a semantic rejection service through a central control service for entering the parameters;

032: and acquiring voice recognition text characteristics, acoustic characteristics and acoustic rejection results through the semantic rejection service to perform first semantic rejection on the voice request to obtain a first semantic rejection result.

The processor is used for sending an asynchronous request to the semantic rejection service for entering the reference through the central control service; and obtaining the voice recognition text characteristics, the acoustic characteristics and the acoustic rejection result through the semantic rejection service, and performing first semantic rejection on the voice request to obtain a first semantic rejection result.

Specifically, as shown in step 8 and step 9 of fig. 3, the central control service sends an asynchronous request to the semantic rejection service for participation. Then, as shown in step 10 and step 11 in fig. 3, the semantic rejection service obtains the speech recognition text feature, the acoustic feature and the acoustic rejection result from the dialog management context service, and performs the first semantic rejection on the speech request to obtain a first semantic rejection result.

In the first asynchronous request stage (steps 8 to 13) of the central control service for the semantic rejection service, the previous round of voice request of the user is taken as 'open air conditioner', the current round of voice request is taken as 'third air volume', the central control service sends the asynchronous request to the semantic rejection service for entering, the semantic rejection service acquires voice recognition text features, acoustic features and acoustic rejection results from the dialogue management context service to perform first semantic rejection on the voice request, and the step of obtaining the first semantic rejection results can be expressed in a table form shown in the following table 4.

TABLE 4

AssociatedStep (ii) of	Input the method	Output of	Time consuming
				8.1910 11	Asynchronous request body:{“data”：{“async”：1，“hardwareid”：“xxxxxx”，“msgld”：“xxxxxx”，"query": wind (Chinese character of 'feng')Measuring the third gear,“recordid”：“rcidxxxxxx”，“scenelds”：xxxxxx}，“status”：“xxxxxx”}	context characteristics: { "data"; [ { "MFCCFeat": []，“acousticRejection”：“0”，“asrAlignment”：[{“conf”：1.0，“end”：1120，“pinyin”：“dakai "," start ":660, "word": "open" }, { "conf":0.795，“end”：1820，“pinyin”：“kongtiao”，"start":1120, word: "air conditioner" }]，“eof”：true}，{“MFCCFeat”：[]，“acousticRejection”：“0”，“asrAlignment”：[{“conf”：1.0，“end”：820，“pinyin”：“feng”，“start”：640，“word”："wind" }, { "conf":1.0, "end":940, "pinyin":"liang", "start":820, "word": the "amount" of the water is used,{“conf”：1.0，“end”：1120，“pinyin”：“san”，"start":940, "word": "three" }, { "conf":1.0,“end”：1340，“pinyin”：“dang”，“start”：1120，"word": gear]，“eof”：true}]}	Asynchronous 20ms

The voice request may be subjected to the first semantic rejection through a pre-trained acoustic rejection model, and accordingly, the first semantic rejection result may also be referred to as a multi-modal rejection model result.

Afterstep 03, the voice interaction method includes:

04: and storing the first semantic rejection result into a data storage service.

The processor is configured to store the first semantic rejection result in the data storage service.

Specifically, as shown in steps 12 and 13 in fig. 3, after obtaining the first semantic rejection result in the semantic rejection service, the present invention may store the first semantic rejection result in the data storage service.

In the stage of storing the first semantic rejection result in the data storage service (steps 12 and 13), the step of storing the first semantic rejection result in the data storage service is shown in table 5, taking the previous round of voice request of the user as "turn on the air conditioner" and the current round of voice request as "third gear of air volume" as an example.

TABLE 5

AssociatedStep (ii) of	Input the method	Output of	Time consuming
				12 13	Context characteristics: { "data": [ { "MFCCFeat": []，“acousticRejection”：“0”，“asrAlignment”： [{“conf”：1.0，“end”：1120，“pinyin”：“da kai”，“start”：660, "word": "open" }, { "conf":0.795,“end”：1820，“pinyin”：“kongtiao”，"start":1120, word: air conditioner]，“eof”：true}，{“MFCCFeat”：[]，“acousticRejection”：“0”，“asrAlignment”：[{“conf”：1.0，“end”：820，“pinyin”：“feng”，“start”：640，"word": "wind" }, { "conf":1.0, "end":940,“pinyin”：“liang”，“start”：820，"word": "amount" }, { "conf":1.0, "end":1120，“pinyin”：“san”，“start”：940，"word": "three" }, { "conf":1.0, "end":1340，“pinyin”：“dang”，“start”：1120，"word": gear]，“eof”：true}]Is asynchronousA request body: an asynchronous request body: { "data":{“async”：1，“hardwareid”：“xxxxxx”，"msgld": "xxxxxx", "query": ' air volume threeGear "," recordidd ": "rcidxxxxxx",“scenelds”：xxxxxx}，“status”：“xxxxxx”}	and (3) modeling results: { "query":'three wind volume gears',“modelConfidence”：{“noise”：0.0，“clear”：1.0，“taskLabel”：“xx”，“taskLabel”：“xx”，“detail”：{“taskLabel”：0.9，“taskLabel2”：0.8}}，“code”，10000，“msg”：“ok”}	asynchronous30ms

Referring to fig. 6,step 05 includes:

051: after receiving a downstream logic processing result, sending a synchronization request to a semantic rejection service for entering a reference through a central control service based on the downstream logic processing result;

052: acquiring a first semantic rejection result of the data storage service through the semantic rejection service;

053: and performing second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and fusing to obtain a second semantic rejection result.

The processor is used for sending a synchronization request to the semantic rejection service for entering the parameters through the central control service based on the downstream logic processing result after receiving the downstream logic processing result; acquiring a first semantic rejection result of the data storage service through the semantic rejection service; and performing second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and fusing to obtain a second semantic rejection result.

Referring to fig. 3, as shown in steps 14 to 18 in fig. 3, after receiving the downstream logic processing result, the central control service may send a synchronization request to the semantic rejection service for entry based on the downstream logic processing result, where the semantic rejection service obtains a first semantic rejection result stored in the data storage service, performs a second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and merges the first semantic rejection result and the second semantic rejection result to obtain a second semantic rejection result.

That is, the central control service receives the downstream logic processing results returned by other downstream services (NLU/DM/BOT) in the dialog system, sends a synchronization request to the semantic rejection service by using the service logic, and the semantic rejection service reads the multi-modal rejection model result in the data storage service (REDIS), performs rule reasoning by combining the service logic, and then fuses to obtain a second semantic rejection result to return to the central control service.

In the second synchronous request stage (steps 14 to 18) of the semantic rejection service by the central control service, taking the previous round of voice request of the user as "open the air conditioner" and the current round of voice request as "third air volume" as an example for explanation, the central control service receives the downstream logic processing result returned by other downstream services (NLU/DM/BOT) in the dialog system, and the step of sending the second synchronous request to the semantic rejection service by the business logic can be expressed in a table form as shown in table 6; the step of the semantic rejection service reading the result of the multi-modal rejection model in the data storage service (REDIS) can be expressed in a tabular form as shown in table 7; the steps of performing rule reasoning by combining with the service logic, then fusing to obtain a second semantic rejection result, and returning the result to the central control service can be expressed in a table form as shown in table 8.

TABLE 6

AssociatedStep (ii) of	Input device	Output of
			15	Second-time synchronization request body: { "data": { "async":1,“hardwareid”：“xxxxxx”，“msgld”："xxxxxx", "query": 'air volume three-gear',“recordid”：“rcidxxxxxx”，“scenelds”：xxxxxx}，“status”：“xxxxxx”，“domains”：[{“domainConfidence”：0.91380006，“domainName”：“ac”，“intents”：[{“intentConfidence”：1.0，“intentName”：“ac wind set”，“slotConfidence”：0.0，“slots”：[{“name”：“number”，“pos”：[2.2]，"rawvalue": "three", "value type":“string”}]}]}]，}}	is free of

TABLE 7

Closing (A)CoupletStep (b)Method for preparing a Chinese medicinal composition	Input device	Output of	Consumption ofTime of flight
				1617	Key value:{“recordld”：“rcid”xxxxxx}	model results in data storage service: { "query": 'air volume three-gear',“modelConfidence”：{“noise”：0.0，“clear”：1.0，“taskLabel”：“xx”，“detail”：{“taskLabel”：0.9，“taskLabel”：0.8}}，“code”：10000，“msg”：“ok”}	all in oneStep by step1-5ms

TABLE 8

AssociatedStep (ii) of	Input the method	Output the output	Time consuming
				18	Second synchronization request body: { "data": { "async":1，“hardwareid”：“xxxxxx”，“msgld”："xxxxxx", "query": 'air volume three-gear',“recordid”：“rcidxxxxxx”，“scenelds”：xxxxxx}，“status”：“xxxxxx”，“domains”：[{“domainConfidence”：0.91380006，“domainName”：“ac”，“intents”：[{“intentConfidence”：1.0，“intentName”：“ac wind set”，“slotConfidence”：0.0，“slots”：[{“name”：“number”，“pos”：[2.2]，"rawvalue": "three", "value type":“string”}]}]}]in the data storage serviceResults of the model of (1):{ "query": 'three wind volume gears',“modelConfidence”：{“noise”：0.0，“clear”：1.0，“taskLabel”：“xx”，“detail”：{“taskLabel”：0.9，“taskLabel”：0.8}}，“code”：10000，“msg”：“ok”}	and outputting a final result:{ "query": 'three wind volume gears',"queryRewrite": air volumeThird gear, the "queryState",“clear”，“modelConfidence”：{“noise”：0.0，“clear”：1.0，“taskLabel”：“2.7”：{“taskLabel”：“xx”，“detail”：{“taskLabel”：0.9，“taskLabe2”：0.8}}，“ruleConfidence”：1.0，“filterReason”：“number set”，“code”：10000，“msg”：“ok”}	synchronization5ms

Therefore, when the voice interaction method of the invention carries out the second same-time request, the end-to-end time delay of the voice interaction process can be further reduced by directly reading the rejection model result and carrying out rule prediction, and the accuracy of voice interaction is ensured.

In summary, the voice interaction method of the present invention can hide the delay in the backbone link through the first asynchronous request (step 8), and directly read the model result and make rule prediction when the second synchronous request (step 15) is performed, so that steps 1 to 18 in fig. 3 of the voice interaction method of the present invention can be compared with steps 7 to 14 in fig. 1 of the related art, and the use of the voice interaction method of the present invention can reduce the end-to-end delay from 200ms to 6 to 10ms. Wherein, the end-to-end delay of the voice interaction scheme in fig. 1 is 200ms, the end-to-end delay of the voice interaction scheme of the present invention is 6-10ms, and the specific time consumption analysis comparison table is shown in table 9 and table 10.

TABLE 9

Voice interaction scheme of related art	Time consuming procedure	Time consuming
			Synchronization scheme	Steps 7-14 of FIG. 1	Synchronization for 200ms, backbone link addition
	End-to-end time consumption increase	200ms

TABLE 10

Voice interaction scheme of the invention	Time consuming procedure	Time consuming
			Asynchronous scheme	1.2 2.1 2.2 3.1 3.2	Asynchronous 50ms, hidden in backbone link
	1.1 4 5 6 7	Asynchronous 100ms, hidden in backbone links
				8.1 9 10 11	Asynchronous 20ms, hidden in backbone links
	2.1 8	Backbone link 200-500ms
				12 13	Asynchronous 30ms, hidden in backbone link
	16 17	Synchronization 1-5ms, backbone link addition
				18	Synchronization for 5ms, backbone link addition
	End-to-end time consumption increase	6-10ms

The present invention also provides a non-transitory computer-readable storage medium containing the computer program. The computer program, when executed by one or more processors, implements the voice interaction method described in any of the embodiments above.

For example, the computer program when executed by a processor implements the steps of the following voice interaction method:

02: the method comprises the steps that when downstream logic processing is carried out on a voice request, an asynchronous request is sent, so that first semantic rejection is carried out on the voice request according to context characteristics, and a first semantic rejection result is obtained;

03: after receiving the downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result;

04: and transmitting the second semantic rejection result to the vehicle to complete voice interaction.

It will be appreciated that the computer program comprises computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, and the like. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), and software distribution medium.

The storage medium of the invention carries out first semantic rejection on the voice request according to the context characteristics and carries out second semantic rejection on the voice request according to the first semantic rejection and the downstream logic processing result to obtain a second semantic rejection so as to complete voice interaction, thereby greatly reducing the end-to-end time delay of the voice interaction process.