CN120783802A

Movatterモバイル変換

Info

Publication number: CN120783802A
Application number: CN202511168214.4A
Authority: CN
Inventors: 熊艳艳; 朱雷明; 王雨; 余秀; 王爱爱; 齐勇; 吴爱玲
Original assignee: Fujian Luoyuan Senior Vocational School
Current assignee: Fujian Luoyuan Senior Vocational School
Filing date: 2025-08-20
Publication date: 2025-10-14

Abstract

The invention provides a driving emotion early warning method and a driving emotion early warning system based on voice analysis, which relate to the technical field of voice emotion recognition, and effectively solve the interference problem of vehicle-mounted dynamic strong noise by arranging a microphone array and a noise sensor in a cockpit and combining a multi-channel noise reduction algorithm such as beam forming, linear beam minimum variance and the like, and extract a high-quality voice instruction from a source; on the basis, the accurate recognition of the driving intention is carried out on clear voice signals by using the deep neural network, so that the accuracy is remarkably improved, more importantly, the real-time quantitative evaluation of the man-machine interaction smoothness is realized by creatively recognizing the instability Ls and the high interaction friction probability Pf through the calculation instruction, and when the interaction friction probability Pf exceeds the preset threshold, the system can automatically trigger the self-adaptive strategy adjustment, thereby avoiding the distraction of a driver caused by recognition errors and greatly enhancing the interaction experience and the driving safety.

Description

Driving emotion early warning method and system based on voice analysis

Technical Field

The invention relates to the technical field of voice emotion recognition, in particular to a driving emotion early warning method and system based on voice analysis.

Background

In the prior art, the bulletin number is CN111951832B, the name is a method and a device for analyzing the user dialogue emotion, the method and the device for analyzing the user dialogue emotion are provided, the method solve the problems that the wrong user emotion can be acquired and the user emotion has hysteresis and uncertainty under the condition that the user emotion is acquired by converting the user voice data into text data, realize that the user emotion is speaking, directly analyze the voice information of the user, and acquire the technical effect of a speaker in real time by decoding the short-time effective voice segments into the short-time emotion states according to an emotion recognition model, and combine the adjacent and same short-time emotion states into a first stable emotion state, wherein the first stable emotion state corresponds to the first real voice part, and acquire a second stable emotion state according to the first stable emotion state.

However, when the above technical scheme is applied to the fields of vehicle-mounted voice interaction and intelligent driving assistance, the voice analysis method has the following two technical disadvantages associated with each other, and a chain reaction exists between the disadvantages:

The problem of vehicle-mounted complex environment interference is that a very complex and dynamically-changed noise environment is generated in the running process of a vehicle. This includes various sources of interference such as engine booming, road noise due to friction between tires and road surfaces, wind noise during high-speed traveling, running sound of an air conditioning system in a vehicle, and passenger talking or vehicle-mounted sound playing. These noise can severely contaminate the voice command from the driver, resulting in a dramatic drop in the signal-to-noise ratio (SNR) of the collected voice signal, and severely undermine the clarity and integrity of the effective voice information.

Accuracy problems of speech recognition and intended understanding the accuracy of speech recognition models (ASR) can be significantly reduced when converting speech signals with low signal-to-noise ratio into text due to interference from complex environments on board vehicles. For example, a sentence of "navigate home" instruction may be incorrectly identified as "make a call" or other unrelated content. Such recognition errors directly result in the inability of subsequent Natural Language Understanding (NLU) modules to accurately resolve the actual intent of the driver. Therefore, the system is likely to execute wrong operation or cannot respond to the instruction, which not only reduces the efficiency and experience of human-computer interaction, but also increases the cognitive load of a driver, forces the driver to distract from repeating the instruction or manually operate the equipment, thereby directly threatening the driving safety, and violating the original purpose of improving the safety of voice interaction.

The above information disclosed in the above background section is only for enhancement of understanding of the background of the disclosure and therefore it may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a driving emotion early warning method and system based on voice analysis, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the driving emotion early warning method based on voice analysis comprises the following specific steps:

Step one, acquiring voice data of a driver and vehicle-mounted dynamic environment noise data, presetting a voice instruction validity rule, and extracting valid voice instruction sections from the voice data to obtain a valid voice instruction set;

Step two, based on the effective voice instruction set, further carrying out association analysis and processing on the voice signal and the vehicle-mounted dynamic environment noise data, carrying out definition optimization on the voice signal by adopting a multi-channel noise reduction algorithm, generating an instruction sequence with optimized definition, calculating background noise power Np in the instruction sequence, and carrying out real-time evaluation;

Step three, presetting a voice cutting length, carrying out voice segmentation processing based on a command sequence after definition optimization, and cutting the voice cutting length into a plurality of short-time effective command segments;

step four, according to the recognized driving intention, the adjacent short-time intention is aggregated through time sequence analysis to generate a stable driving intention sequence, and on the basis, the instability Ls of instruction recognition is calculated and evaluated to analyze the smoothness of man-machine interaction;

Step five, based on the evaluation content of the command recognition instability Ls, further calculating a high interaction friction probability Pf, and analyzing the trend of command recognition failure or user repeated command;

When the high interaction friction probability Pf exceeds a preset threshold, the system automatically triggers the interaction strategy to adjust and early warn, and a corresponding self-adaptive interaction strategy is adopted according to the current driving scene and the interaction state.

Further, the first step specifically includes:

Setting an acquisition time period, and deploying a microphone array and a noise sensor at a key position in a vehicle cockpit to acquire driver voice data and directional vehicle-mounted environment noise data in the driving process;

wherein the microphone array directionally collects voice data of a driver through a beam forming technology and generates a voice signal V (t);

the noise sensor records directional environmental background noise including engine, road noise and wind noise and collects data to generate a directional noise signal N (t);

Applying a preset validity rule, namely noise filtering and voice definition rule, to the voice signal V (t) and the directional noise signal N (t);

dividing a voice signal V (t) conforming to a validity rule into a plurality of valid voice instruction segments according to natural speaking pauses by utilizing a voice activity detection algorithm, setting the shortest instruction duration again, only reserving the valid voice instruction segments longer than the shortest duration, and eliminating nonsensical voice segments shorter than the duration;

and integrating all the effective voice instruction segments screened by the effectiveness rules into an effective voice instruction set.

Further, the second step specifically includes:

using an effective voice instruction set and directional vehicle-mounted environment noise data as inputs, and establishing and applying an adaptive noise cancellation model to analyze and weaken the influence of the directional environment noise data on the effective voice instruction set;

Constructing a preliminary noise reduction instruction sequence according to the model processing result, and performing time marking on each instruction data point in the sequence to form a time marked instruction sequence;

for continuously measured ambient noise within a time window T of a selected length, a background noise power Np is calculated for evaluating the noise level of the current driving environment.

Further, the second step specifically further includes:

Collecting audio signals distributively within the cockpit space using two or more microphones, wherein each microphone independently captures audio signals from the driver and background noise sources;

performing primary filtering processing on the audio signals of each channel, removing the signal part with the filtering frequency outside the filtering frequency interval threshold value through presetting the filtering frequency interval threshold value;

The time alignment of each channel signal is adjusted according to the time delay, and superposition processing is carried out, further noise suppression and voice enhancement processing are carried out on the superposition signal after the time alignment through a linear beam minimum variance algorithm, wherein the adjustment parameters of the linear beam minimum variance algorithm are dynamically adjusted according to the real-time signal, the gain of the direction signal from a driver is kept unchanged, and meanwhile, the variance of an output signal is minimized;

finally, power spectral density estimation is performed to evaluate whether the adjustment of the frequency domain characteristics maintains the natural characteristics and intelligibility of the driver's voice signal.

Further, the third step specifically includes:

Based on the requirement of considering instruction integrity and processing efficiency, setting a voice cutting length Hm for capturing an operation instruction formed by a single word or phrase;

the numerical value of the set voice cutting length Hm is halved, and a short-time effective instruction section of the instruction sequence is continuously cut out as a step-length numerical value of a sliding window mode;

And extracting the acoustic features of each short-time effective instruction segment by using the mel frequency cepstrum coefficient to form a feature vector, and classifying the driving intention of the feature vector by adopting a deep neural network.

Further, the fourth step specifically includes:

aggregating adjacent short-term recognition intents to generate a stable driving intention sequence;

The short-time recognition intention is analyzed by utilizing a time sequence analysis technology and setting a sliding window comprising specific window size and steps, so that the dominant driving intention in the same time window is recognized and aggregated to form a continuous and stable driving intention sequence in time sequence for final instruction execution.

Further, the fourth step specifically further includes:

The command recognition instability Ls is obtained by calculating the average absolute difference between the successive intentions in each driving intention sequence, the specific calculation formula is as follows:

Wherein j represents the index of the time window, m is the number of the total time window; A classification code for the dominant driving intent identified for the jth time window,Classification encoding of dominant driving intent for the immediate next time window;

Presetting an instruction recognition instability threshold Si, and comparing and evaluating with the instruction recognition instability Ls;

If the command recognition instability Ls is smaller than or equal to the command recognition instability threshold Si, the man-machine interaction is considered to be smooth, and the command recognition is stable;

if the command recognition instability Ls is greater than the command recognition instability threshold Si, the voice interaction is considered to be difficult, i.e., there is a command recognition error or the driver's intention is ambiguous.

Further, the fifth step specifically includes:

Firstly, defining a high interaction friction event, namely, considering that one high interaction friction occurs when the driving intention recognition result change in two continuous time windows exceeds a command recognition instability threshold Si;

the calculation formula of the high interaction friction probability Pf is specifically:

Wherein Qhc represents the number of times of occurrence of high interactive friction, and the number of times of occurrence of high interactive friction Qhc refers to the total number of all events satisfying the intention recognition result change exceeding the instability threshold Si in the monitoring period, and Gt represents the total time window number.

Further, the fifth step further specifically includes:

setting a preset threshold value Pth of high interaction friction probability, and carrying out comparison evaluation with the high interaction friction probability Pf;

When the high interaction friction probability Pf exceeds a preset threshold Pth, the system automatically triggers an interaction strategy adjustment early warning mechanism and invokes a corresponding self-adaptive interaction strategy library;

the self-adaptive interaction strategy library comprises a plurality of response measures customized according to specific interaction dilemma, wherein the response measures comprise 'initiatively initiating clear inquiry', 'simplifying interface information', 'suggesting to use physical keys' and 'retrying at a safety opportunity', and the system selects a proper strategy according to the currently identified interaction state;

Meanwhile, the system sends a signal to the driving auxiliary module so as to temporarily reduce interference of non-key information during difficult interaction and ensure that a driver focuses on driving tasks.

The driving emotion early warning system based on voice analysis comprises the following modules:

the data acquisition processing module acquires voice data of a driver and vehicle-mounted dynamic environment noise data, presets voice instruction validity rules, and extracts valid voice instruction sections from the voice data to obtain a valid voice instruction set;

The voice signal enhancement module is further used for carrying out association analysis and processing on the voice signal enhancement module and the vehicle-mounted dynamic environment noise data based on the effective voice instruction set, carrying out definition optimization on the voice signal by adopting a multichannel noise reduction algorithm, generating an instruction sequence with optimized definition, and calculating background noise power Np in the instruction sequence for carrying out real-time evaluation;

The driving intention recognition module is used for presetting a voice cutting length, carrying out voice segmentation processing based on the command sequence after definition optimization, and cutting the command sequence into a plurality of short-time effective command segments;

The interactive fluency assessment module is used for aggregating adjacent short-time intentions through time sequence analysis according to the identified driving intentions to generate a stable driving intention sequence, and calculating an instruction identification instability Ls and assessing the instability on the basis so as to analyze the fluency of man-machine interaction;

The self-adaptive strategy decision module is used for further calculating high interaction friction probability Pf based on the evaluation content of the command recognition instability Ls and analyzing the trend of the command recognition failure or the repeated command of the user;

Compared with the prior art, the invention has the beneficial effects that:

According to the method, firstly, an effective voice instruction set is screened out through a voice activity detection algorithm and a shortest instruction duration rule, then a self-adaptive noise elimination model and an advanced multichannel noise reduction algorithm are used, in particular, noise suppression and voice enhancement are carried out on multichannel signals through a linear beam minimum variance algorithm, meanwhile, background noise power Np is continuously calculated in a time window T to evaluate noise level in real time, finally, a command sequence with optimized definition is generated, purity and integrity of input voice signals are guaranteed fundamentally, and high-quality voice information can be extracted even in strong interference environments such as engine booming and wind noise;

The voice recognition and driving intention understanding accuracy is remarkably improved by setting the voice cutting length Hm, extracting the feature vector by using the Mel frequency cepstrum coefficient and classifying by the deep neural network based on the command sequence after definition optimization, furthermore, the invention creatively provides a set of interaction quality evaluation and early warning mechanism, achieves real-time quantitative monitoring of man-machine interaction smoothness by calculating the command recognition instability LS and comparing the command recognition instability with the command recognition instability threshold Si and calculating the high interaction friction probability Pf and comparing the command recognition instability threshold Si with the preset threshold Pth, and can automatically trigger interaction strategy adjustment early warning and call the self-adaptive interaction strategy library when the high interaction friction probability Pf exceeds the preset threshold Pth of the high interaction friction probability, thereby avoiding operation failure and driver distraction caused by recognition errors and greatly improving interaction experience and driving safety.

Drawings

FIG. 1 is a schematic flow chart of the overall method of the present invention;

FIG. 2 is a schematic diagram of the overall system framework of the present invention.

Detailed Description

The present invention will be further described in detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

It is to be noted that unless otherwise defined, technical or scientific terms used herein should be taken in a general sense as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "up", "down", "left", "right" and the like are used only to indicate a relative positional relationship, and when the absolute position of the object to be described is changed, the relative positional relationship may be changed accordingly.

Embodiment one:

referring to fig. 1, the invention provides a driving emotion early warning method based on voice analysis, which comprises the following specific steps:

Further, the source problem of vehicle-mounted complex environment interference is effectively solved through the first step, namely a microphone array and a noise sensor are deployed in a cockpit, and the acquired voice data of a driver and vehicle-mounted dynamic environment noise data are subjected to preliminary validity rule screening, so that the quality and purity of the data processed later are ensured;

performing association analysis and multichannel noise reduction algorithm processing on an effective voice instruction set and vehicle-mounted dynamic environment noise data, optimizing voice definition especially through a linear beam minimum variance algorithm, calculating background noise power Np in real time, and remarkably improving the signal-to-noise ratio and anti-interference capability of voice signals so as to lay a foundation for accurate recognition;

The third step is based on the instruction sequence after definition optimization, through presetting the voice cutting length Hm, the extraction of the mel frequency cepstrum coefficient characteristics and the deep neural network classification, the accurate judgment of the intention of a driver is realized, the accuracy problems of voice recognition and intention understanding are solved, and the operation error is avoided;

Analyzing the aggregate short-time intention through a time sequence, calculating the command recognition instability Ls, quantitatively evaluating the smoothness and stability of human-computer interaction, and providing a system optimization basis;

And fifthly, potential interaction dilemma or recognition failure can be timely found and predicted by calculating the high interaction friction probability Pf and comparing the high interaction friction probability Pf with a preset threshold Pth, and the self-adaptive interaction strategy adjustment is automatically triggered, so that the cognitive load of a driver is obviously reduced, repeated instructions and manual operation are reduced, and the efficiency, accuracy and driving safety of vehicle-mounted voice interaction are comprehensively improved.

The first step specifically comprises:

Further, sensor deployment is carried out in a vehicle cab, specifically, a linear array formed by 4 MEMS digital microphones is deployed in a position 0.5 m away from a driver mouth in a driver seat headrest, the distance between the microphones is 3 cm, and meanwhile, independent noise sensors are respectively deployed in the lower part of an instrument panel and in the vicinity of a pillar A inner plaque at the driver side, and are used for directionally capturing low-frequency road noise Ne (t) and high-frequency road noise Nw (t), wherein the low-frequency road noise Ne (t) comprises engine noise, and the high-frequency road noise Nw (t) comprises wind noise;

All sensors were synchronized by a unified clock for data acquisition at a 48kHz sampling rate and 24 bit quantization depth. The acquired original multi-channel voice data is processed through a beam forming algorithm of a generalized sidelobe canceller, the algorithm accurately aims a main beam lobe of a beam at the direction of a driver mouth, and the beam width is controlled within +/-30 degrees, so that the multi-channel data is synthesized into a voice signal V (t) with enhanced directivity and a single channel;

In parallel, the collected independent noise signals are processed, and the instantaneous amplitude of the comprehensive directional noise signal at the time point t, namely the comprehensive directional noise signal N (t), is generated by taking the maximum value of the instantaneous absolute values of the low-frequency noise sensor signal and the high-frequency noise sensor signal, wherein the specific calculation formula is as follows:

Wherein Ne (t) represents the instantaneous signal amplitude of the engine and the low-frequency path noise sensor at the time point t; representing taking the maximum of the low frequency noise sensor signal and the high frequency noise sensor signal;

subsequently, the root mean square energy of the integrated directional noise signal N (t) is calculated over a 20 ms analysis frame, denoted noise intensity Nrm (t);

Next, a hierarchical validity rule is applied to the voice signal V (t). Firstly, noise filtering is performed by setting a dynamic noise intensity threshold delta_N (t);

In each 20 ms voice frame, if the noise intensity Nrm (t) of the frame is greater than the dynamic noise intensity threshold value, setting the voice signal V (t) in the frame to zero to generate a voice signal after noise filtration;

Then, speech intelligibility assessment is performed, and speech energy E (t) is calculated for each 200 ms speech segment, as follows:

wherein t0 and t1 represent the lower and upper limits of the integral, respectively, t0 represents the start time of the analyzed speech segment, t1 represents the end time of the analyzed speech segment; A voice signal V (t) at each instant in time from t0 to t 1; representing squaring the amplitude of the noise-filtered speech signal; representing time differentiation;

If the calculated speech energy E (t) is smaller than the preset fixed speech energy threshold value, judging the speech segment as invalid data and rejecting the invalid data.

When the voice activity detection algorithm detects a non-voice section which lasts more than 300 milliseconds, the voice activity detection algorithm automatically divides the continuous voice of the preamble into an independent voice instruction section;

Finally, all the effective voice instruction segments passing through screening are integrated and stored in an ordered queue form to construct an effective voice instruction set, wherein the audio data are uniformly in a 16-bit PCM format and are added with instruction serial number metadata;

The setting basis of the dynamic noise intensity threshold is the recent average background noise level tracked in real time, and aims to adaptively distinguish the background noise of voice and dynamic change, so that the noise can be effectively filtered in different driving environments;

The setting basis of the fixed voice energy threshold is the minimum energy standard which is counted based on a large amount of experimental data and is required for forming clearly distinguishable voices, and aims to eliminate meaningless voices with too low volume and ensure that all processed voices have basic recognizability.

The second step specifically comprises:

The second step specifically further comprises:

Further, the second step aims to carry out secondary deep noise reduction on each instruction section in the effective voice instruction set, specifically, the vehicle-mounted processing unit carries out self-adaptive noise elimination processing on any instruction section obtained from the effective voice instruction set and directional vehicle-mounted environment noise data synchronously recorded in a corresponding time period by applying spectral subtraction;

The processor performs the self-adaptive noise reduction and simultaneously performs the quantitative evaluation on the environmental noise intensity when the current instruction occurs, and calculates the average background noise power Np of the current instruction in the whole time window by utilizing the noise data segment Ns (t) synchronous with the current voice instruction segment;

The specific calculation formula is as follows:

where L is the total number of sampling points for the noise data segment,In the t-th time period, the amplitude value of the ith sampling point is i, wherein i represents a data point index;

the calculated background noise power Np is used as a quantized environmental noise level index, and reflects the noisy degree of the cockpit when the instruction is sent out;

for facilitating subsequent processing and traceability, the processor integrates the output of the preceding steps in a structured way, and creates a data structure containing three key information, namely instruction audio data after noise reduction, an absolute start time stamp t_star in the whole driving session and average background noise power Np, each time an effective voice instruction segment is processed;

The processor adds the data structure body into an ordered list in time sequence to finally form a 'time-marked noise reduction instruction sequence', wherein the time-marked noise reduction instruction sequence not only contains high-quality noise reduction voice, but also is accompanied with key context metadata, so that comprehensive and instant available input is provided for a subsequent voice recognition engine;

In order to ensure that the noise reduction process does not cause excessive damage to the voice, a final quality verification checkpoint is set in the second step;

Specifically, the PSD estimation is performed by the Welch method, a Hamming window is used as a window function, the window length is set to 512 sampling points, the overlapping rate between windows is set to 50%, the calculated PSD result is compared with a reference human voice PSD model stored in the system in advance, and the reference model is obtained by carrying out average PSD calculation on voices recorded by a large number of speakers with different sexes and ages in a quiet environment.

The third step specifically comprises:

Further, the noise reduction instruction signal of the noise reduction recognition is called, the frame length is set to 25 milliseconds, the frame is shifted to 10 milliseconds to carry out sliding window framing, and a Hamming window function is applied to each frame;

Finally, arranging 39-dimensional feature vectors of all frames in a single complete instruction according to a time sequence to construct an N multiplied by 39 two-dimensional feature matrix, wherein N is the total frame number of the instruction;

Inputting the generated feature matrix into a pre-trained hybrid neural network model for driving intention recognition, wherein the innovative structure of the hybrid neural network model is as follows:

The one-dimensional convolution module comprises two cascaded one-dimensional convolution layers, wherein the two cascaded one-dimensional convolution layers are used for automatically learning and extracting local robust acoustic modes insensitive to pronunciation change along the direction of a frequency axis from an input feature matrix, each convolution layer is provided with 64 convolution kernels with the size of 3, a modified linear unit activation function is used, and a maximum pooling layer is connected to refine features and reduce data dimension;

The long-term and short-term memory network module comprises two stacked LSTM layers, wherein 128 hidden units are arranged on each layer, the module receives a characteristic sequence processed by the one-dimensional convolution module, and the core function of the module is to learn the long-distance dependence relationship between acoustic primitives in the time dimension so as to capture complete words, phrases and syntactic structure information;

The output layer is a full-connection layer, is connected with a Softmax activation function, and is used for mapping the advanced time sequence information captured by the LSTM layer to M preset driving intention categories, namely probability distribution of 'navigating to a company' and 'increasing volume', and outputting the category with the highest probability as a recognition result;

judging the confidence coefficient of the recognition result by setting a confidence coefficient threshold A, and when the highest probability value is greater than or equal to the confidence coefficient threshold A, confirming that the classification result is effective driving intention and outputting the classification result to a vehicle-mounted control system;

otherwise, if the confidence coefficient is lower than the confidence coefficient threshold value A, the recognition is judged to be low in confidence coefficient, the system refuses to execute the instruction and can trigger a voice prompt to request the user to repeat the instruction, and the robustness and the safety of man-machine interaction are ensured.

The fourth step specifically comprises:

The fourth step specifically further comprises:

Further, based on dominant driving intention recognition of the confidence weighted voting, by setting a sliding window with a time window size of 1.5 seconds and a step of 0.5 seconds, performing the confidence weighted voting on the aggregated short-time recognition intention set in each time window;

Specifically, accumulating the confidence coefficient score of each short-time recognition intention to the corresponding intention category, and finally recognizing the intention category with the highest accumulated confidence coefficient score as the dominant driving intention of the time window;

setting a continuous stability window counter with a trigger threshold value set as q based on a dominant driving intention sequence output according to time sequence, and confirming the intention as a stable executable instruction only when the same dominant driving intention is continuously recognized in q time windows and sending the instruction into an instruction execution queue;

any dominant driving intention changes, a counter is reset, and the instantaneous jitter of the instruction sequence is effectively filtered;

When the situation that the instruction identification is wrong or the intention of the driver is ambiguous exists, the first-level adjustment is started, and the system broadcasts a preset voice prompt through the vehicle-mounted sound equipment, for example, "I don't hear, please speak again, and actively requests the user to clarify;

if the system continuously judges that interaction is difficult for q times within 1 minute, the system automatically temporarily increases the confidence threshold A of the upstream neural network to the confidence threshold 1.5A, and suppresses the output of a low-quality recognition result by increasing the decision threshold until the command recognition instability Ls is recovered below the threshold, and then the confidence threshold 1.5A automatically falls back to the confidence threshold A.

The fifth step specifically comprises:

The fifth step specifically further comprises:

Furthermore, the high interaction friction probability Pf objectively quantifies the whole unsmooth degree of the recent interaction, and after the calculation is completed, the counter Qhc is cleared, and the interaction strategy adjustment early warning mechanism is characterized in that the interaction strategy adjustment early warning mechanism does not directly execute a single action, but firstly diagnoses the reason of the high interaction friction probability Pf which exceeds a preset threshold Pth to distinguish different interaction dilemma states, for example:

the diagnosis feature is that the dominant intention repeatedly jumps between two instructions with similar semantics;

an "environmental worsening" condition, characterized by a high Pf value occurring in synchronization with a sharp rise in background noise power;

according to the specific dilemma state, the system selects and executes the most matched strategy from a preset self-adaptive interaction strategy library, namely, for the ' intention confusion ' state, executing a clear inquiry strategy comprising ' you want to say ' navigate ' or ' phone ';

for the "environmental worsening" state, alternative suggestion policies are performed, including "environmental noisy, suggesting use of physical keys";

when the early warning mechanism is triggered, the man-machine interaction system not only executes internal strategy adjustment, but also immediately broadcasts a high-cognitive load state signal to an advanced driving auxiliary system and an instrument panel module of the vehicle through the CAN bus;

after receiving the high cognitive load state signal, the advanced driving assistance system delays broadcasting all non-urgent safety prompt tones within 30 seconds, the instrument panel or HUD module temporarily conceals the display of secondary information such as incoming calls, media and the like, and when the high interactive friction probability Pf value calculated later falls below a threshold value, the system broadcasts a normal load signal, and each module resumes a normal working mode.

Example 2

Referring to fig. 2, a driving emotion early warning system based on voice analysis includes the following modules:

Further, the following is an example of simulation implementation of the present technical solution:

The driver king is driving, and initially, the environment in the vehicle is quiet, and the interaction is smooth. Subsequently, he opens the window and the wind noise increases, resulting in difficulties in subsequent voice interactions. The present example will fully demonstrate how the system goes from steady state to detection of instability, eventually triggering the overall process of advanced adaptive strategies;

The preset intention classification codes are as follows:

0, no instruction, 1, navigating home, 2, playing music, 3, increasing volume, 4, calling wife, 5, navigating to company;

The sliding window is preset to be 1.5 seconds in size and 0.5 seconds in step = 0.5 seconds, the stable window counter threshold is preset to be 3, and the command recognition instability threshold Si is preset to be 0.5;

the preset threshold value Pth of the high interaction friction probability is set to be 0.1, and the calculation monitoring period of the high interaction friction probability Pf is preset to be 60 seconds and comprises 120 time windows;

Stage one, smooth interaction and instruction execution;

the time is 0-10 seconds, the King says "navigate home", the environment in the car is quiet at this time;

table 1 identifies the intended sequence for the original output short time:

At this time, the system determines that the dominant driving intention is 'navigation home' through confidence weighted voting in a first time window of t=1.2 s, namely, the intention classification codes 1;

the system confirms that the instruction is stable, and broadcasts 'good' through vehicle-mounted sound equipment, so that a home route is planned for you. Simultaneously, the central control screen starts navigation;

At this stage, the dominant intent sequence is {1,. }, the value of the instruction identification instability Ls is close to 0, much smaller than the threshold instruction identification instability threshold Si;

the result of this stage is that the system accurately and stably executes the user instruction;

Step two, environmental change and interactive friction occur;

the window is opened by the small king for 30-40 seconds, the wind noise is obviously increased, and he tries to send a new instruction of 'calling a wife';

table 2 shows the dominant driving intent sequence after environmental degradation with high interactive friction event monitoring;

The system identified results repeatedly jumped between "call wife" and "navigate to company" due to wind noise interference, as shown in Table 2, the system monitored continuous high interaction friction events, the value of the high interaction friction event counter Qhc increased rapidly;

The system detects that the value of the continuous instruction identification instability Ls is larger than 0.5, namely a preliminary first-level adjustment strategy is triggered, and the system broadcasts that' the user is not inadvisable, i.e. you speak again. The king does not respond immediately and continues driving;

The result of this stage is that the system has failed to stably recognize the instruction and begins to accumulate evidence of "interactive friction";

when the system reaches the end of the monitoring period of 60 seconds, the system processing flow is started, which comprises the following steps:

high interaction friction probability calculation:

During the monitoring period of the past 60 seconds, the total time window number gt=120, assuming that the system has monitored qhc=14 high interaction friction events in total;

calculate the high interaction friction probability Pf:

the system then compares the calculated pf≡0.118 with a preset threshold pth=0.1;

because 0.118>0.1, the system immediately triggers an interaction strategy adjustment early warning mechanism;

after the early warning mechanism is triggered, the system analyzes an event mode which leads to high interaction friction probability Pf, discovers that the intention code mainly jumps between 4 and 5, and accordingly diagnoses the current interaction dilemma as an intention confusion state;

at this time, the system selects a strategy for the "intention confusion" state from the adaptive interaction strategy library:

Actively initiating a clarification inquiry;

The system broadcasts a high cognitive load state signal through the CAN bus while executing a strategy aiming at the 'intention confusion' state, and after receiving the signal, the ADAS system decides to delay a prompt tone of a non-urgent 'service area in front' to be broadcasted for 30 seconds;

The vehicle-mounted sound clearly broadcasts a clear inquiry of ' please, you want to ' call wife ' or ' navigate to company '

The king hears the clear option and easily answers "call wife".

The system receives the clear instruction and successfully dials the telephone for the king.

In the embodiment, through a set of complete 'monitoring-quantification-diagnosis-decision-linkage' processes, not only is the complex interaction dilemma solved, but also the attention of a driver is ensured through cooperative linkage, and the driving safety is improved. In the subsequent interaction, the high interaction friction probability Pf value falls back, and the system releases the high cognitive load state and returns to normal.

It should be noted that all the calculation formulas in the document of the present application use regression analysis including, but not limited to, machine learning algorithms to further analyze the collected relevant parameters to identify natural trends and correlations. Mathematical models matching the data are automatically generated using specialized software, such as the Scikit-learn library of Python or the R language. Secondly, objectively evaluating the performance of the model by a cross verification method and the like, and combining continuous feedback and optimization to ensure that the created formula truly reflects the internal rule of the data, thereby ensuring the effectiveness and the accuracy of the created formula; non-dimensionality techniques include, but are not limited to, maximum-minimum Normalization (Min-Max-Normalization), Z-Score Normalization;

The aspects of the present invention may be embodied in essence or contributing to the prior art in the form of a software product which may be stored in a computer readable storage medium, such as a floppy disk, a read only Memory (ReadOnly, a Memory, a ROM), a Random-Access-Memory (RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods of the various embodiments of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. The driving emotion early warning method based on voice analysis is characterized by comprising the following specific steps of:

2. The method for driving emotion pre-warning based on voice analysis according to claim 1, wherein the first step specifically comprises:

3. The driving emotion early warning method based on voice analysis according to claim 2, characterized by comprising the following steps:

4. The method for driving emotion pre-warning based on voice analysis of claim 3, further comprising the following steps:

5. The method for driving emotion pre-warning based on voice analysis of claim 4, wherein the third step specifically comprises:

6. The method for driving emotion pre-warning based on voice analysis of claim 5, wherein the fourth step comprises the following steps:

7. The method for driving emotion pre-warning based on voice analysis of claim 6, further comprising the following steps:

If the command recognition instability Ls is greater than the threshold Si, the voice interaction is considered to be difficult, i.e., there is a command recognition error or the driver intention is ambiguous.

8. The method for driving emotion pre-warning based on voice analysis of claim 7, wherein the fifth step specifically comprises:

The calculation formula for setting the high interaction friction probability Pf is specifically as followsWherein Qhc represents the number of times of occurrence of high interactive friction, and the number of times of occurrence of high interactive friction Qhc refers to the total number of all events satisfying the intention recognition result change exceeding the instability threshold Si in the monitoring period, and Gt represents the total time window number.

9. The method for driving emotion pre-warning based on voice analysis of claim 8, further comprising the following steps:

10. The driving emotion pre-warning system based on voice analysis according to any one of claims 1 to 9, characterized by comprising the following modules: