CN119148964B

Movatterモバイル変換

Info

Publication number: CN119148964B
Application number: CN202410955198.2A
Authority: CN
Inventors: 孙少飞; 赵志定; 周洪涛
Original assignee: Zhejiang Zero Run Technology Co Ltd
Current assignee: Zhejiang Zero Run Technology Co Ltd
Priority date: 2024-07-16
Filing date: 2024-07-16
Publication date: 2025-10-03
Anticipated expiration: 2044-07-16
Also published as: CN119148964A

Abstract

The application discloses a vehicle-mounted sound effect adaptation method and electronic equipment, wherein the vehicle-mounted sound effect adaptation method comprises the steps of carrying out audio separation on audio to be played to obtain a plurality of audio fragments, carrying out emotion recognition on a driving object corresponding to a target vehicle to obtain an emotion recognition result, wherein the target vehicle is provided with an audio player, calculating an output proportionality coefficient corresponding to each audio fragment based on the emotion recognition result, and outputting each audio fragment to the audio player for playing according to the output proportionality coefficient. The application can automatically adjust the audio playing effect according to the emotion of the driving object so as to improve the audio effect of the current playing audio and the adaptation degree of the driving object and improve the driving or riding feeling of the driving object.

Description

Vehicle-mounted sound effect adaptation method and electronic equipment

Technical Field

The application relates to the technical field of audio processing, in particular to a vehicle-mounted sound effect adaptation method and electronic equipment.

Background

Currently, in-car audio playing is a very high frequency use scene in car use, such as music playing, in-car broadcasting, etc. The sound quality of the vehicle-mounted audio play can greatly influence the driving and riding feeling of the vehicle.

The vehicle-mounted sound effect tuning is a very professional work, has very high requirements on the sense of music and the acoustic knowledge, and is very difficult for an ordinary user to adjust the self-favorite sound effect in the existing personalized setting.

Disclosure of Invention

In order to solve the technical problems, the application at least provides a vehicle-mounted sound effect adaptation method and electronic equipment.

The application provides a vehicle-mounted sound effect adaptation method, which comprises the steps of carrying out audio separation on audio to be played to obtain a plurality of audio fragments, carrying out emotion recognition on a driving object corresponding to a target vehicle to obtain an emotion recognition result, calculating an output proportionality coefficient corresponding to each audio fragment based on the emotion recognition result, and outputting each audio fragment to the audio player for playing according to the output proportionality coefficient.

In one embodiment, the types of audio data comprise direct sound, ambient sound and bass sound, the audio separation is carried out on the audio to be played to obtain a plurality of audio fragments, the audio separation comprises the steps of carrying out short-time Fourier transform on the audio to be played, separating the direct sound fragments and the ambient sound fragments from the audio to be played based on a short-time Fourier transform result, and carrying out filtering processing on the audio to be played based on a preset low-pass filter to obtain the bass sound fragments.

In one embodiment, emotion recognition is performed on a driving object corresponding to a target vehicle to obtain an emotion recognition result, wherein the emotion recognition result comprises the steps of obtaining an object image obtained by image acquisition of the driving object, object voice obtained by voice acquisition of the driving object, vehicle running environment information of the target vehicle and a running destination of the target vehicle, and performing emotion recognition on the driving object based on one or more of the object image, the object voice, the vehicle running environment information and the running destination to obtain the emotion recognition result.

In one embodiment, the method comprises the steps of carrying out emotion recognition on a driving object based on an object image to obtain an emotion recognition result, wherein the emotion recognition result comprises the steps of extracting an emotion-related image area in the object image, and carrying out emotion recognition on the emotion-related image area by utilizing a preset emotion recognition algorithm to obtain the emotion recognition result.

In one embodiment, the number of the driving objects is a plurality, emotion recognition is carried out on the driving objects corresponding to the target vehicle to obtain emotion recognition results, the emotion recognition is carried out on each driving object to obtain initial recognition results corresponding to each driving object, and the emotion recognition results are obtained by combining the initial recognition results corresponding to each driving object.

In an embodiment, the emotion recognition result is obtained by combining initial recognition results corresponding to each driving object respectively, wherein the emotion recognition result comprises the steps of determining a role, a sitting position and a sitting frequency of each driving object respectively, calculating weight parameters of each driving object respectively based on one or more of the role, the sitting position and the sitting frequency of each driving object, and carrying out weighted fusion on the initial recognition results corresponding to each driving object respectively according to the weight parameters of each driving object to obtain the emotion recognition result.

In an embodiment, the number of the audio players is a plurality, different audio players are deployed at different positions of the target vehicle, and calculating the output proportionality coefficient corresponding to each audio segment based on the emotion recognition result comprises obtaining the deployment position of each audio player in the target vehicle, and calculating the output proportionality coefficient of each audio segment relative to each audio player based on the deployment position of each audio player and the emotion recognition result.

In one embodiment, a target vehicle has multiple types of audio players deployed therein; the method comprises the steps of respectively calculating the output proportionality coefficient of each audio fragment relative to each audio player based on the deployment position and emotion recognition result of each audio player, and respectively calculating the output proportionality coefficient of each audio fragment relative to each audio player based on the deployment position of each audio player, the player type and emotion recognition result of each audio player.

In an embodiment, based on the deployment position of each audio player, the player type to which each audio player belongs and the emotion recognition result, the output proportionality coefficient of each audio clip relative to each audio player is calculated respectively, wherein the method comprises the steps of inquiring a matched preset proportionality coefficient calculation formula based on the deployment position of the current audio player, the player type to which the current audio player belongs and the audio type to which the current audio clip belongs, substituting the emotion recognition result into the preset proportionality coefficient calculation formula, and calculating to obtain the output proportionality coefficient of the current audio clip relative to the current audio player.

The second aspect of the present application provides an electronic device, including a memory and a processor, where the processor is configured to execute program instructions stored in the memory, so as to implement the above-mentioned vehicle-mounted sound effect adaptation method.

A third aspect of the present application provides a computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the above-described vehicle-mounted sound effect adaptation method.

According to the scheme, a plurality of audio clips are obtained through audio separation of audio to be played, different audio clips correspond to different types of audio data, emotion recognition is conducted on a driving object corresponding to a target vehicle to obtain an emotion recognition result, the target vehicle is provided with an audio player, an output proportion coefficient corresponding to each audio clip is calculated based on the emotion recognition result, each audio clip is output to the audio player to be played according to the output proportion coefficient, and the audio playing effect can be automatically adjusted according to the emotion of the driving object, so that the sound effect of the current playing audio and the adaptation degree of the driving object are improved, and the driving or riding feeling of the driving object is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic illustration of an implementation environment for an embodiment of the present application;

Fig. 2 is a schematic diagram showing in-vehicle audio playback according to an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a vehicle audio adaptation method according to an exemplary embodiment of the present application;

FIG. 4 is a schematic deployment view of an audio player shown in accordance with an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of an electronic device shown in an exemplary embodiment of the application;

fig. 6 is a schematic diagram of a structure of a computer-readable storage medium according to an exemplary embodiment of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association information describing an associated object, meaning that three relationships may exist, e.g., a and/or B, and may mean that a alone exists, while a and B exist, and B alone exists. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

The following describes a vehicle-mounted sound effect adaptation method provided by the embodiment of the application.

Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The solution implementation environment may include a vehicle 110 and a server 120, with the vehicle 110 and the server 120 being communicatively coupled to each other.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform.

In one example, the vehicle 110 obtains the audio to be played from the server 120, and calculates an output proportionality coefficient according to the emotion recognition result of the riding object, so as to output the audio to be played to the audio player according to the output proportionality coefficient for playing.

For example, referring to fig. 2, fig. 2 is a schematic diagram of vehicle audio playing according to an exemplary embodiment of the present application, as shown in fig. 2, a vehicle 110 uses a QAM8295 chip as a System-on-a-chip (SOC), operates QNX (Quantum Nexus) systems in the SOC as a base to process hardware driving, operates virtual machines in a QNX System and guides Android (Android) System to start in the virtual machines, and operates audio playing software in the Android System to obtain audio to be played for playing. It will be appreciated that other types of system host chips and other audio playing software operating methods may also be selected, as the application is not limited in this regard.

The audio playing software runs in an android system, decodes audio to be played into a binaural pulse code modulation (Pulse Code Modulation, PCM) data stream when playing the audio, and transmits the PCM data to a QNX system in a cross-system manner after passing through a Framework layer, a hardware abstraction (Hal) layer and an audio driver (TINY ADVANCED Linux Sound Architecture, tinyALSA) of the android. PCM data is transmitted to an Audio DIGITAL SIGNAL Processor (ADSP) using a high-pass framework in the QNX system. The ADSP module is externally connected with an AKM (ASAHI KASEI Microsystems) audio corresponding digital signal processing technology (DIGITAL SIGNAL Processor, DSP) chip. The vehicle-mounted sound effect adaptation algorithm operates in an AKM chip (such as AKM 7709), an audio fragment (such as a direct sound fragment and an ambient sound fragment) is separated in the HIFI core, or data is transmitted to the AK core to be subjected to active frequency division to obtain an audio fragment (such as Gao Yinsheng fragments and a bass sound fragment), the AK core can also perform sound effect preprocessing such as gain, delay and equalizer on the audio, after the processing is completed, the AK core performs output proportion coefficient distribution of each audio fragment on each audio channel according to the emotion recognition result of a driving object, and then the output proportion coefficient distribution is input to an audio Amplifier (AMP) of the audio kungfu module, and amplified analog signals are input to each audio player, wherein different audio players correspond to different audio channels.

It should be noted that fig. 2 is only an exemplary illustration of an application scenario of the vehicle-mounted sound effect adaptation method, and may be flexibly adjusted according to an actual application scenario, which is not limited by the present application.

In the vehicle-mounted sound effect adaptation method provided by the embodiment of the application, the execution main body of each step can be the vehicle 110 or the server 120, or the vehicle 110 and the server 120 are in interactive cooperation for execution, namely, one part of the steps of the method are executed by the vehicle 110, and the other part of the steps are executed by the server 120.

Referring to fig. 3, fig. 3 is a flowchart illustrating a vehicle-mounted sound effect adaptation method according to an exemplary embodiment of the present application. The vehicle-mounted sound effect adaptation method can be applied to the implementation environment shown in fig. 1 and is specifically executed by a vehicle in the implementation environment. It should be understood that the method may be adapted to other exemplary implementation environments and be specifically executed by devices in other implementation environments, and the implementation environments to which the method is adapted are not limited by the present embodiment.

As shown in fig. 3, the vehicle-mounted sound effect adaptation method at least includes steps S310 to S340, and is described in detail as follows:

step S310, audio separation is carried out on the audio to be played to obtain a plurality of audio fragments, and different audio fragments correspond to different types of audio data.

The audio to be played can be stored in a vehicle local memory, and can be directly obtained from the vehicle local memory, or stored in a server, and a request for obtaining the audio to be played is sent to the server to obtain the audio to be played returned by the server.

The audio to be played consists of different types of audio data, and audio separation is carried out on each type of audio data in the audio to be played to obtain a plurality of audio fragments.

The audio data may be classified according to the type of the sound source, i.e., the direct sound, the ambient sound, etc., i.e., the direct sound segment, the ambient sound segment, etc., may be separated from the audio to be played, and the audio data may be classified according to the type of the frequency, i.e., the treble sound, the mid sound, the bass sound, etc., i.e., the treble sound segment, the mid sound segment, the bass sound segment, etc., may be separated from the audio to be played.

The dividing mode of the audio data types can be flexibly set according to actual conditions, and the application does not limit the dividing mode of the audio data types.

For example, according to the source of the audio to be played, the audio data type of the audio to be played, which needs to be classified, is determined, so that the accuracy of audio separation is improved, and the subsequent sound effect adaptation effect is improved. For example, if the audio to be played is from music playing software, the audio to be played needs to be divided into direct sound, environment sound, high sound, middle sound and low sound, and if the audio to be played is from phase sound broadcasting software, the audio to be played needs to be divided into direct sound and environment sound.

And step 320, carrying out emotion recognition on the driving object corresponding to the target vehicle to obtain an emotion recognition result, wherein the target vehicle is provided with an audio player.

The riding object is a driving object or a riding object corresponding to the target vehicle.

The number and types of the audio players can be flexibly set according to actual conditions, and the application is not limited to the number and types of the audio players.

And carrying out emotion recognition on the driving object corresponding to the target vehicle to obtain emotion recognition results, such as recognizing that the emotion of the driving object is excited, pleasant, calm, frustrated or sad.

For example, emotion recognition may be performed on the driving object according to one or more parameters of an image of the driving object, a voice of the driving object, a terminal use condition of the driving object, vehicle driving environment information of the target vehicle, a driving destination of the target vehicle, and the like, to obtain an emotion recognition result.

It should be noted that, step S310 and step S320 are not sequentially executed, that is, step S310 may be executed first and then step S320 may be executed, step S320 may be executed first and then step S310 may be executed, or step S310 and step S320 may be executed simultaneously.

Step S330, calculating the output proportionality coefficient corresponding to each audio clip based on the emotion recognition result.

And adjusting the output proportionality coefficient of each audio fragment according to the emotion recognition result.

The number of the riding objects may be one or more, and the case where the number of the riding objects is one and the number of the riding objects is a plurality will be described respectively:

For example, if the number of the riding objects is one, the output proportionality coefficient of each audio clip is directly adjusted according to the emotion recognition result of the riding object. For example, if the emotion recognition result of the riding object is excited, the output ratio of the direct sound fragment is adjusted to 1, and if the emotion recognition result of the riding object is calm, the output ratio of the direct sound fragment is adjusted to 0.8.

For example, if the number of the riding objects is plural, one of the plurality of riding objects may be selected as a target object (e.g., selecting the riding object with the lowest emotion or high emotion, selecting the riding object with the highest riding frequency, selecting the riding object with the primary driving position or the secondary driving position), and the output scaling factor of each audio clip may be adjusted according to the emotion recognition result of the target object. The output proportionality coefficient of each audio segment may also be adjusted in combination with the emotion recognition result of each of the riders, for example, if the emotion recognition results of the riders 1 and 2 are excited, the output proportion of the direct sound segment is adjusted to 1, and if the emotion recognition result of the riders 1 is excited, the emotion recognition result of the riders 2 is frustrated, the output proportion of the direct sound segment is adjusted to 0.7.

The number of audio players deployed in the target vehicle may be one or more, and the case where the number of audio players is one and the number of audio players is a plurality will be described respectively:

for example, if the number of audio players is one, the output scaling factors corresponding to different audio segments may be queried directly according to the emotion recognition result, and if the emotion recognition result is excited, the output ratio of the direct sound segment of the audio player is adjusted to be 1 and the output ratio of the ambient sound segment is adjusted to be 0.5.

For example, if the number of audio players is plural, the output scaling factors corresponding to different audio segments may be queried according to the emotion recognition result, and the output scaling factors may be unified as the output scaling factor of each audio player. According to the deployment position and/or the player type of each audio player, the output proportionality coefficients corresponding to different audio segments can be queried for each audio player, that is, the obtained output proportionality coefficients corresponding to each audio player may be different, for example, the queried output proportion of the direct sound segment is 0.6 for the audio player deployed in the front left channel, and the queried output proportion of the direct sound segment is 0.8 for the audio player deployed in the middle channel.

Of course, besides the above embodiment, the spatial distance between each riding object and the audio player may be comprehensively referred to, so as to determine the influence degree of the emotion recognition result of the riding object on the audio player according to the spatial distance between the riding object and the audio player, and improve the accuracy of calculating the output proportionality coefficient.

Step S340, according to the output proportionality coefficient, each audio clip is output to the audio player for playing.

After the output proportionality coefficient corresponding to each audio fragment is obtained, the intensity degree of each audio fragment in the audio player can be determined, so that each audio fragment is output to the audio player for playing according to the output proportionality coefficient.

The application can automatically adjust the audio playing effect according to the emotion of the driving object so as to improve the audio effect of the current playing audio and the adaptation degree of the driving object and improve the driving or riding feeling of the driving object.

Next, some embodiments of the present application will be described in detail.

In some embodiments, the types of the audio data include direct sound, ambient sound and bass sound, and the step S310 of performing audio separation on the audio to be played to obtain a plurality of audio fragments includes performing short-time Fourier transform on the audio to be played, separating the direct sound fragment and the ambient sound fragment from the audio to be played based on a short-time Fourier transform result, and performing filtering processing on the audio to be played based on a preset low-pass filter to obtain the bass sound fragment.

Taking audio to be played as a stereo source as an example, the audio to be played is composed of left channel audio data and right channel audio data, the left channel audio data is denoted as X_L, the right channel audio data is denoted as X_R, the direct sound fragment in the left channel audio data X_L is denoted as P_L, the ambient sound fragment in the left channel audio data X_L is denoted as U_L, the right channel audio data is denoted as the direct sound fragment in X_R is denoted as P_R, the ambient sound fragment in the right channel audio data is denoted as X_R is denoted as U_R, and the following formulas 1 and 2 are obtained:

X_L＝P_L+U_L (formula 1)

X_R＝P_R+U_R (formula 2)

Performing short-time Fourier transform on left channel audio data and right channel audio data in audio to be played to obtain the following formula 3 and formula 4:

X_L(i,k)＝P_L(i,k)+U_L (i, k) (equation 3)

X_R(i,k)＝B(i,k)P_R(i,k)+U_R (i, k) (equation 4)

Wherein i is a time frame index of left channel audio data or right channel audio data, k is a frequency point index of the left channel audio data or the right channel audio data, and B is a correlation coefficient between the left channel audio data and the right channel audio data.

Since X_L and X_R are known, estimating B, P_L、U_L、U_R, namely, separating the direct sound and the ambient sound of the left channel audio data and the right channel audio data, the step of estimating B, P_L、U_L、U_R may be:

The short-term energy estimate for X_L is expressed as equation 5:

Wherein E {. Cndot. } indicates the desire.

In equations 3 and 4, the ambient sounds of the left channel audio data and the right channel audio data have the same short-time energy, denoted as P_A, and the short-time energy of the coherent sound is denoted as P_S. Then, in combination with the correlation hypothesis, equations 6 and 7 can be derived:

Combining the formulas to define the normalized inter-channel cross-correlation coefficient of the audio to be playedExpressed as equation 8:

substituting formula 3 and formula 4 into formula 8, and combining formula 6 and formula 7 to obtain a formula

Formula 9:

B. P_S and P_A are related toAndAnd thus B, P_S and P_A:

Wherein:

After B, P_S and P_A are calculated according to formulas 10 to 14, the coherent sound S, the ambient sound U_L、U_R of the left channel audio data and the right channel audio data are estimated by the same method:

S=w_L(P_S+U_L)+W_R(P_S+U_R) (equation 15)

Wherein W_L、W_R is an estimated weight, and the estimated error of S is expressed as formula 16:

σs= (1-W_L-W_RB)S-W_LU_L-W_RU_R (equation 16)

In the least square algorithm, when the estimation error is completely uncorrelated with the audio to be played, the obtained weight is the optimal estimation:

e { σsx_L } =0 (formula 17)

E { σsx_R } =0 (equation 18)

At this time, the estimated weight of the optimal estimation is:

The above formula is combined to obtain the value of P_L、U_L、P_R、U_R.

In addition, filtering processing is carried out on the audio to be played according to a preset low-pass filter, so that a bass sound fragment is obtained. For example, audio data below 120Hz (hertz) is filtered out using a preset low pass filter to obtain a bass segment.

Through the method, the obtained direct sound segment, the environment sound segment and the bass sound segment are separated.

In some embodiments, in step S320, emotion recognition is performed on a driving object corresponding to a target vehicle to obtain an emotion recognition result, wherein the emotion recognition result comprises obtaining an object image obtained by image acquisition of the driving object, object voice obtained by voice acquisition of the driving object, vehicle running environment information of the target vehicle and a running destination of the target vehicle, and performing emotion recognition on the driving object based on one or more of the object image, the object voice, the vehicle running environment information and the running destination to obtain the emotion recognition result.

An image acquisition device is disposed in the target vehicle, and the image acquisition device is used for acquiring an image of a driving object to obtain an object image, and carrying out emotion recognition on the driving object according to the object image to obtain an emotion recognition result.

For example, the emotion recognition result is obtained by analyzing the expression, limb movement, age, sex, and the like of the riding object in the object image.

The method comprises the steps of extracting an emotion-related image area in an object image, and carrying out emotion recognition on the emotion-related image area by using a preset emotion recognition algorithm to obtain an emotion recognition result.

The preset emotion recognition algorithm can be an emotion recognition neural network model with pre-learning training completed, and the emotion-related image area is input into the emotion recognition neural network model for processing to obtain an emotion recognition result output by the emotion recognition neural network model.

The target vehicle is provided with a voice acquisition device, voice of the driving object is acquired through the voice acquisition device, the object voice is obtained, emotion recognition is carried out on the driving object according to the object voice, and an emotion recognition result is obtained.

For example, according to the voice text content, intonation, volume and the like of the riding object in the object voice, the emotion recognition result is obtained.

Illustratively, an environment sensing device, such as a laser radar, an image acquisition device, an audio acquisition device, a speed sensor and the like, is deployed in the target vehicle, vehicle driving environment information is acquired through the environment sensing device, the vehicle driving environment information includes, but is not limited to, vehicle driving speed, vehicle driving scenes (such as cities, suburbs, deserts and the like), road congestion degree, road flatness degree and the like, and emotion recognition is performed on a driving object according to the vehicle driving environment information, so that emotion recognition results are obtained.

For example, the mood of the riding object may be more pleasant if the vehicle running environment information indicates a lower level of road congestion, and the mood of the riding object may be more frustrated if the vehicle running environment information indicates a higher level of road congestion.

Illustratively, a driving path of the target vehicle is obtained, a driving destination of the target vehicle is obtained, emotion recognition is performed on the driving object according to the driving destination, and an emotion recognition result is obtained.

For example, if the traveling destination is an amusement park, or the like, the mood of the riding object may be more pleasant.

For example, the driving object may be subjected to emotion recognition in combination with one or more of object image, object voice, vehicle driving environment information and driving destination, so as to obtain an emotion recognition result, so as to improve accuracy of the emotion recognition result.

In some embodiments, the number of the driving objects is a plurality, and the step S320 of performing emotion recognition on the driving objects corresponding to the target vehicle to obtain an emotion recognition result includes performing emotion recognition on each driving object to obtain an initial recognition result corresponding to each driving object, and combining the initial recognition results corresponding to each driving object to obtain the emotion recognition result.

When a plurality of riding objects exist in the target vehicle, determining the emotion corresponding to each riding object according to the object image and/or the object voice of each riding object or the combined vehicle driving environment information, the driving destination and the like, and obtaining an initial recognition result.

And obtaining a final emotion recognition result according to the initial recognition result corresponding to each driving object.

The emotion recognition result is obtained by combining initial recognition results corresponding to each driving object respectively, wherein the emotion recognition result comprises the steps of determining a role, a sitting position and a sitting frequency of each driving object, calculating weight parameters of each driving object based on one or more of the role, the sitting position and the sitting frequency of each driving object, and carrying out weighted fusion on the initial recognition results corresponding to each driving object according to the weight parameters of each driving object to obtain the emotion recognition result.

For example, according to family relatives, the roles of each driving object can be divided into father, mother, child, etc., the riding positions can be divided into a main driving position, a co-driving position, a rear riding position, etc., and the riding frequency refers to the number of times the driving object rides or drives the target vehicle in a preset time period.

And respectively calculating the weight parameter of each riding object according to one or more of the role, riding position and riding frequency of each riding object.

For example, the roles of different riders have different priorities, the higher the priority of the roles, the higher the weight of the corresponding riders, the different riding positions have different priorities, the higher the priority of the positions, the higher the weight of the corresponding riders, and the higher the riding frequency, the higher the weight of the corresponding riders.

And weighting and fusing initial recognition results corresponding to each driving object respectively according to the weight parameters of each driving object to obtain emotion recognition results.

For example, the emotion recognition result is represented by an emotion value between 0 and 1, and the closer the emotion value is to 1, the more excited the driving object is represented, whereas the closer the emotion value is to 0, the more sad the driving object is represented. If the driving object of the target vehicle includes a driving object 1 and a driving object 2, the emotion value of the driving object 1 is 0.5, the weight parameter is 0.7, the emotion value of the driving object 2 is 0.8, and the weight parameter is 0.8, the final emotion value is 0.5×0.7+0.8×0.3=0.59.

In addition to the above-mentioned method of obtaining the emotion recognition result by fusing the initial recognition results corresponding to each riding object, the final emotion recognition result may be determined by other methods.

For example, if the emotion recognition result is represented by an emotion value between 0 and 1, the emotion value having the largest value may be selected from the plurality of initial recognition results as the final emotion recognition result.

For another example, the weight parameter of each riding object is determined, and the initial recognition result corresponding to the riding object with the largest weight parameter is used as the final emotion recognition result.

For example, the emotion recognition results are emotion classification (such as excitement, calm, depression, etc.), the initial recognition results of each riding object are counted to obtain the number of riding objects corresponding to each emotion type, and the emotion type with the largest number is selected as the final emotion recognition result.

And calculating the output proportionality coefficient corresponding to each audio fragment by utilizing the final emotion recognition result determined by the embodiment.

Of course, in another embodiment, after obtaining the initial recognition results corresponding to each driving object respectively, the output scaling factor corresponding to each audio segment is directly determined according to the multiple initial recognition results.

For example, for the audio player 1, the influence of the emotion of each riding object on the audio player 1 is obtained according to the spatial distance between each riding object and the audio player 1, wherein the larger the spatial distance is, the smaller the influence is. Then, the output proportionality coefficient of each audio clip relative to the audio player 1 is calculated in combination with the initial recognition result and the influence degree corresponding to each riding object.

For another example, the output proportionality coefficient of each audio clip relative to the audio player 1 may also be calculated by combining the initial recognition result corresponding to each riding object, the influence degree of the emotion of each riding object on the audio player, and the weight parameter of each riding object.

In some embodiments, the number of the audio players is a plurality, different audio players are deployed at different positions of the target vehicle, and the calculating of the output proportionality coefficient corresponding to each audio clip in step S330 based on the emotion recognition result includes obtaining the deployment position of each audio player in the target vehicle, and calculating the output proportionality coefficient of each audio clip relative to each audio player based on the deployment position of each audio player and the emotion recognition result.

And respectively calculating the output proportionality coefficient of each audio clip relative to each audio player according to the deployment position of each audio player and the emotion recognition result.

For example, if the emotion recognition result is calm, the output proportionality coefficient of the direct sound segment of the audio player deployed in the front left door is calculated to be 0.6, and the output proportionality coefficient of the direct sound segment of the audio player deployed in the middle of the vehicle is calculated to be 0.8.

In some embodiments, a target vehicle is deployed with a plurality of types of audio players, and in step S330, output proportionality coefficients of each audio clip relative to each audio player are calculated based on the deployment position and emotion recognition result of each audio player, respectively, including obtaining a player type to which each audio player belongs, and calculating output proportionality coefficients of each audio clip relative to each audio player based on the deployment position, player type to which each audio player belongs, and emotion recognition result of each audio player, respectively.

Referring to fig. 4, fig. 4 is a schematic deployment diagram of an audio player according to an exemplary embodiment of the present application, and as shown in fig. 4, a target vehicle is deployed with tweeters, midrange speakers, woofers, full-range speakers, D-pillar surround speakers, etc. at different locations. It should be noted that fig. 4 is only one deployment example of a plurality of audio players of various types, and in an actual application scenario, the types and deployment positions of the audio players can be flexibly selected according to the actual application situation, which is not limited by the present application.

And respectively calculating the output proportionality coefficient of each audio clip relative to each audio player by combining the deployment position of each audio player, the player type to which each audio player belongs and the emotion recognition result.

For example, if the emotion recognition result is calm, the output proportionality coefficient of the direct sound segment of the tweeter disposed at the left front door is calculated to be 0.6, and the output proportionality coefficient of the direct sound segment of the woofer disposed at the left front door is calculated to be 0.5.

In some embodiments, the output proportionality coefficient of each audio clip relative to each audio player is calculated based on the deployment position of each audio player, the player type to which each audio player belongs and the emotion recognition result, and the method comprises the steps of inquiring a matched preset proportionality coefficient calculation formula based on the deployment position of the current audio player, the player type to which the current audio player belongs and the audio type to which the current audio clip belongs, substituting the emotion recognition result into the preset proportionality coefficient calculation formula, and calculating to obtain the output proportionality coefficient of the current audio clip relative to the current audio player.

The method comprises the steps of presetting preset proportionality coefficient calculation formulas aiming at audio players at different deployment positions, audio players of different types and audio fragments of different types, inquiring preset proportionality coefficient calculation formulas matched with the deployment position of the current audio player, the player type of the current audio player and the audio type of the current audio fragment, substituting emotion recognition results into the preset proportionality coefficient calculation formulas, and calculating to obtain the output proportionality coefficient of the current audio fragment relative to the current audio player.

By the method, the output proportionality coefficient of each audio clip relative to each audio player is calculated.

For example, assuming that the emotion recognition result is Y, the left channel direct sound segment is denoted as P_L, the left channel ambient sound segment is denoted as U_L, the right channel direct sound segment is denoted as P_R, the right channel ambient sound segment is denoted as U_R, the bass sound segment is D, and the audio player includes a center full-frequency speaker, a left door full-frequency speaker, a right door full-frequency speaker, a left side D-pillar surround speaker, a right side D-pillar surround speaker, and a bass speaker, the preset scaling factor calculation formula of each audio segment relative to each audio player is determined as follows:

1. for a left channel direct sound fragment and a right channel direct sound fragment:

In put full frequency speaker:

(P_L+P_R) (0.8+ (Y-0.5) x 2 x 0.2) (equation 21)

Left door full-frequency speaker:

Right door full-frequency speaker:

2. For a left channel ambient sound segment and a right channel ambient sound segment:

In put full frequency speaker:

left door full-frequency speaker:

Right door full-frequency speaker:

The left D-pillar surrounds the speaker:

the right D-pillar surrounds the speaker:

3. for bass segments:

A woofer:

D (0.6+ (Y-0.5) 2.0.4) (equation 29)

And calculating the output proportionality coefficient of each audio fragment relative to each audio player through the queried preset proportionality coefficient calculation formula.

In another embodiment, the output scaling factor of each audio segment relative to each audio player may be determined by a table lookup method, for example, a scaling factor table is pre-stored, where the scaling factor table is used to store output scaling factors corresponding to different audio segments, different audio players, and different emotion recognition results.

For example, referring to table 1 below, table 1 is a table of scale factors shown in an exemplary embodiment:

Table 1

The output scaling factor of each audio clip relative to each audio player can be obtained by looking up table 1.

The vehicle-mounted sound effect adaptation method provided by the application comprises the steps of obtaining a plurality of sound frequency fragments through audio frequency separation of the to-be-played sound frequency, carrying out emotion recognition on a driving object corresponding to a target vehicle to obtain an emotion recognition result, wherein the target vehicle is provided with a sound frequency player, calculating an output proportion coefficient corresponding to each sound frequency fragment based on the emotion recognition result, outputting each sound frequency fragment to the sound frequency player for playing according to the output proportion coefficient, and automatically adjusting the sound frequency playing effect according to the emotion of the driving object to improve the sound effect of the current playing sound frequency and the adaptation degree of the driving object and the driving or riding feeling of the driving object.

The application also provides an electronic device, referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of the electronic device of the application. The electronic device 500 comprises a memory 501 and a processor 502, the processor 502 being arranged to execute program instructions stored in the memory 501 for implementing the steps of any of the above-described embodiments of the vehicle audio adaptation method. In a specific implementation scenario, electronic device 500 may include, but is not limited to, a microcomputer, a server, and further, electronic device 500 may include a mobile device such as a notebook computer, a tablet computer, etc., without limitation.

In particular, the processor 502 is configured to control itself and the memory 501 to implement the steps of any of the above-described embodiments of the vehicle audio adaptation method. The processor 502 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 502 may be an integrated circuit chip with signal processing capabilities. The Processor 502 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 502 may be commonly implemented by an integrated circuit chip.

The present application also provides a computer readable storage medium, please refer to fig. 6, fig. 6 is a schematic structural diagram of an embodiment of the computer readable storage medium of the present application. The computer readable storage medium 600 stores program instructions 610 that can be executed by a processor, where the program instructions 610 are configured to implement the steps in any of the above-described embodiments of the vehicle audio adaptation method.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims

1. A vehicle-mounted sound effect adaptation method, characterized by comprising:

performing audio separation on audio to be played to obtain a plurality of audio clips, wherein different audio clips correspond to different types of audio data;

carrying out emotion recognition on a driving object corresponding to a target vehicle to obtain an emotion recognition result, wherein an audio player is deployed on the target vehicle;

Calculating an output proportionality coefficient corresponding to each audio fragment based on the emotion recognition result;

and outputting each audio clip to the audio player for playing according to the output proportionality coefficient.

2. The method of claim 1, wherein the types of audio data include direct sound, ambient sound, and bass sound, wherein the audio to be played is audio-separated to obtain a plurality of audio clips, and wherein the audio separating comprises:

Performing short-time Fourier transform on the audio to be played, separating a direct sound fragment and an ambient sound fragment from the audio to be played based on a short-time Fourier transform result, and

And filtering the audio to be played based on a preset low-pass filter to obtain a bass sound fragment.

3. The method of claim 1, wherein performing emotion recognition on the riding object corresponding to the target vehicle to obtain an emotion recognition result comprises:

acquiring an object image obtained by image acquisition of the driving object, object voice obtained by voice acquisition of the driving object, vehicle running environment information of the target vehicle and a running destination of the target vehicle;

And carrying out emotion recognition on the driving object based on one or more of the object image, the object voice, the vehicle driving environment information and the driving destination to obtain an emotion recognition result.

4. A method according to claim 3, wherein performing emotion recognition on the riding object based on the object image to obtain an emotion recognition result comprises:

Extracting emotion-related image areas in the object image;

And carrying out emotion recognition on the emotion-associated image area by using a preset emotion recognition algorithm to obtain an emotion recognition result.

5. The method of claim 1, wherein the number of the riding objects is a plurality of, and the performing emotion recognition on the riding objects corresponding to the target vehicle to obtain an emotion recognition result comprises:

Respectively carrying out emotion recognition on each driving object to obtain initial recognition results respectively corresponding to each driving object;

and combining the initial recognition results corresponding to each driving object respectively to obtain emotion recognition results.

6. The method of claim 5, wherein the combining the initial recognition results corresponding to each of the riding objects to obtain the emotion recognition result includes:

Respectively determining the role, the riding position and the riding frequency of each riding object;

Calculating a weight parameter of each driving object based on one or more of the role, the riding position and the riding frequency of each driving object;

7. The method of claim 1, wherein the number of audio players is a plurality of, different audio players are deployed at different locations of the target vehicle, and wherein calculating the output scaling factor for each audio clip based on the emotion recognition result comprises:

Acquiring a deployment position of each audio player in the target vehicle;

And respectively calculating the output proportionality coefficient of each audio clip relative to each audio player based on the deployment position of each audio player and the emotion recognition result.

8. The method of claim 7, wherein the target vehicle has a plurality of types of audio players disposed therein, wherein the calculating the output scaling factor of each audio clip relative to each audio player based on the disposition location of each audio player and the emotion recognition result comprises:

acquiring the player type of each audio player;

and respectively calculating the output proportionality coefficient of each audio clip relative to each audio player based on the deployment position of each audio player, the player type to which each audio player belongs and the emotion recognition result.

9. The method of claim 8, wherein calculating the output scaling factor of each audio clip relative to each audio player based on the deployment location of each audio player, the player type to which each audio player belongs, and the emotion recognition result, respectively, comprises:

Inquiring a matched preset proportionality coefficient calculation formula based on the deployment position of the current audio player, the player type of the current audio player and the audio type of the current audio fragment;

substituting the emotion recognition result into the preset proportionality coefficient calculation formula, and calculating to obtain the output proportionality coefficient of the current audio clip relative to the current audio player.

10. An electronic device comprising a memory and a processor for executing program instructions stored in the memory to implement the steps of the method according to any of claims 1-9.