Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
1. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
2. Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning generally comprise artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning and other technologies, in the application, a trained machine learning model is used to process audio to be matched, feature information and tag information of the audio to be matched are obtained, and target candidate audio of the audio to be matched is determined through the feature information and the tag information.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, smart medical, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more audio processing fields and become more and more important values.
3. Mel (Mel) frequency refers to a non-linear frequency scale determined based on sensory judgment of equidistant pitch (pitch) variation by the human ear. When the signal is processed, the Mel frequency can cater to the auditory perception threshold change of human ears, so that the frequency scale is artificially set. In the field of audio processing, the frequency characteristics of audio can be calculated from mel frequencies.
4. The convolutional neural network (Convolutional Neural Network, CNN) is a feed-forward neural network. The convolutional neural network is provided with at least one artificial neuron, and the artificial neuron can respond to surrounding units in a part of coverage area, so that the relevance among information of different positions can be improved. The convolutional neural network consists of at least one convolutional layer and a fully connected layer (corresponding to a classical neural network). The convolutional neural network also includes an associated weight and pooling layer (pooling layer).
5. The pre-trained audio neural network (Pretrained Audio Neural Networks, PANN) is a large audio data set and trained based audio neural network. The pre-trained audio neural network may be used to perform audio pattern recognition or audio frame level embedding (embedding) quantization.
The pre-trained audio neural network can be used as a model front-end coding network that proposes the waveform pattern features learned from the waveforms, and an audio feature extraction scheme for the waveform pattern LogMel CNN. The pre-trained audio neural network achieves the most advanced performance in terms of audio set labeling. The pre-training audio neural network is used for processing a wide range of audio pattern recognition tasks, and can obtain a better processing effect. And fine tuning is performed on the pre-trained audio neural network by using a small amount of data corresponding to the new task, so that a better processing effect can be achieved.
6. Spatial similarity measures the similarity between two vectors by measuring the cosine of the angle between them. Spatial similarity is also known as spatial distance. The value of the spatial similarity between the two vectors is [0,1]; wherein the spatial similarity is that the cosine value of the 0-degree angle of the space is 1, and the cosine value of any other angle is not more than 1; and the cosine minimum of the angle is-1.
According to the cosine value of the angle between the two vectors, the similarity of the two vectors in space is determined, which is equivalent to determining the space included angle and the direction coincidence degree of the two vectors. When the two vectors point to the same direction in space, the cosine similarity value is 1; the space included angle between the two vectors is 90 degrees, namely the cosine similarity value is 0 when the similarity is lower; when the two vectors are directed exactly opposite in space, it is indicated that the two vectors are not similar at all, and the cosine similarity has a value of-1. The cosine result between two vectors is only related to the pointing direction of the vector, the length of the vector is irrelevant. Cosine similarity is usually used in the positive space, and the cosine value is in the range of 0, 1.
7. Musical moods are defined by the subjective perception of a person. After the human ear senses the audio signal of the music, resonance of the subjective emotion of the human can be caused according to the audio signal. Music itself does not bear any emotion, which is the subjective perception of music by a person. Because of the variability between individuals, the same piece of music may cause non-identical musical moods between different listeners. Alternatively, the music emotion is also called audio emotion.
In the related art, there are the following audio matching methods:
1. in the process of audio matching, a person responsible for operating products manually searches and matches the whole copyright music library (corresponding to the audio library in claim 1), evaluates music in the whole music library through subjective hearing feeling, finds target candidate audio corresponding to the audio to be matched, and replaces the audio to be matched with the target candidate audio.
However, there are a number of drawbacks to manually performing audio matching. On the one hand, the efficiency of manually performing audio matching is low, and when the magnitude of the audio stored in the copyright music library is large, a great amount of time is consumed for manually performing audio matching.
On the other hand, the subjective hearing of a person has subjectivity, and the music emotion brought by the same song to different staff is not identical. The music matching is carried out by relying on manpower alone, so that the personalized difference problem is easy to occur. That is, the audio matching process is performed on the same audio to be matched by different staff under the influence of the hearing difference among the staff, the difference among the target candidate audios determined respectively is large, and the standardized audio matching process cannot be realized.
2. Audio matching is performed through a music classification network. And determining the label corresponding to the audio in the copyrighted music library through the music classification network, determining the label corresponding to the audio to be matched through the music classification network, screening a candidate audio set from the copyrighted music library according to the label replacement corresponding to the audio to be matched, and then manually performing fine matching.
However, the tags in the related art generally can classify only the audio into a single tag range, but in general, one audio tends to correspond to a plurality of tags. Categorizing the audio into a single tag in the above approach may result in limitations in the description of the audio.
Meanwhile, in the method, although the audio in the copyright music library is initially screened by utilizing the audio tag, and the determination range of the target candidate audio is reduced, the fine matching process still needs to be performed manually, and the problems of low audio matching efficiency, low audio matching accuracy and the like are still not solved.
3. Extracting the feature vector of the audio to be matched by using a traditional network model, and then carrying out similarity matching with the feature vector of the audio in the copyright music library to determine target candidate audio.
However, this method consumes a large amount of computing resources in the audio matching process, and because of the lack of assistance of other information in the audio matching process, the similarity between the determined target candidate audio and the audio to be matched in at least one classification mode is easily low.
Aiming at the problems of lower accuracy, lower efficiency and the like in the audio matching method in the related technology, the application provides a novel audio matching method, and the accuracy of audio matching can be improved under the condition of not consuming a large amount of computing resources.
FIG. 1 is a schematic diagram of an implementation environment for an approach provided by an exemplary embodiment of the present application. The implementation environment of the scheme can comprise:computer device 10, terminal device 20, and server 30.
Thecomputer device 10 includes, but is not limited to, a personal computer (Personal Computer, PC), a cell phone, a tablet computer, and the like. Thecomputer device 10 is capable of providing audio matching services. In some embodiments, the audio matching is accomplished by a machine learning model, which is run on thecomputer device 10 to enable the audio matching.
The terminal device 20 may be an electronic device such as a personal computer, tablet computer, cell phone, wearable device, smart home appliance, vehicle terminal, etc. A client with a target application running on the terminal device 20. The target application is capable of playing multimedia files (e.g., audio, video). For example, the target application is an application for processing audio traffic, wherein the audio traffic includes, but is not limited to, at least one of: audio playback and audio recognition, etc. As another example, the target application is an application of a video play function.
In addition, the target application may be a news application, a shopping application, a social application, an interactive entertainment application, a browser application, a content sharing application, a virtual reality application, an augmented reality application, and the like, which is not limited in the embodiment of the present application. In addition, the types of audio processed by the application programs may be different from application program to application program, and the corresponding functions may be different, which may be configured in advance according to actual requirements, which is not limited in the embodiment of the present application.
The terminal device 20 has at least a data receiving function and a storage function. Alternatively, the data receiving function of the terminal device 20 obtains the audio to be matched input by the user, and provides the audio to be matched to thecomputer device 10 for audio matching.
The server 30 is used to provide background services for clients of the target application in the terminal device 20. For example, the server 30 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, secure service content distribution network, (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platform, but is not limited thereto.
The server 30 has at least a data receiving function, a storing and operating function. The server 30 corresponds to a video database and an audio database. In the event that replacement of audio to be matched in a certain video is required, the server 30 requests thecomputer device 10 to perform audio matching for the audio to be matched.
In some embodiments, a machine learning model for audio matching may be run on the terminal device 20 or the server 30. That is, thecomputer device 10 may be implemented as the terminal device 20 or the server 30. Of course, thecomputer device 10 may be other devices than the terminal device 20 or the server 30.
In one example, the audio matching method provided herein may be used for audio replacement. For example, the audio matching method performs audio replacement of background music for which copyrights expire used in movie plays. The copyrights respectively corresponding to the movie play and the background music used by the movie play can be independent of each other, so as to avoid the situation that the background music is continuously used in the movie play after the service life of the copyrights is exceeded. In the case where the copyright of the background music is about to expire, audio matching is required, replacement music of the background music is selected and used instead of the background music in the movie.
The background music mentioned in this example corresponds to the audio to be matched in the claims; the mentioned replacement music corresponds to the target candidate audio in the claims.
In another example, the audio matching method provided herein may be used for audio recommendation. For example, the audio matching method provided by the application is used for authoring assistance for a short video two-creation platform. During the creation of a short video clip by an author, reference may be made to the work of other authors in the short video platform and background music selected from the work of other authors. The creator indicates to theterminal device 10 the selected background music for which a set of background music candidates similar to the selected background music is desired.
By the audio matching method, the creator can be helped to search the background music candidate set similar to the selected background music, and the creator can quickly create the background music candidate set conveniently.
The selected background music mentioned in this example corresponds to the audio to be matched in the claims; the mentioned background music candidate set comprises the object candidate audio in the claims.
In another example, the audio matching method provided by the present application may be used for listening to songs. For example, the audio matching method provided by the application is used for performing audio recognition, and in the case that a user desires to acquire certain song information, the user uses the terminal device 20 to input a piece of audio to be recognized. The server 30 receives the audio to be identified and requests thecomputer device 10 to perform audio matching to obtain target candidate audio corresponding to the audio to be identified, optionally, the server 30 obtains the target candidate audio and transmits the target candidate audio to the terminal device 20. The audio to be identified mentioned in this example corresponds to the audio to be matched in the claims.
Of course, the above several application scenarios are merely exemplary, and the audio matching method provided in the present application may also be applicable to other scenarios, which is not limited in this application.
Fig. 2 is a flowchart of an audio matching method provided in an exemplary embodiment of the present application. By way of example, the subject of execution of the method may be thecomputer device 10 of fig. 1. The method may include the following steps (210-240):
step 210, obtaining feature information of the audio to be matched, wherein the feature information of the audio to be matched is used for representing semantic features of the audio to be matched.
The audio to be matched refers to any piece of audio. In some embodiments, the audio to be matched may be from background music, soundtrack in a piece of video. For example, the audio to be matched may be a title, a tail, background music, etc. in the video. In other embodiments, the audio to be matched may be a musical composition, song, or the like that appears independently.
Because the speed and the rhythm of different audio to be matched are not identical, and the musical instrument compositions in the audio to be matched are not identical, the different audio to be matched have differences in the aspects of audio style, audio emotion and the like. The audio style is used to characterize the kind of music to which the audio belongs, and the audio emotion refers to the subjective emotion generated during the process of listening to the audio by the listener mentioned above. In some embodiments, the audio to be matched may be referred to as audio to be identified.
In some embodiments, the audio to be matched belongs to at least one audio style. For example, a certain audio a to be matched belongs to the rock and roll style. For another example, a certain audio B to be matched belongs to both dance music style and classical style. It should be noted that, the audio styles of different audio to be matched are not identical, and the audio styles of the audio to be matched are determined according to the actual situation of the audio to be matched.
In some embodiments, the audio to be matched can cause at least one audio emotion. For example, a certain audio C to be matched can make people feel happy and healed subjectively.
The feature information of the audio to be matched is used for representing semantic features of the audio to be matched. In some embodiments, the semantic features include feature information in the audio to be matched. Optionally, the semantic feature information relates to features of the audio to be matched in the directions of time domain, frequency domain, etc., and audio clips during a playing time. In general, the feature information corresponding to different audio frequencies is not identical. The correlation between different audio frequencies can be determined by the characteristic information of the audio frequencies.
Alternatively, the feature information of the audio to be matched is represented using a vector form, in which case the feature information may be referred to as a feature vector. In some embodiments, the computer device obtains feature information of the audio to be matched by processing the audio to be matched. For specific steps of obtaining the feature information of the audio to be matched, please refer to the following embodiments.
In some embodiments, a computer device obtains audio to be matched and determines feature information for the audio to be matched. Optionally, the audio to be matched is from a video file.
In one example, the audio to be matched belongs to a long video (e.g., the playing time exceeds 5 minutes, and the video content has a strong relation). For example, the audio to be matched is background music in a movie play. Since the copyright of the movie and the copyright of the audio to be matched are independent of each other, the movie stops using the background music after the exceeding copyright of the audio to be matched expires. In this case, the computer device uses the background music as audio to be matched, performs audio matching on the audio to be matched, and determines at least one target candidate audio, and optionally, replaces the background music with the target candidate audio, wherein the copyright of the background music is expired in the movie.
For another example, the audio to be matched belongs to background music in a short video. In the process of editing the short video by the user, the computer equipment performs audio matching according to the audio to be matched specified by the user, and determines at least one target candidate audio for the user to select and use.
In another example, the audio to be matched refers to a separate audio file. For example, in the case where the user uses the audio recognition function of the target application to listen to songs and recognize songs, the computer device obtains audio to be matched recorded by the user, and performs audio matching on the audio to be matched to obtain at least one target candidate audio.
Step 220, determining the label information of the audio to be matched according to the characteristic information of the audio to be matched, wherein the label information of the audio to be matched comprises the confidence degrees respectively corresponding to the audio to be matched under a plurality of categories, and the confidence degrees are used for representing the correlation degree between the audio to be matched and the categories.
In some embodiments, the tag information of the audio to be matched is used to characterize the confidence that the audio to be matched corresponds to under multiple categories, respectively. Optionally, the tag information of the audio to be matched is represented in a one-dimensional vector form, and values of different positions in the vector are used for representing the correlation degree of the audio to be matched and a certain category.
Alternatively, the tag information of the audio to be matched is referred to as multi-classification confidence of the audio to be matched.
In some embodiments, the classification scheme includes at least one of: audio style, audio emotion, audio tempo, instrument composition in audio, and language to which the audio belongs, etc.
For one classification scheme of the plurality of classification schemes, the classification scheme corresponds to at least one category. A category is understood to mean at least one category which is divided by this classification. For an explanation of this procedure, please refer to the following examples.
In some embodiments, the confidence level of the audio to be matched corresponding to a certain category is used to characterize the correlation degree between the audio to be matched and the category. Optionally, for a classification scheme, the audio to be matched is related to multiple categories in the classification scheme at the same time. The label information of the audio to be matched comprises the correlation degree between the audio to be matched and a plurality of categories in at least one classification mode.
The label information comprises the confidence degrees of the audio to be matched, which correspond to the audio to be matched under a plurality of categories, so that the information carried in the label information of the audio to be matched is more abundant, and the description capability of the category to which the audio to be matched belongs by the label information is improved. Therefore, more information is provided in the audio matching process, and the accuracy of audio matching is improved.
Or, the label information of the audio to be matched comprises the correlation degrees corresponding to the first t categories with the highest correlation degree of the audio to be matched under a certain classification mode, and t is a positive integer. In this case, it is equivalent to that in the audio matching process, correlation between the audio to be matched and other categories in the classification scheme is not considered. This approach helps to reduce the amount of computation in performing the audio matching process.
In some embodiments, the tag information of the audio to be matched includes confidence levels of the audio to be matched corresponding to the plurality of categories, respectively. In order to improve the accuracy of audio matching, the label information of the audio to be matched comprises confidence degrees corresponding to all categories respectively; all classification schemes are understood to be the sum of the classes that each classification scheme already includes.
In some embodiments, the tag information of the audio to be matched consists of confidence levels of the audio to be matched corresponding to the plurality of categories, respectively. Optionally, the computer device arranges the confidence degrees corresponding to the audio to be matched under a plurality of categories according to the arrangement sequence to obtain the label information of the audio to be matched.
The arrangement sequence can be preset, namely the same position of the tag information of different audios is used for representing the correlation degree of the audios and the same category, and the method is beneficial to simplifying the method for calculating the correlation degree according to the tag information of the audios.
For a method of determining tag information about audio to be matched, please refer to the following embodiments.
The confidence degrees under the multiple categories are included in the tag information of the audio to be matched, so that the description capability of the tag information on the category of the audio to be matched is improved, and the accuracy of subsequent audio matching is improved.
Step 230, selecting at least one candidate audio from the audio library according to the tag information of the audio to be matched and the tag information of each audio contained in the audio library.
After determining the tag information of the audio to be matched, the computer equipment screens a plurality of audios in the audio library according to the tag information of the audio to be matched to obtain at least one candidate audio.
The audio library includes at least one audio. Optionally, the audio in the audio library is within a copyright-allowed lifetime. The subjective feelings brought by a plurality of audios of the audio library to a listener are not identical, and in the process of audio matching, at least one target candidate audio with higher similarity with the audio to be matched needs to be determined from the audio library.
In some embodiments, the computer device filters the audio in the audio library according to the tag information of the audio to be matched and the tag information of each audio contained in the audio library to obtain at least one candidate audio.
The candidate audio is audio with high similarity with the audio to be matched in at least two categories.
In some embodiments, the computer device determines at least one candidate audio by the audio screening conditions, tag information of the audio to be matched, and tag information of each audio in the audio library; the audio screening conditions are used for screening at least one candidate audio from the audio library.
In some embodiments, the computer device calculates a tag similarity between tag information of the audio to be matched and tag information of the audio in the audio library; the computer device treats at least one audio whose tag similarity satisfies the audio screening condition as at least one candidate audio.
For example, the computer device determines the tag similarity between the tag information of the audio to be matched and the tag information of the audio in the audio library by calculating the spatial similarity between the tag information of the audio to be matched and the tag information of the audio in the audio library, and the like.
Optionally, the audio filtering condition is used for indicating that the first n audio frequencies with the highest label similarity in the audio library are used as candidate audio frequencies, and n is a positive integer.
Optionally, the audio filtering condition is used to indicate that, in the audio library, the audio with the tag similarity greater than or equal to the similarity threshold k belongs to (0, 1) as at least one candidate audio. For example, k is equal to 0.75. The audio screening conditions are set according to actual conditions, and the present application is not limited thereto.
The computer device uses the tag information of the audio to be identified to perform primary screening on the audio in the audio library, and typically, the computer device determines a plurality of candidate audio from the audio library so as to further finely screen out target candidate audio from the plurality of candidate audio.
In one aspect, the tag information (including, for each of the audio to be matched and the audio library) includes: the confidence level of the audio under the multiple categories improves the correlation degree between at least one candidate audio after screening and the audio to be matched under the multiple categories compared with the confidence level of the audio under the multiple categories when screening is carried out by using a single label. Because the label information of the audio includes confidence degrees under a plurality of categories, the screening condition is limited, and the number of the determined candidate audio is reduced.
On the other hand, the categories corresponding to the audio are limited, that is, the data amount of the tag of the audio is limited (for example, the amount of tag information of the audio is smaller compared with the data amount of feature information of the audio), and the tag information of the audio is used for primary screening in the audio library, so that the computing resources consumed for determining the candidate audio from the audio library are reduced, and the speed for determining the candidate audio is improved.
Step 240, determining target candidate audio matching the audio to be matched from at least one candidate audio.
In some embodiments, the target candidate audio refers to audio that is similar to the audio to be matched. Optionally, subjective emotion brought by the target candidate audio to the audience is relatively close to subjective emotion brought by the audio to be matched to the audience; the audio style to which the target candidate audio belongs is similar to the audio style to which the audio to be matched belongs.
Optionally, the number of the determined target candidate audios is determined according to actual needs from at least one candidate audio. For example, in a scenario where audio substitution is performed, 1 target candidate audio may be determined. For example, in a scenario where audio recommendation or audio recognition is performed, at least one target candidate audio may be determined for further selection by the user.
In some embodiments, the computer device performs a refinement screen on at least one candidate audio to determine a target candidate audio that matches the audio to be matched. Optionally, the computer device determines the target candidate audio according to the feature information of the candidate audio and the feature information corresponding to at least one candidate audio respectively. In this case, the candidate audio corresponding to the feature information having the highest similarity to the feature information of the audio to be matched is the target candidate audio.
In the process of determining the tag information of the audio to be matched, the feature information of the audio to be matched is already determined, so that the target candidate audio is determined from at least one candidate audio through the feature information of the audio to be matched, other features of the audio to be matched are not required to be generated, and the process of audio matching is facilitated to be reduced. When the model is trained, the selection and training times of the model are reduced, and the convergence rate of the model training process is improved.
The audio matching method provided by the application is generally described below by way of one embodiment.
Under the condition that audio matching is required, the computer equipment firstly determines the characteristic information of the audio to be matched, and determines the label information of the audio to be matched according to the characteristic information of the audio to be matched so as to grasp the confidence degrees respectively corresponding to the audio to be matched under a plurality of categories.
And the computer equipment performs primary screening in an audio library according to the label information of the audio to be matched to obtain at least one candidate audio. Optionally, the correlation degree between the tag information of the candidate audio and the tag information of the audio to be identified is higher than the correlation degree between the tag information of other audio in the audio library and the tag information of the audio to be matched.
Typically, the computer device will select a plurality of candidate audio from an audio library. After the initial screening is completed, a plurality of candidate audios which are similar to the audios to be matched in classification modes such as music styles, music moods and the like can be obtained. The computer device then performs a refinement screen on the plurality of candidate audio to obtain a target candidate audio.
In summary, in the audio matching method provided by the present application, in the process of determining the target candidate audio, tag information including multi-category confidence is first used to select at least one candidate audio from the audio library, and then the target candidate audio is selected from the at least one candidate audio. On the one hand, compared with the audio matching method in the related art, the method avoids the interference influence of subjective factors on the audio matching process, and is beneficial to improving the accuracy of audio matching. By generating the tag information comprising the multi-classification confidence coefficient, the candidate audio is screened from the audio library, so that the correlation degree of the candidate audio and the audio to be matched under a plurality of categories is improved, the number of the candidate audio determined from the audio library is limited, and the time consumption of the whole audio matching process is reduced.
On the other hand, the audio matching method provided by the application realizes the automation of audio matching, and reduces the labor cost consumed in the audio matching process. Rapid audio matching can also be achieved where the audio library includes millions of level audio.
Next, description will be made of a method of acquiring tag information of candidate audio by several embodiments.
In some embodiments, the computer device determines tag information of the audio to be matched according to the feature information of the audio to be matched, including: the computer equipment determines the classification result of the audio to be matched according to the characteristic information of the audio to be matched through a plurality of different classification networks; wherein, different classification networks correspond to different classification modes, and the classification result determined by each classification network comprises: confidence degrees respectively corresponding to the audio to be matched under a plurality of categories of the classification mode corresponding to the classification network; and the computer equipment determines the label information of the audio to be matched according to the classification results respectively determined by the plurality of different classification networks.
In some embodiments, the classification approach refers to an approach that divides the domain to which the audio to be matched belongs. As described above, the classification means includes, but is not limited to, at least one of: audio style, audio emotion, audio tempo, instrument composition in audio, source of human voice in audio, and language to which the audio belongs.
Optionally, the number and the names of the categories respectively included in the different classification modes are not identical.
For example, for the classification mode of audio emotion, the corresponding emotion categories are shown in table 1:
TABLE 1
| Injury feeling | Happy food | Inspirational emotion | Healing | Concept of thinking | Sweet |
| Xuan (releasing) | Sensible person | Tension is intense | (Epic) | Combat | Joke up |
For example, for the classification mode of audio styles, the corresponding style categories are shown in table 2:
TABLE 2
| Rock and roll | Fashion of popularity | Dance music | Classical | Ballad | Electronic device | Hip-hop and talking-hop |
| Bruce | Latin | Light music | Rural area | Punk for curing common cold | Metal material | Exercise machine |
The classification method is set according to actual conditions, and the present application is not limited thereto.
In some embodiments, the classification network is configured to determine a classification result of the audio to be matched in a corresponding classification manner, where the classification result of the audio to be matched includes: the audio to be matched is classified according to the classification confidence degrees corresponding to the classification modes. That is, the confidence degrees of the audio to be matched, which correspond to the multiple categories included in a certain classification mode, are determined by using the same classification network.
Optionally, the feature information of the audio to be matched is input into a classification network corresponding to a certain classification mode, so that the confidence degrees respectively corresponding to the audio to be matched and a plurality of categories included in the classification mode can be obtained.
In some embodiments, the classification network belongs to a machine learning network. And the method is used for predicting the classification result corresponding to the audio to be matched in a certain classification mode according to the characteristic information of the audio to be matched. In some embodiments, the classification network includes at least one pooling layer and an activation layer for integrating feature information in the audio to be matched and predicting confidence of the audio to be matched under multiple categories.
Optionally, the activation layer in the classification network includes a sigmoid () activation function, and the use of the sigmoid () activation function makes the classification result determined by the classification network include the confidence levels corresponding to the classification result in the classification mode, so as to promote the richness of the information carried by the tag information generated according to the classification result.
In some embodiments, the classification results determined by any one of the classification networks include: the confidence degrees of the audio to be matched are respectively corresponding to all the categories of the classification modes corresponding to the classification network. For example, 3 categories of a certain classification scheme a are category 1, category 2, and category 3, respectively. The computer equipment uses the classification network 1 to determine the correlation degree 1 of the audio to be matched and the category 1, the correlation degree 2 of the audio to be matched and the category 2, and the correlation degree 3 of the audio to be matched and the category 3, and uses the correlation degree 1, the correlation degree 2 and the correlation degree 3 as classification results corresponding to the audio to be matched in the classification mode A.
In some embodiments, the computer device uses a plurality of classification networks to respectively determine classification results of the audio to be matched respectively corresponding in different classification modes.
In some embodiments, the computer device determines a classification result corresponding to the audio to be matched using at least one classification network. Different classification networks are used for determining classification results corresponding to the audio to be matched under different classification modes. Optionally, the computer device determines the number of classification networks to be used in the process of generating the label information of the audio to be matched according to the audio matching requirement, and processes the audio to be matched by using the classification networks corresponding to the classification modes.
For example, if the audio matching requirement is high accuracy, more classification networks may be used in determining the tag information of the audio to be matched. For example, if the audio matching requirement is high accuracy, the computer device selects a classification network respectively corresponding to 3 classification modes of audio style, audio emotion and audio rhythm in the process of determining the tag information.
And determining classification results of the audio to be matched, which correspond to the classification modes, by using the classification networks, and determining label information of the audio to be matched according to the classification results. The method is beneficial to improving the description accuracy of the label information to the audio to be matched, and further limiting the number of the candidate audios screened from the audio library according to the label information of the audio to be matched (namely reducing the number of the candidate audios), thereby reducing the calculated amount in the process of finely screening the target candidate audios from at least one candidate audio and being beneficial to accelerating the audio matching speed.
For another example, if the audio matching requirement is high responsiveness, the number of classification networks used in the process of determining the tag information of the audio to be matched can be reduced, and the speed of performing audio matching is improved. For example, the audio matching requirement is high responsiveness, and only the classification network corresponding to the audio style is selected in the process of determining the tag information. By the method, the calculated amount consumed in the classification process corresponding to the audio to be matched is increased, and the speed of determining the label information of the audio to be matched is increased.
In some embodiments, the audio matching requirements may be indicated by the user. For example, if the user indicates to the target application that the target candidate audio and the audio to be matched have a high similarity in a certain classification mode, the computer device determines a classification result corresponding to the audio to be matched in the classification mode according to at least selecting a classification network corresponding to the classification mode indicated by the user.
If the user wants to obtain at least one target candidate audio which is similar to the audio rhythm classification mode of the audio to be matched, in the process of determining the label information of the audio to be matched according to the characteristic information of the audio to be matched, the computer equipment at least selects a classification network corresponding to the audio rhythm, and determines a classification result of the audio to be matched by using the classification network corresponding to the audio rhythm.
Optionally, in order to improve accuracy of the target candidate audio, the computer device may further select a classification network corresponding to at least one other classification mode, and determine a classification result of the audio to be matched in the classification mode. And then the computer equipment determines the label information of the audio to be matched according to the classification results corresponding to the audio to be matched under different classification modes.
In this way, in the process of audio matching, different audio matching requirements are customized in a personalized way, and the adaptability of the audio matching method to different audio matching scenes is improved.
In the following, a method for generating a classification result of audio to be matched is described by several embodiments.
In some embodiments, the plurality of different classification networks includes a first classification network and a second classification network, the first classification network corresponds to a classification based on audio style classification, and the second classification network corresponds to a classification based on audio emotion classification; the computer equipment respectively determines the classification result of the audio to be matched according to the characteristic information of the audio to be matched through a plurality of different classification networks, and comprises the following steps: the computer equipment determines a first classification result of the audio to be matched according to the characteristic information of the audio to be matched through a first classification network, wherein the first classification result comprises: confidence degrees of the audio to be matched, which correspond to the audio to be matched under a plurality of categories obtained based on the classification of the audio styles; the computer equipment determines a second classification result of the audio to be matched according to the characteristic information of the audio to be matched through a second classification network, wherein the second classification result comprises: the confidence degrees of the audio to be matched are respectively corresponding to a plurality of categories obtained based on the audio emotion classification.
In some embodiments, the computer device determines a classification result for the audio to be matched using at least two classification networks. Namely, the label information of the audio to be matched is generated according to at least two classification results.
In some embodiments, after obtaining the tag information of the audio to be matched, the computer device inputs the tag information of the audio to be matched into the first classification network and the second classification network, respectively, and obtains a first classification result of the audio to be matched and a second classification result of the audio to be matched.
It should be noted that, the present application does not limit the order of occurrence of determining the first classification result using the first classification network and determining the second classification result using the second classification network.
For example, after determining the feature information of the audio to be matched, the computer equipment firstly uses a first classification network to determine a first classification result according to the feature information of the audio to be matched; and determining a second classification result by using the characteristic information of the audio to be matched of the second classification network.
For another example, after determining the characteristic information of the audio to be matched, the computer device determines a second classification result according to the characteristic information of the audio to be matched by using a second classification network; and determining a first classification result by using the characteristic information of the audio to be matched of the first classification network.
Because the first classification network and the second classification network are mutually independent, the characteristic information of the audio to be matched can be simultaneously input into the first classification network and the second classification network respectively, and the first classification result output by the first classification network and the second classification result output by the second classification network are obtained respectively.
The confidence degrees respectively corresponding to the audio to be matched under the multiple categories obtained based on the audio style classification and the confidence degrees respectively corresponding to the audio to be matched under the multiple categories obtained based on the audio emotion classification are determined, so that the description capability of the correlation degree between the audio to be matched and the multiple categories is improved.
In some embodiments, the computer device determines tag information of the audio to be matched according to classification results respectively determined by a plurality of different classification networks, including: for each classification network, the computer equipment selects at least one confidence coefficient meeting the result screening condition from the classification results determined by the classification network according to the result screening condition corresponding to the classification network, and obtains a screened classification result corresponding to the classification network; and integrating the filtered classification results corresponding to the classification networks by the computer equipment to obtain the label information of the audio to be matched.
The result screening condition is used to indicate that at least one confidence level is screened from the classification result.
In some embodiments, the confidence between the audio to be matched and a certain category is used to characterize the degree of correlation between the audio to be matched and the category. Optionally, the confidence level is positively correlated with the degree of correlation. That is, if the degree of correlation between the audio to be matched and the category is greater, the value of the confidence between the degree of correlation between the audio to be matched and the category is greater; if the degree of correlation between the audio to be matched and the category is smaller, the value of the confidence coefficient between the degree of correlation between the audio to be matched and the category is smaller.
In some embodiments, the result screening condition is used for indicating the computer device to select the first i confidence degrees with the largest value from the plurality of confidence degrees included in the classification result, so as to obtain a screened classification result, wherein i is a positive integer; the classified result after screening comprises the first i confidence degrees with the maximum confidence coefficient value.
For example, i is equal to 5, a classification mode includes 20 categories, and the classification result corresponding to the classification mode in the audio to be matched includes confidence degrees respectively corresponding to the audio to be identified and the 20 categories. The computer equipment selects the first 5 confidence degrees with the largest numerical value from the 20 confidence degrees, and obtains the classified result after screening according to the 5 confidence degrees.
The specific conditions of the result screening conditions are determined according to actual needs, and the present application is not limited thereto.
In this way, the filtered classification result includes the confidence degrees corresponding to at least one category with strong correlation with the audio to be matched, and the confidence degrees corresponding to other categories with low correlation with the audio to be matched are removed. And generating the identification information in the audio to be matched according to the filtered classification result, which is beneficial to reducing the quantity of invalid information carried in the identification information of the audio to be matched. On the one hand, the method is beneficial to reducing the calculation amount in the process of selecting at least one candidate audio according to the identification information of the audio to be matched and the identification information of each audio in the audio library. On the other hand, because the correlation degree of different audios and the same category is not completely the same, the classification result after screening comprises the confidence coefficient corresponding to the category with stronger correlation with the audio to be matched, and the correlation degree between the candidate audios selected from the audios and the audio to be matched is improved according to the identification information of the audio to be matched.
In some embodiments, a method for determining tag information corresponding to each audio in the audio library is similar to a method for determining identification information corresponding to the audio to be matched.
In some embodiments, in order to shorten the time consumption of performing audio matching on the audio to be matched, the computer device determines in advance a classification result corresponding to each audio in the audio library in different classification modes, so as to quickly determine tag information of each audio in the audio library according to the classification result. Optionally, when storing the audio in the audio library, the computer device determines a classification result and tag information of the audio corresponding to the audio in at least one classification mode, respectively.
For any usable audio in the audio library, the computer equipment determines the characteristic information of the usable audio and determines the classification result of the usable audio corresponding to a plurality of classification modes respectively according to the characteristic information of the usable audio. Optionally, the computer device uses a plurality of different classification models to respectively determine classification results of the usable audio respectively corresponding under different classification modes.
In some embodiments, the computer device may determine the tag information of the usable audio based on classification results of the usable audio corresponding to the different classification modes, respectively. The tag information of the audio to be matched has at least one determination method as described above. For example, the classification results corresponding to the audio to be matched under a certain classification mode (such as audio emotion and audio style) are spliced into the tag information. For another example, the classification result of the audio to be matched is processed through the result screening condition to obtain a screened classification result, and the label information of the audio to be matched is generated according to the screened classification result.
In some embodiments, the computer device determines tag information corresponding to each audio in the audio library using the same determination method of tag information of the audio to be identified.
Optionally, in order to increase the speed of determining the tag information of each audio in the audio library, the computer device may store the feature information of at least one audio in the audio library (i.e. the usable audio mentioned in the previous embodiment) and the classification result of at least one audio corresponding to each other in different classification manners. Then, the computer equipment determines that each audio in the audio library corresponds to the label information according to the method of the label information of the audio to be matched.
Optionally, in the case that the classification result needs to be screened by the result screening condition, and the screened classification result is determined, the computer device determines at least one screened classification result corresponding to the audio to be matched according to the result screening condition. The computer equipment determines at least one target category in a certain classification mode according to the classified result after screening of the audio to be matched. The computer equipment selects the confidence degrees corresponding to the target categories respectively from the classification results corresponding to the usable audio in the classification mode, and generates a filtered classification result of the usable audio according to the confidence degrees corresponding to the target categories respectively. The computer device will use at least one post-screening classification result of the audio to determine tag information of the available audio.
Thereafter, the computer device calculates a correlation degree between the tag information of the audio to be matched and the tag information of each audio in the audio library (e.g. calculates a spatial similarity between the tag information), and selects at least one candidate audio from the audio library, and the detailed content of the process is referred to the above embodiments, and will not be described herein.
In some embodiments, the computer device obtains feature information of the audio to be matched, including: the method comprises the steps that computer equipment obtains a multiband semantic feature sequence of audio to be matched, wherein the multiband semantic feature sequence comprises semantic features corresponding to a plurality of audio frames (Chunk) obtained by framing the audio to be matched; and generating characteristic information of the audio to be matched according to the multiband semantic characteristic sequence.
In some embodiments, in determining the feature information of the audio to be matched, the computer device needs to perform framing processing on the audio to be identified to obtain at least one audio frame.
Macroscopically, the audio to be matched includes an audio signal that is not stationary, while microscopically, small pieces of audio in the audio to be matched are stationary. That is, the audio to be matched has a short-time stationarity (typically the audio signal is approximately constant, or steadily varying, within 10-30 ms). Based on the characteristics, framing processing can be performed on the audio signals to be matched, so that a plurality of audio segments with shorter duration can be obtained. Alternatively, any one audio segment may be referred to as an audio frame.
In some embodiments, the computer device frames the audio to be matched according to the fixed frame length to obtain at least one audio frame.
Optionally, according to the windowing condition of the audio frames, the computer equipment selects different framing methods to carry out framing processing on the audio to be matched. Windowing refers to processing an audio frame using a window function. Because the windowing can lead to the weakening (distortion) of the edge signals at two ends of the audio frame, if the audio frame is required to be windowed after the audio frame is obtained, partial audio signals are overlapped between two adjacent audio frames, the weakened signals after the windowing of one audio frame are ensured, the influence of the windowing of other audio frames is less, and the audio signal loss is reduced.
For example, for two audio frames adjacent in time sequence, audio frame 1 and audio frame 2, audio frame 1 corresponds to a playing period of 60-90ms in the audio to be matched, and audio frame 2 corresponds to a playing period of 85-115ms in the audio to be matched.
If no windowing of the audio frames is required, there is no overlapping audio signal between two adjacent audio frames after framing. For example, for two audio frames adjacent in time sequence, audio frame 3 and audio frame 4, audio frame 3 corresponds to a play period of 60-90ms in the audio to be matched, and audio frame 4 corresponds to a play period of 90-120ms in the audio to be matched.
The frame length used for framing the audio frame is set according to actual needs, and the present application is not limited thereto.
The semantic features of an audio frame are used to characterize the feature information carried in the audio frame.
In some embodiments, the computer device sequentially determines semantic features of each audio frame, and arranges the semantic features of the audio frames according to a playing period corresponding to each audio frame in the audio to be matched, to obtain a multiband semantic feature sequence of the audio to be matched. In some embodiments, the multi-band semantic feature sequence is used to characterize semantic features of the complete audio to be matched.
For a specific method of determining semantic features of an audio frame, please refer to the following embodiments.
The playing time lengths of different audio to be identified are not identical, the audio to be identified is framed through the fixed frame length, a plurality of audio frames with the same time span are obtained, the multi-band semantic feature sequences corresponding to the audio frames are determined according to the semantic features of the audio frames, the processing modes of the audio to be matched with different playing time lengths are unified, and the adaptability of the audio matching method to the audio to be matched with different playing time lengths is improved.
In some embodiments, a computer device obtains a multi-band semantic feature sequence of audio to be matched comprising: extracting time domain feature information and frequency domain feature information of the audio to be matched by the computer equipment, wherein the time domain feature information is used for representing the feature of the audio to be matched in the time domain dimension, and the frequency domain feature information is used for representing the feature of the audio to be matched in the frequency domain dimension; the method comprises the steps that computer equipment performs fusion processing on at least one intermediate time domain feature in a time domain feature information extraction process and at least one intermediate frequency domain feature in a frequency domain feature information extraction process to obtain interaction feature information of audio to be matched, wherein the interaction feature information is used for representing interaction features of the audio to be matched between a time domain dimension and a frequency domain dimension; and the computer equipment obtains a multiband semantic feature sequence of the audio to be matched according to the time domain feature information, the frequency domain feature information and the interaction feature information.
In some embodiments, the computer device processes the audio to be matched from the time domain dimension, and the obtained time domain feature information computer device processes the audio to be matched from the frequency domain dimension, and the obtained frequency domain feature information. In order to achieve the complementation of the audio to be identified between the time domain and the frequency domain, the computer device also generates interaction characteristic information according to the intermediate time domain characteristics and the intermediate frequency domain characteristics.
Optionally, the determining the time domain feature information, the frequency domain feature information and the interaction feature information of the audio to be matched are all performed in units of audio frames. After determining a plurality of audio frames of the audio to be matched, the computer equipment respectively determines time domain feature information, frequency domain feature information and interaction feature information corresponding to the audio frames, generates semantic features of the audio frames according to the time domain feature information, the frequency domain feature information and the interaction feature information, and obtains a multiband semantic feature sequence of the audio to be matched by splicing the semantic features of the audio frames according to the audio playing sequence.
Optionally, the time domain feature information is used to characterize the audio loudness as a function of time, and the frequency domain feature information is used to characterize the amplitude as a function of frequency. Therefore, the time domain feature information and the frequency domain feature information can only represent the audio feature information to be matched from a single dimension, and the complementation between the time domain feature information and the frequency domain feature information is realized by determining the interaction feature information, so that the expression capability of the multiband semantic feature sequence to the audio to be matched is improved.
In some embodiments, the multi-band semantic feature sequences are obtained through a feature sequence extraction network comprising: a time domain processing branch, a frequency domain processing branch and a characteristic interaction branch.
In some embodiments, the time domain processing branch includes a plurality of time domain convolutional layers therein; the method comprises the steps that a plurality of time domain convolution layers are sequentially connected, a first time domain convolution layer processes each audio frame of audio to be matched to obtain an intermediate time domain feature, and a high-level time domain convolution layer receives the intermediate time domain feature output by an adjacent lower-level time domain convolution layer. Optionally, the high-level time domain feature output by the last time domain convolution layer of the time domain processing branch is used as time domain feature information.
The frequency domain processing branch comprises a plurality of frequency domain convolution layers. The frequency domain convolution layers are sequentially connected, the audio to be matched is convolved for multiple times to obtain high-layer frequency domain features, and the high-layer frequency domain features output by the last frequency domain convolution layer of the frequency domain processing branch are used as frequency domain feature information.
Optionally, the intermediate time domain feature refers to a time domain feature output by any one of the time domain convolutional layers except the last time domain convolutional layer in the time domain processing branch. The intermediate frequency domain features refer to the frequency domain features output by any frequency domain convolution layer except the last frequency domain convolution layer in the frequency domain processing branch.
And the feature interaction branch is used for interacting at least one intermediate time domain feature and at least one intermediate frequency domain feature to obtain interaction feature information. Optionally, the interaction feature branch comprises a plurality of feature interaction layers; any one of the feature interaction layers is used for receiving an intermediate time domain feature and a first intermediate frequency domain feature, and carrying out fusion processing on the intermediate time domain feature and the intermediate frequency domain feature to obtain an intermediate interaction feature output by the feature interaction layer.
The method not only determines the time domain characteristic information and the frequency domain characteristic information of the audio to be identified, but also interacts at least one piece of intermediate time domain characteristic information and the intermediate frequency domain characteristic information to obtain interaction characteristic information. On one hand, the relationship between the time domain and the frequency domain features of the audio to be identified is established by determining the interaction feature information, so that the mutual complementation between the frequency domain information and the time domain information is realized.
On the other hand, the interactive feature information is generated according to at least one intermediate belonging feature and at least one intermediate frequency domain feature, so that the low-level features respectively generated by the time domain processing branch and the frequency domain feature branch are reserved in the interactive feature information. By generating the interactive feature information, the high-level network in the feature sequence extraction network can be helped to perceive the features extracted by the low-level network.
The following describes a method for generating interactive feature information through several embodiments.
In some embodiments, the at least one intermediate time-domain feature and the at least one intermediate frequency-domain feature form at least one feature set, each feature set comprising a corresponding set of intermediate time-domain features and intermediate frequency-domain features; the computer equipment performs fusion processing on at least one intermediate time domain feature in the time domain feature information extraction process and at least one intermediate frequency domain feature in the frequency domain feature information extraction process to obtain interaction feature information of the audio to be matched, and the method comprises the following steps: for each feature group, splicing a group of corresponding intermediate time domain features and intermediate frequency domain features contained in the computer equipment feature group to obtain splicing features corresponding to the feature group; and the computer equipment performs feature extraction processing on the spliced features respectively corresponding to the feature groups according to the cascading order to obtain interactive feature information of the audio to be matched.
In some embodiments, the degree of convolution of the intermediate time-domain features included in each feature set is the same as the degree of convolution of the intermediate time-domain features. For example, for a certain feature set, the intermediate time-domain features included in the feature set are output by the second time-domain convolutional layer in the time-domain processing branch, and the intermediate time-frequency features included in the feature set are output by the second frequency-domain convolutional layer in the frequency-domain processing branch.
In some embodiments, the intermediate time domain features and the intermediate frequency domain features included in the different feature sets are not identical. That is, the degree of convolution of the intermediate time-domain feature and the intermediate frequency-domain feature included in each of the two feature groups is different.
In some embodiments, for each feature set, stitching a corresponding set of intermediate time-domain features and intermediate frequency-domain features contained in the set of computer device features comprises: vector deformation (such as by a reshape function) is carried out on the intermediate time domain feature, and the deformed intermediate time domain feature is obtained; and the computer equipment uses the intermediate frequency domain characteristics and the deformed intermediate time domain characteristics to perform fusion processing so as to obtain splicing characteristic information corresponding to the characteristic group.
In some embodiments, the dimensions of the deformed intermediate time-domain feature are the same as the dimensions of the intermediate frequency-domain feature.
In some embodiments, the computer device performs fusion processing using the deformed intermediate time domain feature and the deformed intermediate frequency domain feature to obtain splicing feature information corresponding to the feature group, including: splicing the deformed intermediate time domain features and the deformed intermediate frequency domain features (such as by a concat function) to obtain splicing features corresponding to the feature group; and the computer equipment performs feature extraction processing on the spliced features respectively corresponding to the feature groups according to the cascading order to obtain interactive feature information of the audio to be matched.
In some embodiments, the computer device obtains the interactive feature information of the audio to be matched by performing at least one convolution process on the spliced features corresponding to the feature groups according to the cascade order.
By the method, the frequency domain characteristics of the audio to be identified and the time domain characteristics are complemented by utilizing the intermediate time domain information and the intermediate frequency domain information, so that the generated multiband semantic characteristic sequence can describe the semantic information of the audio to be matched more accurately, the expression capability of the determined characteristic information of the audio to be matched to the audio to be matched according to the multiband semantic characteristic sequence is improved, and the accuracy of audio matching is further improved.
In some embodiments, the computer device extracts time domain feature information and frequency domain feature information of the audio to be matched, comprising: the computer equipment processes the audio to be matched through a time domain feature extraction network to obtain time domain feature information of the audio to be matched; the computer equipment performs time-frequency conversion on the audio to be matched to obtain frequency spectrum information of the audio to be matched; the computer equipment processes the frequency spectrum information through the frequency domain feature extraction network to obtain the frequency domain feature information of the audio to be matched.
In some embodiments, the computer device sequentially inputs a plurality of audio frames corresponding to the audio to be identified into the feature sequence extraction network according to the playing order, determines semantic features corresponding to the plurality of audio frames respectively, and generates a multiband semantic feature sequence of the audio to be matched according to the semantic features corresponding to the plurality of audio frames respectively.
Optionally, the time domain convolution layer includes at least one convolution unit and at least one pooling unit. For any time domain convolution layer in the time domain processing branches, carrying out convolution processing on the audio frame through a convolution unit to obtain a first time domain feature, and carrying out pooling processing on the first time domain feature through a pooling unit to obtain a pooled first time domain feature; transmitting the pooled first time domain features to the next adjacent time domain convolution layer; the pooled time domain characteristics output by the last time domain convolution layer in the time domain processing branch are the time domain characteristic information of the audio frame.
In some embodiments, before inputting the audio to be matched into the frequency domain processing branch, time-frequency transformation needs to be performed on the audio to be matched to obtain frequency spectrum information of the audio to be matched, wherein the frequency spectrum information is used for representing the relation between the frequency and the amplitude of the audio to be matched.
In some embodiments, the computer device performs a time-frequency transform on each audio frame to obtain spectral information of the audio frame. And processing the frequency spectrum information of the audio frame through a frequency domain processing branch to determine the frequency domain characteristic information of the audio frame.
Optionally, the method of time-frequency conversion includes, but is not limited to, at least one of: fourier transform, short-time fourier transform, wavelet transform, etc.
Optionally, the frequency domain convolution layer includes at least one convolution unit and at least one pooling unit. For any frequency domain convolution layer in the time-frequency processing branches, carrying out convolution processing on spectrum information of an audio frame or upper-layer frequency domain features input by the last convolution layer through a convolution unit to obtain first frequency domain features, and carrying out pooling processing on the first frequency domain features through a pooling unit to obtain pooled first frequency domain features; transmitting the pooled first frequency domain features to the adjacent next frequency domain convolution layer; the pooled frequency domain features generated by the last frequency domain convolution layer in the frequency domain processing branch are the frequency domain feature information of the audio frame.
And by determining the frequency domain characteristic information and the time domain characteristic information of the audio to be matched, the characteristics of the audio frames are extracted from different dimensions, so that the accuracy of audio matching is improved.
In some embodiments, the computer device obtains a multiband semantic feature sequence of the audio to be matched according to the time domain feature information, the frequency domain feature information and the interaction feature information, including: the computer equipment splices the time domain feature information, the frequency domain feature information and the interaction feature information to obtain fusion feature information; the computer equipment adopts a plurality of different pooling modes to pool the fusion characteristic information to obtain a plurality of pooling results; the computer device generates a multi-band semantic feature sequence of the audio to be matched according to the plurality of pooling results.
In some embodiments, for any one audio frame, the computer device splices the time domain feature information of the audio frame, the frequency domain feature information of the audio frame, and the interaction feature information of the audio frame to obtain fusion feature information of the audio frame.
In some embodiments, vector deformation of the time domain feature information is required before stitching, the deformed time domain feature information has the same dimensions as the frequency domain feature information (or the interaction feature information), and the computer device uses the frequency domain feature information, the interaction feature information, and the deformed time domain feature information to stitch.
For example, the computer device uses a concat () function to splice the time domain feature information, the frequency domain feature information of the audio frame, and the interactive feature information of the audio frame in a splicing order, so as to obtain the fusion feature information of the audio frame. The splicing order is set according to actual needs, and is not limited herein.
The computer equipment adopts a plurality of different pooling modes to pool the fusion characteristic information to obtain a plurality of pooling results; the computer device generates a multi-band semantic feature sequence of the audio to be matched according to the plurality of pooling results.
The pooling process includes, but is not limited to, at least one of: average pooling and maximum pooling.
For example, the computer device performs stitching on time domain feature information of a certain audio frame, frequency domain feature information of the audio frame and interaction feature information of the audio frame to obtain fusion feature information of the audio frame, and performs convolution processing on the fusion feature information of the audio frame to obtain the fusion feature information after convolution. And the computer equipment respectively carries out average (mean) pooling and maximum (max) pooling on the convolved fusion characteristic information to obtain two pooling results, and the computer equipment adds the two pooling results and then activates the two pooling results to obtain activated characteristic information. And carrying out vectorization processing on the activated characteristic information and classifying to obtain the semantic characteristics corresponding to the audio frame. The vectorization processing is used for converting the activated characteristic information into characteristic information of vector information, and the classification is used for predicting the activated characteristic information.
And the computer equipment splices the semantic features corresponding to the audio frames according to the playing sequence of the audio to be matched to obtain a multiband time sequence feature sequence of the audio to be matched.
In some embodiments, the computer device generates feature information for the audio to be matched from the multi-band semantic feature sequence, comprising: the computer equipment carries out context semantic association processing on the multiband semantic feature sequence according to the time sequence forward direction to obtain a forward embedded vector, wherein the forward embedded vector is used for representing semantic features of the audio to be matched in the playing direction; the computer equipment carries out context semantic association processing on the multiband semantic feature sequence according to the time sequence reverse direction to obtain a reverse embedded vector, wherein the reverse embedded vector is used for representing semantic features of the audio to be matched in the reverse play direction; and the computer equipment splices the forward embedded vector and the reverse embedded vector to obtain the characteristic information of the audio to be matched.
In some embodiments, the computer device generates feature information for the audio to be matched from the multi-band semantic feature sequences through an embedded generation network.
In some embodiments, the embedded generation network belongs to a time-sequential network. Optionally, the network structure of the embedded generation network includes, but is not limited to, at least one of: bi-directional Long Short-terminal Memory (BLSTM), bi-directional Tranformers (Bidirectional Transformers), and the like.
In some embodiments, the embedded generation network includes a forward processing branch and a reverse processing branch; the forward processing branch is used for carrying out context semantic association processing on the multiband semantic feature sequence according to the time sequence forward direction to obtain a forward embedded vector, and the reverse processing branch is used for carrying out context semantic association processing on the multiband semantic feature sequence according to the time sequence reverse direction to obtain a reverse embedded vector.
Optionally, in performing the context semantic association process, the semantic features of each audio frame in the multi-band semantic feature sequence are used as a context processing unit. That is, through contextual semantic association processing, the relationship between semantic features of the individual audio frames is determined.
In some embodiments, the computer device concatenates the forward embedded vector and the reverse embedded vector to obtain the characteristic information of the audio to be matched. Alternatively, the feature information is composed of a forward embedded vector and a reverse embedded vector.
In the feature information, the forward embedded vector may be arranged before the reverse embedded vector, and the forward embedded vector may be arranged after the reverse embedded vector, and the splicing order of the forward embedded vector and the reverse embedded vector is not limited in this application.
Optionally, a method for determining feature information of each audio in the audio library is similar to a method for determining feature information of an audio to be matched, which is not described herein in detail.
Because the playing time lengths of different audios are not identical, after the audio to be matched with the different playing time lengths is framed by using the same frame length, the number of the obtained audio frames may not be identical, so that the data amounts of the multiband semantic feature sequences corresponding to the different audios are not identical. By carrying out context association processing on the multiband semantic feature sequences, fitting of the multiband semantic feature sequences is realized, so that the data sizes of the feature information corresponding to different audios are the same, and similarity between audios is calculated through the feature information of the audios.
The audio matching method provided by the scheme is described by a complete embodiment.
FIG. 3 is a schematic diagram of a multi-band semantic feature vector sequence generation process provided by an exemplary embodiment of the present application.
After the computer equipment obtains the audio to be matched, framing the audio to be matched to obtain a plurality of audio frames of the audio to be matched, and inputting the plurality of audio frames into a feature sequence extraction network one by one according to the playing sequence.
The computer equipment extracts time domain processing branches in the network through the feature sequence to generate time domain feature information of the audio frame. Optionally, the time domain processing branch uses a plurality of one-dimensional convolution layers (Conv 1D blocks).
The use of multiple one-dimensional convolutional layers enables direct learning of the time-domain characteristics of the audio signal, including the relationship between audio loudness and amplitude. Optionally, the one-dimensional convolution layer in the time domain processing branch is the convolution unit in the time domain convolution layer mentioned above.
Optionally, a one-dimensional pooling layer (e.g. MaxPooling1D, s=4) is disposed between two adjacent one-dimensional convolution layers, and is used for pooling the convolved intermediate time domain features.
In some embodiments, the computer device reshapes (resize) the feature information of the time-domain processing branch output to obtain time-domain feature information of the audio frame. Alternatively, the time domain feature information is referred to as a two-dimensional map (wavegram).
The computer equipment extracts frequency domain processing branches in the network through the feature sequence and determines frequency domain feature information of the audio frame. It was previously necessary to determine spectral information for an audio frame. For example, the LogMel spectrum of the audio frame is determined by using the mel frequency, and the LogMel spectrum is used as spectrum information of the audio frame.
And processing the frequency spectrum information of the audio frame through a frequency domain processing branch in the feature sequence extraction network to obtain the frequency domain feature information of the audio frame. Alternatively, the frequency domain Feature information is referred to as Feature Maps (Feature Maps).
In some embodiments, the frequency domain processing branch includes a plurality of two-dimensional convolution layers (Conv 2D blocks), and frequency domain feature information of the audio frame is obtained by performing convolution processing on spectrum information of the audio frame. Alternatively, the two-dimensional convolution layer in the frequency domain processing branch is the convolution unit in the frequency domain convolution layer mentioned above.
Optionally, a two-dimensional pooling layer (e.g. MaxPooling2D, s=4) is disposed between two adjacent two-dimensional convolution layers, and is used for pooling the convolved intermediate frequency domain features.
Optionally, the frequency domain feature information of the audio frame is the same as the dimension of the time domain feature information of the audio frame obtained after the above remodeling.
As shown in fig. 3, a characteristic interaction branch is provided between the time domain processing branch and the frequency domain processing branch. The feature interaction branch comprises at least one feature interaction (Concat) layer.
Any one of the feature interaction layers corresponds to one intermediate time domain feature in the time domain processing branch and also corresponds to one intermediate frequency domain feature in the frequency domain processing branch, so that the feature interaction branch can fuse at least one intermediate time domain information and at least one intermediate frequency domain information to obtain interaction feature information.
Vector deformation (reshape) is required to be performed on the intermediate time domain features before the intermediate time domain features and the intermediate frequency domain features are spliced, so that the deformed intermediate time domain features and the deformed intermediate frequency domain features are identical in dimension.
Optionally, a two-dimensional convolution layer (Conv 2D Block) is included between two adjacent feature interaction layers, for performing convolution processing on the spliced features obtained by the feature interaction layers.
And integrating the time domain feature information, the frequency domain feature information and the interaction feature information of the audio frame through a feature sequence extraction network feature integration layer to obtain semantic features of the audio frame.
In some embodiments, the feature integration layer generates fused feature information (which is a set of two-dimensional frequency domain feature maps) of the audio frame based on the time domain feature information, the frequency domain feature information, and the interaction feature information of the audio frame. The feature integration layer inputs the two-dimensional frequency domain feature map into a two-dimensional convolution layer to obtain a one-dimensional feature vector. And then carrying out average (mean) and maximum (max) pooling treatment on the one-dimensional feature vectors respectively, and carrying out the obtained average pooling result and the obtained maximum pooling result. And adding (sum) the average value pooling result and the maximum value pooling result to obtain the pooled characteristics. The pooled features are activated using a linear rectification (Rectified Linear Unit, reLU) layer. And then the activated features pass through a vector processing layer to obtain vectorized features, and the vectorized features are classified by using a classification layer (such as classifying by using a softmax function) to obtain semantic features of the audio frame.
And the computer equipment splices the semantic features of the audio frames according to the playing sequence to obtain a multiband semantic feature sequence of the audio to be matched.
It should be noted that the structures of the time domain processing branch, the frequency domain processing branch, and the interaction processing branch in the feature sequence extraction network in fig. 3 are exemplified. The number of functional layers such as convolution layers in the feature sequence extraction network is determined according to actual needs, and the method is not limited herein.
Fig. 4 is a schematic diagram of a method for generating tag information according to an exemplary embodiment of the present application.
In some embodiments, the computer device processes the multi-band semantic feature sequences of audio to be matched using a bi-directional long-short-term memory network to obtain a forward embedded vector and a reverse embedded vector. And the computer equipment obtains the characteristic information of the audio to be matched by splicing the forward embedded vector and the reverse embedded vector.
And then, the computer equipment respectively inputs the characteristic information of the audio to be matched into a first classification network and a second classification network, and a first classification result and a second classification result are determined, wherein the first classification result is used for representing the confidence degrees of the audio to be matched, which are respectively corresponding to the audio to be matched under a plurality of categories obtained based on audio style classification, and the second classification result is used for representing the confidence degrees of the audio to be matched, which are respectively corresponding to the audio to be matched under a plurality of categories obtained based on audio emotion classification.
Optionally, the first classification network and the second classification network both belong to sigmoid classifiers, and confidence degrees of the audio to be matched, which correspond to various wind audio lattices and various audio emotions, can be obtained through the classifier using the sigmoid, so that the label information in the finally obtained audio to be classified comprises the confidence degrees under a plurality of categories, and the accuracy of audio classification is improved.
Fig. 5 is a flowchart of an audio matching method provided in an exemplary embodiment of the present application.
Step510: the computer device determines tag information of the audio to be matched. And respectively carrying out similarity calculation through the label information of the audio to be matched and the label information of each audio contained in the audio library. And the computer equipment selects at least one candidate audio from the audio library according to the correlation degree of the label information of each audio and the label information of the audio to be matched, so as to finish primary screening.
Step530: the computer device determines a target candidate audio that matches the audio to be matched from the at least one candidate audio. In the process, the computer equipment respectively carries out correlation calculation on the characteristic information of the audio to be matched and the characteristic information of at least one candidate audio. The computer device selects the candidate audio with the highest correlation degree with the label information of the audio to be matched from at least one candidate audio as a target candidate audio, and the audio matching process is ended.
In some embodiments, the characteristic information of the audio is represented in the form of a vector, i.e. the characteristic information of the audio is characterized by a characteristic vector. The degree of correlation (also referred to as similarity) between feature information can be measured by the spatial similarity (e.g., cosine distance) between vectors.
Fig. 6 is a schematic diagram of spatial similarity calculation provided in an exemplary embodiment of the present application.
In fig. 6, the similarity of vector a and vector B in the vector space is measured by calculating the cosine angle cos θ of vector a and vector B.
The calculation formula of the cosine of the included angle is as follows:
wherein, T (x, y) is used for symbolizing an included angle cosine between the vector x and the vector y, x.y represents an inner product of the vector x and the vector y, x y represents a modular product of the vector x and the vector y, and xi Represents the ith element, y, in vector xi Represents the ith element in vector y, i is a positive integer, i.e. [1, n ]]N is the number of elements contained in the expression vector x (or vector y).
According to the audio matching method, the feature information of the audio to be matched is used for multiple times, other features are not required to be determined, so that the model selection and training times are reduced, meanwhile, the audio is matched with the tag information of the audio and the feature information of the audio, the feature information of the audio generated by the trained model is enabled to be more fit with the expression of the semantic features of the original audio, and the similarity between the determined target candidate audio and the audio to be matched is improved.
In addition, in the process of determining the multiband semantic feature sequence, a multilayer and multi-domain mechanism is innovatively added, so that generated interactive feature information simultaneously reserves the time domain characteristics and the frequency domain characteristics of the audio to be matched, and meanwhile, the interactive mechanism can enable a high-level network to learn the bottom layer features, thereby being beneficial to improving the semantic representation capability of feature information and the accuracy of the determined label information of the audio.
In some embodiments, the audio to be matched is audio in the target video; the audio matching method further comprises the following steps: and replacing the audio to be matched in the target video by using the target candidate audio to obtain the processed target video.
In some embodiments, the target audio includes audio to be matched. In one example, the target video is a movie, and the audio used in the audio movie is to be matched. Optionally, the copyright of the audio to be matched expires, and the audio to be matched in the target video needs to be replaced so as not to infringe the copyright of the audio to be matched.
In some embodiments, after determining the target candidate audio, the computer device performs duration equalization processing on the target candidate audio according to the playing duration of the audio to be matched, so as to obtain the processed target candidate audio. Optionally, the computer device replaces the audio to be matched in the target video with the processed target candidate audio.
For example, in the case where the play time period of the audio to be matched is longer than the play time period of the target candidate audio, the computer device lengthens the play time period of the target candidate audio. For example, a certain period of time in the target candidate audio is repeated so that the play duration of the extended target candidate audio is equal to the play duration of the audio to be matched.
For another example, in the case where the playing time length of the audio to be matched is smaller than the playing time length of the target candidate audio, the computer device clips the playing time length of the target candidate audio, for example, clips a header or a tail of the target candidate audio, so that the playing time length of the clipped target candidate audio is equal to the playing time length of the audio to be matched.
By the method, the automaticity of the process of replacing the audio to be matched in the target video is improved, the labor cost consumed in the process of replacing the audio to be matched is reduced, and the speed of replacing the audio to be matched is improved; meanwhile, the problem that subjectivity exists in the target candidate audio selected manually in the audio replacement process is avoided, the playing effect of the target video after audio replacement is guaranteed to be close to the playing effect of the target video before audio replacement, and the influence on the look and feel of the target video caused by audio replacement is reduced.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
Fig. 7 shows a block diagram of an audio matching device according to an exemplary embodiment of the present application. Theapparatus 700 may include:feature acquisition module 710,tag determination module 720,audio screening module 730, andaudio determination module 740.
Thefeature obtaining module 710 is configured to obtain feature information of the audio to be matched, where the feature information of the audio to be matched is used to characterize semantic features of the audio to be matched.
Thetag determining module 720 is configured to determine tag information of the audio to be matched according to the feature information of the audio to be matched, where the tag information of the audio to be matched includes confidence degrees corresponding to the audio to be matched under multiple categories, and the confidence degrees are used for characterizing correlation degrees between the audio to be matched and the categories.
Theaudio filtering module 730 is configured to select at least one candidate audio from the audio library according to the tag information of the audio to be matched and the tag information of each audio contained in the audio library.
An audio determiningmodule 740, configured to determine a target candidate audio that matches the audio to be matched from the at least one candidate audio.
In some embodiments, the tag determination module comprises: the confidence coefficient determining unit is used for determining a classification result of the audio to be matched according to the characteristic information of the audio to be matched through a plurality of different classification networks; wherein, different classification networks correspond to different classification modes, and the classification result determined by each classification network comprises: confidence degrees corresponding to the audio to be matched respectively under a plurality of categories of a classification mode corresponding to the classification network; and the label generating unit is used for determining the label information of the audio to be matched according to the classification results respectively determined by the plurality of different classification networks.
In some embodiments, the plurality of different classification networks includes a first classification network and a second classification network, the first classification network corresponds to a classification based on audio styles, and the second classification network corresponds to a classification based on audio moods; the confidence determining unit is configured to determine, through the first classification network, a first classification result of the audio to be matched according to the feature information of the audio to be matched, where the first classification result includes: confidence degrees corresponding to the audio to be matched respectively under a plurality of categories obtained based on audio style classification; determining a second classification result of the audio to be matched according to the characteristic information of the audio to be matched through the second classification network, wherein the second classification result comprises: and the confidence degrees of the audio to be matched, which are respectively corresponding to the audio to be matched, are obtained under a plurality of categories based on the audio emotion classification.
In some embodiments, the tag generation unit. For each classification network, selecting at least one confidence coefficient meeting the result screening condition from the classification results determined by the classification network according to the result screening condition corresponding to the classification network, and obtaining a screened classification result corresponding to the classification network; and integrating the filtered classification results corresponding to the classification networks respectively to obtain the label information of the audio to be matched.
In some embodiments, thefeature acquisition module 710 includes: a sequence obtaining unit, configured to obtain a multiband semantic feature sequence of the audio to be matched, where the multiband semantic feature sequence includes: framing the audio to be matched to obtain semantic features corresponding to a plurality of audio frames respectively; and the characteristic generating unit is used for generating characteristic information of the audio to be matched according to the multiband semantic characteristic sequence.
In some embodiments, the sequence acquisition unit comprises: the characteristic extraction subunit is used for extracting time domain characteristic information and frequency domain characteristic information of the audio to be matched, wherein the time domain characteristic information is used for representing the characteristics of the audio to be matched in the time domain dimension, and the frequency domain characteristic information is used for representing the characteristics of the audio to be matched in the frequency domain dimension; the feature interaction subunit is used for carrying out fusion processing on at least one intermediate time domain feature in the time domain feature information extraction process and at least one intermediate frequency domain feature in the frequency domain feature information extraction process to obtain interaction feature information of the audio to be matched, wherein the interaction feature information is used for representing interaction features of the audio to be matched between the time domain dimension and the frequency domain dimension; and the sequence generation subunit is used for obtaining the multiband semantic feature sequence of the audio to be matched according to the time domain feature information, the frequency domain feature information and the interaction feature information.
In some embodiments, the at least one intermediate time-domain feature and the at least one intermediate frequency-domain feature form at least one feature set, the feature set comprising a set of corresponding intermediate time-domain features and intermediate frequency-domain features; the feature interaction subunit is configured to, for each feature group, splice a group of corresponding intermediate time domain features and intermediate frequency domain features included in the feature group, and obtain a spliced feature corresponding to the feature group; and carrying out feature extraction processing on the spliced features respectively corresponding to the feature groups according to a cascading order to obtain the interactive feature information of the audio to be matched.
In some embodiments, the feature extraction subunit is configured to: processing the audio to be matched through a time domain feature extraction network to obtain time domain feature information of the audio to be matched; performing time-frequency conversion on the audio to be matched to obtain frequency spectrum information of the audio to be matched; and processing the frequency spectrum information through a frequency domain feature extraction network to obtain the frequency domain feature information of the audio to be matched.
In some embodiments, the sequence generating subunit is configured to splice the time domain feature information, the frequency domain feature information, and the interaction feature information to obtain fusion feature information; carrying out pooling treatment on the fusion characteristic information by adopting a plurality of different pooling modes to obtain a plurality of pooling results; and generating the multiband semantic feature sequence of the audio to be matched according to the plurality of pooling results.
In some embodiments, the feature generating unit is configured to perform context semantic association processing on the multiband semantic feature sequence according to a positive timing direction to obtain a forward embedded vector, where the forward embedded vector is used to characterize semantic features of the audio to be matched in a playing direction; performing context semantic association processing on the multiband semantic feature sequence according to a time sequence reverse direction to obtain a reverse embedded vector, wherein the reverse embedded vector is used for representing semantic features of the audio to be matched in a reverse play direction; and splicing the forward embedded vector and the reverse embedded vector to obtain the characteristic information of the audio to be matched.
In some embodiments, the audio to be matched is audio in a target video; theapparatus 700 further comprises: and replacing the audio to be matched in the target video by adopting the target candidate audio to obtain the processed target video.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the content structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein. The beneficial effects of the device provided in the foregoing embodiments are described with reference to the method side embodiments, and are not repeated herein.
Fig. 8 shows a block diagram of a computer device according to an exemplary embodiment of the present application.
In general, thecomputer device 800 includes: aprocessor 801 and amemory 802.
Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Theprocessor 801 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). Theprocessor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, theprocessor 801 may integrate with an image processor (Graphics Processing Unit, GPU) for rendering and rendering of content required for display by the display screen. In some embodiments, theprocessor 801 may also include an AI processor for processing computing operations related to machine learning.
Memory 802 may include one or more computer-readable storage media, which may be tangible and non-transitory.Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium inmemory 802 stores a computer program that is loaded and executed byprocessor 801 to implement the audio matching method provided by the method embodiments described above.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is not limiting and that more or fewer components than shown may be included or that certain components may be combined or that a different arrangement of components may be employed.
The embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, the computer program being loaded and executed by a processor to implement the audio matching method provided by the above-mentioned method embodiments.
The computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (Random Access Memory, RAM), erasable programmable read-Only Memory (EPROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other solid state Memory technology, high density digital video disk (Digital Video Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the ones described above.
The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program is stored in a computer readable storage medium, and a processor reads and executes the computer program from the computer readable storage medium to implement the audio matching method provided in the foregoing method embodiments.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, the target voice data referred to in this application are all acquired with sufficient authorization.
The foregoing description of the preferred embodiments is merely illustrative of the present application and is not intended to limit the invention to the particular embodiments shown, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and principles of the invention.