CN118312638B

Movatterモバイル変換

Info

Publication number: CN118312638B
Application number: CN202410565110.6A
Authority: CN
Inventors: 甘蓓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-05-09
Filing date: 2024-05-09
Publication date: 2024-12-31
Anticipated expiration: 2044-05-09
Also published as: CN118312638A

Abstract

The embodiment of the application discloses an audio retrieval method, an apparatus, an electronic device and a storage medium, wherein the method comprises the steps of obtaining audio to be retrieved, carrying out semantic splitting processing on the audio to be retrieved to obtain a plurality of sub-audios with independent semantics, obtaining a plurality of reference audios corresponding to the sub-audios and first similarity between the sub-audios and each reference audio respectively according to each sub-audio, determining at least one candidate audio matched with the sub-audio in the plurality of reference audios according to the first similarity, determining second similarity of the candidate audio corresponding to the audio to be retrieved, and determining target audio from the candidate audios according to the second similarity. In the embodiment of the application, the audio to be searched is split into a plurality of independent semantic sub-audios, and then the stepwise searching is carried out from the reference audio based on the sub-audios. Therefore, the accuracy of audio retrieval can be improved by the scheme.

Description

Audio retrieval method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio retrieval method, an audio retrieval device, an electronic device, and a storage medium.

Background

Audio retrieval refers to the process of searching and matching audio data using computer technology. With the popularity of content distribution over the internet, the number of audio frequencies has also increased dramatically, making the use of audio retrieval more widespread.

However, current audio retrieval techniques are generally applicable to audio of only a single sound source, and in the face of audio containing a plurality of sound sources (e.g., a multi-person conversation, a plurality of instruments in music, etc.), the accuracy of audio retrieval cannot be ensured.

Disclosure of Invention

The embodiment of the application provides an audio retrieval method, an audio retrieval device, electronic equipment and a storage medium, which can improve the accuracy of audio retrieval.

The embodiment of the application provides an audio retrieval method, which comprises the following steps:

Acquiring audio to be retrieved;

carrying out semantic splitting processing on the audio to be searched to obtain a plurality of independent semantic sub-audios;

For each sub-audio, acquiring a plurality of reference audios corresponding to the sub-audio, and respectively obtaining first similarity between the sub-audio and each reference audio;

Determining at least one candidate audio matched with the sub-audio in the plurality of reference audio according to the first similarity, and determining a second similarity of the candidate audio corresponding to the audio to be retrieved;

And determining target audio from the candidate audio according to the second similarity.

The embodiment of the application also provides an audio retrieval device, which comprises:

the first acquisition unit is used for acquiring the audio to be retrieved;

The splitting unit is used for carrying out semantic splitting processing on the audio to be searched to obtain a plurality of sub-audio with independent semantics;

A second obtaining unit, configured to obtain, for each sub-audio, a plurality of reference audio corresponding to the sub-audio, and a first similarity between the sub-audio and each reference audio, respectively;

A first determining unit, configured to determine at least one candidate audio that matches the sub-audio in the plurality of reference audio according to the first similarity, and determine a second similarity that corresponds to the audio to be retrieved for the candidate audio;

and the second determining unit is used for determining target audio from the candidate audio according to the second similarity.

In some embodiments, the second acquisition unit is specifically configured to:

extracting the characteristics of the sub-audio to obtain audio characteristics;

The method comprises the steps of obtaining an initial feature vector matched with the audio feature from a plurality of preset initial feature vectors as a reference feature vector, wherein the plurality of initial feature vectors are extracted from a plurality of preset initial audio;

performing audio tracing processing on the reference feature vector to obtain initial audio corresponding to the reference feature vector;

And taking the initial audio corresponding to the reference feature vector as the reference audio.

In some embodiments, the second acquisition unit is specifically further configured to:

Sorting the plurality of initial feature vectors according to the sequence of the candidate similarity from large to small to obtain sorted initial feature vectors;

determining the first X initial feature vectors in the ordered initial feature vectors as initial feature vectors matched with the audio features, wherein X is a positive integer;

And taking the initial feature vector matched with the audio feature as the reference feature vector.

For each initial feature vector, carrying out quantization processing on the initial feature vector to obtain a sub-vector space corresponding to the initial feature vector, wherein the sub-vector space comprises a plurality of sub-vectors with the same dimension, and the dimension of the sub-vector is smaller than that of the initial feature vector;

Clustering the sub-vector space by a preset clustering algorithm to obtain a clustering center of the sub-vector space;

coding the sub-vector space based on the clustering center to obtain category codes corresponding to the sub-vector space;

In some embodiments, the second acquisition unit is specifically configured to:

Detecting the vector quantity of the reference feature vectors corresponding to the reference audio aiming at the reference audio;

If the number of the vectors is one, similarity calculation is carried out on the reference feature vector corresponding to the reference audio and the audio feature corresponding to the sub audio, so that first similarity between the sub audio and the reference audio is obtained;

and if the number of the vectors is multiple, respectively carrying out similarity calculation on a plurality of reference feature vectors corresponding to the reference audio and the audio features corresponding to the sub audio to obtain a plurality of initial similarities, and fusing the plurality of initial similarities to obtain a first similarity between the sub audio and the reference audio.

In some embodiments, the first determining unit is specifically configured to:

sequencing the plurality of reference audios according to the sequence from the big similarity to the small similarity to obtain sequenced reference audios;

and determining the first k reference audios in the sorted reference audios as the candidate audios, wherein k is a positive integer.

In some embodiments, the second determining unit is specifically configured to:

And fusing the first similarity of the candidate audio corresponding to each sub-audio aiming at each candidate audio to obtain the second similarity of the candidate audio corresponding to the audio to be searched.

In some embodiments, the second determining unit is specifically further configured to:

Acquiring fusion weights of the candidate audios corresponding to each sub-audio and the number of the plurality of sub-audios;

Based on the fusion weight, carrying out weighted summation on the first similarity of each sub-audio corresponding to the candidate audio to obtain fusion similarity;

and calculating the quotient of the fusion similarity and the number of the plurality of sub-audio frequencies to obtain the second similarity.

In some embodiments, the splitting unit comprises:

The identifying subunit is used for identifying the jump point in the audio to be searched, wherein the jump point is a time node with the semantic change in the audio to be searched;

And the splitting subunit is used for carrying out semantic splitting processing on the audio to be searched based on the jump point to obtain a plurality of independent semantic sub-audios.

In some embodiments, the identification subunit is specifically configured to:

Extracting features of the audio to be searched to obtain a feature sequence, wherein each feature in the feature sequence corresponds to one time node in the audio to be searched;

dividing the feature sequence into a first sub-feature sequence and a second sub-feature sequence;

performing covariance matrix calculation on the first sub-feature sequence, the second sub-feature sequence and the feature sequence to obtain a first covariance matrix corresponding to the first sub-feature sequence, a second covariance matrix corresponding to the second sub-feature sequence and a third covariance matrix corresponding to the feature sequence;

Determining a logarithmic maximum likelihood ratio according to a first covariance matrix, the second covariance matrix and the third covariance matrix, and determining a time node corresponding to the last feature in the first sub-feature sequence as the jump point if the logarithmic maximum likelihood ratio meets a preset condition.

And determining the candidate audio with the largest second similarity in all the candidate audios matched with the plurality of reference audios as the target audio.

The embodiment of the application also provides electronic equipment which comprises a memory, wherein a plurality of instructions are stored in the memory, and the processor loads the instructions from the memory so as to execute the steps in the audio retrieval method provided by the embodiment of the application.

The embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in the audio retrieval method provided by the embodiment of the application.

The embodiment of the application also provides a computer program product, which comprises a computer program/instruction, wherein the computer program/instruction realizes the steps in the audio retrieval method provided by the embodiment of the application when being executed by a processor.

The embodiment of the application can split the audio to be searched into a plurality of sub-audios with independent semantics after the audio to be searched is acquired, acquire a plurality of reference audios corresponding to the sub-audios and first similarity between the sub-audios and each reference audio respectively according to each sub-audio, then determine at least one candidate audio matched with the sub-audio in the plurality of reference audios according to the first similarity, determine second similarity of the candidate audio corresponding to the audio to be searched, and finally determine target audio from the candidate audios according to the second similarity.

In the embodiment of the application, for each sub-audio of the audio to be retrieved, some similar reference audio can be recalled preliminarily according to the sub-audio, then similar candidate audio is recalled from a plurality of reference audio according to the first similarity between the sub-audio and each reference audio, on the one hand, the recall efficiency is improved, and on the other hand, as the sub-audio has independent semantics, the semantics of the sub-audio can be captured more accurately, and the recall accuracy is improved. And then, according to the second similarity of the candidate audio corresponding to the audio to be searched, the target audio is screened out from the candidate audio, so that the searching range can be reduced on one hand, and on the other hand, the target audio which is more similar to the audio to be searched in the whole can be screened out from the candidate audio. Therefore, the accuracy of audio retrieval is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a scenario of an audio retrieval method according to an embodiment of the present application;

FIG. 1b is a flowchart of an audio retrieval method according to an embodiment of the present application;

FIG. 1c is a schematic diagram illustrating splitting of sub-audio in audio to be retrieved according to an embodiment of the present application;

FIG. 1d is a schematic diagram illustrating a first similarity comparison between sub-audio and reference audio according to an embodiment of the present application;

FIG. 1e is a schematic diagram illustrating a second similarity comparison between sub-audio and candidate audio according to an embodiment of the present application;

Fig. 2a is a schematic diagram of an audio retrieval method according to an embodiment of the present application applied in a server scenario;

FIG. 2b is a schematic diagram of an audio retrieval method application framework according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an audio retrieval device according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides an audio retrieval method, an audio retrieval device, electronic equipment and a storage medium.

The audio retrieval device can be integrated in an electronic device, and the electronic device can be a terminal, a server and other devices. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or other devices, and the server can be a single server or a server cluster formed by a plurality of servers.

In some embodiments, the audio retrieval apparatus may also be integrated in a plurality of electronic devices, for example, the audio retrieval apparatus may be integrated in a plurality of servers, and the audio retrieval method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1a, fig. 1a shows an application scenario schematic diagram of an audio retrieval method according to an embodiment of the present application.

As shown in fig. 1a, the application scenario may include a server 100 and a terminal 200, where the server 100 may be communicatively connected to the terminal 200, in an actual application, an audio database storing a large amount of reference audio may be set in the server 100, the server 100 may receive an audio retrieval request sent by the terminal 200, where the audio retrieval request carries audio to be retrieved, and then the server 100 may retrieve a target audio matching the audio to be retrieved from the audio database and return the target audio to the terminal 200.

It will be appreciated that in the specific embodiment of the present application, related data of audio to be retrieved, audio records, etc. sent by a user are involved, and when the above embodiments of the present application are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.

The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a technology that utilizes a digital computer to simulate the human perception environment, acquire knowledge, and use knowledge, which can enable machines to function similar to human perception, reasoning, and decision. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Key technologies for the speech technology (Speech Technology) are an automatic speech recognition technology and a speech synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicle, robot, smart medical treatment, smart customer service, car networking, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and become more and more important value.

In this embodiment, an audio retrieval method is provided, as shown in fig. 1b, and the specific flow of the audio retrieval method may be as follows:

101. and acquiring the audio to be retrieved.

The audio to be searched is an audio file to be searched currently, and specifically may be an audio file to be searched and found in an audio database. Alternatively, the audio to be retrieved may include, but is not limited to, music, speech, various types of recordings (e.g., environmental recordings, conference recordings, etc.), audio tracks of specific multimedia (e.g., audio extracted from a movie, television program, or online video), and so forth. For example, in practical applications, the audio to be retrieved may be an unknown audio file, an incomplete audio file, an audio file mixed with noise, or the like.

In some embodiments, the audio to be retrieved may be obtained through uploading by the user, or may be obtained through selecting from a given plurality of audio by the user, which is not limited herein. The reference audio may be obtained through self-recording, or may be obtained through downloading from a network, or may be obtained through self-generating by audio generating software, which is not limited herein.

102. And carrying out semantic splitting processing on the audio to be retrieved to obtain a plurality of sub-audio with independent semantics.

The semantic splitting processing refers to splitting, extracting or separating parts of different semantics in the audio.

Wherein having independent semantics indicates that the finger sub-audio segments are semantically relatively independent, i.e. each sub-audio contains specific semantic information or content, without having to rely on the whole audio to be retrieved or other sub-audio to understand its meaning. Illustratively, for example, the audio to be retrieved is a dialog segment, including sub-audio 1 and sub-audio 2, sub-audio 1 being "today's weather is true and sunny". "sub-audio 2 is" we go to a picnic bar in the open, where you want to go, "where are two sub-audios containing information about weather and planned activities respectively, they are semantically relatively independent, each segment having its own clear meaning, and not requiring the entire dialog content to be understood. For another example, the audio to be searched is a dialogue segment which is a news report segment and comprises sub-audio 3 and sub-audio 4, wherein the sub-audio 3 is "traffic accident occurred in yesterday", and thus a plurality of people are injured. "sub-audio 4 is" the relevant department has initiated rescue action, investigate the cause of the accident. "wherein the two sub-audios relate to the information of traffic accident occurrence and related rescue actions, respectively, which are semantically independent, the meaning of which can be understood independently without depending on other contents.

In some embodiments, in the case that the audio to be retrieved does not have semantics, the audio to be retrieved may be subjected to semantic splitting processing through the audio features to obtain a plurality of sub-audio with independent semantics, that is, different sub-audio in the split plurality of sub-audio may have different audio features, for example, the plurality of sub-audio includes sub-audio 5 and sub-audio 6, the sub-audio 5 corresponds to an audio feature of one musical instrument (such as a piano), and the sub-audio 6 corresponds to an audio feature of another musical instrument (such as a small size).

It is to be understood that the audio features do not fall entirely within a particular class, i.e., energy features, but may be time domain features or frequency domain features, such as, but not limited to, waveforms, average energy, spectrograms, etc.

In some embodiments, in step 102, the step of performing semantic splitting processing on the audio to be retrieved to obtain a plurality of specific embodiments with independent semantic sub-audio may include:

A1, identifying a trip point in the audio to be searched, wherein the trip point is a time node with semantically changed in the audio to be searched.

In some embodiments, the trip points in the band search tones may be identified by a pre-trained trip point prediction model, where the trip point prediction model may output corresponding trip points based on the input tone to be searched. For example, when training the trip point prediction model, a large number of audio samples may be prepared, then the trip points in the audio samples are marked, and then the audio samples marked with the trip points are input into the initial model for model training, so that the initial model can accurately identify the trip points in the audio to obtain the trip point prediction model. Alternatively, the initial model may select a model suitable for audio signal processing, such as a Recurrent Neural Network (RNN), convolutional Neural Network (CNN), or Deep Neural Network (DNN). These models may be used to learn patterns and features in the audio data. Alternatively, a supervised learning approach may be used during the training of the initial model, with the model parameters being adjusted by the training data.

In other embodiments, in step A1, the specific implementation of step "identify a trip point in audio to be retrieved" may include:

And A11a, extracting features of the audio to be searched to obtain a feature sequence, wherein each feature in the feature sequence corresponds to one time node in the audio to be searched.

Illustratively, the audio to be retrieved may be feature extracted by mel-frequency cepstral coefficients (Mel Frequency Cepstrum Coefficient, MFCC) to obtain a feature sequence. Optionally, the feature vector dimension of each feature in the feature sequence and the feature quantity in the feature sequence may be set in a custom manner according to actual requirements, for example, the feature vector dimension may be 128, the feature quantity is N, and then the feature sequence may be obtained as。

And A12a, dividing the characteristic sequence into a first sub-characteristic sequence and a second sub-characteristic sequence.

Illustratively, the process of identifying the trip point in the audio to be retrieved may be converted into a process of speech segmentation, and optionally, a bayesian information criterion (Bayesian Information Criterion, BIC) may be employed to perform a speech segmentation process on the audio to be retrieved according to the feature sequence.

Specifically, assuming that there is one trip point for the time node pair corresponding to the i-th feature x_i in the audio to be retrieved, the problem of finding the trip point can be converted into a problem of selecting the following two hypothesis models:

The model H₀ is a model with a feature sequence x_1,x_2,…,x_N conforming to gaussian distribution, that is, no jump points exist in the audio to be retrieved. Model H₁ is that the feature sequence (x_1,x_2,…,x_i) conforms to a gaussian distribution, and the feature sequence (x_i+1,x_i+2,…,x_N) also conforms to a gaussian distribution model, i.e., the time node corresponding to the spectral feature x_i is the trip point existing in the audio to be retrieved.

Therefore, the hypothesis model H₁ may be established, the feature sequence is divided into a first sub-feature sequence (x_1,x_2,…,x_i) and a second sub-feature sequence (x_i+1,x_i+2,…,x_N), and the two sub-feature sequences are verified to determine whether the time node corresponding to x_i is a trip point in the audio to be retrieved.

And A13a, respectively carrying out covariance matrix calculation on the first sub-feature sequence, the second sub-feature sequence and the feature sequence to obtain a first covariance matrix corresponding to the first sub-feature sequence, a second covariance matrix corresponding to the second sub-feature sequence and a third covariance matrix corresponding to the feature sequence.

And A14a, determining a logarithmic maximum likelihood ratio according to the first covariance matrix, the second covariance matrix and the third covariance matrix, and determining a time node corresponding to the last feature in the first sub-feature sequence as a jump point if the logarithmic maximum likelihood ratio meets a preset condition.

Along the above example, if the time node corresponding to x_i is assumed to be the trip point in the audio to be retrieved. Then the corresponding log maximum likelihood ratio for x_i is:

Wherein R (i) is x_i corresponding to the logarithmic maximum likelihood ratio, N represents the number of features in the feature sequence, N₁ represents the number of features in the first sub-feature sequence, N₂ represents the number of features in the second sub-feature sequence,Is the covariance matrix of the feature sequence x_1,x_2,…,x_N,Is a first covariance matrix of the first sub-feature sequence x_1,x_2,…,x_i,Is the second covariance matrix of the second sub-feature sequence x_i+1,x_i+2,…,x_N.Is a matrixIs a determinant of (2).

Optionally, a penalty term may also be added to this model selection problem, such that the resulting BIC value is:

Wherein P is a penalty value, and the expression is as follows:

Wherein d is the feature dimension of the spectrum feature, and N is the feature sequence feature quantity.

Wherein BIC (i) is an evaluation value corresponding to the candidate spectral feature x_i, and λ is a weight corresponding to the penalty value, optionally, if the penalty value needs to have the greatest balancing effect in model selection, λ may be set to 1.

Finally, the objective of solving the above problem is changed from minimum to maximum of solving BIC value, namely:

Then, toAnd solving to obtain x_i corresponding to the first covariance matrix and the second covariance matrix when the maximum likelihood ratio of the logarithm meets the preset condition, so as to obtain the jump point.

It will be appreciated that the number of components,When two independent voice segments exist, a jump point exists; When it is, it means that there is only one independent speech segment and no trip point.

In still other embodiments of the real first sub-feature sequence, in step A1, a specific embodiment of step "identify a trip point in audio to be retrieved" may include:

A11b, extracting spectral features of the audio to be searched to obtain a plurality of spectral features, wherein each spectral feature in the plurality of spectral features corresponds to a time node in the audio to be searched, and the plurality of spectral features are arranged in sequence from beginning to end according to the time node.

Where a spectral feature is a set of features that describe the characteristics of audio in the frequency domain. Optionally, the audio features include, but are not limited to, a spectrogram, a spectral envelope, a spectral bandwidth, mel-frequency cepstral coefficients, and the like.

The time node may refer to a time in duration of the audio to be retrieved, for example, the duration of the audio to be retrieved is 0 to 30 seconds, and the time node may be a time of 10 th second, 15 th second, or the like. The interval between the time nodes may be set according to actual requirements, for example, determined according to a spectrum feature extraction manner, which is not limited herein.

The number and dimension of the spectrum features can be set in a self-defined manner according to actual requirements, and are not limited herein.

In this embodiment, the frequency spectrum feature extraction can be performed on the audio to be retrieved by the MFCC extraction method to obtain a plurality of frequency spectrum features, and since the MFCC extraction method is to convert the frequency spectrum of the audio into a set of feature vectors by simulating the auditory response of human ears, the frequency spectrum feature extraction method is suitable for voice recognition, audio classification and the like, the extracted frequency spectrum features can be convenient for recognizing sub-audio with independent semantics. Alternatively, the vector dimension of the spectral features may be 128.

Illustratively, the audio to be retrieved is subjected to spectral feature extraction, for example, by an MFCC extraction method, and the plurality of spectral features obtained include x_1,x_2,…,x_i, where x_N represents an nth spectral feature, and N is a positive integer. Wherein, (x_1,x_2,…,x_N) may form a spectrum feature sequence, each spectrum feature in the spectrum feature sequence corresponds to a time node, and the spectrum feature sequences are arranged in sequence from first to last according to the time node, that is, the time node corresponding to x_N is behind the time node corresponding to x_N-1, the sequence of the time node corresponds to the feature sequence number in the spectrum feature sequence, and the feature sequence number increases in sequence along with the time node.

It is to be understood that (x_1,x_2,…,x_N) may be all or part of a plurality of spectral features, not limited herein.

A12b, determining at least one target spectrum characteristic from a plurality of spectrum characteristics, wherein the target spectrum characteristic is used for dividing the plurality of spectrum characteristics into a plurality of spectrum characteristic sets, and the spectrum characteristics in each spectrum characteristic set accord with Gaussian distribution.

In some embodiments, taking an example of a sequence of spectral features constituted by a plurality of spectral features, a specific implementation of determining at least one target spectral feature from the plurality of spectral features may include:

and screening a spectrum characteristic from the spectrum characteristic sequence as a reference spectrum characteristic.

And then taking the reference spectrum characteristic and the spectrum characteristic before the reference spectrum characteristic in the spectrum characteristic sequence as a spectrum characteristic set to be detected, and judging whether the spectrum characteristic set to be detected accords with Gaussian distribution.

If yes, the reference spectrum feature can be determined as a target spectrum feature, the spectrum feature located in front of the reference spectrum feature in the reference spectrum feature and the spectrum feature sequence is deleted from the spectrum feature sequence to obtain a new spectrum feature sequence, and the step of screening one spectrum feature from the spectrum feature sequence as the reference spectrum feature is carried out based on the new spectrum feature sequence until the spectrum feature set to be tested which accords with Gaussian distribution does not exist in the new spectrum feature sequence.

Illustratively, taking the spectrum feature sequence as (x_1,x_2,…,x_N) as an example, the spectrum feature x_i may be screened from the spectrum feature sequence as a reference spectrum feature, and x_i is the ith spectrum feature in the spectrum feature sequence, where i is a positive integer and i is less than N.

Then, whether the spectrum feature sequence (x_1,x_2,…,x_i) to be detected accords with the Gaussian distribution is detected, and if the spectrum feature sequence (x_1,x_2,…,x_i) to be detected accords with the Gaussian distribution, a time node corresponding to x_i is determined to be a jump point. And deleting (x_1,x_2,…,x_i) from (x_1,x_2,…,x_N) to obtain a new spectrum feature sequence (x_i+1,x_i+2,…,x_N), and then screening a reference spectrum feature such as x_j from the new spectrum feature sequence (x_i+1,x_i+2,…,x_N), wherein j is a positive integer, j is greater than i+1 and less than N, detecting whether the new spectrum feature sequence (x_i+1,x_i+2,…,x_j) accords with Gaussian distribution, and the like until no spectrum feature set to be detected which accords with Gaussian distribution exists in the new spectrum feature sequence.

If the sequence of spectral features to be measured (x_1,x_2,…,x_i) does not conform to the gaussian distribution, x_i+1 may be taken as a new reference spectral feature and it may be detected whether the new sequence of spectral features to be measured (x_1,x_2,…,x_i+1) conforms to the gaussian distribution based on the new reference spectral feature, and so on until the set of spectral features to be measured conforms to the gaussian distribution.

It will be appreciated that if the spectral feature sequence (x_1,x_2,…,x_i) conforms to a gaussian distribution, it is indicated that the sub-audio corresponding to the spectral feature sequence (x_1,x_2,…,x_i) has consistent semantics, i.e. has independent semantics.

In this embodiment, the spectral features are screened out from the plurality of spectral features one by one as the reference spectral features, the spectral feature sequence to be detected is constructed based on the reference spectral features, and whether the accurate spectral feature sequence to be detected accords with gaussian distribution is detected to determine whether the reference spectral feature is the target spectral feature, so that the target spectral feature can be accurately determined from the plurality of spectral features, and omission of the target spectral feature is avoided.

In other embodiments, in step a12b, a specific implementation of step "determining at least one target spectral feature from the plurality of spectral features" may include:

And A121b, determining a candidate spectrum characteristic from the plurality of spectrum characteristics.

In some embodiments, one spectral feature may be randomly selected from the plurality of spectral features as a candidate spectral feature, wherein a first spectral feature and a last spectral feature of the plurality of spectral features may be ignored when selecting the candidate spectral feature. Illustratively, taking an example that a plurality of spectral features form a spectral feature sequence, when selecting a candidate spectral feature, a spectral feature near the middle in the spectral feature sequence may be preferentially selected as the candidate spectral feature, for example, for the spectral feature sequence (x_1,x_2,…,x_N), where N is 11, x₆ may be preferentially selected as the candidate spectral feature.

A122b divides the plurality of spectrum features into a first spectrum feature set and a second spectrum feature set according to the candidate spectrum features, wherein the first spectrum feature set comprises the candidate spectrum features and spectrum features of which the time nodes are positioned before the candidate spectrum features, and the second spectrum feature set comprises spectrum features of which the time nodes are positioned after the candidate spectrum features.

Illustratively, taking the sequence of spectral features as (x_1,x_2,…,x_N) and the candidate spectral feature as x_i as an example, the first set of spectral features is (x_1,x_2,…,x_i) and the second set of spectral features is (x_i+1,x_i+2,…,x_N).

And A123b, determining an evaluation value corresponding to the candidate spectrum feature according to the first spectrum feature set, the second spectrum feature set and the plurality of spectrum features, wherein the evaluation value represents the probability that the first spectrum feature set and the second spectrum feature set accord with Gaussian distribution.

Illustratively, in practical applications, multiple spectral features may be modeled as an independent multi-variate gaussian process, i.eWherein, the method comprises the steps of, wherein,Representing a plurality of spectral features conforming to a gaussian distribution, wherein,D is a feature dimension of a spectrum feature, specifically may be d=128, n is a feature number in a plurality of spectrum features, and R represents a real vector space.

Through the modeling process, the process of searching the trip point can be converted into the process of selecting the following two hypothesis models aiming at the spectrum characteristic sequence (x_1,x_2,…,x_N):

The model H₀ is a model with a spectrum feature sequence x_1,x_2,…,x_N conforming to gaussian distribution, that is, no jump points exist in the audio to be retrieved. The model H₁ is a model in which the first set of spectral features (x_1,x_2,…,x_i) conforms to a gaussian distribution, and the second set of spectral features (x_i+1,x_i+2,…,x_N) also conforms to a gaussian distribution, i.e., the time node corresponding to the spectral feature x_i is a trip point in the audio to be retrieved.

Then, a Bayesian information criterion can be utilized to perform model selection so as to realize the searching of the jumping points in the audio to be searched. Wherein the bayesian information criterion is a model selection criterion for comparing goodness of fit and complexity of a plurality of candidate models given these models, thereby selecting an optimal model. The evaluation value corresponding to the candidate spectral feature x_i may be a BIC value obtained at the spectral feature x_i using bayesian information criterion.

Specifically, in step a123b, determining, according to the first spectral feature set, the second spectral feature set, and the plurality of spectral features, a specific embodiment of the evaluation value corresponding to the candidate spectral feature may include:

A1231b, determining a logarithmic maximum likelihood ratio corresponding to the candidate spectrum feature according to the first spectrum feature set, the second spectrum feature set and the plurality of spectrum features, wherein the logarithmic maximum likelihood ratio characterizes the initial probability that the first spectrum feature set and the second spectrum feature set both accord with Gaussian distribution.

In some embodiments, the first spectral feature set, the second spectral feature set, and the plurality of spectral features may be calculated by a calculation formula of the covariance matrix, to obtain a first covariance matrix corresponding to the first spectral feature set, a second covariance matrix corresponding to the second spectral feature set, and a third covariance matrix corresponding to the plurality of spectral features.

And then, respectively carrying out determinant computation on the first covariance matrix, the second covariance matrix and the third covariance matrix to obtain a first determinant corresponding to the first covariance matrix, a second determinant corresponding to the second covariance matrix and a third determinant corresponding to the third covariance matrix.

Then, the product of the logarithm of the first determinant and the number of spectral features in the first set of spectral features is determined as a first calculated value, the product of the logarithm of the second determinant and the number of spectral features in the second set of spectral features is determined as a second calculated value, and the product of the logarithm of the third determinant and the number of spectral features in the plurality of spectral features is determined as a third calculated value.

And finally, subtracting the first calculated value from the third calculated value and then subtracting the second calculated value from the third calculated value to obtain the logarithmic maximum likelihood ratio corresponding to the candidate spectrum characteristic.

Illustratively, the expression of the log maximum likelihood ratio corresponding to the candidate spectral feature x_i may be as follows:

Wherein R (i) is a corresponding logarithmic maximum likelihood ratio of the candidate spectral feature x_i, N represents the number of spectral features in the plurality of spectral features, N₁ represents the number of spectral features in the first set of spectral features, N₂ represents the number of spectral features in the second set of spectral features,Is the covariance matrix of the plurality of spectral features x_1,x_2,…,x_N,Is the covariance matrix of x_1,x_2,…,x_i,Is the covariance matrix of x_i+1,x_i+2,…,x_N.Is a matrixIs a determinant of (2).

And A1232b, acquiring the feature quantity of the plurality of spectrum features and the feature dimension of the spectrum features, and determining a penalty value according to the feature quantity and the feature dimension, wherein the penalty value is used for correcting the initial probability.

Wherein the penalty value is used to balance the relationship between fitting data and preventing overfitting when selecting the model in this embodiment.

In some embodiments, the expression for the penalty value may be as follows:

wherein P is a penalty value, d is a feature dimension of the spectrum feature, and N is a feature quantity of the plurality of spectrum features.

A1233b, determining the evaluation value according to the logarithmic maximum likelihood ratio and the penalty value.

In some embodiments, a weight corresponding to the penalty value may be obtained, then a product of the weight and the penalty value is calculated to obtain a fourth calculated value, and then the fourth calculated value is subtracted from the log maximum likelihood ratio to obtain the evaluation value.

Along with the above example, after obtaining the log maximum likelihood ratio and the penalty value, the expression of the evaluation value can be as follows:

BIC (i) is an evaluation value corresponding to the candidate spectrum feature x_i, and λ is a weight corresponding to the penalty value, where the weight corresponding to the penalty value can be set in a self-defined manner according to the actual requirement, and is not limited herein. For example, if it is desired to have the penalty value play the greatest balance in model selection, λ may be set to 1.

And A124b, if the evaluation value is the maximum value in the value range corresponding to the evaluation value, determining the candidate spectrum characteristic as the target spectrum characteristic.

When the evaluation value is the maximum value in the value range corresponding to the evaluation value, the probability that the first spectrum feature set and the second spectrum feature set both accord with Gaussian distribution is indicated to be maximum, namely, the probability that the time nodes corresponding to the candidate spectrum features of the first spectrum feature set and the second spectrum feature set are segmented to be jump points is maximum.

In some embodiments, for each of the plurality of spectral features, the evaluation value corresponding to the spectral feature may be obtained by the evaluation value calculation method, and then the spectral feature in which the evaluation value is the maximum value of the plurality of spectral features may be determined as the target spectral feature.

For example, the target spectral feature of the plurality of spectral features may be solved by establishing an argmax function, where the argmax function is used to take the value of the argument when a certain function takes the maximum value, for example, in this embodiment, the following formula may be established to obtain the target spectral feature:

In this embodiment, the process of searching the trip point is converted into the process of selecting the following two assumed models, and then the model selection is performed by using the bayesian information criterion, so that the target spectrum characteristic corresponding to the trip point can be quickly found out from the plurality of spectrum characteristics, and the determination efficiency of the target spectrum characteristic is improved.

And A13b, determining a time node corresponding to the target spectrum characteristic as a jump point.

A2, carrying out semantic splitting processing on the audio to be retrieved based on the trip point to obtain a plurality of independent semantic sub-audios.

As shown in fig. 1c, the trip point includes a time node T1 and a time node T2 in the audio to be searched, where the time node T0 is a start node of the audio to be searched, the time node T3 is an end node of the audio to be searched, and the audio to be searched can be split into 3 sub-audio with independent semantics through the time node T1 and the time node T2, which are respectively sub-audio a, sub-audio b and sub-audio c.

103. And acquiring a plurality of reference audios corresponding to each sub-audio according to each sub-audio, and respectively acquiring first similarity between the sub-audio and each reference audio.

The reference audio may be a reference audio for comparison with the audio to be retrieved, and the reference audio may be a known, complete, noise-free audio, for example, compared with the audio to be retrieved. Alternatively, the reference audio may be stored in the above-described audio database in advance. Wherein, the audio database can also store audio information corresponding to the reference audio, wherein, the audio information can be information describing audio attributes, including but not limited to, the name, duration, format, author, etc. of the audio.

Each sub-audio may be pre-mapped with a plurality of reference audio, and the sub-audio corresponding reference audio may be an audio having similar characteristics to the sub-audio.

In some embodiments, in step 103, the specific embodiment of obtaining the plurality of reference audio corresponding to the sub-audio may include:

S1, extracting features of the sub-audio to obtain audio features.

In some implementations, the sub-audio may be feature extracted by a preset audio feature extractor. Optionally, the extracted audio feature may be matched with the above-mentioned reference feature vector in feature dimension, so as to facilitate the similarity calculation between the extracted audio feature and the reference feature vector. Alternatively, the preset audio feature extractor may be CNN14 (a feature extraction network), and the CNN14 network mainly includes 12 convolution layers and 2 fully-connected layers. The preset audio feature extractor may be another feature extraction network (such as CNN10, CNN6, etc.), and the specific type of feature extractor used may be set according to actual requirements, which is not limited herein.

S2, acquiring an initial feature vector matched with the audio feature from a plurality of preset initial feature vectors as a reference feature vector, wherein the plurality of initial feature vectors are extracted from a plurality of preset initial audio.

In some embodiments, the specific implementation of step S2 may include:

s21, candidate similarity between the audio feature and the plurality of initial feature vectors is obtained.

In some embodiments, the audio feature may be directly compared to each of the plurality of initial feature vectors to obtain candidate similarities.

In other embodiments, the reference feature vector may be obtained by a vector recall method (such as violent enumeration), so the specific embodiment of obtaining candidate similarities between the audio feature and the plurality of initial feature vectors in S21 may include:

S211, for each initial feature vector, carrying out quantization processing on the initial feature vector to obtain a sub-vector space corresponding to the initial feature vector, wherein the sub-vector space comprises a plurality of sub-vectors with the same dimension, and the dimension of the sub-vector is smaller than that of the initial feature vector.

Illustratively, when the initial feature vector is quantized, the initial feature vector may be decomposed into a cartesian product of several low-dimensional vectors, for example, the D-dimensional vector is decomposed into j sets of D/j-dimensional sub-vectors, and then the decomposed low-dimensional vector space is quantized, so that the initial feature vector can be represented by the quantization of the low-dimensional vector. Specifically, the high-dimensional initial feature vector may be split into a plurality of sub-vectors by Product Quantization (PQ) to obtain a sub-vector space corresponding to the initial feature vector.

S212, clustering the sub-vector space through a preset clustering algorithm to obtain a clustering center of the sub-vector space.

Along the above example, each sub-vector space may be aggregated into 256 classes by k-means, i.e. each sub-vector space has 256 cluster centers.

S213, coding the sub-vector space based on the clustering center to obtain category codes corresponding to the sub-vector space.

Along with the above example, the cluster center of each sub-vector may be mapped into a category code (hereinafter referred to as category ID) such that each sub-vector space may be represented by a category ID, where each category ID requires only 8 bitsSuch that the original d=2048-dimensional floating point number (32-bit) vector can be compressed into D/j 8-bit integers.

S214, determining candidate similarity between the audio feature and the initial feature vector based on the category codes and the audio feature.

Optionally, the similarity calculation can be performed by selecting the cluster center in the sub-vector space and the audio feature through category coding, and then comparing all the vectors in the sub-vector space to obtain the vector similar to the audio feature, wherein the similarity calculation can be performed by a vector inner product mode.

Optionally, finding the corresponding centroid through category IDs, stitching centroid vectors corresponding to all category IDs to obtain decompressed vectors, and then calculating distances from the audio features to all sub-vectors in each sub-vector space to obtain candidate similarities.

Alternatively, the audio features may also be partitioned into audio sub-vectors, and then the sum of the distances of each audio sub-vector to each sub-vector spatially corresponding quantized vector is calculated to obtain candidate similarities.

S22, sorting the plurality of initial feature vectors according to the sequence of the candidate similarity from large to small to obtain sorted initial feature vectors.

S23, determining the first X initial feature vectors in the ordered initial feature vectors as initial feature vectors matched with the audio features, wherein X is a positive integer.

S24, taking the initial feature vector matched with the audio features as a reference feature vector.

S3, performing audio tracing processing on the reference feature vector to obtain initial audio corresponding to the reference feature vector.

The preset plurality of initial feature vectors can be stored in a preset vector library, the preset vector library stores a plurality of initial feature vectors and a vector mapping relation table in advance, and the vector mapping relation table records initial audio corresponding to each initial feature vector in the plurality of initial feature vectors in advance, and the initial feature vectors are extracted from the corresponding initial audio. Therefore, after the initial audio is determined, the initial feature vector corresponding to the initial audio can be found from the preset vector library according to the vector mapping relation table. Wherein one initial audio may correspond to one or more initial feature vectors.

S4, taking the initial audio corresponding to the reference feature vector as the reference audio.

In some embodiments, in step 103, a specific embodiment of obtaining the first similarity between the sub-audio and each of the reference audio may include:

for reference audio, the vector quantity of the reference feature vector corresponding to the reference audio is detected.

If the number of the vectors is one, similarity calculation is carried out on the reference feature vector corresponding to the reference audio and the audio feature corresponding to the sub audio, so that first similarity between the sub audio and the reference audio is obtained.

If the number of the vectors is multiple, similarity calculation is carried out on the multiple reference feature vectors corresponding to the reference audio and the audio features corresponding to the sub audio respectively to obtain multiple initial similarities, and the multiple initial similarities are fused to obtain the first similarity between the sub audio and the reference audio.

For example, the specific obtaining manner of the first similarity may be as follows:

B1, obtaining the corresponding reference feature vector of each reference audio from a preset vector library.

The method comprises the steps of presetting a vector library, wherein a plurality of reference feature vectors and a vector mapping relation table are prestored in the preset vector library, the vector mapping relation table records reference audio corresponding to each reference feature vector in the plurality of reference feature vectors in advance, and the reference feature vectors are extracted from the corresponding reference audio. Therefore, after the reference audio is determined, the reference feature vector corresponding to the reference audio can be found from the preset vector library according to the vector mapping relation table. Wherein one reference audio may correspond to one or more reference feature vectors.

And B2, extracting the characteristics of the sub-audio to obtain the audio characteristics.

And B3, for each reference audio, carrying out similarity calculation on the corresponding reference feature vector and the audio features of the reference audio to obtain a first similarity between the sub-audio and the reference audio.

Illustratively, as shown in fig. 1d, for example, the plurality of sub-audios includes a sub-audio a and a sub-audio b, the audio feature corresponding to the sub-audio a is the audio feature a, the audio feature corresponding to the sub-audio b is the audio feature b, the plurality of reference audios includes a reference audio Z1 and a reference audio Z2, the reference feature vector corresponding to the reference audio Z1 includes a reference feature vector 1 and a reference feature vector 3, and the reference feature vector corresponding to the reference audio Z2 includes a reference feature vector 2.

For the reference audio Z2, the similarity calculation may be performed on the reference feature vector 2 and the audio feature a, so as to obtain a first similarity between the reference audio Z2 and the sub-audio a, i.e. the similarity 2 in fig. 1 d. The similarity calculation may be performed on the reference feature vector 2 and the audio feature b to obtain a first similarity between the reference audio Z2 and the sub-audio b, i.e. the similarity 5 in fig. 1 d. Alternatively, in the present embodiment, the similarity calculation may be performed by a vector inner product.

When the number of the reference audio corresponding reference feature vectors is plural, in step B3, performing similarity calculation on the reference audio corresponding reference feature vectors and the audio features to obtain the first similarity between the sub-audio and the reference audio may include:

and respectively carrying out similarity calculation on the audio feature and the reference audio corresponding to a plurality of reference feature vectors to obtain a plurality of initial similarities.

And fusing the initial similarities to obtain a first similarity between the sub-audio and the reference audio.

With reference to fig. 1d again, taking reference audio Z1 as an example, the reference feature vector corresponding to reference audio Z1 includes reference feature vector 1 and reference feature vector 3, for example, when calculating the first similarity between reference audio Z1 and sub-audio a, the similarity between reference feature vector 1 and audio feature a may be calculated first to obtain an initial similarity (such as similarity 1 in fig. 1 d), then the similarity between reference feature vector 3 and audio feature a is calculated to obtain another initial similarity (such as similarity 3 in fig. 1 d), then the first similarity between reference audio Z1 and sub-audio a may be the result of the fusion of similarity 1 and similarity 3, alternatively, the result of the fusion may be the sum of similarity 1 and similarity 3, for example, similarity 1 is 0.2, and the first similarity between reference audio Z1 and sub-audio a is 0.5.

For another example, in calculating the first similarity between the reference audio Z1 and the sub-audio b, the similarity between the reference feature vector 1 and the audio feature b may be calculated first to obtain an initial similarity (e.g., similarity 4 in fig. 1 d), then the similarity between the reference feature vector 3 and the audio feature b is calculated to obtain another initial similarity (e.g., similarity 6 in fig. 1 d), then the first similarity between the reference audio Z1 and the sub-audio b is a fusion result of the similarity 4 and the similarity 6, for example, the similarity 4 is 0.1, the similarity 6 is 0.1, and then the first similarity between the reference audio Z1 and the sub-audio b is 0.2. Similarly, the first similarity between each sub-audio and each reference audio can be calculated in the above manner.

104. According to the first similarity, at least one candidate audio matched with the sub-audio in the plurality of reference audio is determined, and a second similarity of the candidate audio corresponding to the audio to be retrieved is determined.

In some embodiments, in step 104, determining at least one candidate audio of the plurality of reference audio that matches the sub-audio according to the first similarity may include:

the reference audio having the first similarity greater than or equal to the similarity threshold value among the plurality of reference audio is determined as the candidate audio, for example, the first similarity between the sub-audio a and the reference audio Z1, the reference audio Z2, and the reference audio Z3 is 0.5,0.2, and 0.3, respectively, and if the similarity threshold value is 0.3, the reference audio Z1 and the reference audio Z3 may be determined as the candidate audio corresponding to the sub-audio a.

In other embodiments, in step 104, determining at least one candidate audio of the plurality of reference audio that matches the sub-audio according to the first similarity may include:

and sequencing the plurality of reference audios according to the sequence from the big similarity to the small similarity to obtain sequenced reference audios.

And determining the first k reference audios in the sorted reference audios as candidate audios, wherein k is a positive integer.

In some implementations, the reference audio may also be directly used as the candidate audio.

In some embodiments, in step 104, determining, according to the first similarity, a second similarity of the candidate audio to the audio to be retrieved may include:

And fusing the first similarity of the candidate audio corresponding to each sub-audio aiming at each candidate audio to obtain the second similarity of the candidate audio corresponding to the audio to be retrieved.

In some embodiments, the step of fusing the first similarity of the candidate audio corresponding to each sub-audio to obtain the second similarity of the candidate audio corresponding to the audio to be retrieved may include:

the method comprises the steps of obtaining fusion weights of candidate audios corresponding to each sub-audio and the number of a plurality of sub-audios, carrying out weighted summation on first similarity of the candidate audios corresponding to each sub-audio based on the fusion weights to obtain fusion similarity, and calculating quotient of the fusion similarity and the number of the plurality of sub-audios to obtain second similarity.

With the above example, please refer to fig. 1e again, the candidate audio may include a reference audio Z1 and a reference audio Z3, and for the reference audio Z1, a first similarity (e.g. 0.5 in fig. 1 e) between the reference audio Z1 and the sub-audio a and a first similarity (e.g. 0.3 in fig. 1 e) between the reference audio Z1 and the sub-audio b may be fused to obtain a second similarity of the reference audio Z1 corresponding to the audio to be retrieved. Then, the first similarity between the reference audio Z2 and the sub-audio a (e.g. 0.6 in fig. 1 e) and the first similarity between the reference audio Z2 and the sub-audio b (e.g. 0.4 in fig. 1 e) may be fused, so as to obtain a second similarity of the reference audio Z2 corresponding to the audio to be retrieved. When the first similarities are fused, the first similarities are accumulated and divided by the number of the plurality of sub-audios, so as to obtain a second similarity, for example, the number of the plurality of sub-audios in fig. 1e is 2, and then the second similarity of the reference audio Z1 corresponding to the audio to be retrieved is (0.5+0.3)/2. The second similarity of the reference audio Z2 to the audio to be retrieved is (0.6+0.4)/2.

105. And determining target audio from the candidate audio according to the second similarity.

In some embodiments, in step 105, determining the target audio from the candidate audio according to the second similarity may include:

And determining the candidate audio with the largest second similarity among all the candidate audios matched with the plurality of reference audios as the target audio.

Along with the above example, referring again to fig. 1e, since the second similarity of the reference audio Z1 to the audio to be retrieved is 0.4 and the second similarity of the reference audio Z2 to the audio to be retrieved is 0.5 in the candidate audio, the reference audio Z2 may be determined as the target audio.

In some embodiments, after the target audio is determined, the target audio may also be output. Alternatively, in addition to outputting the target audio, audio information corresponding to the target audio may be output, and specifically, the target audio and the audio information corresponding to the target audio may be returned to the user's terminal as a result of the retrieval of the audio to be retrieved by the user.

It can be seen that in this embodiment, after the audio to be retrieved is obtained, the audio to be retrieved is split into a plurality of sub-audios with independent semantics, then for each sub-audio, a plurality of reference audios corresponding to the sub-audio are obtained, and a first similarity between the sub-audio and each reference audio is obtained, then at least one candidate audio matched with the sub-audio in the plurality of reference audios is determined according to the first similarity, and a second similarity of the candidate audio corresponding to the audio to be retrieved is determined, and finally, the target audio is determined from the candidate audios according to the second similarity. According to the first similarity between each sub-audio and each reference audio, the similar candidate audio is recalled from the plurality of reference audios according to the first similarity between each sub-audio and each reference audio, so that on one hand, the recall efficiency is improved, on the other hand, the semantics of the sub-audio can be captured more accurately due to the fact that the sub-audio has independent semantics, and the recall accuracy is improved. And then, according to the second similarity of the candidate audio corresponding to the audio to be searched, the target audio is screened out from the candidate audio, so that the searching range can be reduced on one hand, and on the other hand, the target audio which is more similar to the audio to be searched in the whole can be screened out from the candidate audio. Therefore, the accuracy and efficiency of audio retrieval are improved.

The method described in the above embodiments will be described in further detail below.

In this embodiment, a server will be taken as an example, and a method according to an embodiment of the present application will be described in detail.

As shown in fig. 2a, a specific flow of an audio retrieval method is as follows:

201. The server obtains the audio to be retrieved.

202. The server identifies the trip point in the audio to be searched, wherein the trip point is a time node with semantically changed in the audio to be searched.

Illustratively, in practical application, the audio retrieval method of the present embodiment may be applied to a framework as shown in fig. 2 b. In fig. 2b, when receiving the audio to be retrieved, the audio to be retrieved may be subjected to trip point detection, i.e. the trip point in the audio to be retrieved is identified.

In some embodiments, in step 202, the specific implementation of step "identify a trip point in audio to be retrieved" may include:

Respectively carrying out covariance matrix calculation on the first sub-feature sequence, the second sub-feature sequence and the feature sequence to obtain a first covariance matrix corresponding to the first sub-feature sequence, a second covariance matrix corresponding to the second sub-feature sequence and a third covariance matrix corresponding to the feature sequence;

determining a logarithmic maximum likelihood ratio according to the first covariance matrix, the second covariance matrix and the third covariance matrix, and determining a time node corresponding to the last feature in the first sub-feature sequence as a jump point if the logarithmic maximum likelihood ratio meets a preset condition.

In other embodiments, in step 202, the specific implementation of step "identify a trip point in audio to be retrieved" may include:

Extracting spectral features of the audio to be searched to obtain a plurality of spectral features, wherein each spectral feature in the plurality of spectral features corresponds to a time node in the audio to be searched, and the plurality of spectral features are arranged in sequence from first to last according to the time nodes;

Determining at least one target spectral feature from a plurality of spectral features, the target spectral feature being used to partition the plurality of spectral features into a plurality of spectral feature sets and the spectral features in each spectral feature set conforming to a gaussian distribution;

and determining the time node corresponding to the target spectrum characteristic as a jump point.

In some embodiments, a specific implementation of step "determining at least one target spectral feature from a plurality of spectral features" may include:

Determining a candidate spectral feature from the plurality of spectral features;

Dividing the plurality of spectrum features into a first spectrum feature set and a second spectrum feature set according to the candidate spectrum features, wherein the first spectrum feature set comprises the candidate spectrum features and the spectrum features of which the time nodes are positioned before the candidate spectrum features;

Determining evaluation values corresponding to the candidate spectrum features according to the first spectrum feature set, the second spectrum feature set and the plurality of spectrum features, wherein the evaluation values represent the probability that the first spectrum feature set and the second spectrum feature set accord with Gaussian distribution;

And if the evaluation value is the maximum value in the corresponding value range of the evaluation value, determining the candidate spectrum characteristic as the target spectrum characteristic.

In some embodiments, the specific implementation of the step of determining the evaluation value corresponding to the candidate spectral feature according to the first set of spectral features, the second set of spectral features and the plurality of spectral features may include:

Determining a logarithmic maximum likelihood ratio corresponding to the candidate spectrum feature according to the first spectrum feature set, the second spectrum feature set and the plurality of spectrum features, wherein the logarithmic maximum likelihood ratio characterizes the initial probability that the first spectrum feature set and the second spectrum feature set both accord with Gaussian distribution;

Acquiring feature quantity of a plurality of spectrum features and feature dimensions of the spectrum features, and determining a penalty value according to the feature quantity and the feature dimensions, wherein the penalty value is used for correcting the initial probability;

And determining an evaluation value according to the logarithmic maximum likelihood ratio and the penalty value.

203. The server performs semantic splitting processing on the audio to be retrieved based on the trip point to obtain a plurality of sub-audio with independent semantics.

With the above example, please refer to fig. 2b again, after identifying the trip point, semantic splitting processing may be performed on the audio to be retrieved according to the trip point, so as to obtain a plurality of sub-audio segments (hereinafter referred to as sub-audio) with independent semantics.

204. The server acquires a plurality of reference audios corresponding to each sub-audio according to each sub-audio, and the first similarity between the sub-audio and each reference audio respectively.

In some embodiments, in step 204, the specific implementation of obtaining the plurality of reference audio corresponding to the sub-audio may include:

Extracting features of the sub-audio to obtain audio features;

The method comprises the steps of obtaining an initial feature vector matched with audio features from a plurality of preset initial feature vectors as a reference feature vector, wherein the plurality of initial feature vectors are extracted from a plurality of preset initial audio;

and taking the initial audio corresponding to the reference feature vector as reference audio.

The specific implementation manner of the step of obtaining the initial feature vector matched with the audio feature from the preset plurality of initial feature vectors as the reference feature vector may include:

the first X initial feature vectors in the ordered initial feature vectors are determined to be initial feature vectors matched with the audio features, wherein X is a positive integer;

and taking the initial feature vector matched with the audio feature as a reference feature vector.

The specific implementation of the step of obtaining candidate similarity between the audio feature and the plurality of initial feature vectors may include:

clustering the sub-vector space through a preset clustering algorithm to obtain a clustering center of the sub-vector space;

In some embodiments, in step 204, the specific implementation of step "obtaining the first similarity between the sub-audio and each of the reference audio" may include:

Firstly, obtaining a corresponding reference feature vector of each reference audio from a preset vector library;

With the above example, referring again to fig. 2b, the server may obtain a media feature library, which may refer to a database containing various types of media content, including mainly different types of audio media data. The features in the media feature library are then quantized to obtain a pre-set vector library (e.g., the low-dimensional feature library of fig. 2 b). In particular, the media feature library is often large in size, and therefore can be optimized using Product Quantization (PQ) to obtain a preset vector library. Alternatively, a cartesian product may be used to decompose an original vector in the media feature library into a plurality of cartesian products of low-dimensional vectors, and quantize the low-dimensional vector space obtained by the decomposition, so that the original vector can be represented by the quantization of the low-dimensional vector, thereby obtaining the reference feature vector. Specifically, the PQ breaks down the high-dimensional vector into a plurality of sub-vectors, each of which is compressed into a number, so that the high-dimensional vector can be represented by several numbers. Illustratively, the D-dimensional original vectors in the media feature library may be decomposed into j sets of D/j-dimensional sub-vectors, and then k-means may be used to aggregate each set of sub-vectors into 256 classes, i.e., 256 cluster centers for each set of sub-vectors. The sub-vectors are mapped into category IDs, i.e. each vector in each group of sub-vectors is represented by a category center, and the Identification (ID) of each category only requires 8 bits) This means that the original d=2048-dimensional floating point number (32-bit) vector can be compressed into D/j 8-bit integers.

And then, extracting the characteristics of the sub-audio to obtain the audio characteristics.

Along the above example, please refer again to fig. 2b, when extracting features of sub-audio, the sub-audio may be converted from wav format (an audio file format) into mel spectrogram, which can simulate the processing of real world sounds (especially human voice) by human ears, extracting information for pitch sensitivity. Next, the CNN14 pre-trained on the large-scale audio data set may be used as a feature extractor, which may better extract the feature embedding of the audio (embbedding), so in this implementation, the last layer of features D in the CNN14 may be obtained as the audio features of each sub-audio segment (i.e. the segment features in fig. 2 b), where feature dimension d=2048. Through feature extraction, the audio to be retrieved may be represented by a feature vector with dimensions t×d, where T is the number of sub-audios. Wherein, since the pre-trained feature extractor has learned a powerful feature extraction function, the network parameters of the feature extractor can be frozen in order to prevent disruption to the original network performance.

And finally, for each reference audio, carrying out similarity calculation on the corresponding reference feature vector and the audio features of the reference audio to obtain a first similarity between the sub-audio and the reference audio.

In some embodiments, the number of reference audio corresponding reference feature vectors is plural, and the step of performing similarity calculation on the reference audio corresponding reference feature vectors and the audio features to obtain the first similarity between the sub-audio and the reference audio may include:

Respectively carrying out similarity calculation on the audio features and the reference audio corresponding to a plurality of reference feature vectors to obtain a plurality of initial similarities;

205. The server determines at least one candidate audio matching the sub-audio in the plurality of reference audio according to the first similarity, and determines a second similarity of the candidate audio corresponding to the audio to be retrieved.

In some embodiments, in step 205, a specific implementation of step "determining at least one candidate audio matching the sub-audio in the plurality of reference audio according to the first similarity" may include:

Sequencing the plurality of reference audios according to the sequence from the big to the small of the first similarity to obtain sequenced reference audios;

With the above example, please refer to fig. 2b again, after obtaining the audio features of the sub-audio, for each sub-audio, similarity comparison may be performed according to the audio features of the sub-audio and the reference feature vectors in the preset vector library, so as to recall a plurality of reference feature vectors most similar to the audio features from the preset vector library, and then trace back to the corresponding reference audio according to the recalled reference feature vectors. When recall the reference feature vector from the preset vector library, the reference feature vector in the preset vector library can be exemplified first, specifically, a cluster center can be established through k-means, the nearest cluster center is queried first when recall of the audio feature is carried out each time, and then all vectors in the clusters are compared to obtain similar vectors.

In some embodiments, in step 205, a specific embodiment of step "determining, according to the first similarity, a second similarity of the candidate audio corresponding to the audio to be retrieved" may include:

The specific implementation method of the step of fusing the first similarity of the candidate audio corresponding to each sub-audio to obtain the second similarity of the candidate audio corresponding to the audio to be retrieved may include:

acquiring fusion weights of the candidate audios corresponding to each sub-audio and the number of a plurality of sub-audios;

and calculating the quotient of the fusion similarity and the number of the plurality of sub-audios to obtain a second similarity.

Along the above example, please refer to fig. 2b again, after tracing the reference audio corresponding to the reference feature vector, the post-processing may be performed on the reference audio, where the post-processing is the step 205, specifically, the reference feature vector having a higher similarity with the audio feature of the sub-audio may be recalled, then the recalled reference feature vector is traced to obtain the corresponding reference audio, the reference audio obtained by tracing is used as the candidate audio, and finally the similarity comparison is performed on the candidate audio and each sub-audio to obtain the comparison result, that is, the second similarity of each candidate audio corresponding to the audio to be retrieved.

206. The server determines the target audio from the candidate audio according to the second similarity.

In some embodiments, in step 206, the specific implementation of step "determine target audio from candidate audio according to the second similarity" may include:

In order to better implement the method, the embodiment of the application also provides an audio retrieval device which can be integrated in electronic equipment, wherein the electronic equipment can be a terminal, a server and the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices, and the server can be a single server or a server cluster consisting of a plurality of servers.

For example, in the present embodiment, a method according to an embodiment of the present application will be described in detail by taking an example in which an audio search device is specifically integrated in an electronic device.

For example, as shown in fig. 3, the audio retrieval apparatus may include a first acquisition unit 301, a splitting unit 302, a second acquisition unit 303, a first determination unit 304, and a second determination unit 305, as follows:

a first obtaining unit 301, configured to obtain audio to be retrieved;

The splitting unit 302 is configured to perform semantic splitting processing on the audio to be retrieved to obtain a plurality of sub-audio with independent semantics;

A second obtaining unit 303, configured to obtain, for each sub-audio, a plurality of reference audio corresponding to the sub-audio, and a first similarity between the sub-audio and each reference audio, respectively;

A first determining unit 304, configured to determine at least one candidate audio matching with the sub-audio in the plurality of reference audio according to the first similarity, and determine a second similarity of the candidate audio corresponding to the audio to be retrieved;

the second determining unit 305 is configured to determine the target audio from the candidate audio according to the second similarity.

In some embodiments, the second obtaining unit 303 is specifically configured to:

Extracting features of the sub-audio to obtain audio features;

In some embodiments, the second obtaining unit 303 is specifically further configured to:

For reference audio, detecting the vector quantity of the reference feature vectors corresponding to the reference audio;

In some embodiments, the first determining unit 304 is specifically configured to:

In some embodiments, the second determining unit 305 is specifically configured to:

In some embodiments, the second determining unit 305 is specifically further configured to:

In some implementations, the splitting unit 302 includes:

the identifying subunit is used for identifying the jump points in the audio to be searched, wherein the jump points are time nodes with semanteme change in the audio to be searched;

the splitting subunit is used for carrying out semantic splitting processing on the audio to be retrieved based on the trip point to obtain a plurality of independent semantic sub-audios.

In some embodiments, the identification subunit is specifically configured to:

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

The embodiment of the application also provides electronic equipment which can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like, and the server can be a single server or a server cluster formed by a plurality of servers and the like.

In this embodiment, an electronic device is taken as an example to describe in detail, for example, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, specifically:

the electronic device may include one or more processor cores 401, one or more computer-readable storage media memory 402, a power supply 403, an input module 404, and a communication module 405, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

The processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall detection of the electronic device. In some embodiments, processor 401 may include one or more processing cores, and in some embodiments, processor 401 may integrate an application processor primarily processing operating systems, user interfaces, applications, and the like, with a modem processor primarily processing wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device also includes a power supply 403 for powering the various components, and in some embodiments, the power supply 403 may be logically connected to the processor 401 by a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may also include an input module 404, which input module 404 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The electronic device may also include a communication module 405, and in some embodiments the communication module 405 may include a wireless module, through which the electronic device may wirelessly transmit over a short distance, thereby providing wireless broadband internet access to the user. For example, the communication module 405 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and so forth.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the audio retrieval methods provided by the embodiments of the present application.

The storage medium may include a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or the like.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the above-described embodiment.

The instructions stored in the storage medium can execute the steps in any audio retrieval method provided by the embodiment of the present application, so that the beneficial effects that any audio retrieval method provided by the embodiment of the present application can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.

The foregoing describes in detail a method, apparatus, electronic device and computer readable storage medium for audio retrieval, and the principles and embodiments of the present application are described herein by applying specific examples, which are provided for aiding in understanding of the method and core concept of the present application, and meanwhile, to those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application ranges, so that the disclosure should not be construed as limiting the application.

Claims

1. An audio retrieval method, comprising:

Acquiring audio to be retrieved;

If the audio to be searched has semantics, identifying a jump point in the audio to be searched, wherein the jump point is a time node of semantic change in the audio to be searched;

If the audio to be searched does not have semantics, splitting the audio to be searched through the audio characteristics of the audio to be searched to obtain a plurality of sub-audios, wherein the plurality of sub-audios correspond to different audio characteristics, and the types of the audio characteristics comprise at least one of energy characteristics, time domain characteristics or frequency domain characteristics;

determining the first k reference audios in the sequenced reference audios as candidate audios, wherein k is a positive integer;

For each candidate audio, fusing the first similarity of the candidate audio corresponding to each sub-audio to obtain the second similarity of the candidate audio corresponding to the audio to be retrieved;

2. The audio retrieval method according to claim 1, wherein the obtaining the plurality of reference audio corresponding to the sub-audio includes:

3. The audio retrieval method according to claim 2, wherein the obtaining, from a plurality of preset initial feature vectors, an initial feature vector matching the audio feature as a reference feature vector, comprises:

4. The audio retrieval method according to claim 3, wherein said obtaining candidate similarities between the audio features and the plurality of initial feature vectors, respectively, comprises:

5. The audio retrieval method according to claim 2, wherein said obtaining a first similarity between the sub-audio and each of the reference audio, respectively, comprises:

6. The audio retrieval method according to claim 1, wherein the fusing the first similarity of the candidate audio corresponding to each sub-audio to obtain the second similarity of the candidate audio corresponding to the audio to be retrieved includes:

7. The audio retrieval method according to claim 1, wherein the identifying the trip point in the audio to be retrieved comprises:

8. The audio retrieval method according to any one of claims 1 to 7, wherein the determining a target audio from the candidate audio according to the second similarity includes:

9. An audio retrieval device, comprising:

the first acquisition unit is used for acquiring the audio to be retrieved;

The splitting unit is used for identifying a jump point in the audio to be searched under the condition that the audio to be searched has semantics, wherein the jump point is a time node of semantic change in the audio to be searched, carrying out semantic splitting processing on the audio to be searched based on the jump point to obtain a plurality of sub-audios with independent semantics, and carrying out splitting processing on the audio to be searched through the audio characteristics of the audio to be searched under the condition that the audio to be searched does not have semantics to obtain a plurality of sub-audios, wherein the plurality of sub-audios correspond to different audio characteristics, and the type of the audio characteristics comprises at least one of energy characteristics, time domain characteristics or frequency domain characteristics;

The first determining unit is used for sequencing the plurality of reference audios according to the sequence from the large similarity to the small similarity to obtain sequenced reference audios, determining the first k reference audios in the sequenced reference audios as candidate audios, wherein k is a positive integer;

10. An electronic device, comprising a processor and a memory, wherein the memory stores a plurality of instructions, and wherein the processor loads instructions from the memory to perform the steps in the audio retrieval method according to any one of claims 1 to 8.

11. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the audio retrieval method of any one of claims 1 to 8.

12. A computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the audio retrieval method of any one of claims 1 to 8.