CN117690420B

Movatterモバイル変換

Info

Publication number: CN117690420B
Application number: CN202311781810.0A
Authority: CN
Inventors: 罗程方; 张超钢; 陈传艺
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2025-09-09
Anticipated expiration: 2043-12-21
Also published as: CN117690420A

Abstract

The application discloses a method, a device, equipment and a storage medium for detecting chorus, and relates to the field of audio processing. Calling a melody feature extraction model to extract features of songs to be detected to obtain song features; the method comprises the steps of calling a melody feature extraction model to perform feature extraction on a reference chorus segment to obtain chorus features, multiplying the song features by the chorus features, calculating a correlation matrix of the song features and the chorus features, wherein the correlation matrix is a two-dimensional matrix, the number of rows of the correlation matrix corresponds to the number of audio frames of the chorus segments, the number of columns of the correlation matrix corresponds to the number of audio frames of the song to be detected, determining a target column number interval of the correlation matrix, wherein the number of the target column number interval meets a threshold value condition, and determining a target audio frame interval of the song to be detected, which corresponds to the target column number interval, as a predicted chorus segment of the song to be detected. The method can improve the chorus annotation efficiency.

Description

Method, device, equipment and storage medium for detecting chorus

Technical Field

The present application relates to the field of audio processing, and in particular, to a method, apparatus, device, and storage medium for detecting a chorus.

Background

The excellent chorus detection model can accurately detect the starting time and the ending time of each chorus segment in the song from the song. How to train an excellent chorus detection model is a popular research topic in the field of music information retrieval, and training data arrangement of the chorus detection model is an important ring, the quality and the quantity of the training data can directly determine the performance of the chorus detection model, and the excellent chorus detection model can be trained only by using a large number of high-quality data sets.

However, the labeling process of the chorus detection training dataset is usually done manually, which is very time-consuming and labor-consuming.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for detecting chorus, which can improve chorus annotation efficiency. The technical scheme is as follows.

According to an aspect of the present application, there is provided a chorus detection method, the method comprising:

The method comprises the steps of calling a melody feature extraction model to extract features of songs to be detected to obtain song features, calling the melody feature extraction model to extract features of reference chorus segments to obtain chorus features, wherein the reference chorus segments are calibrated chorus segments in the songs to be detected, the song features and the chorus features are two-dimensional matrixes, the width of each two-dimensional matrix represents a frequency range, and the length of each two-dimensional matrix represents audio frame number;

Multiplying the song characteristics with the chorus characteristics, and calculating a correlation matrix of the song characteristics and the chorus characteristics, wherein the correlation matrix is a two-dimensional matrix, the number of rows of the correlation matrix corresponds to the number of audio frames of the chorus segment, and the number of columns of the correlation matrix corresponds to the number of audio frames of the song to be detected;

determining a target column number interval in which the value in the correlation matrix meets a threshold condition;

And determining the target audio frame interval of the song to be detected corresponding to the target column number interval as a predicted chorus segment of the song to be detected.

According to another aspect of the present application, there is provided a chorus detection apparatus, the apparatus comprising:

The device comprises a feature extraction module, a melody feature extraction model, a reference chorus segment, a frequency range and an audio frame number, wherein the feature extraction module is used for calling the melody feature extraction model to perform feature extraction on a song to be detected to obtain song features;

The correlation module is used for multiplying the song characteristics with the chorus characteristics and calculating a correlation matrix of the song characteristics and the chorus characteristics, wherein the correlation matrix is a two-dimensional matrix, the number of rows of the correlation matrix corresponds to the number of audio frames of the chorus segment, and the number of columns of the correlation matrix corresponds to the number of audio frames of the song to be detected;

The determining module is used for determining a target column number interval in the correlation matrix, wherein the number of the target column number interval meets a threshold condition;

And the determining module is used for determining the target audio frame interval of the song to be detected, which corresponds to the target column number interval, as a predicted vice song segment of the song to be detected.

According to another aspect of the present application there is provided a computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set loaded and executed by the processor to implement the chorus detection method as described in the above aspect.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions loaded and executed by a processor to implement the chorus detection method as described in the above aspect.

According to another aspect of embodiments of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the chorus detection method provided in the above alternative implementation.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

A reference chorus segment of the already-noted song is utilized to automatically detect other chorus segments of the song. And respectively extracting the features of the reference sub-song segment and the song, calculating the correlation of the two features, and matching the segment with higher correlation degree with the reference sub-song segment from the song according to the calculation result of the correlation, so as to label the sub-song segment in the song. Improves the marking efficiency of the chorus and saves the cost. The training set of the chorus detection model is obtained by the method provided by the embodiment of the application, so that the accuracy of the chorus detection model can be greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a computer device provided by an exemplary embodiment of the present application;

Fig. 2 is a method flowchart of a method for detecting a chorus according to another exemplary embodiment of the present application;

fig. 3 is a schematic diagram of a method for detecting a chorus according to another exemplary embodiment of the present application;

fig. 4 is a schematic diagram of a method for detecting a chorus according to another exemplary embodiment of the present application;

Fig. 5 is a schematic diagram of a method for detecting a chorus according to another exemplary embodiment of the present application;

fig. 6 is a method flowchart of a method for detecting a chorus according to another exemplary embodiment of the present application;

Fig. 7 is a method flowchart of a method for detecting a chorus according to another exemplary embodiment of the present application;

Fig. 8 is a schematic diagram of a method for detecting a chorus according to another exemplary embodiment of the present application;

Fig. 9 is a method flowchart of a method for detecting a chorus according to another exemplary embodiment of the present application;

fig. 10 is a method flowchart of a method for detecting a chorus according to another exemplary embodiment of the present application;

fig. 11 is a block diagram of a chorus detection apparatus provided in another exemplary embodiment of the present application;

fig. 12 is a schematic diagram of a server according to another exemplary embodiment of the present application;

fig. 13 is a block diagram of a terminal provided in another exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of a computer device 101 provided by an exemplary embodiment of the application, which computer device 101 may be a terminal or a server.

The terminals may include smart phones, notebook computers, desktop computers, tablet computers, computer-integrated machines, internet of things devices, intelligent robot workstations, televisions, set-top boxes, smart glasses, smart watches, digital cameras, MP4 playback terminal devices, MP5 playback terminal devices, learning machines, point-to-read machines, electronic books, electronic dictionaries, vehicle-mounted terminal devices, virtual Reality (VR) playback terminal devices, augmented Reality (Augmented Reality, AR) playback terminal devices, and the like.

The terminal and the server are connected with each other through a wired or wireless network.

The terminal includes a first memory and a first processor. The first memory stores a chorus detection algorithm 102, which is called and executed by the first processor to implement the chorus detection method provided by the application. The first Memory may include, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM).

The first processor may be one or more integrated circuit chips. Alternatively, the first processor may be a general purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a network processor (Network Processor, NP). Optionally, the first processor may implement the method for detecting chorus provided by the present application by running a program or code.

The server includes a second memory and a second processor. The second memory stores a chorus detection algorithm 102, which is called by the second processor to implement the chorus detection method provided by the application. Alternatively, the second memory may include, but is not limited to, RAM, ROM, PROM, EPROM, EEPROM. Alternatively, the second processor may be a general purpose processor, such as a CPU or NP.

Fig. 2 shows a flowchart of a method for detecting a chorus according to an exemplary embodiment of the present application. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. The method comprises the following steps.

Step 210, invoking a melody feature extraction model to perform feature extraction on the song to be detected to obtain song features, invoking the melody feature extraction model to perform feature extraction on a reference chorus segment to obtain chorus features, wherein the reference chorus segment is a calibrated chorus segment in the song to be detected, the song features and the chorus features are two-dimensional matrices, the width of the two-dimensional matrices represents a frequency range, and the length of the two-dimensional matrices represents audio frames.

The melody feature extraction model is a neural network model. Optionally, the melody feature extraction model includes a preprocessing layer and a feature extraction layer.

The preprocessing layer is used for preprocessing input data (songs to be detected or reference verse fragments) to obtain mel spectrum characteristics of the input data. The preprocessing may include at least one of time-frequency conversion, filtering, mel-transformation.

Taking a song to be detected as an example, after the song to be detected is input into a pretreatment layer, the pretreatment layer performs STFT (Short-Time Fourier Transform Short-time Fourier transform) on the song to be detected to obtain at least two frequency spectrums corresponding to at least two frames of audio respectively, and performs Mel transform on the at least two frequency spectrums to obtain Mel spectrum characteristics, wherein the Mel spectrum characteristics comprise Mel spectrums corresponding to the at least two frames of audio respectively.

STFT is a method of signal analysis in time and frequency that breaks down a signal into spectral components over different time periods.

The process of STFT may include the following two steps:

1) The audio signal (song to be detected) is divided into at least two audio frames, the duration of each audio frame is fixed, such as 10ms or 20ms, and a certain overlap can exist between adjacent audio frames, and the common overlap proportion is 50% or 75%.

2) And performing discrete Fourier transform (Discrete Fourier Transform, DFT) calculation on each audio frame, and converting the time domain signal into a frequency domain signal to obtain the frequency spectrum of each audio frame.

Illustratively, the frequency spectrum of each frame of audio includes amplitude values (or energy values) at least two frequencies. For example, one spectrum includes 256 amplitude values at 256 frequencies.

The mel-transform may convert the spectrum into a mel-spectrum.

Illustratively, a method of converting a spectrum to a mel-spectrum is presented. For the spectrum of each frame of audio, where the frequency is mapped to a mel scale where the mel scale has a logarithmic relationship with frequency (i.e., hertz), the frequency f (in hertz) can be converted to a mel scale m (in mel) where m=2595×log10 (1+f/700) using the following equation. Where 700 is the reference frequency, other values may be adjusted. And carrying out logarithmic transformation on the amplitude value in the frequency spectrum to enhance the contrast of the low-amplitude part, and finally obtaining the Mel spectrum of each frame of audio.

Illustratively, the preprocessing layer obtains the mel spectrum characteristics of the song to be detected after preprocessing the song to be detected. The preprocessing process can be that firstly, frames are divided into frames according to a preset time window, then, time-frequency conversion and Mel conversion are carried out on each frame to obtain the Mel spectrum of each frame of audio, and the Mel spectrums of multiple frames of audio form the Mel spectrum characteristics of the songs to be detected. For example, the time window may be set to 30ms (milliseconds), and then a period of one minute of the song to be detected may be divided into 2000 frames, and each frame of audio may obtain a mel spectrum, and a total of 2000 mel spectrums may be obtained, which is a mel spectrum of 2000 frames.

Illustratively, the mel-spectra in the mel-spectrum feature are arranged in chronological order of each frame of audio, i.e., in the order of the mel-spectrum of the first frame of audio, the mel-spectrum of the second frame of audio, the mel-spectrum of the third frame of audio.

Mel-spectrum (Mel-spectrogram) is a method of spectrum identification, which may also be referred to as Mel-spectrum. The mel spectrum is obtained based on a spectrogram obtained by analyzing an audio signal in time and frequency, and a spectrogram obtained by performing Short-time fourier transform (Short-time Fourier Transform, STFT) on the audio signal in time domain. Whereas mel-spectra better adapt to the characteristics of human auditory perception by performing a series of transformations on the spectrogram. Mel-spectroscopy uses Mel scale (Mel scale) instead of linear frequency scale in the spectrogram. The Mel scale is more in line with the characteristics of human auditory perception, and can better reflect subjective perception of pitch by people.

Illustratively, the mel-spectrum feature may be data in m x n dimensions, where m is the number of audio frames and n is the amount of data in one mel-spectrum. For example, after the song to be detected is subjected to time-frequency conversion and mel conversion, 2000 frames of 256-dimensional mel spectrum features can be obtained, wherein the song to be detected comprises 2000 frames of audio frames, and 256 frequency spectrum data are obtained after each frame of audio is subjected to mel conversion. 256 spectral data may refer to 256 amplitude data on 256 mel scales (frequencies).

The feature extraction layer is used for extracting features of the data (Mel spectrum features) output by the preprocessing layer and extracting melody features.

Alternatively, the feature extraction layer may employ a model architecture of a first RNN (Recurrent Neural Network ), CNN (Convolutional Neural Networks, convolutional neural network), and a second RNN connected in sequence. For example, the first RNN and the second RNN respectively use a bidirectional LSTM (Long Short Term Memory Network, long and short term memory network) model, and the CNN uses a 5-layer CNN mesh, that is, the feature extraction layer uses a network structure of the first LSTM, the 5-layer CNN, and the second LSTM, which are sequentially connected.

Or the RNN may employ a GRU (Gate Recurrent Unit, gated loop unit) model. CNN may employ ResNet (Residual Network).

Through the feature extraction layer, the most original time sequence features in the audio can be extracted through the LSTM, then the local features of the time sequence features are modeled by using the CNN network, the corresponding melody features in each time frame are extracted, and finally the local features and the context features are connected through the LSTM model to obtain the final melody features.

The reference chorus segment is a segment of audio in the song to be detected. Illustratively, the reference verse segment is a segment of a verse segment in a manually noted song to be detected.

Illustratively, the audio formats of the song to be detected and the reference chorus segment input to the melody feature extraction model are not limited. Optionally, the audio formats of the song to be detected and the reference sub-song clip are the same.

Optionally, the melody feature extraction model outputs features that are a two-dimensional matrix, the length of the two-dimensional matrix depends on the audio duration of the input data, and the width of the two-dimensional matrix is a fixed width, wherein the width of the two-dimensional matrix is used for representing the frequency range.

For example, the length of the song feature is equal to the number of audio frames obtained by framing the song to be detected by the preprocessing layer. When the song to be detected is divided into 2000 frames and the frequency range includes amplitude data on 256 mel scales, the song is characterized as a 2000 x 256 dimensional two-dimensional matrix. Where 2000 is the length (number of columns) of the two-dimensional matrix and 256 is the width (number of rows) of the two-dimensional matrix.

For another example, the length of the chorus feature is equal to the number of audio frames obtained by the preprocessing layer framing the reference chorus segment. When the reference chorus segment is divided into 100 frames, the frequency range includes amplitude data on 256 mel scales, the chorus feature is a two-dimensional matrix of 100 x 256 dimensions. Where 100 is the length (number of columns) of the two-dimensional matrix and 256 is the width (number of rows) of the two-dimensional matrix.

Illustratively, the width (number of rows) of the song features and the chorus features are the same, and the length (number of columns) of the song features is greater than the length (number of columns) of the chorus features.

For example, as shown in fig. 3, a song feature 401 is provided, which has a length equal to the number of audio frames of the song to be detected, a width equal to the frequency range, and a color depth representing a numerical value, wherein the highlighted portion may be used to characterize the melody trend of the song to be detected.

Step 220, multiplying the song features by the chorus features, and calculating a correlation matrix of the song features and the chorus features, wherein the correlation matrix is a two-dimensional matrix, the number of rows of the correlation matrix corresponds to the number of audio frames of the chorus segments, and the number of columns of the correlation matrix corresponds to the number of audio frames of the song to be detected.

Illustratively, the two-dimensional matrix of the verse is transposed, and matrix multiplication is performed on the characteristics of the verse after the transposition and the characteristics of the song to obtain a correlation matrix.

For example, the dimensions of the song feature are [ A, B ], where A is the frequency range and B is the number of audio frames of the song to be detected. The dimension of the chorus feature is [ A, C ], where A is the frequency range and C is the number of audio frames of the reference chorus segment. And (3) carrying out matrix multiplication on the converted verse characteristics and the song characteristics to obtain a correlation matrix of [ C, A ] [ A, B ] = [ C, B ]. That is, the length (number of columns) of the correlation matrix is equal to the number of audio frames B of the song to be detected, and the width (number of rows) of the correlation matrix is equal to the number of audio frames C of the reference chorus segment.

For example, the dimension of the song feature is 256×2000, the dimension of the chorus feature is 256×100, and the dimension of the correlation matrix is 100×2000.

For example, as shown in fig. 4, a correlation matrix 402 is provided, which has a length equal to the number of frames of the audio frame of the song to be detected and a width equal to the number of frames of the audio frame of the reference chorus segment, and the color depth indicates the correlation size, and the highlighted portion indicates that the correlation is high.

Step 230, determining a target column number interval in which the value in the correlation matrix satisfies the threshold condition.

For example, the magnitude of the magnitude in the correlation matrix may characterize the correlation of the song to be detected and the reference sub-song segment at that location. For example, the value of the (x, y) point in the correlation matrix may represent the correlation between the audio frame y in the song to be detected and the audio frame x melody of the reference chorus segment, and if the value is higher, it indicates that the audio frame y in the song to be detected is similar to the melody of the audio frame x of the reference chorus segment, and if the value is lower, it indicates that the audio frame y in the song to be detected is dissimilar to the melody of the audio frame x of the reference chorus segment.

Therefore, the segment with high correlation with the reference chorus segment in the song to be detected presents a high-brightness (high-value) diagonal line in the correlation matrix, and the chorus segment in the song to be detected can be determined by identifying the high-brightness diagonal line in the correlation matrix.

It is assumed that from frame 11 to frame 20 in the song to be detected is a chorus segment, the frame 11 of the song to be detected has a high correlation with frame 1 of the reference chorus segment, i.e. the (1, 11) position value in the correlation matrix is high, the frame 12 of the song to be detected has a high correlation with frame 2 of the reference chorus segment, i.e. the (2, 12) position value in the correlation matrix is high, the frame 13 of the song to be detected has a high correlation with frame 3 of the reference chorus segment, i.e. the (3, 13) position value in the correlation matrix is high. It can be seen that the highlighted diagonal (from top left to bottom right) in the correlation matrix corresponds to the chorus segment in the song to be detected.

Because the number of columns of the correlation matrix is equal to the number of audio frames of the song to be detected, the number of columns of the diagonal line with the value higher than the threshold value is determined, and the audio frame of the chorus segment in the song to be detected can be determined.

The threshold condition may include that a value on a diagonal within a column number interval in the correlation matrix is higher than a threshold.

Optionally, the width of the column number interval depends on the number of audio frames of the reference google segment (i.e. the width of the correlation matrix), i.e. the width of the target column number interval is equal to the width of the correlation matrix. For example, the dimension of the correlation matrix is 100×2000, and the width of the target column number interval is 100, for example, the target column number interval may include 11 th to 110 th columns of the correlation matrix.

And 240, determining a target audio frame interval of the song to be detected corresponding to the target column number interval as a predicted chorus segment of the song to be detected.

For example, if the diagonal value from the 11 th column to the 20 th column in the similarity matrix is higher than the threshold value, the 11 th frame to the 20 th frame in the song to be detected are predicted chorus segments correspondingly.

Optionally, the similarity matrix may include at least one target column number section that meets a threshold condition, where each target column number section may determine a predicted chorus segment, and the plurality of target column number sections may determine a plurality of predicted chorus segments.

Illustratively, the predicted clip determined in step 240 includes a reference clip. For example, if 3 chorus segments are included in the song to be detected and the reference chorus segment is the 2 nd chorus segment, then in step 240, 3 chorus segments (predicted chorus segments) of the song to be detected are determined.

The method provided by the embodiment of the application can be used for collecting training sample data of the chorus detection model, and training the chorus detection model by using the training sample data.

The method comprises the steps of determining songs to be detected as sample input data, determining predicted chorus segments as sample chorus labels, calling a chorus detection model, outputting predicted chorus intervals according to the sample input data, and training the chorus detection model until convergence according to losses of the predicted chorus intervals and the sample chorus labels.

For example, the number of audio frames in which the predicted chorus segment is located is determined as a sample chorus tag, and the chorus detection model is trained to output the number of audio frames in which the chorus segment is located in the song to be detected (predicted chorus interval).

Illustratively, as shown in fig. 5, a song to be detected 501 is input into a melody feature extraction model to obtain song features 503, and a reference chorus segment 502 is input into the melody feature extraction model to obtain chorus features 504.

The song to be detected 501 is an audio signal, and the signal length of the audio signal is equal to the duration of the song to be detected. The song feature 503 is a two-dimensional matrix, the length of which is equal to the number of audio frames of the song to be detected (at least one frame of audio frame is obtained by framing the audio signal), and the width of which is equal to the frequency range preset by the model. The reference chorus segment 502 is an audio signal having a signal length equal to the duration of the reference chorus segment. The verse feature 504 is a two-dimensional matrix with a matrix length equal to the real audio frame number of the reference verse segment (at least one audio frame is obtained by framing the audio signal of the reference verse segment) and a width equal to the frequency range preset by the model. Alternatively, assuming that the song 501 to be detected includes two sub-song segments, the reference sub-song segment may be any sub-song segment.

Then, the two-dimensional matrix of song features is transposed, and the chorus features are multiplied by the transposed song feature calculation matrix to obtain the correlation matrix 505. The matrix length of the correlation matrix 505 is equal to the number of audio frames of the song to be detected, and the matrix width of the correlation matrix 505 is equal to the number of audio frames of the reference chorus segment.

The number of columns of the correlation matrix 505 corresponding to the target window matrix with the value higher than the threshold value is the number of frames of the audio frames in which the chorus segments in the song to be detected are located, and according to the number of frames of the audio frames in which the target window matrix is located, the predicted chorus segment 506 can be directly determined from the song to be detected.

In summary, the method provided in this embodiment automatically detects other sub-song segments in the song by using a reference sub-song segment in the already marked song. And respectively extracting the features of the reference sub-song segment and the song, calculating the correlation of the two features, and matching the segment with higher correlation degree with the reference sub-song segment from the song according to the calculation result of the correlation, so as to label the sub-song segment in the song. Improves the marking efficiency of the chorus and saves the cost. The training set of the chorus detection model is obtained by the method provided by the embodiment of the application, so that the accuracy of the chorus detection model can be greatly improved.

The embodiment of the application provides two exemplary ways for determining the target column number interval from the correlation matrix.

First, referring to fig. 6, a flowchart of a method for detecting a chorus provided in an exemplary embodiment of the present application is shown. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. Based on the embodiment shown in fig. 2, step 230 includes step 231.

And 231, determining a numerical value higher than a first threshold value in the correlation matrix as a candidate point, and determining a column number interval in which the candidate points in the correlation matrix are distributed in a diagonal line as a target column number interval.

For example, as shown in fig. 4, candidate points (highlighted points) having values higher than a first threshold in the correlation matrix 402 are selected, and then a target column number section in which the candidate points are distributed diagonally in the correlation matrix is selected, wherein the width of the target column number section is equal to the width of the correlation matrix, that is, the slope of the diagonal is 45 °.

Second, referring to fig. 7, a flowchart of a method for detecting a chorus provided in an exemplary embodiment of the present application is shown. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. Based on the embodiment shown in fig. 2, step 230 includes steps 232 through 235.

And 232, sliding in the correlation matrix according to a preset step by using a sliding window, and intercepting at least one window matrix from the correlation matrix, wherein the number of rows and the number of columns of the sliding window are the same as the number of rows of the correlation matrix.

Illustratively, the width of the sliding window is equal to the width (number of rows) of the correlation matrix, and the length of the sliding window is equal to the width (number of rows) of the correlation matrix, i.e., the sliding window is a square of equal length and width.

Taking the dimension of the correlation matrix as 100×2000 and the preset stride as 1 as an example, the dimension of the sliding window is 100×100, sliding is started from the initial position of the correlation matrix by using the sliding window, so that 2000 window matrices can be intercepted altogether, and the dimension of each window matrix is 100×100.

Step 233, for each window matrix in at least one window matrix, calculating a sum of values on a first diagonal line to obtain a reference value corresponding to each window matrix, where the first diagonal line includes a diagonal line from an upper left corner to a lower right corner in the window matrix.

And adding the correlation values on the diagonal lines from the upper left to the lower right in each window matrix to obtain the sum of the values on the diagonal lines. If the current window matrix is a chorus segment, the sum of correlation values on the diagonal is extremely large.

After the sum of the values of the diagonals in each window matrix is obtained by calculation, a one-dimensional vector can be obtained, the value of the one-dimensional vector is the sum of the values of the diagonals in each window matrix, the one-dimensional vectors are arranged according to the sequence of the window matrices, a line graph 403 shown in fig. 8 can be drawn according to the one-dimensional vector, and the window matrix corresponding to the peak point in the line graph 403 is the section corresponding to the chorus segment.

And 234, determining a window matrix corresponding to the target peak point in the at least one reference value as a target window matrix.

For example, as shown in fig. 8, after the reference value of each window matrix is plotted as a line graph 403, three peak points may be obtained, where the window matrix corresponding to each peak point is the position of the chorus segment.

Alternatively, since the number of sub-song segments of a song is typically 2-4, the sub-song segments may be screened according to the number of peak points. The target peak point is determined by acquiring candidate peak points with reference values higher than a second threshold value in the at least one reference value, and determining the candidate peak points as the target peak point when the number of the candidate peak points is within a preset number interval.

Wherein the second threshold is used to screen out peak points with smaller reference values (smaller sum of similarity on diagonal). The preset number interval may be set to 2 to 4 for screening out abnormal songs having a chorus segment number of 1 or less and 4 or more.

Optionally, outputting the song to be detected as an abnormal song when the number of the candidate peak points is not within a preset number interval.

Step 235, determining the column number interval corresponding to the target window matrix in the correlation matrix as the target column number interval.

For example, if the number of columns of the target window matrix in the correlation matrix is 10 to 100, the audio clips from the 10 th frame to the 100 th frame in the song to be detected are chorus clips.

In summary, the method provided in this embodiment uses the melody feature extraction model to extract the melody features of the reference chorus segment and the song to be detected, and then uses the melody features of the reference chorus segment to match the segment with high correlation from the melody features of the song to be detected, so as to select the predicted chorus segment from the song to be detected. The method can directly determine the audio frame interval where the predicted chorus segment is located from the correlation matrix of the two features, thereby improving the labeling efficiency of the chorus and saving the cost. The training set of the chorus detection model is obtained by the method provided by the embodiment of the application, so that the accuracy of the chorus detection model can be greatly improved.

According to the method provided by the embodiment, the melody characteristics of the reference chorus segment are extracted by using the melody characteristic extraction model, and other chorus segments in the song to be detected are found out according to the given reference chorus segment according to the principle of melody similarity, so that the marking cost of the chorus detection model training data set is reduced, and the accuracy of the chorus marking data set is improved.

According to the method provided by the embodiment, the known reference chorus segment and the whole song to be detected in one song are respectively extracted into melody features, and melody feature correlation is calculated through the melody features of the known reference chorus segment and the whole song to be detected in a matrix multiplication mode, so that a melody correlation matrix is obtained. And the melody correlation matrix is mapped to one dimension to obtain a melody correlation vector, so that the similarity fragment can be conveniently searched. And detecting a peak value in the melody correlation vector by using a peak value detection algorithm, and filtering the peak value to obtain other chorus fragments in the song. And the detection efficiency of the chorus segment is improved.

Illustratively, the melody feature extraction model includes a preprocessing layer and a feature extraction layer, and a training method of the melody feature extraction model is given below.

Fig. 9 shows a flowchart of a method for detecting a chorus according to an exemplary embodiment of the present application. The method may be performed by a computer device, for example, a terminal or server as shown in fig. 1. The method comprises the following steps.

In step 310, sample data is obtained, the sample data including sample audio and a sample tag, the sample tag being used to annotate a vocal melody of the sample audio.

By way of example, sample data may be obtained using the following method:

(1) And extracting the fundamental frequency of each frame of audio in the singing song to obtain the singing melody of the singing song, wherein the singing song is the vocal singing audio without accompaniment.

(2) And determining the sound classification corresponding to the fundamental frequency of each frame of audio in the singing melody according to the mapping relation between the fundamental frequency and the sound classification to obtain a sample tag corresponding to the singing song, wherein the sound classification comprises at least one sound classification obtained by dividing a frequency interval of the fundamental frequency into at least one subinterval.

(3) An accompaniment audio library is acquired, wherein the accompaniment audio library comprises at least one section of accompaniment audio.

(4) And mixing the singing song with a section of accompaniment audio in the accompaniment audio library in a weighted manner to obtain sample audio, wherein a sample label corresponding to the sample audio is a sample label corresponding to the singing song.

Illustratively, a large number of singing songs (songs with no accompaniment only) are collected, and a melody is extracted from the singing songs using a signal processing algorithm pYIN, wherein the melody is expressed in hertz (Hz) (the melody in the song refers to a fundamental frequency). The window length used in extracting the melody may be 25ms and the window movement step may be 10ms, so that the resulting melody (base frequency) is a sequence, each value in the sequence representing the melody value of the location.

To facilitate model training, the fundamental frequencies are divided into a plurality of sound component categories, and the fundamental frequency values of the singing song are mapped into specific sound component categories. For example, the fundamental frequency range recognizable by the model is 49.0Hz (G1) to 1174.66Hz (D6), spanning 55 semitones, each of which is 100 parts (class), each of which is divided into 5 parts (class), each of which corresponds to 20 parts, and there are 275 parts in total, and one part alone is set to represent no melody (no singing), so that the total number of parts is 276.

Then, a large number of accompaniment (only accompaniment unmanned) and noise common in life are collected, and the accompaniment is used for randomly mixing with the singing song in the training process to increase the generalization performance of the model.

In order to enable the model to accurately extract the melody features of the human voice in the original song (song with accompaniment), accompaniment is randomly added to the training data (singing song) during the training process. Specifically, the singing song and the corresponding sample label are cut into equal-length fragments, the length of the fragments is 10s, each inputted fragment is randomly selected from an accompaniment audio library during training, then 10s are randomly intercepted from the selected accompaniment audio to be mixed with the singing fragments, wherein the mixing weight of the singing fragments is 1, and the mixing weight of the accompaniment audio is a random value within the range of 0.3-0.9.

In order to increase the model anti-noise capability, the singing segment is replaced by pure accompaniment audio with a probability of 1% in the training process, and the corresponding label is set to be a tone classification corresponding to the rhythm. The method can effectively avoid the model from extracting the melody by mistake in places without singing.

Step 320, call the preprocessing layer to perform time-frequency conversion and filtering on the sample audio, so as to obtain mel spectrum characteristics of the sample audio.

Optionally, the preprocessing layer is processed as described in step 210.

And 330, calling a feature extraction layer to perform feature extraction on the Mel spectrum features to obtain audio features.

Optionally, the processing of the feature extraction layer is described with reference to step 210.

And 340, calling a classification layer to classify the melody of the audio feature to obtain a predicted melody.

The classification layer is used for outputting the sound classification (fundamental frequency) of each frame of audio in the sample audio according to the audio characteristics to obtain a plurality of sound classifications of multi-frame audio, and further obtain the predicted melody of the sample audio (the sound classification of the multi-frame audio).

And step 350, training the feature extraction layer and the classification type until convergence according to the predicted melody and the loss of the sample label.

For example, a cross entropy loss function may be used to calculate the loss of the predicted melody and sample labels. For example, the Focal-Loss function is used to calculate the losses of the predicted melody and sample tags.

In summary, the method provided in this embodiment provides a method for training a melody feature extraction model, and the melody is extracted from a singing song by using a conventional signal processing algorithm pYIN as a training label of the melody feature extraction model. Discretizing the melody into 276 categories facilitates training of the model. And accompaniment audio is randomly added to the singing song in the melody feature extraction model training process, so that the model can accurately extract melody features from the song containing accompaniment, and the robustness and generalization capability of the model are improved.

The embodiment of the application provides a training data set arrangement method of a chorus detection model, which can accurately find out other chorus fragments in each song according to manually marked chorus fragments in the song, and improves the training effect of the chorus detection model. The implementation of how to find other chorus segments in a song from a given chorus segment is described in detail below.

Different chorus segments in the song have the same melody direction characteristic, according to the characteristic, a melody characteristic extraction model can be trained to extract the melody characteristics of the chorus segment, and the known melody characteristics of the chorus segment and the melody characteristics of the whole song are utilized to obtain the correlation, namely the chorus segment with high correlation. Referring to fig. 10, the specific flow is as follows.

At step 510, a melody feature extraction model.

To find other chorus segments in a song from a given chorus segment, a melody feature extraction model needs to be trained first.

The melody feature extraction model is used to extract melody features from songs, the input is an audio signal, and the output is melody features. After the melody feature extraction model is trained, the features output by the melody feature extraction model are melody features, the melody features are two-dimensional matrixes, the length of each matrix represents the number of input audio frames, and the width of each matrix represents the frequency range of the melody. An example of a melody feature visualization is shown in fig. 3, where fig. 3 is a melody feature (song feature 401) of a piece of audio where the highlighted portion represents the melody trend of the song.

And step 520, calculating the melody correlation.

The calculation of the melody correlation is actually a multiplication of two melody feature matrices, assuming that there are two melody feature segments a and B, the dimension of a is [ Ta, D ], the dimension of B is [ D, tb ], and the dimension of the correlation matrix obtained after the calculation of the correlation of a and B is [ Ta, tb ], this correlation matrix 402 is visualized as shown in fig. 4.

The horizontal axis of the correlation matrix represents the length of the segment B, the vertical axis represents the length of the segment A, the values of coordinates [ ta, tb ] in the correlation matrix represent the degree of correlation between the melody at ta of the segment A and the melody at tb of the segment B, and the degree of correlation is represented by a numerical value 1~0 from the highest to the lowest. From the correlation matrix of fig. 4, it can be seen that there are three highlighted diagonal lines, respectively at positions 700-1000, 1390-1690, 1850-2150, indicating that the three sections 700-1000, 1390-1690, 1850-2150 of the B segment are consistent with the melody of the segment a. (if the segment A is a chorus segment and B is the whole song, the segments 710-1010, 1380-1680 and 1840-2140 of the segment B are chorus segments can be seen through the correlation matrix.

In order to improve the accuracy of the chorus segment search, the similarity matrix is processed by a post-processing means.

Step 530, post-processing of the melody correlation matrix.

In order to accurately find out the chorus segment in the melody correlation matrix, the melody correlation matrix is mapped into a one-dimensional vector, wherein the mapping mode is to set a square matrix with the length and width equal to the longitudinal axis of the correlation matrix, the correlation matrix is swept from left to right by using the diagonal of the square matrix, the sum of the diagonals of the square matrix at each position is calculated as output, and finally the one-dimensional vector with the length consistent with the length of the correlation matrix is obtained. After normalizing the one-dimensional vector (the maximum value is 1 and the minimum value is 0), a similarity vector can be obtained, and after visualizing the similarity vector, as shown in fig. 8, the positions of the three peaks in fig. 8 are the positions of the chorus segments.

And detecting a peak value in the similarity vector by using a peak value detection algorithm, wherein the obtained peak value is the chorus segment to be searched. In order to improve the accuracy of the chorus segment search, the peak detection algorithm only detects peaks greater than 0.8, and in addition, the number of detected peaks is 2-4, the number of peaks is less than or equal to 1 or more than or equal to 5, and the chorus segment search is not performed (according to experience, the chorus segment in popular songs is usually repeated for 2-4 times).

In summary, according to the method provided in this embodiment, other chorus segments in the song can be found according to the labeled chorus segment. By using the method to obtain the training set of the chorus detection model, the accuracy of the chorus detection model can be greatly improved.

The following is an embodiment of the device according to the present application, and details of the embodiment of the device that are not described in detail may be combined with corresponding descriptions in the embodiment of the method described above, which are not described herein again.

Fig. 11 is a schematic structural diagram of a chorus detection device according to an exemplary embodiment of the present application. The apparatus may be implemented as all or part of a computer device by software, hardware, or a combination of both, the apparatus comprising:

The characteristic extraction module 701 is used for calling a melody characteristic extraction model to extract characteristics of a song to be detected to obtain song characteristics, calling the melody characteristic extraction model to extract characteristics of a reference chorus segment to obtain chorus characteristics, wherein the reference chorus segment is a calibrated chorus segment in the song to be detected, the song characteristics and the chorus characteristics are two-dimensional matrixes, the width of the two-dimensional matrixes represents a frequency range, and the length of the two-dimensional matrixes represents audio frequency frames;

The correlation module 702 is configured to multiply the song feature with the chorus feature, and calculate a correlation matrix of the song feature and the chorus feature, where the correlation matrix is a two-dimensional matrix, a number of rows of the correlation matrix corresponds to an audio frame number of the chorus segment, and a number of columns of the correlation matrix corresponds to the audio frame number of the song to be detected;

A determining module 703, configured to determine a target column number interval in the correlation matrix, where the number of the target column number interval meets a threshold condition;

the determining module 703 is configured to determine a target audio frame interval of the song to be detected, which corresponds to the target column number interval, as a predicted sub-song segment of the song to be detected.

In an alternative embodiment, the determining module 703 is configured to determine a value in the correlation matrix above a first threshold as a candidate point, and determine a column number interval in which the candidate points in the correlation matrix are diagonally distributed as the target column number interval.

In an alternative embodiment, the determining module 703 is configured to:

Sliding in the correlation matrix according to a preset stride by using a sliding window, and intercepting at least one window matrix from the correlation matrix, wherein the number of rows and the number of columns of the sliding window are the same as the number of rows of the correlation matrix;

Calculating the sum of values on a first diagonal line for each window matrix in the at least one window matrix to obtain a reference value corresponding to each window matrix, wherein the first diagonal line comprises diagonal lines from the upper left corner to the lower right corner in the window matrix;

Determining a window matrix corresponding to a target peak point in at least one reference value as a target window matrix;

And determining a column number interval corresponding to the target window matrix in the correlation matrix as the target column number interval.

In an alternative embodiment, the determining module 703 is configured to:

Acquiring candidate peak points with reference values higher than a second threshold value in the at least one reference value;

under the condition that the number of the candidate peak points is in a preset number interval, determining the candidate peak points as the target peak points;

The apparatus further comprises:

And outputting the song to be detected as an abnormal song under the condition that the number of the candidate peak points is not in a preset number interval.

In an alternative embodiment, the melody feature extraction model includes a preprocessing layer and a feature extraction layer, and the apparatus further includes:

a training module 704, configured to obtain sample data, where the sample data includes sample audio and a sample tag, and the sample tag is used to label a vocal melody of the sample audio;

The training module 704 is configured to invoke the preprocessing layer to perform time-frequency conversion and filtering on the sample audio to obtain mel spectrum features of the sample audio;

the training module 704 is configured to invoke the feature extraction layer to perform feature extraction on the mel spectrum feature to obtain an audio feature;

the training module 704 is configured to invoke a classification layer to classify the melody of the audio feature to obtain a predicted melody;

The training module 704 is configured to train the feature extraction layer and the classification type until convergence according to the predicted melody and the loss of the sample label.

In an alternative embodiment, the training module 704 is configured to extract a fundamental frequency of each frame of audio in a singing song, so as to obtain a singing melody of the singing song, where the singing song is a vocal singing audio without accompaniment;

The training module 704 is configured to determine a class of sound components corresponding to a fundamental frequency of each frame of audio in the singing melody according to a mapping relationship between the fundamental frequency and the class of sound components, so as to obtain a sample tag corresponding to the singing song;

the training module 704 is configured to obtain an accompaniment audio library, where the accompaniment audio library includes at least one accompaniment audio segment;

the training module 704 is configured to mix the singing song with a segment of accompaniment audio in the accompaniment audio library in a weighted manner to obtain sample audio, where a sample tag corresponding to the sample audio is a sample tag corresponding to the singing song.

In an alternative embodiment, the apparatus further comprises:

The training module 704 is configured to determine the song to be detected as sample input data, and determine the predicted chorus segment as a sample chorus tag;

the training module 704 is configured to invoke a chorus detection model to output a predicted chorus interval according to the sample input data;

the training module 704 is configured to train the chorus detection model until convergence according to the predicted chorus interval and the loss of the sample chorus label.

Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application. Specifically, the server 800 includes a central processing unit (English: central Processing Unit, abbreviated as CPU) 801, a system Memory 804 including a random access Memory (English: random Access Memory, abbreviated as RAM) 802 and a Read-Only Memory (English: ROM) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806 for facilitating the transfer of information between various devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user account input information. Wherein both the display 808 and the input device 809 are connected to the central processing unit 801 via an input/output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input/output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only Memory (EPROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (DIGITAL VERSATILE DISC, DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.

According to various embodiments of the application, server 800 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., server 800 may be connected to a network 812 through a network interface unit 811 connected to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The application also provides a terminal which comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the chorus detection method provided by each method embodiment. It should be noted that the terminal may be a terminal as provided in fig. 13 below.

Fig. 13 shows a block diagram of a terminal 900 according to an exemplary embodiment of the present application. The terminal 900 may be a smart phone, tablet, MP3 player (Moving Picture Experts Group Audio Layer III, MPEG audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, MPEG audio layer 4) player, notebook, or desktop. Terminal 900 may also be referred to as a user account device, portable terminal, laptop terminal, desktop terminal, and the like.

In general, terminal 900 includes a processor 901 and memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 901 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 901 may also include a main processor, which is a processor for processing data in a wake-up state, also referred to as a CPU (Central Processing Unit ), and a coprocessor, which is a low-power processor for processing data in a standby state. In some embodiments, the processor 901 may integrate a GPU (Graphics Processing Unit, a verse detector) for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is configured to store at least one instruction for execution by processor 901 to implement a chorus detection method or a chorus detection method provided by a method embodiment of the present application.

In some embodiments, terminal 900 can optionally further include a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 903 via buses, signal lines, or circuit boards. Specifically, the peripheral devices include at least one of a radio frequency circuit 904, a display 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power source 909.

The peripheral interface 903 may be used to connect at least one peripheral device associated with an I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, the memory 902, and the peripheral interface 903 are integrated on the same chip or circuit board, and in some other embodiments, either or both of the processor 901, the memory 902, and the peripheral interface 903 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 904 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 904 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. The radio frequency circuitry 904 illustratively includes an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user account identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to, the world wide web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 904 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.

The display 905 is used to display a UI (User Interface, user account Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 905 is a touch display, the display 905 also has the ability to capture touch signals at or above the surface of the display 905. The touch signal may be input as a control signal to the processor 901 for processing. At this time, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing a front panel of the terminal 900, in other embodiments, the display 905 may be at least two, provided on different surfaces or in a folded design of the terminal 900, respectively, and in still other embodiments, the display 905 may be a flexible display, provided on a curved surface or a folded surface of the terminal 900. Even more, the display 905 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 905 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 906 is used to capture images or video. Illustratively, the camera assembly 906 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user account and the environment, converting the sound waves into electric signals and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be plural and disposed at different portions of the terminal 900. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 907 may also include a headphone jack.

The location component 908 is used to locate the current geographic location of the terminal 900 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 908 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

The power supply 909 is used to supply power to the various components in the terminal 900. The power supply 909 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can further include one or more sensors 910. The one or more sensors 910 include, but are not limited to, an acceleration sensor 911, a gyroscope sensor 912, a pressure sensor 913, a fingerprint sensor 914, an optical sensor 915, and a proximity sensor 916.

The acceleration sensor 911 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 901 may control the display 905 to display the user account interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for the acquisition of motion data for games or user accounts.

The gyro sensor 912 may detect the body direction and the rotation angle of the terminal 900, and the gyro sensor 912 may collect the 3D motion of the user account to the terminal 900 in cooperation with the acceleration sensor 911. The processor 901 can realize functions such as motion sensing (e.g., changing a UI according to a tilting operation of a user account), image stabilization at photographing, game control, and inertial navigation, based on data collected by the gyro sensor 912.

The pressure sensor 913 may be provided at a side frame of the terminal 900 and/or at a lower layer of the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the holding signal of the user account to the terminal 900 can be detected, and the processor 901 performs the left-right hand recognition or the shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at the lower layer of the display 905, the processor 901 performs pressure operation on the display 905 according to the user account, thereby realizing control of the operability control on the UI interface. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 914 is used for collecting the fingerprint of the user account, and the processor 901 identifies the identity of the user account according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the identity of the user account according to the collected fingerprint. Upon recognizing that the identity of the user account is a trusted identity, the processor 901 authorizes the user account to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, and the like. The fingerprint sensor 914 may be provided on the front, back or side of the terminal 900. When a physical key or a vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or the vendor Logo.

The optical sensor 915 is used to collect the intensity of ambient light. In one embodiment, the processor 901 may control the display brightness of the display panel 905 based on the intensity of ambient light collected by the optical sensor 915. Specifically, the display luminance of the display panel 905 is turned up when the ambient light intensity is high, and the display luminance of the display panel 905 is turned down when the ambient light intensity is low. In another embodiment, the processor 901 may also dynamically adjust the shooting parameters of the camera assembly 906 based on the ambient light intensity collected by the optical sensor 915.

A proximity sensor 916, also referred to as a distance sensor, is typically provided on the front panel of the terminal 900. Proximity sensor 916 is used to collect the distance between the user account and the front of terminal 900. In one embodiment, the processor 901 controls the display 905 to switch from the on-screen state to the off-screen state when the proximity sensor 916 detects that the distance between the user account and the front surface of the terminal 900 is gradually decreased, and controls the display 905 to switch from the off-screen state to the on-screen state when the proximity sensor 916 detects that the distance between the user account and the front surface of the terminal 900 is gradually increased.

Those skilled in the art will appreciate that the structure shown in fig. 13 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

The memory also comprises one or more programs, the one or more programs are stored in the memory, and the one or more programs comprise a method for detecting the chorus provided by the embodiment of the application.

The application also provides computer equipment which comprises a processor and a memory, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the storage medium, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the chorus detection method provided by each method embodiment.

The present application also provides a computer readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement a chorus detection method provided by each of the above method embodiments.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the chorus detection method provided in the above alternative implementation.

It should be understood that references herein to "a plurality" are to two or more. "and/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof, but rather as various modifications, equivalent arrangements, improvements, etc., which fall within the spirit and principles of the present application.

Claims

Translated fromChinese

1.一种副歌检测方法，其特征在于，所述方法包括：1. A chorus detection method, characterized in that the method comprises:

调用旋律特征提取模型对待检测歌曲进行特征提取，得到歌曲特征；调用所述旋律特征提取模型对参考副歌片段进行特征提取，得到副歌特征；所述参考副歌片段是所述待检测歌曲中已标定的副歌片段；所述歌曲特征和所述副歌特征均为二维矩阵，所述二维矩阵的宽度表示频率范围，所述二维矩阵的长度表示音频帧数；Calling a melody feature extraction model to perform feature extraction on a song to be detected to obtain song features; calling the melody feature extraction model to perform feature extraction on a reference chorus segment to obtain chorus features; the reference chorus segment is a calibrated chorus segment in the song to be detected; the song features and the chorus features are both two-dimensional matrices, the width of the two-dimensional matrix represents the frequency range, and the length of the two-dimensional matrix represents the number of audio frames;

将所述歌曲特征与所述副歌特征相乘，计算所述歌曲特征与所述副歌特征的相关性矩阵；所述相关性矩阵为二维矩阵，所述相关性矩阵的行数与所述副歌片段的音频帧数相对应，所述相关性矩阵的列数与所述待检测歌曲的音频帧数相对应；Multiplying the song feature and the chorus feature to calculate a correlation matrix between the song feature and the chorus feature; the correlation matrix is a two-dimensional matrix, the number of rows of the correlation matrix corresponds to the number of audio frames of the chorus segment, and the number of columns of the correlation matrix corresponds to the number of audio frames of the song to be detected;

将所述目标列数区间所对应的所述待检测歌曲的目标音频帧区间，确定为所述待检测歌曲的预测副歌片段。The target audio frame interval of the song to be detected corresponding to the target column number interval is determined as the predicted chorus segment of the song to be detected.

2.根据权利要求1所述的方法，其特征在于，所述确定所述相关性矩阵中数值满足阈值条件的目标列数区间，包括：2. The method according to claim 1, wherein determining a target column number interval in which values in the correlation matrix meet a threshold condition comprises:

3.根据权利要求1所述的方法，其特征在于，所述确定所述相关性矩阵中数值满足阈值条件的目标列数区间，包括：3. The method according to claim 1, wherein determining a target column number interval in which values in the correlation matrix meet a threshold condition comprises:

对于所述至少一个窗口矩阵中的每个窗口矩阵，计算第一对角线上的数值之和，得到每个窗口矩阵对应的参考数值；所述第一对角线包括所述窗口矩阵中左上角至右下角的对角线；For each window matrix of the at least one window matrix, calculating the sum of the values on a first diagonal to obtain a reference value corresponding to each window matrix; the first diagonal includes a diagonal from the upper left corner to the lower right corner of the window matrix;

将至少一个参考数值中目标峰值点所对应的窗口矩阵，确定为目标窗口矩阵；Determine a window matrix corresponding to a target peak point in at least one reference value as a target window matrix;

4.根据权利要求3所述的方法，其特征在于，所述目标峰值点是采用以下方法确定的：4. The method according to claim 3, wherein the target peak point is determined by the following method:

获取所述至少一个参考数值中参考数值高于第二阈值的候选峰值点；Obtaining a candidate peak point whose reference value is higher than a second threshold value from the at least one reference value;

在所述候选峰值点的数量位于预设数量区间内的情况下，将所述候选峰值点确定为所述目标峰值点；When the number of the candidate peak points is within a preset number range, determining the candidate peak points as the target peak points;

所述方法还包括：The method further comprises:

在所述候选峰值点的数量不位于预设数量区间内的情况下，输出所述待检测歌曲为异常歌曲。When the number of the candidate peak points is not within a preset number range, the song to be detected is output as an abnormal song.

5.根据权利要求1至4任一所述的方法，其特征在于，所述旋律特征提取模型包括预处理层和特征提取层；所述旋律特征提取模型是采用如下方法训练得到的：5. The method according to any one of claims 1 to 4, wherein the melody feature extraction model comprises a preprocessing layer and a feature extraction layer; and the melody feature extraction model is trained using the following method:

获取样本数据，所述样本数据包括样本音频和样本标签，所述样本标签用于标注所述样本音频的人声旋律；Acquire sample data, the sample data including sample audio and sample labels, the sample labels being used to label the vocal melody of the sample audio;

调用所述预处理层对所述样本音频进行时频转换和滤波，得到所述样本音频的梅尔谱特征；Calling the preprocessing layer to perform time-frequency conversion and filtering on the sample audio to obtain Mel spectrum features of the sample audio;

调用所述特征提取层对所述梅尔谱特征进行特征提取，得到音频特征；Calling the feature extraction layer to extract the Mel spectrum features to obtain audio features;

调用分类层对所述音频特征进行旋律分类，得到预测旋律；Calling the classification layer to perform melody classification on the audio features to obtain a predicted melody;

根据所述预测旋律与所述样本标签的损失，训练所述特征提取层和所述分类型直至收敛。The feature extraction layer and the classification layer are trained until convergence according to the loss between the predicted melody and the sample label.

6.根据权利要求5所述的方法，其特征在于，所述获取样本数据，包括：6. The method according to claim 5, wherein obtaining sample data comprises:

提取清唱歌曲中每帧音频的基频，得到所述清唱歌曲的清唱旋律，所述清唱歌曲为无伴奏的人声清唱音频；Extracting the fundamental frequency of each frame of audio in a cappella song to obtain a cappella melody of the cappella song, wherein the cappella song is an a cappella audio of a human voice without accompaniment;

根据基频与音分类别的映射关系，确定所述清唱旋律中每帧音频的基频对应的音分类别，得到所述清唱歌曲对应的样本标签；所述音分类别包括将所述基频的频率区间划分为至少一个子区间得到的至少一个音分类别；Determining, based on a mapping relationship between fundamental frequency and pitch categories, a pitch category corresponding to the fundamental frequency of each frame of audio in the a cappella melody, and obtaining a sample label corresponding to the a cappella song; the pitch category includes at least one pitch category obtained by dividing the frequency range of the fundamental frequency into at least one subrange;

获取伴奏音频库，所述伴奏音频库中包括至少一段伴奏音频；Acquire an accompaniment audio library, wherein the accompaniment audio library includes at least one section of accompaniment audio;

将所述清唱歌曲与所述伴奏音频库中的一段伴奏音频加权混合，得到样本音频，所述样本音频对应的样本标签为所述清唱歌曲对应的样本标签。The a cappella song is weightedly mixed with a section of accompaniment audio in the accompaniment audio library to obtain a sample audio, and the sample label corresponding to the sample audio is the sample label corresponding to the a cappella song.

7.根据权利要求1至4任一所述的方法，其特征在于，所述方法还包括：7. The method according to any one of claims 1 to 4, further comprising:

将所述待检测歌曲确定为样本输入数据，将所述预测副歌片段确定为样本副歌标签；Determine the song to be detected as sample input data, and determine the predicted chorus segment as a sample chorus label;

调用副歌检测模型根据所述样本输入数据，输出预测副歌区间；Calling the chorus detection model to output a predicted chorus interval based on the sample input data;

根据所述预测副歌区间和所述样本副歌标签的损失，训练所述副歌检测模型直至收敛。The chorus detection model is trained until convergence according to the loss of the predicted chorus interval and the sample chorus label.

8.一种副歌检测装置，其特征在于，所述装置包括：8. A chorus detection device, characterized in that the device comprises:

特征提取模块，用于调用旋律特征提取模型对待检测歌曲进行特征提取，得到歌曲特征；调用所述旋律特征提取模型对参考副歌片段进行特征提取，得到副歌特征；所述参考副歌片段是所述待检测歌曲中已标定的副歌片段；所述歌曲特征和所述副歌特征均为二维矩阵，所述二维矩阵的宽度表示频率范围，所述二维矩阵的长度表示音频帧数；A feature extraction module is configured to call a melody feature extraction model to perform feature extraction on a song to be detected to obtain song features; call the melody feature extraction model to perform feature extraction on a reference chorus segment to obtain chorus features; the reference chorus segment is a calibrated chorus segment in the song to be detected; the song features and the chorus features are both two-dimensional matrices, the width of the two-dimensional matrix represents the frequency range, and the length of the two-dimensional matrix represents the number of audio frames;

所述确定模块，用于将所述目标列数区间所对应的所述待检测歌曲的目标音频帧区间，确定为所述待检测歌曲的预测副歌片段。The determination module is used to determine the target audio frame interval of the song to be detected corresponding to the target column number interval as the predicted chorus segment of the song to be detected.

9.一种计算机设备，其特征在于，所述计算机设备包括：处理器和存储器，所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行，以实现如权利要求1至7任一项所述的副歌检测方法。9. A computer device, characterized in that the computer device comprises: a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the chorus detection method according to any one of claims 1 to 7.

10.一种计算机可读存储介质，其特征在于，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行，以实现如权利要求1至7任一项所述的副歌检测方法。10. A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the chorus detection method according to any one of claims 1 to 7.