CN104598541A

Movatterモバイル変換

Info

Publication number: CN104598541A
Application number: CN201410849018.9A
Authority: CN
Inventors: 王晓萌; 谭傅伦; 许泽军; 王英杰; 袁斌
Original assignee: LeTV Information Technology Beijing Co Ltd
Current assignee: LeTV Information Technology Beijing Co Ltd
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2015-05-06

Abstract

The invention discloses an identification method and an identification device for a multimedia file. The identification method for the multimedia file comprises the following steps: obtaining mixed audio data corresponding to a target multimedia, wherein the mixed audio data comprises audio data and audio watermark data of the target multimedia file; extracting the audio watermark data in the mixed audio data; matching the audio watermark data with a preset audio watermark sample to obtain a first matched result; determining a feature sample part corresponding to the first matched result in a preset feature sample; extracting feature information of the audio data in the target multimedia file in the mixed audio data; matching the feature information with the feature sample part to obtain a second matched result; identifying the target multimedia file according to the second matched result. The identification method and device can be used for improving fineness of audio identification.

Description

Multimedia file identification method and device

Technical Field

The present invention relates to the field of multimedia file identification technologies, and in particular, to a method and an apparatus for identifying a multimedia file.

Background

The current video searching method generally uses the "keyword" searching of videos. This requires not only that the user know the relevant information of the video, but also that the search service provider be able to maintain a database of "keywords" that correspond one-to-one to the video in time. In practice, we often suffer from such embarrassment: an interesting video is encountered in a street or a television, but we are not familiar with or even do not know the information of the video, let alone search the video by the keyword.

Therefore, recognizing video based on voice has been driven by this practical need, which enables recognition of video itself by voice of video. In the technology of recognizing video based on voice, the following two technologies are mainly included: audio watermark-based video identification techniques and audio fingerprint-based video identification techniques.

Among the video identification technologies based on audio watermarking, the video identification technology based on audio watermarking is commonly used, and the principle thereof is as follows: by utilizing the characteristic that human ears are insensitive to high-frequency sound and adding the voice print codes carrying specific information into the high-frequency section of the audio data, the identification terminal can extract the voice print codes carried by the voice print codes from the voice file carrying the voice print codes after acquiring the voice file carrying the voice print codes, and the extracted voice print codes are matched with voice print code samples in the database, so that the video is identified through voice. Its advantage is high recognition speed (millisecond class).

However, this technique only relies on the sound print code data to distinguish when distinguishing the video, and thus cannot distinguish the video to which the same sound print code data is added, for example, when the sound print code data added to a plurality of episode series belonging to the same episode is the same, it is impossible to distinguish each episode series, so when identifying a certain episode series, it can only be identified that the episode series belongs to a certain episode, and it cannot be identified which episode in the episode series is specific; when the sound print code data added to a certain movie is the same, the movie segments in the movie cannot be distinguished, so that when a certain segment in the movie is identified, only the movie segment belonging to a certain movie can be identified, but which segment in the movie segment is specific to cannot be identified.

Aiming at the problem of low video identification fineness in the prior art, an effective solution is not provided at present.

Disclosure of Invention

The invention mainly aims to provide a method and a device for identifying a multimedia file, which aim to solve the problem of low video identification fineness in the prior art.

According to one aspect of the present invention, a method for identifying a multimedia file is provided.

The method for identifying the multimedia file comprises the following steps: acquiring mixed audio data corresponding to a target multimedia, wherein the mixed audio data comprises audio data and audio watermark data of a target multimedia file; extracting audio watermark data in the mixed audio data; matching the audio watermark data with a preset audio watermark sample to obtain a first matching result; determining a characteristic sample part corresponding to a first matching result in a preset characteristic sample; extracting characteristic information of audio data of a target multimedia file in the mixed audio data; matching the characteristic information with the characteristic sample part to obtain a second matching result; and identifying the target multimedia file according to the second matching result.

Further, the mixed audio data further includes user speech data, the method further comprising: extracting user voice data in the mixed audio data; matching the user voice data with a preset voice sample to obtain a third matching result; and selecting a target multimedia file from the target multimedia files identified according to the second matching result according to the third matching result.

Further, extracting the audio watermark data in the mixed audio data includes: extracting audio data of a high frequency part in the mixed audio data; extracting feature information of audio data of a target multimedia file in the mixed audio data includes: extracting feature information of audio data of a low-frequency part in the mixed audio data; extracting the user voice data in the mixed audio data includes: extracting audio data of a low-frequency part in the mixed audio data; and removing the audio data of the target multimedia file in the audio data of the low-frequency part to obtain the user voice data.

Further, extracting the feature information of the audio data of the target multimedia file in the mixed audio data includes: extracting left channel data and right channel data of a low-frequency part in the mixed audio data; combining the left channel data and the right channel data using the following formula to obtain stereo data for the low frequency portion: s ═ a ═ l + b ═ r, where a + b ═ 1, s is stereo data of the low frequency part, l is left channel data of the low frequency part, r is right channel data of the low frequency part, and a and b are preset parameters; and extracting the time-frequency characteristic data of the stereo data to obtain fingerprint information of the target multimedia file, wherein the fingerprint information forms the characteristic information of the audio data of the target multimedia file.

Further, if the target multimedia file is a sub multimedia file of a second multimedia file, the first matching result is identification information of the second multimedia file, the second matching result is identification information of the target multimedia file, the feature sample is at least one multimedia record stored in a preset feature database, and the multimedia record includes fingerprint information of the multimedia file and identification information of the multimedia file corresponding to the fingerprint information, then: determining a feature sample part corresponding to the first matching result in a preset feature sample comprises: locating one or more multimedia records corresponding to the identification information of the second multimedia file in the feature database; matching the feature information with the feature sample portion to obtain a second matching result comprises: the fingerprint information of the target multimedia is matched with the located one or more multimedia records to determine the identification information of the target multimedia.

Further, the stereo data of the low frequency part is N stereo data, wherein the ith stereo data of the N stereo data is s_i＝a_i*l+b_i*r，a_i′+b_i1,2,3 … N, matching the fingerprint information of the target multimedia file with the located one or more multimedia records to determine the identification information of the target multimedia file comprises: matching the time-frequency characteristic data of each stereo data with one or more positioned multimedia records to obtain a plurality of matching rates corresponding to the stereo data; and determining the identification information of the target multimedia file according to the multimedia record corresponding to the maximum value in the matching rates.

According to another aspect of the present invention, there is provided an apparatus for identifying a multimedia file.

The multimedia file recognition device according to the present invention comprises: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring mixed audio data corresponding to target multimedia, and the mixed audio data comprises audio data and audio watermark data of a target multimedia file; the first extraction module is used for extracting audio watermark data in the mixed audio data; the first matching module is used for matching the audio watermark data with a preset audio watermark sample to obtain a first matching result; the determining module is used for determining a characteristic sample part corresponding to the first matching result in a preset characteristic sample; the second extraction module is used for extracting the characteristic information of the audio data of the target multimedia file in the mixed audio data; the second matching module is used for matching the characteristic information with the characteristic sample part to obtain a second matching result; and the identification module is used for identifying the target multimedia file according to the second matching result.

Further, the mixed audio data further includes user voice data, the apparatus further includes: the third extraction module is used for extracting user voice data in the mixed audio data; the third matching module is used for matching the user voice data with a preset voice sample to obtain a third matching result; and the verification module is used for selecting a target multimedia file from the target multimedia files identified according to the second matching result according to the third matching result.

Further, the steps specifically executed by the first extraction module when extracting the audio watermark data are as follows: extracting audio data of a high frequency part in the mixed audio data; the second extraction module specifically executes the following steps when extracting the feature information: extracting feature information of audio data of a low-frequency part in the mixed audio data; the third extraction module specifically executes the following steps when extracting the user voice data: extracting audio data of a low-frequency part in the mixed audio data; and removing the audio data of the target multimedia file in the audio data of the low-frequency part to obtain the user voice data.

Further, the second extraction module comprises: the left and right channel data extraction module is used for extracting left channel data and right channel data of a low-frequency part in the mixed audio data; the stereo data synthesis module is used for combining the left channel data and the right channel data by adopting the following formula to obtain the stereo data of the low-frequency part: s ═ a ═ l + b ═ r, where a + b ═ 1, s is stereo data of the low frequency part, l is left channel data of the low frequency part, r is right channel data of the low frequency part, and a and b are preset parameters; and the fingerprint information extraction module is used for extracting the time-frequency characteristic data of the stereo data to obtain the fingerprint information of the target multimedia file, wherein the fingerprint information forms the characteristic information of the audio data of the target multimedia file.

Further, if the target multimedia file is a sub multimedia file of a second multimedia file, the first matching result is identification information of the second multimedia file, the second matching result is identification information of the target multimedia file, the feature sample is at least one multimedia record stored in a preset feature database, and the multimedia record includes fingerprint information of the multimedia file and identification information of the multimedia file corresponding to the fingerprint information, then: the steps specifically executed by the determination module when determining the characteristic sample part are as follows: locating one or more multimedia records corresponding to the identification information of the second multimedia file in the feature database; the steps specifically executed by the second matching module when the second matching result is obtained are as follows: the fingerprint information of the target multimedia is matched with the located one or more multimedia records to determine the identification information of the target multimedia.

Further, the stereo data of the low frequency part is N stereo data, wherein the ith stereo data of the N stereo data is s_i＝a_i*l+b_i*r，a_i′+b_iAnd' -1, i-1, 2,3 … N, the second matching module comprises: the matching rate determining module is used for respectively matching the time-frequency characteristic data of each stereo data with one or more positioned multimedia records to obtain a plurality of matching rates corresponding to the stereo data; and the identification information determining module is used for determining the identification information of the target multimedia file according to the multimedia record corresponding to the maximum value in the matching rates.

According to the method and the device, when the multimedia file is identified, the audio watermark data of the target multimedia file is matched with the preset audio watermark sample to obtain a first matching result, the preset characteristic sample is screened according to the first matching result, the characteristic sample part corresponding to the first matching result is screened, the primary identification of the target multimedia file is realized, and the identification range is narrowed; on the basis, the characteristic information of the audio data of the target multimedia file is matched with the characteristic sample part, a second matching result can be obtained, namely, the target multimedia file is further identified in the reduced identification range, and finally, the target multimedia file is identified according to the second matching result.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a method according to a second embodiment of the invention;

FIG. 3 is a flow chart of a method according to a third embodiment of the present invention;

FIG. 4 is a flowchart of a method according to a fourth embodiment of the present invention;

FIG. 5 is a flow chart of a method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method according to a sixth embodiment of the invention;

fig. 7 is a block diagram of a recording module of a terminal according to a seventh embodiment of the present invention;

fig. 8 is a block diagram of a terminal audio recognition module according to a seventh embodiment of the present invention;

FIG. 9 is a diagram of a server and a database according to a seventh embodiment of the invention;

FIG. 10 is a flow chart of a method according to a seventh embodiment of the invention;

fig. 11 is a block diagram of an apparatus according to an eighth embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The embodiment of the invention provides a multimedia file identification method, which comprises the steps of firstly, matching audio watermark data of a target multimedia file with a preset audio watermark sample to obtain a first matching result; then obtaining a characteristic sample part corresponding to the first matching result in a preset characteristic sample; and matching the characteristic information of the audio data of the target multimedia file with the characteristic sample part to obtain a second matching result, and finally identifying the target multimedia file according to the second matching result.

It can be seen that, in the method, firstly, the advantage of high processing speed of the video identification technology based on the audio watermark is utilized, and the audio watermark data of the target multimedia file is identified, so that a preliminary identification result can be obtained quickly; then, the advantage that the audio data of any source can be identified by utilizing the video identification technology based on the audio fingerprints is utilized, the identification is further carried out based on the characteristic information of the audio data of the target multimedia file on the basis of the initial identification result, the video identification technology based on the audio watermarks and the video identification technology based on the audio fingerprints are effectively combined, and the identification fineness is improved.

Various embodiments provided by the present invention will be described in detail below.

Example one

An embodiment of the method for identifying a multimedia file is provided, in which audio watermark data is added to audio data of a target multimedia file in advance, and a preset audio watermark sample and a preset feature sample are used when identifying the target multimedia file, and specifically, as shown in fig. 1, the method includes the following steps S102 to S114.

Step S102: and acquiring mixed audio data corresponding to the target multimedia, wherein the mixed audio data comprises audio data and audio watermark data of the target multimedia file.

The target multimedia file can be a video (or an audio), during the playing process of the video (or the audio), the identification device starts a recording device to record sound, so as to obtain mixed audio data of the video (or the audio), and since the audio watermark data is added in advance to the video (or the audio), the recorded mixed audio data includes both the audio data of the target multimedia file and the added audio watermark data.

The identification device can be an intelligent mobile communication terminal, such as a mobile phone and a PAD; or a computer; or as a separate identification unit embedded in the device where the identification of the multimedia file is required.

Step S104: and extracting audio watermark data in the mixed audio data.

According to the characteristics of the audio watermark data, the audio watermark data in the mixed audio data is extracted, for example, the audio watermark data is the acoustic watermark code data, and the acoustic watermark code data is specific information added in a high-frequency section of the audio data, so that the acoustic watermark code data can be obtained by extracting the data of the high-frequency part in the mixed audio data.

Step S106: and matching the audio watermark data with a preset audio watermark sample to obtain a first matching result.

The preset audio watermark sample can be stored in an audio watermark database which is positioned at the local part of the identification device, so that the identification device finishes the matching of the audio watermark data and the preset audio watermark sample, and also can be stored in an audio watermark database which is positioned at the remote end of the identification device and is positioned in an audio watermark identification server, the identification device interacts with the audio watermark identification server and transmits the audio watermark data to the audio watermark identification server, the audio watermark identification server finishes the matching of the audio watermark data and the preset audio watermark sample, and no matter who executes the matching step, a first matching result can be obtained.

The first matching result may be a multimedia file group consisting of a plurality of multimedia files, that is, by this step, it is determined that the target multimedia file is in the multimedia file group.

For example, the same audio watermark data is added to all movies of the same actor in advance, and the preset audio watermark sample is a plurality of audio watermark records stored in an audio watermark database, and each audio watermark record is composed of audio watermark data and a movie actor name corresponding to the audio watermark data. If the target multimedia file is a movie starring by a certain actor, when the audio watermark data of the target multimedia file is matched with a preset audio watermark sample, an audio watermark record can be necessarily positioned in the audio watermark database according to the audio watermark data of the target multimedia file to obtain the starring name of the movie, namely, the target multimedia file is identified to be a movie starring by a certain actor A through the audio watermark data.

Step S108: and determining a characteristic sample part corresponding to the first matching result in the preset characteristic sample.

The predetermined feature sample may be stored in an audio fingerprint database local to the identification device, so that the identification device performs the determination of the feature sample part locally; the first matching result is transmitted to the audio fingerprint identification server, the audio fingerprint identification server determines the characteristic sample part, and no matter who performs the determining step, a part of the characteristic sample part is screened out from the preset characteristic sample and is the characteristic sample part corresponding to the first matching result.

Through the step, when the audio fingerprints are matched, the feature information of the audio data of the target multimedia file is not required to be matched with the whole preset feature sample, and the feature information is only required to be matched with the feature sample part corresponding to the first matching result.

For example, the preset feature samples are a plurality of multimedia records stored in an audio fingerprint database, and each multimedia record is composed of fingerprint information of a multimedia file, a movie name corresponding to the fingerprint information, and a movie lead actor name. The first matching result obtained in step S106 is: the target multimedia is a movie starring an actor a, and in this step, one or more multimedia records corresponding to the actor a can be located in the audio fingerprint database according to the first matching result.

Step S110: and extracting the characteristic information of the audio data of the target multimedia file in the mixed audio data.

For example, when the audio watermark data is the audio watermark code data, the audio watermark code data is specific information added in a high frequency band of the audio data, and therefore, by extracting data of a low frequency part in the mixed audio data, the audio data of the target multimedia file in the mixed audio data can be obtained.

Further, extracting the characteristic information of the data of the low-frequency part to obtain the characteristic information of the audio data of the target multimedia file. Specifically, any feature extraction method of audio data in the prior art may be adopted, for example, time domain feature data of the audio, specifically, amplitude of an audio segment, may be extracted, and time-frequency feature data of the audio may also be extracted.

Step S112: and matching the characteristic information with the characteristic sample part to obtain a second matching result.

When the preset characteristic sample is stored in an audio fingerprint database local to the identification device, matching the characteristic information and the characteristic sample part by the identification device is locally completed; when the preset characteristic sample is stored in an audio fingerprint database which is arranged at the far end of the identification device and is positioned in the audio fingerprint identification server, the audio fingerprint identification server completes the matching of the characteristic information and the characteristic sample part, and no matter who executes the matching step, the result matched with the characteristic information can be obtained.

For example, in the audio fingerprint database, the characteristic sample part is one or more multimedia records corresponding to the actor a, and the characteristic information is respectively matched with the fingerprint information of the one or more multimedia records to obtain fingerprint information successfully matched with the characteristic information, so that the movie name in the multimedia record in which the fingerprint information is located is the movie name of the target multimedia file.

Step S114: and identifying the target multimedia file according to the second matching result.

By adopting the method for identifying the multimedia file provided by the embodiment, firstly, the audio watermark data of the target multimedia file is identified, and by utilizing the advantage of high processing speed of the video identification technology based on the audio watermark, a preliminary identification result can be quickly obtained, the fineness of the identification result is possibly lower, for example, the identification result is only a subordinate range of the multimedia file, but the obtaining speed of the range is high, namely, the method firstly quickly determines a range of the target multimedia file; in this context, identification is further performed based on characteristic information of the audio data of the target multimedia file, and the target multimedia file can be identified by taking advantage of the fact that the audio fingerprint-based video identification technology can identify audio data of any source. Therefore, the method effectively combines the video identification technology based on the audio watermark and the video identification technology based on the audio fingerprint, and compared with the method of simply using the video identification technology based on the audio watermark, the method improves the identification fineness and the application range of video identification; compared with the video identification technology based on the audio fingerprint, the method shortens the identification time.

Example two

The second embodiment provides an embodiment of a method for identifying a multimedia file, which is a preferred embodiment based on the first embodiment. In the method of this embodiment, the target multimedia file is a target video; pre-adding sound print code data in the audio data of the target video; the method comprises the steps of acquiring voice data of a user while acquiring audio data of a target video; the identification method can further verify the accuracy of the identification result according to the voice data of the user on the basis of identifying the target multimedia file in the first embodiment; when the target video is identified, a preset voice print code sample and a preset feature sample are utilized; in verifying the accuracy of the recognition result according to the voice data of the user, a preset voice sample is utilized, and specifically, as shown in fig. 2, the method includes the following steps S202 to S212.

Step S202: and acquiring mixed audio data corresponding to the target video, wherein the mixed audio data comprises audio data of the target video, voice print code data and user voice data.

During the process of watching the video, a user may be familiar with specific detailed contents in the video, such as scenes, actors, even articles, and the like appearing in the video to be recognized, and the user may input the familiar contents through voice during the process of playing the video. For example, in the playing process of the target video, the recognition device starts the recording device to record sound, and records all sound information in the current environment, that is, mixed audio data corresponding to the target video, where the mixed audio data includes the audio data itself of the target video, the added audio watermark data, and user voice data sent by the user.

Step S204: and extracting audio watermark data in the mixed audio data, and obtaining a first matching result by using the audio watermark data and a preset audio watermark sample.

This step is the same as step S104 and step S106 in the first embodiment, and is not described here again.

Step S206: and determining a characteristic sample part corresponding to the first matching result in a preset characteristic sample, extracting characteristic information of audio data of the target video in the mixed audio data, matching the characteristic information with the characteristic sample part to obtain a second matching result, and identifying the target video according to the second matching result.

The steps are the same as steps S108 to S114 in the first embodiment, and are not described again here.

Step S208: user speech data in the mixed audio data is extracted.

According to the characteristics of the sound print code data, the sound print code data can be removed from the mixed audio data by extracting the data of the low-frequency part in the mixed audio data, and then the audio data of the target video in the mixed audio data with the sound print code data removed, namely the audio data of the target video in the data of the low-frequency part is removed, so that the user voice data can be obtained.

Specifically, when removing audio data of a target video from data of a low-frequency portion, it is necessary to acquire the audio data of the target video. Since the target video is identified in step S206, in this step, the audio data corresponding to the target video identified in step S206 is used to implement the user voice data extraction in this step, specifically, for example, the second matching result includes URL information of the target video, the audio data of the target video can be obtained according to the URL information, and then the audio data of the obtained target video is subtracted from the data of the low frequency part, so that the user voice data can be obtained.

Step S210: and matching the user voice data with a preset voice sample to obtain a third matching result.

The preset voice sample can be stored in a voice database local to the recognition device, so that the recognition device completes the matching of the user voice data and the preset voice sample, and also can be stored in a voice database which is far from the recognition device and is located in a voice recognition server, the recognition device interacts with the voice recognition server, the user voice data is transmitted to the voice recognition server, the voice recognition server completes the matching of the user voice data and the preset voice sample, and no matter who executes the matching step, a third matching result can be obtained.

For example, the preset voice sample is a plurality of voice records stored in a voice database, and each voice record is composed of voice feature information and a keyword corresponding to the voice feature information. When the user voice data is matched with the preset voice sample, firstly, the voice characteristic information of the user voice is extracted according to the user voice data, and then the extracted voice characteristic information of the user voice is matched with the voice characteristic information in the voice database, so that one or more voice records can be positioned in the voice database, and further, the keywords corresponding to the user voice data are obtained. Specifically, the keywords corresponding to the obtained user voice data are "shooting location" and "Hainan".

Step S212: and identifying the obtained target multimedia file according to the third matching result and the second matching result. Since the target video is identified by the second matching result, but it is possible that the second matching result gives a plurality of target videos, the keyword input by the user is identified by the third matching result, and the identified target video can be further determined by the keyword input by the user.

Specifically, if it is identified that the target video is a certain movie B and a certain movie C according to the second matching result, the keywords corresponding to the user voice data identified by the third matching result are "shooting location" and "hainan", the shooting location of the movie B is "hainan", and the shooting location of the movie C is "beijing", in this step, it is possible to select whether to identify and output the movie B or the movie C by determining whether the shooting location in the movie B is hainan.

For another example, if the 7 th, 8 th, and 10 th sets of a certain drama are identified as the target video according to the second matching result, the keywords corresponding to the user voice data identified by the third matching result are "actor" and "liu nym", and there is no liu nym in the 7 th and 8 th sets, in this step, it may be determined that the target video file is the 10 th set of the drama by determining the third identification result.

In summary, the target video can be further identified according to the identified keywords, and the accuracy of the identification result of the target video can be verified, so that the identification result with high accuracy can be provided for the user.

By adopting the method for identifying the multimedia file provided by the preferred embodiment, on the basis of the technical effect of the first embodiment, the accuracy of identifying the target multimedia file can be improved by combining the user voice data.

EXAMPLE III

The third embodiment provides an embodiment of a method for identifying a multimedia file, which is another preferred embodiment based on the first embodiment. In the method of this embodiment, audio watermark data is added in advance to the audio data of the target multimedia file, and when the target multimedia file is identified, a preset audio watermark sample and a preset feature sample are used, specifically, as shown in fig. 3, the method includes the following steps S302 to S320.

Step S302: and acquiring mixed audio data corresponding to the target multimedia, wherein the mixed audio data comprises audio data and audio watermark data of the target multimedia file.

Step S304: and extracting audio watermark data in the mixed audio data.

Step S306: and matching the audio watermark data with a preset audio watermark sample to obtain a first matching result.

The steps S302 to S306 correspond to the steps S102 to S106 in the first embodiment one by one, and are not described herein again.

Step S308: and positioning a plurality of multimedia file records corresponding to the first matching result in a preset feature database.

The preset feature database stores at least one multimedia record, the multimedia records stored in the feature database form a preset feature sample, each multimedia record comprises fingerprint information of a multimedia file and identification information of the multimedia file corresponding to the fingerprint information, and the fingerprint information in each multimedia record comprises a plurality of fingerprint values obtained by calculating time-frequency feature data of multimedia audio data.

The multimedia file identification information corresponding to the fingerprint information in each multimedia record can be a television channel name and a television program name of the multimedia file, and after the first matching result determines that the target multimedia file is a program broadcasted by the television channel A, through the step, a plurality of multimedia records corresponding to the television channel A can be located in the feature database, and the plurality of multimedia records are a plurality of television programs broadcasted by the television channel A.

Step S310: left channel data and right channel data of a low frequency part in the mixed audio data are extracted.

The audio data of the target multimedia file in the mixed audio data can be obtained by extracting the data of the low-frequency part in the mixed audio data, wherein the audio data consists of two parts of data, namely left channel data and right channel data.

Step S312: the left channel data and the right channel data are combined to obtain N stereo data of the low frequency part.

Specifically, the following formula is adopted for merging:

s_i＝a_i*l+b_i*r

wherein, a_i′+b_i′＝1，i＝1,2,3…N,s₁For the first stereo data, s_NFor the Nth stereo data, s_iFor the ith stereo data, a_i' and b_i' adjusting a for preset weight parameters_i' and b_iThe size of the' can realize the adjustment of the proportion of the left and right channel data in the stereo data.

Step S314: and calculating time-frequency characteristic data of each stereo data to obtain a plurality of fingerprint values of each stereo data.

For each stereo data, the fingerprint information of each stereo data is composed of a plurality of fingerprint values of each stereo data, and the fingerprint information of the N stereo data is composed of the fingerprint information of the target multimedia file.

Specifically, for a certain stereo data, when calculating the time-frequency feature data of the stereo data to obtain a plurality of fingerprint values of the stereo data, the following steps S3142 to S3148 are included:

step S3142: carrying out short-time Fourier transform on the stereo data to obtain a time-frequency distribution map of the stereo data;

step S3144: acquiring an energy maximum value point in a time-frequency distribution graph;

step S3146: constructing a fingerprint value fp [ ta, fa, fb, tb-ta ] according to maximum value points A [ ta, fa, Va ], B [ tb, fb, Vb ] at two different moments, and converting the fingerprint value into a hash code fp [ hashData, ta ], wherein ta is the moment at which the extreme value point A is located, fa is the frequency at which the extreme value point A is located, Va is the energy of the extreme value point A, tb is the moment at which the extreme value point B is located, fb is the frequency at which the extreme value point B is located, Vb is the energy of the extreme value point B, ta is < tb, and the maximum value point A and the extreme value point B are any two adjacent energy maximum value points in a time-frequency distribution graph;

step S3148: and combining all the constructed fingerprint values according to the time sequence to obtain a plurality of fingerprint values of the stereo data.

Correspondingly, in the feature database, for the fingerprint information in each multimedia record, when calculating a plurality of fingerprint values obtained by time-frequency feature data of the audio data of the multimedia, stereo data of the multimedia is preferably adopted as the audio data, and the fingerprint value is preferably calculated by adopting the method of the time-frequency feature data, so that the feature information of the target multimedia file is ensured to be consistent with the feature information in the feature database, and the matching accuracy is improved.

Step S316: and respectively matching the fingerprint values of each stereo data with the positioned multimedia records to obtain the matching rate corresponding to each stereo data.

For example: fingerprint information fp (hashdata, t) of a certain first stereo data includes a plurality of fingerprint values, which are in turn: [ (10001,1), (10002,1), (20001,2) (30001,3) … … ];

the fingerprint information fp (hashdata, t) of a certain second stereo data includes a plurality of fingerprint values, which are in turn: [ (10002,11), (10004,11), (30001,14) (30005,16) … … ];

the characterizing information located in the first multimedia recording is [ (10003,10), (10002,20), (20001,21) (30001,31) … … ];

the characteristic information located in the second multimedia recording is [ (10002,11), (10004,11), (30001,14) (30005,16) … … ].

The matching rate corresponding to the first stereo data includes: the number of matches 3 with the first multimedia recording and the number of matches 2 with the second multimedia recording; the matching rate corresponding to the second stereo data includes: the number of matches 2 to the first multimedia recording and the number of matches 4 to the second multimedia recording.

Step S318: and determining the identification information corresponding to the target multimedia file according to the multimedia record corresponding to the maximum value in the matching rates.

For example, the maximum matching rate is 4, which is the matching number of the second stereo data and the second multimedia record, so that the identification information corresponding to the target multimedia file determined in this step is also the identification information in the second multimedia record.

Step S320: and identifying the target multimedia file according to the identification information corresponding to the target multimedia file.

For example, the two multimedia records correspond to records of two television programs broadcast on television channel a, the television program name in the identification information in the second multimedia record is "see", and the identified target multimedia file is "see" broadcast on television channel a.

By adopting the multimedia file identification method provided by the embodiment, on the basis of the technical effect of the first embodiment, when the target multimedia file is identified, the obtained audio data of the target multimedia file is stereo data formed by combining left channel data and right channel data, and correspondingly, the preset feature sample is also the feature of the stereo data, so that the source data type of the feature information of the target multimedia file is consistent with the source data type of the feature sample, stereo data are adopted, and the identification accuracy is improved; and when the left and right channel data are merged into stereo data, setting weight parameters a and b so as to adjust the proportion of the left and right channel data in the stereo data according to actual needs.

Furthermore, when the characteristic information of the target multimedia file is constructed, the left and right channel data of the target multimedia file are converted into a plurality of groups of stereo data by setting a plurality of groups of weight parameters, and the fingerprint value corresponding to each group of stereo data is calculated, so that the characteristic information of the target multimedia file comprises a plurality of groups of fingerprint values. When the target multimedia file is identified, each group of fingerprint values are respectively matched with the positioned multimedia file records, and the target multimedia file is identified according to the multimedia file record corresponding to the maximum matching rate, so that the identification accuracy is further improved.

Example four

The fourth embodiment provides an embodiment of a method for searching for a multimedia file, as shown in fig. 4, the method includes the following steps S402 to S406.

Step S402: a search request is received, wherein the search request includes mixed audio data of a target multimedia file to be searched.

Step S404: and identifying the target multimedia file according to the search request.

Step S406: and searching the target multimedia file according to the identification result.

In this embodiment, when searching for a target multimedia file, the multimedia file needs to be identified first, and then the multimedia file is further searched according to the identified identification information of the multimedia file. Wherein, when the target multimedia file is identified, any of the above embodiments can be adopted.

EXAMPLE five

The fifth embodiment provides an embodiment of a method for searching a multimedia file, the execution subject of the method may be any terminal, as shown in fig. 5, and the method includes the following steps S502 to S512.

Step S502: and acquiring mixed audio data corresponding to the target multimedia, wherein the mixed audio data of the target multimedia file comprises audio data and audio watermark data of the target multimedia file.

Step S504: and extracting audio watermark data in the mixed audio data of the target multimedia file.

Step S506: and sending the audio watermark data to an audio watermark identification server to obtain a first matching result, wherein the first matching result of the target multimedia file is the matching result obtained by matching the audio watermark identification server with the audio watermark data and a preset audio watermark sample.

In this embodiment, the audio watermark recognition server is provided with an audio watermark database, where the audio watermark database is used to store preset audio watermark samples, where a plurality of audio watermark records are stored in the audio watermark database, and each audio watermark record includes audio watermark information and multimedia file identification information corresponding to the audio watermark information.

After the terminal sends the audio watermark data of the target multimedia file to the audio watermark recognition server, the audio watermark recognition server locates the audio watermark record matched with the audio watermark data of the target multimedia file in the audio watermark database, so as to obtain a first matching result, namely, the multimedia file identification information corresponding to the target multimedia file is obtained from the audio watermark record.

For identifying the target multimedia file, the identification information of the multimedia file at the position has relatively coarse identification fineness, that is, the target multimedia file cannot be uniquely determined, for example, the target multimedia file is identified to belong to a television episode through the identification information at the position, but which episode in the television episode is specifically cannot be determined; for another example, the identification information at this location identifies the tv programs of a certain tv channel to which the target multimedia file belongs, but cannot determine which tv program in the tv channel is specific.

Step S508: and extracting the characteristic information of the audio data of the target multimedia file in the mixed audio data of the target multimedia file.

Step S510: and sending the feature information and the first matching result of the audio data of the target multimedia file to an audio fingerprint identification server to obtain a second matching result, wherein the second matching result is obtained by matching the feature information and the feature sample part after the audio fingerprint identification server determines the feature sample part corresponding to the first matching result in the preset feature sample.

In this embodiment, the audio fingerprint identification server is provided with an audio fingerprint database, and the audio fingerprint database is used for storing preset characteristic samples, wherein a plurality of multimedia records are stored in the audio fingerprint database, and each multimedia record is composed of fingerprint information of multimedia and identification information of a multimedia file corresponding to the fingerprint information.

The multimedia file identification information is relatively thin for identifying the target multimedia file, and the target multimedia file can be uniquely determined through the content of the identification information. The identification information may include the identification information of the multimedia file in the audio watermark database, and further include information describing the detail of the multimedia file, such as the storage location of the multimedia file, the name of the multimedia file, and the like.

After the terminal sends the characteristic information of the audio data and the first matching result to the audio fingerprint identification server, the audio fingerprint identification server firstly positions one or more multimedia records corresponding to the first matching result in the audio fingerprint database, and then matches the characteristic information of the audio data with the positioned one or more multimedia records, so that a second matching result is obtained, and the target multimedia file is uniquely identified.

Step S512: and sending the second matching result to the multimedia management server to obtain a target multimedia file, wherein the target multimedia file is the multimedia file acquired by the multimedia management server according to the second matching result.

For example, the URL of the target multimedia file can be obtained through the second matching result, the terminal sends the second matching result to the multimedia management server, and the multimedia management server returns the relevant data of the target multimedia file to the terminal after acquiring the target multimedia file according to the URL in the second matching result. The related data can be the streaming media data of the target multimedia file, and the terminal receives the streaming media data and directly plays the target multimedia file; the terminal can also be a download address of the target multimedia file, and after receiving the download address, the terminal downloads the target multimedia file on the corresponding server for playing.

In a preferred embodiment of the present invention, the target multimedia file mixed audio data further includes user voice data, and before step S512, the method further includes the following steps:

step S514: user speech data in the mixed audio data is extracted.

Step S516: and sending the user voice data to a voice recognition server to obtain a third matching result, wherein the third matching result is a matching result obtained by matching the user voice data with a preset voice sample by the voice recognition server.

In the preferred embodiment, the voice recognition server is provided with a voice database for storing preset voice samples, wherein a plurality of voice records are stored in the voice database, and each voice record is composed of voice characteristic information and a keyword corresponding to the voice characteristic information.

After the terminal sends the user voice data to the voice recognition server, the voice recognition server firstly extracts the voice characteristic information of the user voice according to the user voice data, and then matches the extracted voice characteristic information of the user voice with the voice characteristic information in the voice database, so that one or more voice records can be positioned in the voice database, and then keywords corresponding to the user voice data are obtained.

Step S518: and identifying the target multimedia file according to the second matching result, and verifying whether the identification result obtained by identifying the target multimedia file is correct according to the third matching result, wherein when the identification result is correct, the step S512 is executed.

In step S510, after the second matching result is obtained, in this step, after the target multimedia file is identified according to the second matching result, the accuracy of the identification result is verified through the third matching result, and when the identification result is accurate, the second matching result is sent to the multimedia management server.

For example, the target multimedia file is a movie Q, the movie Q can be identified by the second matching result, and further the description information of the movie Q including the information that the lead actor of the movie Q is the king XX is obtained, the keywords obtained by the third matching result are the "lead actor" and the "king XX", the identification result is verified to be correct by the third matching result, and at this time, the second matching result is sent to the multimedia management server to obtain the movie Q.

In another preferred embodiment provided by the present invention, when the step S504 extracts the feature information of the audio data of the target multimedia file in the mixed audio data of the target multimedia file, the feature information extraction method described in the third embodiment may be adopted, and details are not repeated here.

EXAMPLE six

The sixth embodiment provides another embodiment of a method for searching a multimedia file, in which audio data of a target video is added with sound print code data in advance; when video clip audio data of a target video are obtained, voice data of a user are obtained; when the target video is identified, a preset audio watermark database and a preset audio fingerprint database are utilized; a pre-set speech database is utilized in recognizing the speech data of the user. Specifically, as shown in fig. 6, the method includes steps S602 to S608 as follows.

Step S602: and starting a recording module to acquire mixed audio data of the video clip of the target video, wherein the mixed audio data comprises the audio data of the target video and the voice data of the user.

After the recording module is started, sound information in the current environment is recorded in real time to obtain mixed audio data, and in the process of playing the target video, if user voice exists, the recorded sound information comprises audio data of a video clip of the target video, voice data of the user and some background sound data in the environment.

After the recording module is started, every time the recording duration reaches time T2, sound data with the length of T2 is packaged, and the packaged sound data comprises audio data of a video clip, voice data of a user and background sound.

Step S604: an audio file of the mixed audio data is preprocessed.

The method specifically comprises the following steps:

1. and (5) converting an audio format.

Calling the software of the third party (such as ffmpeg) to uniformly convert the audio files with different formats into PCM encoded audio data with the time length of T2.

2. Audio data of a high frequency part is extracted.

Using a high-pass filter (the frequency range of filtering is kept consistent with the frequency range occupied by the vocoding, assumed to be H1Hz to H2Hz), audio data Music1 with a time length of T2 and a frequency range of H1Hz to H2Hz is obtained.

3. Audio data of a low frequency part is extracted.

Using a low-pass filter, audio data Music2 having a time length of T2 and a frequency range of L1Hz to L2Hz is acquired.

Step S606: and identifying according to the audio file of the preprocessed mixed audio data.

Specifically, the content to be identified comprises a video clip of the target video and a keyword corresponding to the user voice data, and the method comprises the following steps:

1. an audio file of the preprocessed mixed audio data is received, i.e. two audio pieces are received, including Music1 and Music 2.

2. And splicing the acquired audio data of the low-frequency part to prepare data for voice extraction.

Every time one low frequency partial audio data Music2 is received, the audio data Music3 with the time length N × T2(N is the total number of the current audio piece) is spliced in time sequence.

3. And locking the target video by using the voice print code.

And identifying the voice print code information carried in the high-frequency part audio data Music1 to obtain an identification Result 1.

For example, the recognition Result1 is the ID number of the target video in the audio fingerprint database (TrackID that uniquely identifies the video fingerprint).

Result1：{TrackID：“……”}。

4. Video clip for accurately positioning target video

And extracting fingerprint information of Music2, and matching the extracted fingerprint information with the fingerprint information in the audio fingerprint database pointed by Result1 to obtain a matching Result 2.

Result2 contains information including the index information TrackID of the target video in the audio fingerprint database, the storage location information URL, and the time ranges of the video clip in the target video, namely timestamp and timestamp.

Result2：{TrackID：“……”，URL:“http://……”，timeStart：“……”，timeStop：“……”}；

5. Audio data of a video segment of the target video, i.e., an original sound of the video segment, is extracted.

Reading Result2, and finding a video file vedio according to the storage location information URL; extracting audio data music of a video file vedio; and extracting audio data music _ clip of a specific time period (namely time start to time top) according to the time information in Result2, wherein the music _ clip is the original sound of the video clip.

6. User voice data is extracted.

For the spliced audio data Music3, the spliced audio data Music3 is essentially composed of the following three parts:

music3 ═ a1 ═ audio data of the video segment + a2 ═ user speech data + a3 ═ background sound data (a1, a2, a3 are constants)

Assuming that the recording conditions are good enough, namely: a3 is 0, then:

user speech data b1 Music 3-b 2 audio data of video clip (b1, b2 are constants)

Thus, user speech data can be extracted:

user speech data word b1 Music3 b2 Music clip.

7. The voice instructions are parsed by the user voice data.

Matching the user voice data word with the voice Command in the voice database to obtain a voice Command which is closest to the word:

Command：{index：{“music，title，……”}}

step S608: and returning a retrieval result according to the identification result.

For example, if the index information in Command includes a music name and a singer name, all information describing the video clip, including information such as article information, scene information, a music name of background music, and a singer name of the background music appearing in the video clip, can be obtained by Result2, and in all the information, it is determined whether the content corresponding to the index information in Command is in all the information describing the video clip obtained by Result2, and if so, the video file drive can be found by the URL in Result2, and information such as the video file drive or the link address of the video file is returned as a search Result.

EXAMPLE seven

The seventh embodiment provides another embodiment of the search method for multimedia files, and in this embodiment, a search system implementing the search method is described.

Specifically, the system for implementing the method is composed of a terminal, a server and a database, which are respectively described as follows.

Firstly, the terminal comprises a recording module, an audio preprocessing module and an audio identification module. Wherein,

a recording module: for obtaining sound information. The input sound information is composed of two parts: (1) sound data of the video (including the voice print code); (2) voice data of the user. The user can input his voice information at any time during the recording process.

The audio preprocessing module: as shown in fig. 7, the audio format conversion unit performs data conversion on the audio data acquired by the recording module, and performs audio extraction by the high frequency extraction unit and the low frequency extraction unit, respectively, to prepare data for video recognition and voice recognition in the next step.

The audio recognition module: as shown in fig. 8, the audio recognition module receives the preprocessed and audio data, including the high frequency sound data and the low frequency sound data, outputs the audio fingerprint search result, that is, the search result of the target video, and also outputs the voice data of the user, where each unit is described as follows:

a voice print code recognition unit: the voice print code recognition server is used for interacting with the voice print code recognition server, uploading high-frequency audio information to the voice print code recognition server and obtaining a voice print code recognition result sent by the voice print code recognition server.

Fingerprint identification unit: the voice fingerprint recognition server is used for receiving voice print code recognition results, uploading the voice print code recognition results and low-frequency voice data to the voice frequency fingerprint recognition server, and receiving voice frequency fingerprint recognition results returned by the fingerprint recognition server.

An audio splicing unit: the method is used for splicing the segments of the low-frequency sound data into a whole to prepare data for the voice extraction of the user.

A voice recognition unit: on one hand, the method is used for receiving the audio fingerprint identification result, uploading the result to the video management server, and receiving the audio clip of the target video sent by the video management server. And on the other hand, the voice recognition unit is used for recognizing and acquiring the acoustic audio data according to the audio fingerprint recognition result and extracting the voice data of the user according to the acoustic audio data and the low-frequency sound data sent by the audio splicing unit. And on the other hand, the voice recognition server is used for uploading the voice data to the voice recognition server and receiving a voice recognition result returned by the voice recognition server.

The audio recognition module may further include: and the identification result verification unit is used for judging whether the identification result of the fingerprint identification unit is correct according to the voice data of the user, sending the audio fingerprint identification result as a search intention to the video management server when the identification result is correct, and receiving a return result of the video management server.

The terminal further includes: and the display module is used for displaying the result returned by the video management server received by the identification result verification unit to the user, and is also used for calling various multimedia file resources according to the type of the returned information and displaying the result to the user.

The above describes the terminal in the system, and with reference to fig. 9, the server and database configuration in the system will be described below.

1. A voice recognition server and a voice database.

And the voice recognition server receives the voice data sent by the terminal, recognizes the voice data in the voice database according to the voice data, and returns a voice instruction corresponding to the voice data.

The voice database stores preset voice instructions:

and (2) Command: { "keyword 1", "keyword 2", "keyword 3" … … }.

The instructions may be described in terms of keywords, which may include: video type (e.g., drama, movie, news), information of video content (e.g., actors, items, locations).

2. An audio watermark identification server and an audio watermark database.

And receiving the high-frequency part voice data sent by the terminal, analyzing the acoustic seal code carried by the high-frequency part voice data from the high-frequency part voice data, matching the analyzed acoustic seal code with the acoustic seal code data in the database, and acquiring a matching result with the highest acoustic seal code matching degree. And returning the matching result to the terminal.

In the audio watermark database, the stored data structure may adopt the following structure:

{ "ID", "url", "voice print code", "TrackID" }

The "ID" is the unique identification of the voice print code in the voice print code database. "url" is the storage location of the video file corresponding to the vocoding. "acoustic signature": each video file is a unique voice print code corresponding to a binary number sequence of 01010101 … …. In use, the vocoding is loaded in high frequency audio data. The "TrackID" is the identifier of the fingerprint information corresponding to the video data corresponding to the voice print code in the audio fingerprint database.

3. The system comprises an audio fingerprint identification server and an audio fingerprint database.

The data structure of the fingerprint data stored in the fingerprint database can adopt the following results:

{ TrackID: { }, fp: { }, "keyword 1": { }, "keyword 2": { }, … … }

The TrackID is the unique identifier of the fingerprint information in the audio fingerprint database; fp is audio fingerprint data corresponding to the video file, and the structure of fp is { "Hash 1", "time 1", "Hash 2", "time 2", "Hash 3", "time 3", … … }; "keyword N": the keywords are in one-to-one correspondence with the keywords in the voice server, and have the structures { "content 1", "time 1", "content 2", "time 2", "content 3", "time 3", … … } (for example, if the keyword is an actor ", the content of the keyword identifies the names of actors appearing in the video at different playing times, and is stored in the" content ").

The audio fingerprinting server functions as follows:

receiving low-frequency part sound data sent by a terminal video identification module and an identification result sent by an audio watermark identification server; extracting fingerprint information of a target audio from sound data; extracting TrackID from the identification result sent by the audio watermark identification server, and determining the fingerprint retrieval range according to the TrackID; matching the target fingerprint with fingerprint information pointed by the TrackID in the determined retrieval range; and acquiring the time information of the currently played video, and returning the matched audio clip to the video identification module.

4. A video management server and a video database.

The video database is used for storing video files.

The function of the video management server is explained as follows

(1) Receiving an identification result sent by an audio fingerprint identification unit of a terminal video identification module; extracting a result to obtain video index information (URL) and time segment information in a database, and retrieving a segment of a corresponding video; extracting audio data of the video clip; and returning the audio data to a voice recognition unit of the terminal video recognition module.

(2) And receiving the search intention sent by the terminal identification result verification unit and returning the search result to the terminal.

With the terminal and the server described above and referring to fig. 10, the process of implementing the search method of the present embodiment is described as follows:

the method comprises the following steps:

and starting a terminal recording module to acquire the sound data of the target video and the voice information of the user. Once the recording module is started, the terminal records the sound information in the current environment by the recording equipment. And stopping recording until a video retrieval result sent by the server is received or the total recording time is longer than the preset time T1.

Every time the duration of the recording reaches the time T2(T2< < T1), sound data with the length of T2 is packaged, and the data comprises audio data of video and voice commands of a user and is uploaded to an audio data preprocessing module.

Step two:

this step prepares the data for further video recognition. The realization process is as follows:

1. receiving an audio file with the time length of T2;

2. and (5) converting an audio format. Calling the software of the third party (such as ffmpeg) to uniformly convert the audio files with different formats into PCM encoded audio data with the time length of T2.

3. Audio data of a high frequency part is extracted. Using a high-pass filter (the frequency range of filtering is consistent with the frequency range occupied by the audio seal code, which is assumed to be H1-H2 Hz), audio data Music1 with the time length of T2 and the frequency range of H1-H2 Hz is obtained.

4. Audio data of a low frequency part is extracted. Using a low-pass filter, audio data Music2 with a time length of T2 and a frequency range of L1-L1 Hz is obtained.

5. Music1 and Music2 are uploaded to the video identification module.

Step three:

the third step is divided into two stages: extracting voice information in the first stage and recognizing voice in the second stage, wherein the first stage is carried out at a terminal and a video management server is required to provide an original sound audio clip of a target video; and the second stage is carried out in a voice recognition server, and the terminal is required to provide user voice data. In this step, according to the obtained audio data, a target video is retrieved, a retrieval result of the target video and voice information of the user are obtained, and a voice instruction of the user is further recognized, specifically, two stages are described as follows:

stage one

1. The video identification module receives audio segments (including Music1 and Music2) from the audio pre-processing module.

2. And splicing the acquired low-frequency part audio data to prepare data for voice extraction.

Every time a low-frequency partial audio data Music2 is received, the low-frequency partial audio data Music3 with the time length of N x T2(N is the total number of the current audio segments) are spliced in time sequence until a video retrieval result is received.

3. And locking the target video by using the voice print code.

The high-frequency audio data Music1 is uploaded to a voice print code recognition server, and voice print code information carried in Music1 is extracted and recognized by the voice print code recognition server. And returning the recognition Result1 to the video recognition module of the terminal. The recognition Result1 is the ID number of the target video in the "video fingerprint database" (TrackID that uniquely identifies the video fingerprint).

Result1：{TrackID：“……”}；

4. And precisely positioning the playing segment of the playing video.

The low frequency part audio data Music2 is uploaded to the "audio fingerprinting server" together with Result 1. And extracting the target video fingerprint information in Music2 by the fingerprint identification server, matching the target video fingerprint with the fingerprint pointed by Result1, and returning a matching Result2 to the video identification module of the terminal.

Result2 contains information including index information TrackID of the search Result in the fingerprint database, index information URL in the video database, and time ranges TimeStart and TimeStop.

5. The video identification module stops receiving the audio clip and simultaneously sends the information of stopping recording to the recording module.

6. Extracting audio acoustic data of target video

Reading Result2, and uploading Result2 to the video management server; the video management server finds a video file vedio according to the index information; extracting audio data music of the video; extracting audio data music _ clip of a specific time period according to the time information in Result 2; and returning the music _ clip to a video information recommending module of the terminal.

7. Extracting user speech information

The spliced audio data Music3 obtained by us is composed of the following three parts:

music3 ═ a1 acoustic audio + a2 user speech + a3 ambient noise; (a1, a2, a3 are constants)

We assume that under the condition that the recording conditions are good enough (i.e.: a3 ═ 0), we can simply obtain it by:

user speech b1 Music 3-b 2 original sound audio (b1, b2 are constants)

Wherein the acoustic audio is obtained by step 6: music _ clip.

User speech (word) b1 Music 3-b 2 Music _ clip;

stage two

8. Parsing user instructions by user speech

And uploading the word to a voice recognition server, and matching the word with a voice instruction in a voice database. And returning the voice Command closest to the word to the recognition result verification unit.

Command：{index：{“music，title，belowing……”}}

9. Returning a retrieval result according to the user instruction

And judging whether the content corresponding to the index information in the Command is in all the information describing the video clip obtained by Result2, and if so, uploading Result2 to the video management server. The matching video file is located by the URL in Result2 and the video file is returned.

In this embodiment, the user voice data is recognized by the voice recognition unit, and the target video is further recognized by the recognition result. The search system can return the retrieval result with high accuracy to the user by combining the voice data of the user.

The above is a description of the identification method and the search method of the multimedia file provided by the embodiment of the present invention, and the identification apparatus of the multimedia file provided by the embodiment of the present invention will be described below.

Example eight

The eighth embodiment provides an embodiment of an apparatus for identifying a multimedia file, as shown in fig. 11, the apparatus includes an obtaining module 810, a first extracting module 820, a first matching module 830, a determining module 840, a second extracting module 850, a second matching module 860, and an identifying module 870.

The obtaining module 810 is configured to obtain mixed audio data corresponding to a target multimedia, where the mixed audio data includes audio data and audio watermark data of a target multimedia file. The target multimedia file can be a video (or an audio), during the playing process of the video (or the audio), the identification device starts a recording device to record sound, so as to obtain mixed audio data of the video (or the audio), and since the audio watermark data is added in advance to the video (or the audio), the recorded mixed audio data includes both the audio data of the target multimedia file and the added audio watermark data. The identification device can be an intelligent mobile communication terminal, such as a mobile phone and a PAD; or a computer; or as a separate identification unit embedded in the device where the identification of the multimedia file is required.

The first extracting module 820 is configured to extract audio watermark data in the mixed audio data, and extract the audio watermark data in the mixed audio data according to characteristics of the audio watermark data, for example, the audio watermark data is voice print code data, and since the voice print code data is specific information added in a high frequency band of the audio data, the voice print code data can be obtained by extracting data of a high frequency portion in the mixed audio data.

The first matching module 830 is configured to match the audio watermark data with a preset audio watermark sample to obtain a first matching result, where the first matching result may be a multimedia file group composed of multiple multimedia files, that is, by using the method, it is determined that the target multimedia file is in the multimedia file group.

The determining module 840 is configured to determine a feature sample portion corresponding to the first matching result in the preset feature sample, and through this module, when performing audio fingerprint matching, it is not necessary to match feature information of audio data of the target multimedia file with the whole preset feature sample, but only needs to match the feature information with the feature sample portion corresponding to the first matching result.

The second extraction module 850 is configured to extract feature information of audio data of a target multimedia file in the mixed audio data, and may specifically adopt any feature extraction method of audio data in the prior art, for example, may extract time-domain feature data of an audio, specifically, extract amplitude of an audio segment, and also may extract time-frequency feature data of the audio.

The second matching module 860 is used for matching the feature information with the feature sample part to obtain a second matching result. The identifying module 870 is configured to identify the target multimedia file according to the second matching result.

By adopting the identification device of the multimedia file provided by the embodiment, the video identification technology based on the audio watermark and the video identification technology based on the audio fingerprint are effectively combined, and compared with a method of simply using the video identification technology based on the audio watermark, the identification fineness and the application range of video identification are improved; compared with the video identification technology based on the audio fingerprint, the method shortens the identification time.

Preferably, the mixed audio data further comprises user voice data, and the device further comprises a third extraction module, a third matching module and a verification module. The third extraction module is used for extracting user voice data in the mixed audio data, and the third matching module is used for matching the user voice data with a preset voice sample to obtain a third matching result; and selecting a target multimedia file from the target multimedia files identified according to the second matching result according to the third matching result.

By adopting the preferred embodiment, the accuracy of identifying the target multimedia file can be improved by combining the voice data of the user.

Further preferably, the steps specifically executed by the first extraction module when extracting the audio watermark data are as follows: extracting audio data of a high frequency part in the mixed audio data; the second extraction module specifically executes the following steps when extracting the feature information: extracting feature information of audio data of a low-frequency part in the mixed audio data; the third extraction module specifically executes the following steps when extracting the user voice data: extracting audio data of a low-frequency part in the mixed audio data; and removing the audio data of the target multimedia file in the audio data of the low-frequency part to obtain the user voice data.

Preferably, the second extraction module includes a left and right channel data extraction module, a stereo data synthesis module, and a fingerprint information extraction module. The left and right channel data extraction module is used for extracting left channel data and right channel data of a low-frequency part in the mixed audio data; the stereo data synthesis module is used for combining the left channel data and the right channel data by adopting the following formula to obtain stereo data of a low-frequency part: s ═ a ═ l + b ═ r, where a + b ═ 1, s is stereo data of the low frequency part, l is left channel data of the low frequency part, r is right channel data of the low frequency part, and a and b are preset parameters; and the fingerprint information extraction module is used for extracting the time-frequency characteristic data of the stereo data to obtain the fingerprint information of the target multimedia file, wherein the fingerprint information forms the characteristic information of the audio data of the target multimedia file.

By adopting the preferred embodiment, when the target multimedia file is identified, the obtained audio data of the target multimedia file is stereo data formed by combining left channel data and right channel data, and correspondingly, the preset characteristic sample is also the characteristic of the stereo data, so that the source data type of the characteristic information of the target multimedia file is consistent with the source data type of the characteristic sample, and stereo data are adopted, thereby improving the identification accuracy; and when the left and right channel data are merged into stereo data, setting weight parameters a and b so as to adjust the proportion of the left and right channel data in the stereo data according to actual needs.

Further preferably, if the target multimedia file is a sub multimedia file of the second multimedia file, the first matching result is identification information of the second multimedia file, the second matching result is identification information of the target multimedia file, the feature sample is at least one multimedia record stored in a preset feature database, and the multimedia record includes fingerprint information of the multimedia file and identification information of the multimedia file corresponding to the fingerprint information, the step specifically executed by the determining module when determining the portion of the feature sample is: locating one or more multimedia records corresponding to the identification information of the second multimedia file in the feature database; the steps specifically executed by the second matching module when the second matching result is obtained are as follows: the fingerprint information of the target multimedia is matched with the located one or more multimedia records to determine the identification information of the target multimedia.

Further preferably, the stereo data of the low frequency part is N stereo data, where the ith stereo data of the N stereo data is s_i＝a_i*l+b_i*r，a_i′+b_iAnd 1,2,3 … N, the second matching module comprises a matching rate determination module and an identification information determination module. The matching rate determining module is used for respectively matching the time-frequency characteristic data of each stereo data with one or more positioned multimedia records to obtain a plurality of matching rates corresponding to the stereo data; the identification information determining module is used for determining a piece of multimedia record corresponding to the maximum value in a plurality of matching ratesIdentification information of the target multimedia file.

By adopting the preferred embodiment, when the feature information of the target multimedia file is constructed, the left and right channel data of the target multimedia file are converted into a plurality of groups of stereo data by setting a plurality of groups of weight parameters, and the feature information corresponding to each group of stereo data is calculated, so that the feature information of the target multimedia file comprises a plurality of groups of features. When the target multimedia file is identified, each group of characteristic information is respectively matched with the positioned multiple multimedia file records, and the target multimedia file is identified according to the multimedia file record corresponding to the maximum matching rate, so that the identification accuracy is further improved.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying a multimedia file, comprising:

acquiring mixed audio data corresponding to a target multimedia, wherein the mixed audio data comprises audio data and audio watermark data of the target multimedia file;

extracting audio watermark data in the mixed audio data;

matching the audio watermark data with a preset audio watermark sample to obtain a first matching result;

determining a characteristic sample part corresponding to the first matching result in a preset characteristic sample;

extracting feature information of audio data of the target multimedia file in the mixed audio data;

matching the characteristic information with the characteristic sample part to obtain a second matching result;

and identifying the target multimedia file according to the second matching result.

2. The method for identifying a multimedia file according to claim 1, wherein the mixed audio data further includes user voice data, the method further comprising:

extracting user voice data in the mixed audio data;

matching the user voice data with a preset voice sample to obtain a third matching result; and

and selecting one target multimedia file from the target multimedia files identified according to the second matching result according to the third matching result.

3. The method for identifying a multimedia file according to claim 2,

extracting the audio watermark data in the mixed audio data comprises: extracting audio data of a high frequency part in the mixed audio data;

extracting the feature information of the audio data of the target multimedia file in the mixed audio data comprises: extracting feature information of audio data of a low-frequency part in the mixed audio data;

extracting the user speech data in the mixed audio data comprises: extracting audio data of a low-frequency part in the mixed audio data; and removing the audio data of the target multimedia file in the audio data of the low-frequency part to obtain the user voice data.

4. The method for identifying a multimedia file according to claim 1, wherein extracting feature information of audio data of the target multimedia file from the mixed audio data comprises:

extracting left channel data and right channel data of a low-frequency part in the mixed audio data;

merging the left channel data and the right channel data to obtain stereo data of the low frequency part by adopting the following formula: s + a + l + b r, where a + b is 1, s is stereo data of the low frequency portion, l is left channel data of the low frequency portion, r is right channel data of the low frequency portion, and a and b are preset parameters; and

and extracting time-frequency characteristic data of the stereo data to obtain fingerprint information of the target multimedia file, wherein the fingerprint information forms the characteristic information of the audio data of the target multimedia file.

5. The method for identifying a multimedia file according to claim 4, wherein if the target multimedia file is a sub-multimedia file of a second multimedia file, the first matching result is identification information of the second multimedia file, the second matching result is identification information of the target multimedia file, the feature sample is at least one multimedia record stored in a preset feature database, and the multimedia record includes fingerprint information of the multimedia file and identification information of the multimedia file corresponding to the fingerprint information, then:

determining the feature sample part corresponding to the first matching result in a preset feature sample comprises: locating one or more multimedia records corresponding to the identification information of the second multimedia file in the feature database;

matching the feature information with the feature sample portion to obtain a second matching result comprises: matching the fingerprint information of the target multimedia with the one or more located multimedia records to determine identification information of the target multimedia.

6. The method according to claim 5, wherein the stereo data of the low frequency part is N stereo data, and the ith stereo data of the N stereo data is s_i＝a_i*l+b_i*r，a_i′+b_i(ii) 1,2,3 … N, matching the fingerprint information of the target multimedia file with the located one or more multimedia records to determine identification information of the target multimedia file comprises:

matching the time-frequency characteristic data of each stereo data with the positioned one or more multimedia records respectively to obtain a plurality of matching rates corresponding to the stereo data;

and determining the identification information of the target multimedia file according to the multimedia record corresponding to the maximum value in the matching rates.

7. An apparatus for identifying a multimedia file, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring mixed audio data corresponding to target multimedia, and the mixed audio data comprises audio data and audio watermark data of a target multimedia file;

the first extraction module is used for extracting audio watermark data in the mixed audio data;

the first matching module is used for matching the audio watermark data with a preset audio watermark sample to obtain a first matching result;

the determining module is used for determining a characteristic sample part corresponding to the first matching result in a preset characteristic sample;

the second extraction module is used for extracting the characteristic information of the audio data of the target multimedia file in the mixed audio data;

the second matching module is used for matching the characteristic information with the characteristic sample part to obtain a second matching result;

and the identification module is used for identifying the target multimedia file according to the second matching result.

8. The apparatus for recognizing a multimedia file according to claim 7, wherein the mixed audio data further includes user voice data, the apparatus further comprising:

the third extraction module is used for extracting user voice data in the mixed audio data;

the third matching module is used for matching the user voice data with a preset voice sample to obtain a third matching result; and

and the verification module selects a target multimedia file from the target multimedia files identified according to the second matching result according to the third matching result.

9. The apparatus for identifying a multimedia file according to claim 8,

the first extraction module specifically executes the following steps when extracting the audio watermark data: extracting audio data of a high frequency part in the mixed audio data;

the second extraction module specifically executes the following steps when extracting the feature information: extracting feature information of audio data of a low-frequency part in the mixed audio data;

the third extraction module specifically executes the following steps when extracting the user voice data: extracting audio data of a low-frequency part in the mixed audio data; and removing the audio data of the target multimedia file in the audio data of the low-frequency part to obtain the user voice data.

10. The apparatus for identifying a multimedia file according to claim 7, wherein the second extracting module comprises:

the left and right channel data extraction module is used for extracting left channel data and right channel data of a low-frequency part in the mixed audio data;

a stereo data synthesis module for combining the left channel data and the right channel data using the following formula to obtain stereo data of the low frequency portion: s + a + l + b r, where a + b is 1, s is stereo data of the low frequency portion, l is left channel data of the low frequency portion, r is right channel data of the low frequency portion, and a and b are preset parameters; and

and the fingerprint information extraction module is used for extracting the time-frequency characteristic data of the stereo data to obtain the fingerprint information of the target multimedia file, wherein the fingerprint information forms the characteristic information of the audio data of the target multimedia file.

11. The apparatus for identifying a multimedia file according to claim 10, wherein if the target multimedia file is a sub-multimedia file of a second multimedia file, the first matching result is identification information of the second multimedia file, the second matching result is identification information of the target multimedia file, the feature sample is at least one multimedia record stored in a predetermined feature database, and the multimedia record includes fingerprint information of the multimedia file and identification information of the multimedia file corresponding to the fingerprint information, then:

the steps specifically executed by the determination module when determining the feature sample part are as follows: locating one or more multimedia records corresponding to the identification information of the second multimedia file in the feature database;

the second matching module specifically executes the following steps when obtaining a second matching result: matching the fingerprint information of the target multimedia with the one or more located multimedia records to determine identification information of the target multimedia.

12. The apparatus for identifying a multimedia file according to claim 11, wherein the stereo data of the low frequency part is N stereo data, and wherein the ith stereo data of the N stereo data is s_i＝a_i*l+b_i*r，a_i′+b_i' -1, i-1, 2,3 … N, the second matching module comprises:

a matching rate determining module, configured to match time-frequency feature data of each piece of stereo data with the one or more located multimedia records, respectively, to obtain multiple matching rates corresponding to the stereo data;

and the identification information determining module is used for determining the identification information of the target multimedia file according to the multimedia record corresponding to the maximum value in the matching rates.