Detailed description of the invention
Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limitDetermine the present invention.
The primary solutions of the embodiment of the present invention is: include the audio-frequency information of video in video copy detectionScheme, utilizes the method that audio frequency and video combine, and is possible not only to strengthen the vigorousness of video copy detection system,And by audio and video characteristic is merged, greatly speed up the execution efficiency of copy detection system, pass throughAudio frequency and video are analyzed jointly, improve copy fragment positioning precision.
Specifically, the embodiment of the present invention it is considered that existing video copy detection scheme, or only withThe video copy detection scheme of characteristics of image based on key frame of video, not only weakens video copy detectionThe vigorousness of system, and the Position location accuracy for copy fragment is the highest;Use based on audio frequency and videoThe video copy detection scheme that feature detection result combines, but, enter in the face of copy detection in resultant layerRow merges needs to extract more feature, and needs most feature all to complete whole copy detection streamJourney, thus add time overhead, and corresponding algorithm complex and data integration linear correlation, thusAdd algorithm complex.
The present embodiment scheme includes the audio-frequency information of video in video copy detection scheme, utilizes audio frequency and video phaseIn conjunction with method, extracted by audio/video decoding and pretreatment, audio and video characteristic, audio and video characteristic merges,The processing procedures such as copy judgement and location, are possible not only to strengthen the vigorousness of video copy detection system, andAnd by audio and video characteristic is merged, greatly speed up the execution efficiency of copy detection system, pass through soundVideo is analyzed jointly, improves copy fragment positioning precision.
Specifically, the audio frequency and video copy detection device that embodiment of the present invention audio frequency and video copy detection scheme relates toHardware configuration can be as it is shown in figure 1, this detection device can be carried on PC end, it is also possible to be carried on handsThe mobile terminals such as machine, panel computer, portable handheld device or other there is audio frequency and video copy detection meritIn the electronic equipment of energy, such as apparatus for media playing.
As it is shown in figure 1, this detection device may include that processor 1001, such as CPU, network interface 1004,User interface 1003, memorizer 1005, communication bus 1002, photographic head 1006.Wherein, communication bus1002 for realizing the connection communication detecting between these assemblies of device.User interface 1003 can includeDisplay screen (Display), input block such as keyboard (Keyboard), optional user interface 1003 also may be usedTo include the wireline interface of standard, wave point.Network interface 1004 optionally can include having of standardLine interface, wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, alsoCan be stable memorizer (non-volatile memory), such as disk memory.Memorizer 1005Optionally can also is that the storage device independent of aforementioned processor 1001.
Alternatively, this detection device is when being carried on mobile terminal, it is also possible to include RF (RadioFrequency, radio frequency) circuit, sensor, voicefrequency circuit, WiFi module etc..Wherein, sensorSuch as optical sensor, motion sensor and other sensors.Specifically, optical sensor can include environmentOptical sensor and proximity transducer, wherein, ambient light sensor can regulate according to the light and shade of ambient lightThe brightness of display screen, proximity transducer can cut out display screen and/or the back of the body when mobile terminal moves in one's earLight.As the one of motion sensor, Gravity accelerometer can detect in all directions (generallyThree axles) size of acceleration, can detect that size and the direction of gravity time static, can be used for identifying mobileThe application (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating) of terminal attitude, Vibration identificationCorrelation function (such as pedometer, percussion) etc.;Certainly, this detection device can also configure gyroscope, gasOther sensors such as pressure meter, drimeter, thermometer, infrared ray sensor, do not repeat them here.
It will be understood by those skilled in the art that the apparatus structure shown in Fig. 1 is not intended that this detection deviceRestriction, can include that ratio illustrates more or less of parts, or combine some parts, or differentParts arrange.
As it is shown in figure 1, as the memorizer 1005 of a kind of computer-readable storage medium can include operation systemSystem, network communication module, Subscriber Interface Module SIM and audio frequency and video copy detection application program.
In the detection device shown in Fig. 1, network interface 1004 is mainly used in connecting back-stage management platform,Data communication is carried out with back-stage management platform;User interface 1003 is mainly used in connecting client, with clientEnd carries out data communication;And processor 1001 may be used for calling the audio frequency and video of storage in memorizer 1005Copy detection application program, and perform following operation:
Obtain audiovisual presentation, described audiovisual presentation is decoded and pretreatment, obtains described sound and regardFrequently the audio-frequency unit of image and frame of video;
Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audio frequency and videoAudio frequency characteristics that image is corresponding and the characteristics of image of frame of video;
The audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video merge, and obtainThe audio frequency and video fusion feature of described audiovisual presentation;
Feature database based on default reference video, mates described audio frequency and video fusion feature, obtainsThe frame collection matching result of described audiovisual presentation;
Frame collection matching result based on described audiovisual presentation and reference video, to described audiovisual presentationCarry out copy to judge and location.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005Application program can perform following operation:
The audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is turned by Fourier transformationChange the energy to frequency domain;
The frequency domain energy obtained is divided into some sons being in scheduled frequency range according to logarithmic relationshipBand;
The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains the audio sub-band energy difference of audio frameFeature;
Carry out the sampling of audio frame according to predetermined space, obtain the sound of the audio-frequency unit of described audiovisual presentationFrequently sub belt energy difference feature.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005Application program can perform following operation:
Frame of video to described audiovisual presentation, is converted into its image gray level image and is compressed processing;
Gray level image after processing compression is divided into some sub-blocks;
Calculate the DCT energy value of each sub-block;
The relatively DCT energy value between adjacent two sub-blocks, the Image DCT obtaining described frame of video is specialLevy;
According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005Application program can perform following operation:
Setting the described audio frequency characteristics feature as M per second 32 bits, the characteristics of image of frame of video is per secondThe feature of n 32 bits, wherein, n is the frame per second of video, and n is less than or equal to 60;
One frame of video is corresponded to the mode of some frame audio frames to carry out merging features, obtain product per secondThe audio frequency and video fusion feature of raw M 64 bits, wherein, corresponding one of each audio frequency and video fusion featureThe individually audio frequency characteristics of audio frame, one that M/n adjacent audio frequency and video fusion feature correspondence is identical regardsFrequently the characteristics of image of frame.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005Application program can perform following operation:
Matching list is obtained from the feature database of default reference video;
For each audio frequency and video fusion feature, inquiry and described audio frequency and video fusion feature from described matching listBetween Hamming distance less than the feature of predetermined threshold value, as the similar spy of described audio frequency and video fusion featureLevy;
Obtain the similar features of audio frequency and video fusion feature, obtain the frame collection matching result of described audiovisual presentation.
In one embodiment, the audio frequency and video copy inspection of storage during processor 1001 calls memorizer 1005Survey application program and can perform following operation:
The audio/video frames of the reference video corresponding to described similar features carries out time extension, obtains described soundAudio/video frames corresponding in video image compares the similar fragments that described reference video is constituted;
Based on described similar fragments, calculate audio/video frames corresponding in described audiovisual presentation and reference videoSimilarity;
If described similarity is more than setting threshold value, then judge that described audiovisual presentation constitutes copy, and recordThe original position of the similar fragments of described audiovisual presentation and final position.
In one embodiment, the audio frequency and video copy detection of storage during processor 1001 calls memorizer 1005Application program can perform following operation:
Described matching list is created in the feature database of described reference video.
The present embodiment passes through such scheme, especially by obtaining audiovisual presentation, to described audiovisual presentationIt is decoded and pretreatment, obtains audio-frequency unit and the frame of video of described audiovisual presentation;Described sound is regardedFrequently audio-frequency unit and the frame of video of image carries out feature extraction, obtains the audio frequency that described audiovisual presentation is correspondingFeature and the characteristics of image of frame of video;The audio frequency characteristics corresponding to described audiovisual presentation and the figure of frame of videoAs feature merges, obtain the audio frequency and video fusion feature of described audiovisual presentation;Based on default referenceThe feature database of video, mates described audio frequency and video fusion feature, obtains the frame of described audiovisual presentationCollection matching result;Frame collection matching result based on described audiovisual presentation and reference video, to described soundVideo image carries out copy and judges and location, thus the method utilizing audio frequency and video to combine, not only increaseThe vigorousness of video copy detection system, and by audio and video characteristic is merged, be greatly acceleratedThe execution efficiency of copy detection system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.
Based on above-mentioned hardware configuration, audio frequency and video copy detection method embodiment of the present invention is proposed.
As in figure 2 it is shown, first embodiment of the invention proposes a kind of audio frequency and video copy detection method, including:
Step S101, obtains audiovisual presentation, is decoded described audiovisual presentation and pretreatment,Audio-frequency unit and frame of video to described audiovisual presentation;
Specifically, first, obtaining and need the audiovisual presentation carrying out copy detection, this audiovisual presentation canTo obtain from this locality, it is also possible to obtained from outside by network.
The audiovisual presentation obtained is decoded and pretreatment, extracts the audio frequency of video, and be downsampled toMonophonic 5512.5Hz;Extract each frame of video frame by frame, thus obtain the audio-frequency unit of audiovisual presentationFrame of video with each frame.
Step S102, audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtainAudio frequency characteristics that described audiovisual presentation is corresponding and the characteristics of image of frame of video;
This part carries out feature extraction primarily with respect to audio frequency corresponding to video and all videos frame.CauseRepresent for the easy binary bits of audio frequency characteristics itself, thus often use binary index orLSH accelerates inquiry.The audio frequency characteristics that the present invention is extracted is audio sub-band energy difference feature, extractionThe characteristics of image of frame of video is DCT (Discrete Cosine Transform, discrete cosine transform) feature.
Wherein, the audio-frequency unit of described audiovisual presentation is carried out feature extraction, obtain described audio frequency and video figureAs the process of corresponding audio frequency characteristics includes:
Each audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is become by FourierChange the energy being transformed into frequency domain;The frequency domain energy obtained is divided into some being according to logarithmic relationshipThe subband of scheduled frequency range;The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains each soundFrequently the audio sub-band energy difference feature of frame;Carry out the sampling of audio frame according to predetermined space, obtain described soundThe audio sub-band energy difference feature of the audio-frequency unit of video image.
More specifically, the extraction flow process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:
Having main steps that of the algorithm that the extraction of this audio sub-band energy difference feature relates to:
First, by the time-domain audio shape information (audio frame) of every 0.37 second through Hanning window (HanningWindow) energy of frequency domain it is transformed into after filtering by Fourier transformation;
Secondly, the frequency domain energy obtained is divided into 33 according to logarithmic relationship (Bark grade) and is positioned atThe subband of human auditory system scope (300Hz~2000Hz), and calculate consecutive frame (being spaced 11 milliseconds) phaseThe difference of the absolute value of the energy between adjacent subband, thus each audio frame can be obtained one 32 ratioSpecial audio frequency characteristics.
" 1 " therein represents the energy difference of adjacent two subbands of current audio frame more than next audio frameThe energy difference of corresponding adjacent sub-bands, is otherwise 0.
Detailed process is as follows:
In figure 3, input content is a section audio;Output content is several (n that this section audio is correspondingIndividual) audio frequency characteristics.
Wherein, Framing: framing, it may be assumed that by this audio fragment cutting (n) audio frames that are several.In this example, according to M=2048 audio frame of collection per second, (in other examples, M can also set for otherValue), the audio content that each audio frame comprises 0.37 second (has the weight of 2047/2048 between adjacent audio frameFolded).
Fourier Transform: Fourier transformation, for turning the shape information (original audio) of time domainIt is changed to the energy information of the different frequency range ripple of frequency domain, it is simple to be analyzed processing.
ABS: take the absolute value (i.e.: only considering amplitude, do not consider direction of vibration) of wave energy information.
Band Division: point band, is divided into 33 mutually between 300Hz-2000Hz by whole frequency domainNonoverlapping frequency band (divides according to logarithmic relationship, it may be assumed that frequency is the lowest, frequency belonging to this frequencyBand scope is the least).As such, it is possible to obtain original audio energy on these different frequency bands.
Energy Computation: calculate each audio frame energy value on these 33 frequency bands (everyIndividual audio frame obtains 33 energy values).
Bit Derivation: derive bit: 33 above-mentioned energy values are compared (i-th successivelyThe energy of subband and the energy of i+1 subband compare) obtain the difference of 32 energy values.RelativelyThe size of these 32 energy value differences between current audio frame a and next audio frame b.Assume the of aJ energy value difference jth energy value difference than b is big, then the jth position of a is characterized as 1, otherwise, and aJth position be characterized as 0.So can obtain the magnitude relationship of 32 energy value differences between a and b,It is the feature of 32 bits of audio frame a.
Present invention employs this audio frequency characteristics, and carry out adopting of audio frame according to the interval of 1/2048 secondSample, all can generate the audio frequency characteristics of 2048 32 bits hence for the audio fragment of each second.
The frame of video of described audiovisual presentation is carried out feature extraction, obtains the video that audiovisual presentation is correspondingThe process of the characteristics of image of frame may include that
Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressedProcess;Gray level image after processing compression is divided into some sub-blocks;Calculate the DCT energy value of each sub-block;Relatively the DCT energy value between adjacent two sub-blocks, obtains the Image DCT feature of described frame of video;According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
More specifically, the flow process of the Image DCT feature of the frame of video of the present embodiment extraction audiovisual presentationAs shown in Figure 4:
For the feature that internet video picture entire change amplitude is little, the embodiment of the present invention has selected oneKind of efficient image overall feature is used as the characteristics of image of frame of video: DCT feature.
The algorithm idea of DCT feature is: divide the image into into several sub-blocks, by the most adjacent sonEnergy height between block, thus obtain the Energy distribution situation of entire image.Concrete algorithm steps is:
First, coloured image is converted into gray level image and compress (change the ratio of width to height) to wide 64 pixels,High 32 pixels.
Then, gray level image being divided into 32 sub-blocks (0 as shown in Figure 4~31), each piece comprises 8x8The image of pixel.
For each sub-block, calculate the DCT energy value of this sub-block.The energy value that selection can carryAbsolute value represents the energy of this sub-block.
Finally, calculate adjacent sub-blocks energy value relative size and obtain the feature of 32 bits.If theThe energy of i sub-block is more than the energy of i+1 sub-block, then ith bit position is 1, is otherwise 0.Especially:31st sub-block and the 0th sub-block compare.
By said process, each frame of video will obtain the Image DCT feature of 32 bits.
Step S103, the audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video are carried outMerge, obtain the audio frequency and video fusion feature of described audiovisual presentation;
After the characteristics of image that said process has obtained audio frequency characteristics corresponding to video and frame of video, incite somebody to actionThe characteristics of image and the audio frequency characteristics that obtain merge.Concrete fusion method as shown in Figure 5 (wherein:The longitudinal axis is time shaft).
As it is shown in figure 5, in the present embodiment, setting audio is characterized as that (this value can set M=2048 per secondFixed) feature of individual 32 bits, and the feature that the characteristics of image of frame of video is n per second 32 bits (n isThe frame per second of video, n is usually no more than 60).
Thus, the present embodiment carries out feature by the way of a frame of video is corresponded to some audio framesSplicing, it may be assumed that the audio frequency and video fusion feature of 2048 64 bits of generation per second, wherein, each mergesThe feature of all corresponding single audio frame of feature, and 2048/n adjacent audio frequency and video fusion feature pairThe Image DCT feature of a frame of video that should be identical.
Merged by the characteristics of image of the above-mentioned audio frequency characteristics corresponding to audiovisual presentation and frame of video,Obtain the audio frequency and video fusion feature of audiovisual presentation.
Step S104, feature database based on default reference video, described audio frequency and video fusion feature is carried outCoupling, obtains the frame collection matching result of described audiovisual presentation;
The present embodiment is preset with the feature database of reference video, and creating in the feature database of reference video hasMatching list, to facilitate video individual features to be detected quickly to retrieve.
When audio frequency and video fusion feature is mated, first, from the feature database of default reference videoObtain matching list;For each audio frequency and video fusion feature, from described matching list, inquiry meets pre-conditionedFeature, as the similar features of audio frequency and video fusion feature.Such as inquire about from described matching list and regard with soundFrequently the Hamming distance between fusion feature is less than the feature of predetermined threshold value (such as 3), regards as described soundFrequently the similar features of fusion feature;Obtain the similar features of all audio frequency and video fusion features, obtain described soundThe frame collection matching result of video image.
More specifically, the present embodiment is considered:
For inquiry video (needing to carry out the video of copy detection) and a reference video, ifBy comparing the similarity of both features frame by frame, required time complexity is just all becoming with the two videoRatio, thus it is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on existing simhashTechnology, it is proposed that a kind of index based on audio frequency and video fusion feature and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking intoThe 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The principle of this algorithmSchematic diagram is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then by 64 bitsIt is divided into 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to,In remaining 48 bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.LogicalAfter crossing twice index search coupling, can in remaining 36 bits, enumerate most 3 discrepantPosition, such that it is able to be substantially reduced the complexity of original algorithm.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two featureIt is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodimentCopy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, thenFront 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bitsOn all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, useCarry out quick search audio frequency and video fusion feature.
Then, by inquiring about the matching list of above-mentioned structure, obtain the similar features of audio frequency and video fusion feature,Obtain the result of characteristic key.
Step S105, frame collection matching result based on described audiovisual presentation and reference video, to describedAudiovisual presentation carries out copy and judges and location.
According to the result of the characteristic key obtained in said process, and combine video copy fragment localization method,Thus judge whether to inquire about video as copying video.If it is determined that inquiry video is copy video, be then givenCorresponding copy fragment location.
The present embodiment is considered: for two videos, if between calculating the two video between a frameSimilarity, then can obtain the similarity matrix shown in rightmost in Fig. 8.Thus, find two video phasesAlso the line finding similarity to be higher than certain threshold value in similarity matrix it has been converted to like the target of fragmentSection, but this processing mode time overhead strengthens.
The principle that audiovisual presentation carries out in the present embodiment copy judgement and location is: by above-mentioned couplingAlgorithm, can find some points (representing these similarities the highest) the brightest in similarity matrix, such as figureBright spot shown in Far Left in 8, and by these point carry out time extension, such that it is able to obtain in Fig. 8Similar fragments (the most possible copy fragment) shown between, is screened by threshold value afterwards, such that it is able toJudge whether certain two video constitutes copy, and if constitute copy, then can record this similar fragmentsOriginal position and final position distribution moment.
Specifically, when audiovisual presentation being carried out copy and judging and position, first said process is obtainedThe audio/video frames (bright spot shown in corresponding diagram 8 Far Left figure) of reference video corresponding to similar features enterThe row time extends, and obtains the reference video fragment of described reference video, the sound corresponding to described similar featuresAudio/video frames in video image carries out time extension, obtains comparing in described audiovisual presentation described referenceThe similar fragments (as shown in Fig. 8 middle graph) that video is constituted;Calculate described in described audiovisual presentation similarSimilarity between fragment and described reference video fragment, i.e. calculates similar fragments in audiovisual presentation correspondingThe similarity of the audio/video frames audio/video frames corresponding with reference video fragment, to each audio/video frames obtainedSimilarity average;If described similarity is more than setting threshold value, then judge described audiovisual presentation structureBecome copy, and record original position and the final position of the similar fragments of described audiovisual presentation.
It is to say, the audio/video frames that similar fragments is corresponding in calculating audiovisual presentation and reference videoDuring similarity, to each frame (including the feature of 64 bits) in this similar fragments and reference video segmentCorresponding frame carries out Characteristic Contrast, calculates similarity, averages afterwards, by this meansigma methods and predetermined threshold valueRelatively, if similarity is more than setting threshold value, then judges that described audiovisual presentation constitutes copy, and record instituteState original position and the final position of the similar fragments of audiovisual presentation.
It is exemplified below:
100 frames (i.e. one audio-video sequence) if in similar fragments, between the 10-20 second of inquiry video100 frames between the 30-40 second of corresponding reference video, then by 100 between the 10-20 second of inquiry videoEach frame correspondence in frame and each frame in 100 frames between the 30-40 second of reference video are compared,Calculate the similarity of each frame respectively, in the such as first frame 64 bit, have feature and the reference of 50 bitsFrame of video is identical, then the similarity S1=50/64 ≈ 0.78125 of this first frame;With this principle, obtain secondSimilarity S2 of frame ..., similarity S100 of 100 frames, each similarity is averaged, obtains phaseLike in fragment, inquire about the similarity of video and reference video, it is assumed that be 0.95, it (is set with setting threshold valueIt is 0.9) compare, thus may determine that inquiry video constitutes copy, and record the start bit of this similar fragmentsPut and final position.
Judging and in position fixing process at above-mentioned copy, an inquiry video there may be multiple similar fragmentsSituation, can string together record by the plurality of similar fragments.
It should be noted that in the present embodiment said process, judging inquiry according to frame collection matching resultWhen whether video is the copy of certain video in reference video storehouse, it is possible to use other algorithms realize,Such as: Hough transformation, Smith Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..LogicalCross these algorithms and find the inquiry video one section sequence most like with certain reference video, and by threshold valueDetermine whether to constitute copy.For being judged to the video of copy, it is judged that copy fragment end to end, thus is markedRemember that this Partial Fragment is for copy fragment.
The present embodiment passes through such scheme, utilizes the method that audio frequency and video combine, not only increases video and copyThe vigorousness of shellfish detecting system, and by being merged by audio and video characteristic, it is greatly accelerated copy inspectionThe execution efficiency of examining system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.
As it is shown in figure 9, second embodiment of the invention proposes a kind of audio frequency and video copy detection method, based on upperState embodiment, before obtaining the step of audiovisual presentation, also include:
Step S100, creates described matching list in the feature database of described reference video.
Specifically, create matching list, be that video individual features the most to be detected can quickly be examinedRope.
Matching list creates based on reference video, and concrete establishment process is as follows:
First, collect reference video fragment, reference video fragment carried out audio/video decoding and pretreatment,Obtain audio-frequency unit and the frame of video of reference video.
Then, audio-frequency unit and frame of video to reference video carry out feature extraction, obtain reference videoAudio frequency characteristics and the characteristics of image of frame of video.
Afterwards, reference video being carried out audio and video characteristic fusion, the audio frequency and video obtaining reference video merge spyLevy.
Finally, audio frequency and video fusion feature based on this reference video creates matching list, for follow-up inquiryVideo carries out aspect indexing retrieval coupling.
Wherein, when audio frequency and video fusion feature based on this reference video creates matching list, based on following formerReason:
Consider: inquiry video (needing to carry out the video of copy detection) and a reference are regardedFrequently, if by the similarity comparing both features frame by frame, required time complexity regards with the twoFrequency is all directly proportional, thus is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based onSome simhash technology, it is proposed that a kind of index based on audio frequency and video fusion feature and the coupling plan of inquirySlightly.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking intoThe 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The signal of this algorithmFigure is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then 64 bits are dividedBecome 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, surplusIn 48 remaining bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.By twoAfter coupling searched in secondary index, most 3 discrepant positions can be enumerated in remaining 36 bits,Such that it is able to be substantially reduced the complexity of original algorithm.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two featureIt is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodimentCopy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, thenFront 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bitsOn all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, useCarry out quick search audio frequency and video fusion feature.
Accordingly, the functional module embodiment of embodiment of the present invention audio frequency and video copy detection device is proposed.
As shown in Figure 10, first embodiment of the invention proposes a kind of audio frequency and video copy detection device, including:Decoding and pretreatment module 201, characteristic extracting module 202, Fusion Module 203, matching module 204 andCopy determination module 205, wherein:
Decoding and pretreatment module 201, be used for obtaining audiovisual presentation, solve described audiovisual presentationCode and pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation;
Characteristic extracting module 202, for carrying out feature to audio-frequency unit and the frame of video of described audiovisual presentationExtract, obtain audio frequency characteristics corresponding to described audiovisual presentation and the characteristics of image of frame of video;
Fusion Module 203, special for the image of the audio frequency characteristics corresponding to described audiovisual presentation and frame of videoLevy and merge, obtain the audio frequency and video fusion feature of described audiovisual presentation;
Described audio frequency and video, for feature database based on default reference video, are merged spy by matching module 204Levy and mate, obtain the frame collection matching result of described audiovisual presentation;
Copy determination module 205, for frame collection matching result based on described audiovisual presentation and with reference to regardingFrequently, described audiovisual presentation carries out copy judge and location.
Specifically, first, obtaining and need the audiovisual presentation carrying out copy detection, this audiovisual presentation canTo obtain from this locality, it is also possible to obtained from outside by network.
The audiovisual presentation obtained is decoded and pretreatment, extracts the audio frequency of video, and be downsampled toMonophonic 5512.5Hz;Extract each frame of video frame by frame, thus obtain the audio-frequency unit of audiovisual presentationFrame of video with each frame.
Afterwards, audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain describedAudio frequency characteristics that audiovisual presentation is corresponding and the characteristics of image of frame of video.
This part carries out feature extraction primarily with respect to audio frequency corresponding to video and all videos frame.CauseRepresent for the easy binary bits of audio frequency characteristics itself, thus often use binary index orLSH accelerates inquiry.The audio frequency characteristics that the present invention is extracted is audio sub-band energy difference feature, extractionThe characteristics of image of frame of video is DCT (Discrete Cosine Transform, discrete cosine transform) feature.
Wherein, the audio-frequency unit of described audiovisual presentation is carried out feature extraction, obtain described audio frequency and video figureAs the process of corresponding audio frequency characteristics includes:
Each audio frame of the audio-frequency unit of described audiovisual presentation is filtered, and is become by FourierChange the energy being transformed into frequency domain;The frequency domain energy obtained is divided into some being according to logarithmic relationshipThe subband of scheduled frequency range;The difference of the absolute value of the energy between calculating adjacent sub-bands, obtains each soundFrequently the audio sub-band energy difference feature of frame;Carry out the sampling of audio frame according to predetermined space, obtain described soundThe audio sub-band energy difference feature of the audio-frequency unit of video image.
More specifically, the extraction flow process of the present embodiment audio sub-band energy difference feature is as shown in Figure 3:
Having main steps that of the algorithm that the extraction of this audio sub-band energy difference feature relates to:
First, by the time-domain audio shape information (audio frame) of every 0.37 second through Hanning window (HanningWindow) energy of frequency domain it is transformed into after filtering by Fourier transformation;
Secondly, the frequency domain energy obtained is divided into 33 according to logarithmic relationship (Bark grade) and is positioned atThe subband of human auditory system scope (300Hz~2000Hz), and calculate consecutive frame (being spaced 11 milliseconds) phaseThe difference of the absolute value of the energy between adjacent subband, thus each audio frame can be obtained one 32 ratioSpecial audio frequency characteristics.
" 1 " therein represents the energy difference of adjacent two subbands of current audio frame more than next audio frameThe energy difference of corresponding adjacent sub-bands, is otherwise 0.
Detailed process is as follows:
In figure 3, input content is a section audio;Output content is several (n that this section audio is correspondingIndividual) audio frequency characteristics.
Wherein, Framing: framing, it may be assumed that by this audio fragment cutting (n) audio frames that are several.According to 2048 audio frames of collection per second in example, each audio frame comprises the audio content (phase of 0.37 secondThe overlap of 2047/2048 is had) between adjacent audio frame.
Fourier Transform: Fourier transformation, for turning the shape information (original audio) of time domainIt is changed to the energy information of the different frequency range ripple of frequency domain, it is simple to be analyzed processing.
ABS: take the absolute value (i.e.: only considering amplitude, do not consider direction of vibration) of wave energy information.
Band Division: point band, is divided into 33 mutually between 300Hz-2000Hz by whole frequency domainNonoverlapping frequency band (divides according to logarithmic relationship, it may be assumed that frequency is the lowest, frequency belonging to this frequencyBand scope is the least).As such, it is possible to obtain original audio energy on these different frequency bands.
Energy Computation: calculate each audio frame energy value on these 33 frequency bands (everyIndividual audio frame obtains 33 energy values).
Bit Derivation: derive bit: 33 above-mentioned energy values are compared (i-th successivelyThe energy of subband and the energy of i+1 subband compare) obtain the difference of 32 energy values.RelativelyThe size of these 32 energy value differences between current audio frame a and next audio frame b.Assume the of aJ energy value difference jth energy value difference than b is big, then the jth position of a is characterized as 1, otherwise, and aJth position be characterized as 0.So can obtain the magnitude relationship of 32 energy value differences between a and b,It is the feature of 32 bits of audio frame a.
Present invention employs this audio frequency characteristics, and carry out adopting of audio frame according to the interval of 1/2048 secondSample, all can generate the audio frequency characteristics of 2048 32 bits hence for the audio fragment of each second.
The frame of video of described audiovisual presentation is carried out feature extraction, obtains the video that audiovisual presentation is correspondingThe process of the characteristics of image of frame may include that
Each frame of video to described audiovisual presentation, is converted into gray level image by its image and is compressedProcess;Gray level image after processing compression is divided into some sub-blocks;Calculate the DCT energy value of each sub-block;Relatively the DCT energy value between adjacent two sub-blocks, obtains the Image DCT feature of described frame of video;According to above-mentioned processing procedure, obtain the Image DCT feature of the frame of video of described audiovisual presentation.
More specifically, the flow process of the Image DCT feature of the frame of video of the present embodiment extraction audiovisual presentationAs shown in Figure 4:
For the feature that internet video picture entire change amplitude is little, the embodiment of the present invention has selected oneKind of efficient image overall feature is used as the characteristics of image of frame of video: DCT feature.
The algorithm idea of DCT feature is: divide the image into into several sub-blocks, by the most adjacent sonEnergy height between block, thus obtain the Energy distribution situation of entire image.Concrete algorithm steps is:
First, coloured image is converted into gray level image and compress (change the ratio of width to height) to wide 64 pixels,High 32 pixels.
Then, gray level image being divided into 32 sub-blocks (0 as shown in Figure 4~31), each piece comprises 8x8The image of pixel.
For each sub-block, calculate the DCT energy value of this sub-block.The energy value that selection can carryAbsolute value represents the energy of this sub-block.
Finally, calculate adjacent sub-blocks energy value relative size and obtain the feature of 32 bits.If theThe energy of i sub-block is more than the energy of i+1 sub-block, then ith bit position is 1, is otherwise 0.Especially:31st sub-block and the 0th sub-block compare.
By said process, each frame of video will obtain the Image DCT feature of 32 bits.
After the characteristics of image that said process has obtained audio frequency characteristics corresponding to video and frame of video, incite somebody to actionThe characteristics of image and the audio frequency characteristics that obtain merge.Concrete fusion method as shown in Figure 5 (wherein:The longitudinal axis is time shaft).
As it is shown in figure 5, in the present embodiment, setting audio is characterized as that (this value can set M=2048 per secondFixed) feature of individual 32 bits, and the feature that the characteristics of image of frame of video is n per second 32 bits (n isThe frame per second of video, n is usually no more than 60).
Thus, the present embodiment carries out feature by the way of a frame of video is corresponded to some audio framesSplicing, it may be assumed that the audio frequency and video fusion feature of 2048 64 bits of generation per second, wherein, each mergesThe feature of all corresponding single audio frame of feature, and 2048/n adjacent audio frequency and video fusion feature pairThe Image DCT feature of a frame of video that should be identical.
Merged by the characteristics of image of the above-mentioned audio frequency characteristics corresponding to audiovisual presentation and frame of video,Obtain the audio frequency and video fusion feature of audiovisual presentation.
Afterwards, feature database based on default reference video, described audio frequency and video fusion feature is mated,Obtain the frame collection matching result of described audiovisual presentation.
The present embodiment is preset with the feature database of reference video, and creating in the feature database of reference video hasMatching list, to facilitate video individual features to be detected quickly to retrieve.
When audio frequency and video fusion feature is mated, first, from the feature database of default reference videoObtain matching list;For each audio frequency and video fusion feature, from described matching list, inquiry meets pre-conditionedFeature, as the similar features of audio frequency and video fusion feature.Such as inquire about from described matching list and regard with soundFrequently the Hamming distance between fusion feature is less than the feature of predetermined threshold value (such as 3), regards as described soundFrequently the similar features of fusion feature;Obtain the similar features of audio frequency and video fusion feature, obtain described audio frequency and videoThe frame collection matching result of image.
More specifically, the present embodiment is considered:
For inquiry video (needing to carry out the video of copy detection) and a reference video, ifBy comparing the similarity of both features frame by frame, required time complexity is just all becoming with the two videoRatio, thus it is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based on existing simhashTechnology, it is proposed that a kind of index based on audio frequency and video fusion feature and the matching strategy of inquiry.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking intoThe 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The principle of this algorithmSchematic diagram is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then by 64 bitsIt is divided into 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to,In remaining 48 bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.LogicalAfter crossing twice index search, most 3 discrepant positions can be enumerated in remaining 36 bits,Such that it is able to be substantially reduced the complexity of original algorithm.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two featureIt is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodimentCopy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, thenFront 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bitsOn all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, useCarry out quick search audio frequency and video fusion feature.
Then, by inquiring about the matching list of above-mentioned structure, obtain the similar features of audio frequency and video fusion feature,Obtain the result of characteristic key.
According to the result of the characteristic key obtained in said process, and combine video copy fragment localization method,Thus judge whether to inquire about video as copying video.If it is determined that inquiry video is copy video, be then givenCorresponding copy fragment location.
The present embodiment is considered: for two videos, if calculate between the two video each frame itBetween similarity, then can obtain the similarity matrix shown in rightmost in Fig. 8.Thus, find two to regardFrequently the target of similar fragments has also been converted to find similarity to be higher than certain threshold value in similarity matrixLine segment, but this processing mode time overhead strengthens.
The principle that audiovisual presentation carries out in the present embodiment copy judgement and location is: by above-mentioned indexAlgorithm, can find some points (representing these similarities the highest) the brightest in similarity matrix, such as figureBright spot shown in Far Left in 8, and by these point carry out time extension, such that it is able to obtain in Fig. 8Similar fragments (the most possible copy fragment) shown between, is screened by threshold value afterwards, such that it is able toJudge whether certain two video constitutes copy, and if constitute copy, then can record this similar fragmentsOriginal position and final position distribution moment.
Specifically, when audiovisual presentation being carried out copy and judging and position, first said process is obtainedThe audio/video frames (bright spot shown in corresponding diagram 8 Far Left figure) of reference video corresponding to similar features enterThe row time extends, and obtains the reference video fragment of described reference video, the sound corresponding to described similar featuresAudio/video frames in video image carries out time extension, obtains comparing in described audiovisual presentation described referenceThe similar fragments (as shown in Fig. 8 middle graph) that video is constituted;Calculate described in described audiovisual presentation similarSimilarity between fragment and described reference video fragment, i.e. calculates similar fragments in audiovisual presentation correspondingThe similarity of the audio/video frames audio/video frames corresponding with reference video fragment, to each audio/video frames obtainedSimilarity average;If described similarity is more than setting threshold value, then judge described audiovisual presentation structureBecome copy, and record original position and the final position of the similar fragments of described audiovisual presentation.
It is to say, the audio/video frames that similar fragments is corresponding in calculating audiovisual presentation and reference videoDuring similarity, to each frame (including the feature of 64 bits) in this similar fragments and reference video segmentCorresponding frame carries out Characteristic Contrast, calculates similarity, averages afterwards, by this meansigma methods and predetermined threshold valueRelatively, if similarity is more than setting threshold value, then judges that described audiovisual presentation constitutes copy, and record instituteState original position and the final position of the similar fragments of audiovisual presentation.
It is exemplified below:
100 frames (i.e. one audio-video sequence) if in similar fragments, between the 10-20 second of inquiry video100 frames between the 30-40 second of corresponding reference video, then by 100 between the 10-20 second of inquiry videoEach frame correspondence in frame and each frame in 100 frames between the 30-40 second of reference video are compared,Calculate the similarity of each frame respectively, in the such as first frame 64 bit, have feature and the reference of 50 bitsFrame of video is identical, then the similarity S1=50/64 ≈ 0.78125 of this first frame;With this principle, obtain secondSimilarity S2 of frame ..., similarity S100 of 100 frames, each similarity is averaged, obtains phaseLike in fragment, inquire about the similarity of video and reference video, it is assumed that be 0.95, it (is set with setting threshold valueIt is 0.9) compare, thus may determine that inquiry video constitutes copy, and record the start bit of this similar fragmentsPut and final position.
Judging and in position fixing process at above-mentioned copy, an inquiry video there may be multiple similar fragmentsSituation, can string together record by the plurality of similar fragments.
It should be noted that in the present embodiment said process, judging inquiry according to frame collection matching resultWhen whether video is the copy of certain video in reference video storehouse, it is possible to use other algorithms realize,Such as: Hough transformation, Smith Waterman algorithm, Blast algorithm, time domain pyramid algorith etc..LogicalCross these algorithms and find the inquiry video one section sequence most like with certain reference video, and by threshold valueDetermine whether to constitute copy.For being judged to the video of copy, it is judged that copy fragment end to end, thus is markedRemember that this Partial Fragment is for copy fragment.
The present embodiment passes through such scheme, utilizes the method that audio frequency and video combine, not only increases video and copyThe vigorousness of shellfish detecting system, and by being merged by audio and video characteristic, it is greatly accelerated copy inspectionThe execution efficiency of examining system, is analyzed jointly by audio frequency and video, improves copy fragment positioning precision.
As shown in figure 11, second embodiment of the invention proposes a kind of audio frequency and video copy detection device, based on upperState embodiment, also include:
Creation module 200, for creating described matching list in the feature database of described reference video.
Specifically, create matching list, be that video individual features the most to be detected can quickly be examinedRope.
Matching list creates based on reference video, and concrete establishment process is as follows:
First, collect reference video fragment, reference video fragment carried out audio/video decoding and pretreatment,Obtain audio-frequency unit and the frame of video of reference video.
Then, audio-frequency unit and frame of video to reference video carry out feature extraction, obtain reference videoAudio frequency characteristics and the characteristics of image of frame of video.
Afterwards, reference video being carried out audio and video characteristic fusion, the audio frequency and video obtaining reference video merge spyLevy.
Finally, audio frequency and video fusion feature based on this reference video creates matching list, for follow-up inquiryVideo carries out aspect indexing retrieval.
Wherein, when audio frequency and video fusion feature based on this reference video creates matching list, based on following formerReason:
Consider: inquiry video (needing to carry out the video of copy detection) and a reference are regardedFrequently, if by the similarity comparing both features frame by frame, required time complexity regards with the twoFrequency is all directly proportional, thus is unfavorable for expanding to the situation of large scale database.Therefore, the present invention is based onSome simhash technology, it is proposed that a kind of index based on audio frequency and video fusion feature and query strategy.
Wherein, the basic object of Simhash index is: in the feature database of numerous 64 bits, for looking intoThe 64 bit features ask, quickly find Hamming distance with this 64 bit feature less than or etc.All features (in i.e. 64 bits, be up to 3 bits are different from this feature) in 3.The signal of this algorithmFigure is as shown in Figure 6.For 64 Bit datas, if limiting Hamming distance as 3, then 64 bits are dividedBecome 4 16 bits, there will necessarily be 16 bits completely the same with query characteristics.It is similar to, surplusIn 48 remaining bits, the piecemeal certainly existing 12 bits is completely the same with query characteristics.By twoAfter secondary index is searched, most 3 discrepant positions can be enumerated in remaining 36 bits, thusThe complexity of original algorithm can be substantially reduced.
The 64 bit audio frequency and video fusion features that the present invention uses have inquiry characteristic the same for simhash equally,That is: need to find all features at most differing 3 bits with certain 64 feature (to think the two featureIt is relevant).Additionally, following qualifications: i.e.: the two correlated characteristic front 32Many differences 2 bits, and rear 32 most difference 2 bits of the two feature.Based on this, the present embodimentCopy the way of simhash, but concordance list number is expanded to 24, concrete extended method such as Fig. 7Shown in:
In matching algorithm design as shown in Figure 7, it is considered to the situation of rear 32 most 1 bit difference, thenFront 32 be up to 16 bit difference, then for Fig. 7, in A, B, C, D at least 2Block is completely the same, and in E, F at least one piece completely the same, therefore can build 32 bitsOn all four matching list.Such inquiry table one has C (4,2) * C (2,1) * 2, because being likely to front 32Bit at most differs from 2.Therefore, it can construct altogether 24 sublists, as the matching list created, useCarry out quick search audio frequency and video fusion feature.
Embodiment of the present invention audio frequency and video copy detection method and device, by obtaining audiovisual presentation, to instituteState audiovisual presentation to be decoded and pretreatment, obtain audio-frequency unit and the frame of video of described audiovisual presentation;Audio-frequency unit and frame of video to described audiovisual presentation carry out feature extraction, obtain described audiovisual presentationCorresponding audio frequency characteristics and the characteristics of image of frame of video;To audio frequency characteristics corresponding to described audiovisual presentation andThe characteristics of image of frame of video merges, and obtains the audio frequency and video fusion feature of described audiovisual presentation;Based onThe feature database of the reference video preset, mates described audio frequency and video fusion feature, obtains described sound and regardFrequently the frame collection matching result of image;Frame collection matching result based on described audiovisual presentation and reference video,Described audiovisual presentation carries out copy judge and location, thus utilize the method that audio frequency and video combine, noOnly enhance the vigorousness of video copy detection system, and by audio and video characteristic is merged, greatlyAccelerate greatly the execution efficiency of copy detection system, jointly analyzed by audio frequency and video, improve copy fragmentPositioning precision.
Also, it should be noted in this article, term " include ", " comprising " or its any other becomeBody is intended to comprising of nonexcludability, so that include the process of a series of key element, method, articleOr device not only includes those key elements, but also includes other key elements being not expressly set out, orAlso include the key element intrinsic for this process, method, article or device.There is no more restrictionIn the case of, statement " including ... " key element limited, it is not excluded that including the mistake of this key elementJourney, method, article or device there is also other identical element.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art is it can be understood that arrive above-mentionedEmbodiment method can add the mode of required general hardware platform by software and realize, naturally it is also possible to logicalCross hardware, but a lot of in the case of the former is more preferably embodiment.Based on such understanding, the present invention'sThe part that prior art is contributed by technical scheme the most in other words can be with the form body of software productRevealing to come, this computer software product is stored in a storage medium (such as ROM/RAM, magnetic disc, lightDish) in, including some instructions with so that a station terminal equipment (can be mobile phone, computer, serviceDevice, or the network equipment etc.) perform the method described in each embodiment of the present invention.
The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention,Every equivalent structure utilizing description of the invention and accompanying drawing content to be made or flow process conversion, or directly orConnect and be used in other relevant technical field, be the most in like manner included in the scope of patent protection of the present invention.