Summary of the invention
The present invention is proposed in view of the above problem.The present invention provides a kind of for handling video and related audioMethod and apparatus and search method and device.
According to an aspect of the present invention, a kind of method for handling video and related audio is provided.This method comprises:
Obtain the video of one or more faces including one or more objects;
Face datection is carried out to each video frame in the video, to identify one or more of faces;
Obtain with the video acquired in same time period include one or more of objects at least partlyThe audio of the voice of object;
For at least partly each of face in one or more of faces,
Determine audio-frequency unit in the audio, corresponding with the face;
The face is associated with corresponding audio-frequency unit,
Wherein, at least partly face is belonging respectively at least partly object.
Illustratively, institute is determined in described at least partly each of face in one or more of facesBefore stating audio-frequency unit in audio, corresponding with the face, the method further includes:
For described at least partly each of face,
The video is segmented according to the mouth action of the face, to obtain initial video corresponding with the faceSection;
The audio is segmented according to the phonetic feature in the audio, it is corresponding with the face initial to obtainAudio section;And
In the video and face phase is obtained according to initial video section corresponding with the face and initial audio sectionEffective audio section in corresponding effective video section and the audio, corresponding with the face;
In the determining audio of at least partly each of face in one or more of faces,Audio-frequency unit corresponding with the face includes:
For described at least partly each of face, determine that effective audio section corresponding with the face is and the peopleThe corresponding audio-frequency unit of face.
Illustratively, described at least partly each of face in one or more of faces is by the faceIt associates with corresponding audio-frequency unit and includes:
For described at least partly each of face,
For each effective video section corresponding with the face, people is selected from all video frames of the effective video sectionThe optimal video frame of face quality;
Selected video frame and effective audio section corresponding with the effective video section are associated, to form a viewFrequency domain audio combination.
Illustratively, the method further includes:
The video frame in the particular video frequency audio combination is carried out for face corresponding to particular video frequency audio combinationFace characteristic extracts, to obtain Given Face feature, wherein the particular video frequency audio combination is at least partly face instituteOne of corresponding all video/audio combinations;
Sound characteristic extraction is carried out to effective audio section in the particular video frequency audio combination, to obtain specific sound spySign;
Each of combined for remaining video/audio in all video/audio combinations,
It calculates the Given Face feature and combines human face similarity degree between corresponding face characteristic with the video/audio;
It calculates the specific sound feature and combines sound similarity between corresponding sound characteristic with the video/audio;
Calculate the particular video frequency audio combination combined with the video/audio between human face similarity degree and sound similarityAverage value, combined with to obtain the particular video frequency audio combination with the video/audio between average similarity;
If the particular video frequency audio combination combined with the video/audio between average similarity be greater than similarity thresholdValue, then combine the particular video frequency audio combination with the video/audio and be referred to same target.
Illustratively, described at least partly each of face basis initial view corresponding with the faceFrequency range and initial audio section obtain it is in effective video section in the video, corresponding with the face and the audio, withThe corresponding effective audio section of the face includes:
For described at least partly each of face, initial video section corresponding with the face is determined as and is somebody's turn to doThe corresponding effective video section of face, and will initial audio section corresponding with the face be determined as it is corresponding with the faceEffective audio section.
Illustratively, described at least partly each of face basis initial view corresponding with the faceFrequency range and initial audio section obtain it is in effective video section in the video, corresponding with the face and the audio, withThe corresponding effective audio section of the face includes:
For described at least partly each of face,
Unified split time is determined according to the split time of initial video section corresponding with the face and initial audio section;
Unified segmentation is carried out to the video and the audio according to the unified split time, to obtain and the face phaseCorresponding effective video section and effective audio section.
Illustratively, the audio is acquired by unified microphone,
It is described to be directed to described at least partly each of face according to the phonetic feature in the audio to the audioIt is segmented to obtain initial audio section corresponding with the face and include:
The audio is segmented according to the phonetic feature in the audio, to obtain mixed audio piece;And
For described at least partly each of face, from the mixed audio piece selection on acquisition time and withThe corresponding consistent mixed audio piece of initial video section of the face is as initial audio section corresponding with the face.
Illustratively, the audio includes respectively by one or more shotgun microphones one or more sound collectedFrequently,
It is described acquisition with the video acquired in same time period include one or more of objects in extremelyBefore the audio of the voice of small part object, the method further includes:
Control one or more of shotgun microphones be respectively facing at least partly object with acquire it is described all the way orMCVF multichannel voice frequency;
It is described to be directed to described at least partly each of face according to the phonetic feature in the audio to the audioIt is segmented to obtain initial audio section corresponding with the face and include:
For described at least partly each of face, according to as the oriented microphone towards object corresponding to the facePhonetic feature in wind audio all the way collected is segmented the road audio, to obtain initial sound corresponding with the faceFrequency range.
Illustratively, the number of the shotgun microphone is equal to or more than the number of one or more of faces.
Illustratively, the one or more of shotgun microphones of the control be respectively facing at least partly object withBefore acquiring one or more described audio, the method further includes:
The priority of each face is determined according to the face characteristic of one or more of faces and/or movement;And
According to the priority of each face determine one or more of shotgun microphones will direction object be used as described inAt least partly object.
Illustratively, described to be directed to described at least partly each of face according to the mouth action of the face to describedVideo is segmented to be implemented according to following rule:
For described at least partly each of face, if the mouth of the face becomes at the first moment from closed stateChange in the first predetermined period to open configuration and before first moment and be continuously in closed state, then by describedOne moment is as the video segmentation time started, if the mouth of the face changes to closed state from open configuration at the second momentAnd it is continuously in closed state in the second predetermined period after second moment, then using second moment as viewThe frequency division section end time,
Wherein, the video, portion between adjacent video segmentation time started and video segmentation end timeIt is divided into the initial video section.
Illustratively, described at least partly each of face for described in is according to the phonetic feature pair in the audioThe audio is segmented to be implemented according to following rule:
If voice in the audio the third moment never sounding state change to sounding state and describedIt is continuously in not sounding state in third predetermined period before three moment, then is started using the third moment as audio parsingTime, if voice in the audio is at the 4th moment from sounding state change to not sounding state and at the described 4thBe continuously in not sounding state in the 4th predetermined period after quarter, then using the 4th moment as audio parsing at the end ofBetween,
Wherein, the audio, portion between adjacent audio parsing time started and audio parsing end timeIt is divided into the initial audio section.
Illustratively, institute is determined in described at least partly each of face in one or more of facesAfter stating audio-frequency unit in audio, corresponding with the face, the method further includes:
For described at least partly each of face,
Speech recognition is carried out to audio-frequency unit corresponding with the face, represents audio corresponding with the face to obtainPartial text file;
The text file is associated with the face.
Illustratively, the method further includes: output expectation information,
Wherein, the expectation information includes one or more in following item: the video, the audio include describedThe video frame of Given Face in one or more faces, video frame comprising the Given Face acquisition time, with it is describedThe acquisition time of the corresponding audio-frequency unit of Given Face and audio-frequency unit corresponding with the Given Face.
According to a further aspect of the invention, a kind of search method is provided, comprising:
The retrieval received for target face indicates;
The relevant information of the target face is searched from database according to the retrieval instruction;And
Export the relevant information of the target face;
Wherein, the database is used to store carries out according to the method described above for processing video and related audioThe video and audio of processing and/or audio-frequency unit corresponding with described at least partly each of face,
And wherein, the relevant information of the target face includes one or more in following item: including the targetThe video frame of face, the acquisition time of video frame comprising the target face, audio portion corresponding with the target facePoint and audio-frequency unit corresponding with the target face acquisition time.
According to a further aspect of the invention, it provides a kind of for handling the device of video and related audio.The device includes:
First obtains module, for obtaining the video of one or more faces including one or more objects;
Face detection module, it is one to identify for carrying out Face datection to each video frame in the videoOr multiple faces;
Second obtains module, for obtains with the video acquired in same time period including one or more ofThe audio of the voice of at least partly object in object;
Audio-frequency unit determining module, for for each in at least partly face in one or more of facesIt is a, determine audio-frequency unit in the audio, corresponding with the face, wherein at least partly face is belonging respectively to instituteState at least partly object;And
Audio relating module, for being directed to described at least partly each of face, by the face and corresponding audioPartial association is got up.
Illustratively, described device further comprises:
Video segmentation module, for being directed to described at least partly each of face, according to the mouth action of the faceThe video is segmented, to obtain initial video section corresponding with the face;
Audio parsing module, for being directed to described at least partly each of face, according to the voice in the audioFeature is segmented the audio, to obtain initial audio section corresponding with the face;And
Effective video and audio obtain module, for according to initial video section corresponding with the face and initial audio sectionObtaining in effective video section in the video, corresponding with the face and the audio, corresponding with the face hasImitate audio section;
The audio-frequency unit determining module includes determining submodule, for for each in at least partly faceIt is a, determine that effective audio section corresponding with the face is audio-frequency unit corresponding with the face.
Illustratively, the audio relating module includes:
Video frame selects submodule, for being directed to described at least partly each of face, for opposite with the faceThe each effective video section answered selects the optimal video frame of face quality from all video frames of the effective video section;And
It is associated with submodule, for selected video frame and effective audio section corresponding with the effective video section to be associated withCome, to form a video/audio combination.
Illustratively, described device further comprises:
Face characteristic extraction module, for being directed to face corresponding to particular video frequency audio combination to the particular video frequency soundVideo frame in frequency combination carries out face characteristic extraction, to obtain Given Face feature, wherein the particular video frequency audio combinationIt is one of all video/audio combinations corresponding at least partly face;
Sound characteristic extraction module carries out sound characteristic to effective audio section in the particular video frequency audio combination and mentionsIt takes, to obtain specific sound feature;
Human face similarity degree computing module, for in remaining video/audio combination in all video/audio combinationsEach, it is similar to calculate the face that the Given Face feature is combined with the video/audio between corresponding face characteristicDegree;
Sound similarity calculation module, for calculating the spy for each of described remaining video/audio combinationDetermine sound characteristic and combines sound similarity between corresponding sound characteristic with the video/audio;
Average similarity computing module, for calculating the spy for each of described remaining video/audio combinationThe average value that video/audio combines the human face similarity degree between combining with the video/audio and sound similarity is determined, described in obtainingParticular video frequency audio combination combined with the video/audio between average similarity;
Classifying module, for each of combination of remaining video/audio for described in, if the particular video frequency audioCombination combine with the video/audio between average similarity be greater than similarity threshold, then by the particular video frequency audio combination andVideo/audio combination is referred to same target.
Illustratively, the effective video and audio acquisition module include:
Effective video section determines submodule, will be with the face phase for being directed to described at least partly each of faceCorresponding initial video section is determined as effective video section corresponding with the face;And
Effective audio section determines submodule, will be with the face phase for being directed to described at least partly each of faceCorresponding initial audio section is determined as effective audio section corresponding with the face.
Illustratively, the effective video and audio acquisition module include:
Unified split time determines submodule, for for described at least partly each of face, according to the peopleThe split time of the corresponding initial video section of face and initial audio section determines unified split time;
Unified subsection submodule, for carrying out unified point to the video and the audio according to the unified split timeSection, to obtain effective video section corresponding with the face and effective audio section.
Illustratively, the audio is acquired by unified microphone,
The audio parsing module includes:
First subsection submodule, for being segmented according to the phonetic feature in the audio to the audio, to obtainMixed audio piece;And
Audio section selects submodule, for being directed to described at least partly each of face, from the mixed audio pieceIt is middle select on acquisition time and the consistent mixed audio piece of initial video section corresponding with the face as with the face phaseCorresponding initial audio section.
Illustratively, the audio includes respectively by one or more shotgun microphones one or more sound collectedFrequently,
Described device further comprises:
Control module is respectively facing at least partly object for controlling one or more of shotgun microphones to adoptCollect one or more described audio;
The audio parsing module includes:
Second subsection submodule, for for described at least partly each of face, according to by towards the face institutePhonetic feature in the shotgun microphone of corresponding object audio all the way collected is segmented the road audio, with obtain withThe corresponding initial audio section of the face.
Illustratively, the number of the shotgun microphone is equal to or more than the number of one or more of faces.
Illustratively, described device further comprises:
Priority Determination module determines each for the face characteristic and/or movement according to one or more of facesThe priority of face;And
Object determining module, for determining that one or more of shotgun microphones will court according to the priority of each faceTo object as at least partly object.
Illustratively, the video segmentation module is segmented the video according to following rule:
For described at least partly each of face, if the mouth of the face becomes at the first moment from closed stateChange in the first predetermined period to open configuration and before first moment and be continuously in closed state, then by describedOne moment is as the video segmentation time started, if the mouth of the face changes to closed state from open configuration at the second momentAnd it is continuously in closed state in the second predetermined period after second moment, then using second moment as viewThe frequency division section end time,
Wherein, the video, portion between adjacent video segmentation time started and video segmentation end timeIt is divided into the initial video section.
Illustratively, the audio parsing module is segmented the audio according to following rule:
If voice in the audio the third moment never sounding state change to sounding state and describedIt is continuously in not sounding state in third predetermined period before three moment, then is started using the third moment as audio parsingTime, if voice in the audio is at the 4th moment from sounding state change to not sounding state and at the described 4thBe continuously in not sounding state in the 4th predetermined period after quarter, then using the 4th moment as audio parsing at the end ofBetween,
Wherein, the audio, portion between adjacent audio parsing time started and audio parsing end timeIt is divided into the initial audio section.
Illustratively, described device further comprises:
Speech recognition module, for being directed to described at least partly each of face, to sound corresponding with the faceFrequency part carries out speech recognition, to obtain the text file for representing audio-frequency unit corresponding with the face;And
Textual association module, for the text file to associate with the face.
Illustratively, described device further comprises output module, for exporting expectation information,
Wherein, the expectation information includes one or more in following item: the video, the audio include describedThe video frame of Given Face in one or more faces, video frame comprising the Given Face acquisition time, with it is describedThe acquisition time of the corresponding audio-frequency unit of Given Face and audio-frequency unit corresponding with the Given Face.
According to a further aspect of the invention, a kind of retrieval device is provided, comprising:
Receiving module, for receiving the retrieval instruction for being directed to target face;
Searching module, for searching the relevant information of the target face from database according to the retrieval instruction;WithAnd
Output module, for exporting the relevant information of the target face;
Wherein, the database is used to store is carried out using the device described above for processing video and related audioThe video and audio of processing and/or audio-frequency unit corresponding with described at least partly each of face,
And wherein, the relevant information of the target face includes one or more in following item: including the targetThe video frame of face, the acquisition time of video frame comprising the target face, audio portion corresponding with the target facePoint and audio-frequency unit corresponding with the target face acquisition time.
Method and apparatus and search method and device according to an embodiment of the present invention for handling video and related audio,By the way that the face of object and its voice association get up, speak time and the speech content of object can be determined, to facilitate useFamily is checked and is retrieved in speech content of the later period to the object.
Specific embodiment
In order to enable the object, technical solutions and advantages of the present invention become apparent, root is described in detail below with reference to accompanying drawingsAccording to example embodiments of the present invention.Obviously, described embodiment is only a part of the embodiments of the present invention, rather than this hairBright whole embodiments, it should be appreciated that the present invention is not limited by example embodiment described herein.Based on described in the present inventionThe embodiment of the present invention, those skilled in the art's obtained all other embodiment in the case where not making the creative laborIt should all fall under the scope of the present invention.
Firstly, describing referring to Fig.1 for realizing according to an embodiment of the present invention for handling video and related audioThe exemplary electronic device 100 of method and apparatus.
As shown in Figure 1, electronic equipment 100 include one or more processors 102, it is one or more storage device 104, defeatedEnter device 106, output device 108, video acquisition device 110 and audio collecting device 114, these components pass through bus system112 and/or other forms bindiny mechanism's (not shown) interconnection.It should be noted that electronic equipment 100 shown in FIG. 1 component andStructure be it is illustrative, and not restrictive, as needed, the electronic equipment also can have other assemblies and structure.
The processor 102 can be central processing unit (CPU) or have data-handling capacity and/or instruction executionThe processing unit of the other forms of ability, and the other components that can control in the electronic equipment 100 are desired to executeFunction.
The storage device 104 may include one or more computer program products, and the computer program product canTo include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.It is described easyThe property lost memory for example may include random access memory (RAM) and/or cache memory (cache) etc..It is described non-Volatile memory for example may include read-only memory (ROM), hard disk, flash memory etc..In the computer readable storage mediumOn can store one or more computer program instructions, processor 102 can run described program instruction, to realize hereafter instituteThe client functionality (realized by processor) in the embodiment of the present invention stated and/or other desired functions.In the meterCan also store various application programs and various data in calculation machine readable storage medium storing program for executing, for example, the application program use and/orThe various data etc. generated.
The input unit 106 can be the device that user is used to input instruction, and may include keyboard, mouse, wheatOne or more of gram wind and touch screen etc..
The output device 108 can export various information (such as image and/or sound) to external (such as user), andIt and may include one or more of display, loudspeaker etc..
The video acquisition device 110 can acquire desired video, and video collected is stored in described depositFor the use of other components in storage device 104.Video acquisition device 110 can be realized using any suitable equipment, such as be taken the photographCamera or the camera of mobile terminal etc..Video acquisition device 110 is optionally that electronic equipment 100 can not include that video is adoptedAcquisition means 110.Electronic equipment 100 can use video acquisition device 110 and acquire video, can also be via between other equipmentCommunication interface (not shown) receive other equipment transmission video.
The audio collecting device 114 can acquire desired audio, and audio storage collected is deposited describedFor the use of other components in storage device 104.Audio collecting device 114 can be using the realization of any suitable sound pick-up outfit, exampleThe built-in microphone of such as independent microphone or mobile terminal.Audio collecting device 114 can also be the built-in wheat of video cameraGram wind, that is to say, that audio collecting device 114 can be integrated with video acquisition device 110.Audio collecting device 114It is optionally, electronic equipment 100 can not include audio collecting device 114.Electronic equipment 100 can use audio collecting device114 acquisition audios, can also receive the audio of other equipment transmission via the communication interface (not shown) between other equipment.
Illustratively, for realizing the method and apparatus according to an embodiment of the present invention for handling video and related audioExemplary electronic device can be realized in the equipment of personal computer or remote server etc..
In the following, reference Fig. 2 is described the method according to an embodiment of the present invention for being used to handle video and related audio.Fig. 2It shows according to an embodiment of the invention for handling the schematic flow chart of the method 200 of video and related audio.Such as Fig. 2Shown, the method 200 for handling video and related audio includes the following steps.
In step S210, the video of one or more faces including one or more objects is obtained.
" object " as described herein can be any required people for recording its voice, such as participate in personnel or the behavior of meetingThe personnel etc. for needing to be monitored.Same target has same face, position of the same face in different video frame, tableFeelings movement may be different, can track the face of same target in continuous video frame using face tracking technology.
Under conference scenario, camera (such as independent video camera or the camera of mobile terminal etc.) acquisition can useThe video of personnel in session meeting-place.It is desirable that, in video collected including all participants face orIncluding at least the face of all participants for saying and exchanging words.Camera can be by collected video real-time transmission to serverEnd, is handled in real time by server end.Certainly, together with video acquisition end also may be implemented with processing end.Wherein, processing endIt can be used for handling the collected video of video acquisition end.
In step S220, Face datection is carried out to each video frame in video, to identify one or more faces.
In this step, it can determine whether comprising face in each video frame of video collected, andComprising orienting human face region in the video frame in the case where face in video frame.It can use preparatory trained Face datectionDevice comes the locating human face region in video frame collected.For example, Ha Er (Haar) algorithm can be advanced with, Adaboost is calculatedThe Face datections such as method and recognizer train human-face detector on the basis of a large amount of pictures, for single video collectedFrame, trained human-face detector can quickly locate out human face region in advance for this.In addition, for multiple views of continuous acquisitionFrequency frame (i.e. one section of video) can be based on the previous view of current video frame after orienting human face region in first video frameThe position of the human face region in current video frame is tracked in the position of human face region in frequency frame in real time, that is, may be implementedFace tracking.
It should be appreciated that the present invention is not limited by the method for detecting human face specifically used, either existing method for detecting human faceOr the method for detecting human face of exploitation in the future can be applied to according to an embodiment of the present invention for handling video and related soundIt in the method for frequency, and also should include within the scope of the present invention.
In step S230, obtain with the video acquired in same time period include in one or more objects extremelyThe audio of the voice of small part object.
Under conference scenario, microphone (such as independent microphone or the microphone of mobile terminal etc.) acquisition can useThe audio of personnel in session meeting-place records these personnel institute what someone saids, i.e. their voice.In this embodiment,The audio acquired in the personnel in session meeting-place carries out simultaneously with the video for acquiring the personnel in session meeting-place, i.e.,Audio & video need to acquire simultaneously in same time period.It is desirable that, including all participants in audio collectedVoice or the voice of the participants to exchange words including at least all theorys.It will be appreciated that in some cases, such as in used wheatGram wind quantity is inadequate or the bad audio for making acquisition of microphone quality not enough clearly in the case of, may be unable to get it is all withCan personnel (or all say the participant that exchanges words) voice.Microphone can be by collected audio real-time transmission to serverEnd, is handled in real time by server end.Certainly, together with audio collection end also may be implemented with processing end.Wherein, processing endIt can be used for handling the collected audio in audio collection end.
In step S240, at least partly each of face in one or more faces, determine it is in audio,Audio-frequency unit corresponding with the face, wherein at least partly face is belonging respectively at least partly object.
Video has time shaft, and each video frame all has exact acquisition time.Since the personnel in meeting-place are speakingWhen, the variation of its face's (mainly mouth) can be detected in video, therefore may determine that its time spoken.TogetherSample, audio also have time shaft, and the acquisition time of audio data can also be known.When someone speaks, it can be examined in audioThe variation of sound wave is measured, accordingly it is also possible to judge its time spoken.It is appreciated that comprehensive video and audio data, can compareRelatively easily know the time that someone speaks and the content that it is spoken (i.e. its voice).It is desirable that, the institute in record meeting-placeThere are the face and voice of personnel, especially record the face and voice for once saying the personnel to exchange words, in this way, the later period can be by userCheck or retrieve the speech content for each once saying the personnel to exchange words.However, it is possible in video include all participants (orIt is all to say the participant that exchanges words) face, and audio does not include all participants (or all say the participant to exchange words)Voice, or on the contrary, include the voice of all participants (or all say the participant that exchanges words) in audio, and in videoIt does not include that the face of all participants (or all say the participant that exchanges words) can determine meeting-place in this caseIn part participant (or part say the participant to exchange words) face and corresponding audio-frequency unit.
In step S250, at least partly each of face, which is associated with corresponding audio-frequency unitCome.
After determining audio-frequency unit corresponding with some face, which can be associated with corresponding audio-frequency unitGet up.For example, if in the video of morning one day acquisition, by Face datection discovery in 9 points to 9 points very video framesIn certain object say and find in the audio exchanged words, and acquired at the same time at 9 points to 9 points there are voice variations in ten minutes, then can will examineThe object measured facial image (such as the face comprising the object entire video frame or pass through extract video frame obtainThe only image of the face comprising the object) it is associated together with a segment of audio of 9 points to 9 points acquisitions in ten minutes.In this way, afterwardsWhen user checks the minutes, it can inform that user's object is defended oneself at 9 points to 9: 10 and exchange words, and can inform useThe speech content of the family object during this period of time.In addition, above-mentioned interrelational form allows user very conveniently to meetingRecord is retrieved.
It will be appreciated that the implementation sequence of each step shown in Fig. 2 is only illustrative and not restrictive, step S210 to stepS250 can have any suitable implementation sequence.In one example, step S210 to step S250 can be real-time perfoming's.For example, video and audio can start simultaneously at acquisition and acquisition, i.e. step S210 and step S230 can be implemented simultaneously.MoreSpecifically, under conference scenario, camera continuously acquires the video frame of participant and passes collected video frameIt is sent to the native processor or remote server being connected, while microphone continuously acquires the audio of participant and incites somebody to actionCollected audio data is transmitted to the native processor or remote server being connected.Native processor or remote server are everyWhen receiving new video frame (i.e. implementation steps S210), Face datection (i.e. implementation steps S220) just is carried out to video frame.ThisGround processor or remote server also receive new audio data (i.e. implementation steps when receiving new video frame simultaneouslyS230).Native processor or remote server can determine and the face phase according to the face identified in step S220Corresponding audio-frequency unit (i.e. implementation steps S240), and face and corresponding audio-frequency unit are associated (i.e. implementation stepsS250).Above-mentioned entire method 200 is implemented continuously, in real time.In another example, camera can will be collectedVideo about entire meeting stores, and microphone can also get up the collected audio storage about entire meeting.After meeting adjourned, complete video and audio can be transmitted to native processor or remote service by camera and microphoneDevice.Complete video and audio are handled by processing locality or remote server.In this case, step S210 can be withImplement before, after or at the same time in step S230, step S220 can be implemented before, after or at the same time in step S230.
Illustratively, the method according to an embodiment of the present invention for handling video and related audio can have storageIt is realized in the unit or system of device and processor.
It is according to an embodiment of the present invention to be deployed in client for handling video and the method for related audio.For example,Under conference scenario, the video of camera (i.e. video acquisition device) the acquisition participant of mobile terminal and utilization can useThe audio of microphone (i.e. audio collecting device) the acquisition participant of mobile terminal, then (i.e. by the processor of mobile terminalFor handling the device of video and related audio) video and audio are handled.Under another conference scenario, video acquisition dressIt sets, audio collecting device and the device for handling video and related audio are deployed in meeting-place.For example, can use independent take the photographCamera (i.e. video acquisition device) acquires the video of participant and utilizes independent microphone or the built-in microphone of video camera(i.e. audio collecting device) acquires the audio of participant, and subsequent video camera and microphone send the video of acquisition and audio toThe computer being connected, by computer processor (i.e. for handling the device of video and related audio) to video and audio intoRow processing.
Alternatively, according to an embodiment of the present invention to be disposed with being distributed for handling video and the method for related audioAt server end (or cloud) and client (such as mobile terminal).For example, can use camera under conference scenarioThe video of (such as independent video camera or the camera of mobile terminal etc.) acquisition participant, and utilization microphone (such as solelyBuilt-in microphone or the microphone of mobile terminal of vertical microphone, video camera etc.) acquisition target audio, camera and MikeThe video of acquisition and audio are transmitted in server end (or cloud) by wind, server end (or cloud) to video and audio atReason.
Method according to an embodiment of the present invention for handling video and related audio, by by the face of object and its languageSound associates, and speak time and the speech content of object can be determined, so that user be facilitated to speak in the later period to the objectContent is checked and is retrieved.The present invention is suitable for any suitable scene for needing to record object voice, such as conference scenarioDeng.
Fig. 3 shows in accordance with another embodiment of the present invention for handling the signal of the method 300 of video and related audioProperty flow chart.Step S310, S320, S330 and S380 of method 300 shown in Fig. 3 step with method 200 shown in Fig. 2 respectivelyRapid S210, S220, S230 and S250 are corresponding.Those skilled in the art are understood that in Fig. 3 with description above according to fig. 2Above-mentioned steps, for sake of simplicity, details are not described herein.Step S370 shown in Fig. 3 is that one kind of step S240 shown in Fig. 2 is specificEmbodiment is described more fully below.According to the present embodiment, before step S370, method 300 may further include followingStep.
In step S340, at least partly each of face, video is carried out according to the mouth action of the faceSegmentation, to obtain initial video section corresponding with the face.
Carrying out Face datection in step s 320 can detecte out the profile of face, orient human face region.It then can be withFurther locating human face's key point in the human face region positioned.It is strong that face key point generally includes some characterization abilities of faceKey point, such as eyes, canthus, eye center, eyebrow, nose, nose, mouth and the corners of the mouth etc..In the present invention, it mainly needsWhat is positioned is mouth key point.It can use crucial spot locator trained in advance and come the locating human face pass in human face regionKey point.For example, can advance with cascade homing method trains key on the basis of the face picture largely manually markedSpot locator.Alternatively, traditional face key independent positioning method can also be used, parametric shape model is based on, according to passAppearance features near key point learn a parameter model out, iteratively optimize the position of key point when in use, finally obtainKey point coordinate.
As described above, in the present invention, it is usually required mainly for positioning is mouth key point.For example, can be with positioning mouth contouring.It can be somebody's turn to do by size variation judgement of the mouth profile of same face in a period of time (namely in continuous video frame)The mouth action of face.For example, if the mouth of same face is becoming larger or is becoming smaller in a period of time, it is believed thatObject corresponding to the face is being spoken.If the mouth of same face is continuously in closed state in a period of time, canTo think that object corresponding to the face is not spoken.Alternatively, if the mouth of same face is continuously in a period of timeOpen configuration and mouth profile variations very little, then it is also assumed that object corresponding to the face do not speak (such as mayIt is yawning).The video frame that the video frame and object acquired when object can be spoken according to mouth action acquires when silentVideo is segmented whether separating, that is, speak according to object.Initial video section corresponding with face obtainedIt can be according to determined by the mouth action of face, object is in collected part video when speaking state.
Although in entire video or saying in each video frame there may be multiple faces, corresponding to each face pairIt exchanges words as may only be said within certain time.For example, having recorded the face of object A and the face of object B in entire video X.Object A and object B says and exchanges words in period a and period b respectively.Can independent tracking object A face, according to the mouth of object AEntire video X is segmented by movement, finds out the part video acquired in period a, i.e., corresponding with the face of object A initialVideo-frequency band.For object B, entire video X can be segmented according to the mouth action of object B, find out that acquired in period bPartial video, i.e., initial video section corresponding with the face of object B.That is, can be respectively according to the mouth of each facePortion's action situation is segmented entire video, to obtain initial video section corresponding with each face.
In step S350, at least partly each of face, audio is carried out according to the phonetic feature in audioSegmentation, to obtain initial audio section corresponding with the face.
Phonetic feature may include voice variation.Voice variation in audio i.e. the sound wave from object in audioFluctuating.It is appreciated that there are the fluctuatings of voice in audio when someone speaks, and when nobody speaks, it may be only in audioThere are ambient noises, are nearly no detectable the fluctuating of voice.Therefore, can be changed by voice to determine whether someone is sayingWords.Phonetic feature can also include other kinds of feature, such as voice content.If object continuously issues whithin a period of timeSuch as the meaningless modal particle of " ", " " etc, it may be considered that object is not spoken during this period of time.
The audio data that the audio data acquired when object is spoken acquires when silent with object separates, that is, foundationAudio is segmented whether object is spoken.Initial audio section corresponding with face obtained can be according in audioPhonetic feature determined by, object be in collected part audio when speaking state.
If audio is acquired using global unified microphone (being properly termed as unified microphone), all objectsVoice may be mingled in in audio all the way.In this case, after being segmented according to phonetic feature to audio, need intoOne step judges sound according to the acquisition time of the initial video section of the acquisition time of the audio section marked off and each objectFrequency range should be corresponding with the face of which object.If audio is acquired using shotgun microphone, the collected audio of instituteIt is divided into multichannel, every road only includes the voice of an object, in such a case, it is possible to without judging that audio section is corresponding with faceRelationship, because the corresponding relationship has determined when distributing shotgun microphone for object.These embodiments will be hereinafter into oneStep detailed description, this will not be repeated here.
In step S360, at least partly each of face, according to initial video section corresponding with the faceIt is obtained with initial audio section in video, corresponding with face effective video section and audio, corresponding with the faceEffective audio section.
Effective video section refers to that finally determining object is in collected video-frequency band, effective audio section when speaking state and isRefer to that finally determining object is in collected audio section when speaking state.Effective video section and effective audio section can be used for reallyThe fixed and corresponding audio-frequency unit of face and face is associated with corresponding audio-frequency unit.
In one example, step S360 may include: to be directed at least partly each of face, will be with the face phaseCorresponding initial video section is determined as effective video section corresponding with the face, and will initial sound corresponding with the faceFrequency range is determined as effective audio section corresponding with the face.
In general, the mouth action of object is consistent with voice, that is to say, that object opens one's mouth to speak, and object can not shut upSound.Therefore, the initial video section marked off according to the mouth action of face and the initial audio section marked off according to phonetic featureIt is substantially corresponding on a timeline.In such a case, it is possible to initial video section is directly considered as effective video section, and will be firstBeginning audio section is considered as effective audio section.This mode relatively simple can quickly determine effective video section and effective audio section.
In another example, step S360 may include: at least partly each of face, according to the peopleThe split time of the corresponding initial video section of face and initial audio section determines unified split time;According to unified split time pairVideo and audio carry out unified segmentation, to obtain effective video section corresponding with the face and effective audio section.
Since the differentiation standard of video and audio in terms of about object is spoken whether is different, initial video section and justBeginning audio section may be not quite identical on a timeline, also not necessarily identical to the division accuracy for the state of speaking.In certain feelingsInitial video section is likely larger than initial audio section to the division accuracy for the state of speaking under condition, then may in yet some other casesOn the contrary.Therefore, the split time of initial video section and initial audio section can be comprehensively considered to determine a more appropriate systemOne split time.This, which unifies split time, can be used for whether universal formulation object is in the state of speaking.Then, unified according to thisSplit time is again segmented video and audio, obtains effective video section and effective audio section.This mode can be improvedThe division accuracy of effective audio section and effective video section.
For example, it is assumed that for object A, according to its mouth action in video find its 9 points 20 seconds to nine ten minutesPoint is continuously in the state of speaking in 30 seconds 11 minutes periods 1, and using one section of video collected in the period 1 as justBeginning video-frequency band;In addition, according to the phonetic feature Finding Object A in audio 9 points 11 30 seconds ten minutes to 9: 35It is continuously in the state of speaking in the period 2 of second, and a segment of audio will be collected as initial audio section in the period 2, then,The period 1 can be comprehensively considered and the period 2 determines unified split time.For example, by 9 points 11 20 seconds ten minutes to 9: three15 second this period, (can be described as " period 3 ") was considered as the time that object A is practically in the state of speaking.In this case, will9 points are used as unified Segment start times for 20 seconds ten minutes, are used as within 35 seconds 11 minutes the unified segmentation end time for 9 points,Using that section of video collected in the period 3 as effective video section, using that section audio collected in the period 3 as effective audioSection.In this example, the period 3 is the union of period 1 and period 2.In another example can by 9 points 30 seconds to 9 11 ten minutes30 second this period (can be described as " period 4 ") is divided to be considered as the time that object A is practically in the state of speaking.In this case,Unified Segment start times are used as within 30 seconds ten minutes by 9 points, are used as within 30 seconds 11 minutes the unified segmentation end time for 9 points,Using that section of video collected in the period 4 as effective video section, using that section audio collected in the period 4 as effective audioSection.In this example, the period 4 is the intersection of period 1 and period 2.The method of determination of above-mentioned unified split time be only example andUnrestricted, unified split time also can have other suitable methods of determination, should all fall into protection scope of the present invention.
Compared with carrying out segmentation according to phonetic feature merely, cooperated jointly by face mouth action and phonetic feature to divideSection can obtain better subsection efect, can preferably handle situations such as whispering.
In the embodiment shown in fig. 3, step S370 can be specifically included: it is directed at least partly each of face,Determine that effective audio section corresponding with the face is audio-frequency unit corresponding with the face.
All effective audio sections corresponding with certain face can be considered as to sound that need to find out, corresponding with the faceFrequency part.
Embodiment according to Fig.3, to video and audio parsing whether being spoken by object, so as to more accurateFind out audio-frequency unit corresponding with the face of object in ground.
According to embodiments of the present invention, above-mentioned steps S380 may include: to be directed at least partly each of face, forEach effective video section corresponding with the face selects face quality optimal from all video frames of the effective video sectionVideo frame;Selected video frame and effective audio section corresponding with the effective video section are associated, to form a viewFrequency domain audio combination.Wherein, the optimal video frame of face quality can be the highest video frame of all video frame intermediate-resolutions orIt can be the clearest video frame of face.
Effective audio section corresponding with effective video section refers to consistent with effective video section on acquisition time or basic oneThe effective audio section caused.By step S310 to S370, for each face, can obtain several effective video sections andWith effective video section effective audio section correspondingly.In one example, can directly by effective video section and corresponding haveEffect audio section associates, and forms several videos and audio pair, and each video and audio are to can be considered as a video/audioCombination.In another example, one or more more representational video frames can be selected from each effective video section,It is, can choose the optimal one or more video frames of face quality.By selected video frame and corresponding effective soundFrequency range associates, and ultimately forms several video frames and audio pair, and each video frame and audio are to can also be considered as a viewFrequency domain audio combination.It is understood that video frame is facial image, the inside may include several faces.Selected video frameIt can be original video frame, wherein it is right that (such as marking using box) effective video section institute can be marked in the video frameThe face answered.In addition, selected video frame is also possible to the video frame only comprising face corresponding to effective video-frequency band.ForLatter situation can be converted to the original video frame in initial video section corresponding with desired face in step S340The only new video frame comprising the expectation face or can will effective video corresponding with desired face in step S360Original video frame in section is converted to the only new video frame comprising the expectation face or can will be with the phase in step S380The original video frame or selected video frame hoped in the corresponding effective video section of face are converted to only comprising the expectation faceNew video frame.Compare it is desirable that, each video/audio combination formed is one only comprising the facial image of certain faceMatch an effective audio section corresponding with the face.In this way, under conference scenario, when minutes are checked in user's expectationWhen, it can be very intuitive in this way and be highly susceptible to examining by facial image plus minutes are presented in the form of effective audio sectionRope.
According to embodiments of the present invention, in order to compensate for Face datection error, method 200 (or 300), which may further include, returnsClass step.Fig. 4 shows the schematic flow chart of classifying step according to an embodiment of the invention.As shown in figure 4, classifying stepIt may comprise steps of.
In step S410, for face corresponding to particular video frequency audio combination to the video in particular video frequency audio combinationFrame carries out face characteristic extraction, to obtain Given Face feature, wherein particular video frequency audio combination is that at least partly face institute is rightOne of all video/audios combination answered.
In one example, video acquisition device and audio collecting device acquire video and audio in real time, for handling viewThe device of frequency and related audio handles video and audio in real time.That is, video and audio are handled in acquisition.In this case, more and more video/audio combinations can be obtained as time goes by.Whenever one new video sound of acquisitionWhen frequency combines, it can combine the video/audio currently obtained combination with all video/audios previously obtained and compare, such asThe video/audio combination that fruit discovery currently obtains is combined with a certain video/audio previously obtained belongs to the same object, then by twoPerson is referred to same target.For the video/audio combination currently obtained, some video that can be calculated separately and previously obtainThe face of audio combination and the similarity of voice.
For the first complete acquisition of video and audio, the case where then processing, any video/audio combination can chooseAs particular video frequency audio combination, the similarity of face and voice that it is combined with remaining video/audio is calculated.
The face characteristic of Given Face is mainly extracted in step S410.For example, for effective by a video frame and oneThe video/audio combination of audio section composition, video frame may only include the corresponding face of video/audio combination, it is also possible into oneStep includes other faces.When carrying out face characteristic extraction, it is special to need to combine corresponding face progress only for video/audioSign is extracted.
Face characteristic extracts, and also referred to as face characterizes, it is the process that feature modeling is carried out to face.Face characteristic extraction canTo be realized using two class methods: one is the methods based on geometrical characteristic;Another is based on algebraic characteristic or statistical learningMethod.Method based on geometrical characteristic is mainly to pass through to extract face vitals (such as eyes, nose, mouth, chin)Geometry and geometrical relationship are as face characteristic.The positions such as eyes, nose, mouth, the chin of face are properly termed as characteristic point.BenefitThe characteristic component that can measure face characteristic can be constructed with these characteristic points, characteristic component generally includes the Euclidean between characteristic pointDistance, curvature and angle etc..Face characteristic as described herein may include features described above component.Based on algebraic characteristic or statisticsThe method of habit is that video frame is regarded as to a matrix, and by making matrixing or linear projection, the statistics that can extract face is specialSign, this is a kind of thought based on entirety, and entire video frame (i.e. facial image) is regarded as a mode and is identified, therefore thisKind method is also a kind of template matching method.Face characteristic as described herein can also include above-mentioned statistical nature.
The method that the above face characteristic extracts only is exemplary rather than limitation, can use any other known or future canThe face feature extraction method being able to achieve handles the video frame in particular video frequency audio combination, to obtain Given FaceFeature.
The face characteristic of face corresponding to particular video frequency audio combination, i.e. Given Face can be obtained through the above wayFeature.
In step S420, sound characteristic extraction is carried out to effective audio section in particular video frequency audio combination, to obtain spyDetermine sound characteristic.
Sound characteristic extracts can be by extracting and selecting the vocal print to speaker to have the spies such as separability is strong, stability is highThe acoustics or language feature of property is realized.Extracted sound characteristic may include: the anatomy of (1) and the pronunciation mechanism of the mankindThe related acoustic feature of structure (such as frequency spectrum, cepstrum, formant, fundamental tone, reflection coefficient), nasal sound, band deep breathing sound, sandDumb sound, laugh etc.;(2) semanteme, the rhetoric, pronunciation, speech habit influenced by socioeconomic status, education level, birthplace etc.It is used to;(3) features such as personal touch or the rhythm influenced by parent, rhythm, speed, intonation, volume.
It is calculated special in step S430 for each of remaining video/audio combination in the combination of all video/audiosDetermine face characteristic and combines human face similarity degree between corresponding face characteristic with the video/audio.
The face characteristic of face corresponding to the corresponding face characteristic of video/audio combination i.e. video/audio combination.The case where video and audio are acquired and handled in real time can calculate it when obtaining new video/audio combination every timeCorresponding face characteristic, and in the storage device by the storage of calculated face characteristic.At the same time it can also will currently obtainThe corresponding face characteristic (i.e. Given Face feature) of video/audio combination and each video/audio that is stored, previously obtainingThe corresponding face characteristic of combination is compared, and calculates similarity between the two.
For the first complete acquisition of video and audio, the case where then processing, all video sounds can be calculated simultaneouslyThe face characteristic of frequency combination, and select any video/audio combination therein as particular video frequency audio combination, calculate itself and itsHuman face similarity degree between remaining video/audio combination.
Specific sound feature and the video sound are calculated for each of remaining video/audio combination in step S440Frequency combines the sound similarity between corresponding sound characteristic.
The case where video and audio are acquired and handled in real time can obtain every time new video/audio combinationWhen, the sound characteristic corresponding to it is calculated, and in the storage device by the storage of calculated sound characteristic.At the same time it can also incite somebody to actionThe corresponding sound characteristic (i.e. specific sound feature) of the video/audio combination currently obtained with it is stored, previously obtaining everyThe corresponding sound characteristic of a video/audio combination is compared, and calculates similarity between the two.
For the first complete acquisition of video and audio, the case where then processing, all video sounds can be calculated simultaneouslyThe sound characteristic of frequency combination, and select any video/audio combination therein as particular video frequency audio combination, calculate itself and itsSound similarity between remaining video/audio combination.
Particular video frequency audio combination and the view are calculated for each of remaining video/audio combination in step S450The average value of human face similarity degree and sound similarity between frequency domain audio combination, to obtain particular video frequency audio combination and the videoAverage similarity between audio combination.
For a certain video/audio combination x, there is human face similarity degree harmony between particular video frequency audio combination ySound similarity, it is assumed that be respectively 80% and 90%.The average value for calculating the two, obtaining average similarity is 85%.That is,The average similarity that the video/audio combines between x and particular video frequency audio combination y is 85%.
In step S460, for each of remaining video/audio combination, if particular video frequency audio combination and the viewAverage similarity between frequency domain audio combination is greater than similarity threshold, then combines particular video frequency audio combination with the video/audioIt is referred to same target.
Depending on similarity threshold can according to need, any suitable value can be, the present invention limits not to this.Assuming that similarity threshold is 90%, then average similarity 85% is less than similarity threshold, in this way, it is believed that above-mentioned video/audioCombination x and particular video frequency audio combination y is not belonging to same target.If before by Face datection mistakenly by video/audio groupIt closes x and particular video frequency audio combination y and corresponds to same face, then in this way, the error of Face datection can be corrected.It is falseIf similarity threshold is 80%, then average similarity 85% is greater than similarity threshold, in this way, it is believed that above-mentioned video/audio groupIt closes x and particular video frequency audio combination y and belongs to same target.If mistakenly video/audio combining x by Face datection beforeDifferent faces are corresponded to particular video frequency audio combination y, then in this way, the error of Face datection can be corrected.
By categorizing operation, the face of the same target in video can be made to be referred to together, thus significantly promotion pairThe classification accuracy of audio.In addition, can be made using categorizing operation as described herein for handling video and related audioMethod has better compatibility to situations such as different tone of same target, different volumes, can be reduced an object notThe case where with mood classification to multiple objects.
It should be appreciated that Fig. 4 is only exemplary rather than limitation, above-mentioned steps S410 to S460 can have any reasonable implementationSequentially it is not limited to sequence shown in Fig. 4.
According to embodiments of the present invention, the audio is acquired by unified microphone, and above-mentioned steps S350 may include: rootAudio is segmented according to the phonetic feature in audio, to obtain mixed audio piece;And for every at least partly faceOne, from selection in mixed audio piece on acquisition time and the consistent mixed audio of initial video section corresponding with the faceDuan Zuowei initial audio section corresponding with the face.
It, can be using (may also be a multiple) microphone (i.e. unified microphone) all participants of acquisition under conference scenarioThe voice of personnel.In this case, the voice of all participants will be included in in audio all the way.It will according to phonetic featureAfter audio parsing, the audio section (i.e. mixed audio piece) of acquisition may correspond to different objects.It can by initial video sectionKnow when each object is being said and exchange words.For example, for object A, it is assumed that it has, and there are three initial video sections.In conjunction with initial video sectionSplit time can find on acquisition time with consistent three mixed audio pieces of initial video section.These three mixed audiosSection is exactly required, corresponding with the face of object A initial audio section.It should be noted that when acquisition as described hereinBetween the case where unanimously may include acquisition time synchronous or basic synchronization, must be complete without should only be understood as acquisition timeIt is exactly the same.
Using unified microphone acquisition audio, the mode of corresponding voice messaging is found out in conjunction with face informationIt is a kind of simple and efficient mode.
According to embodiments of the present invention, the audio include respectively by one or more shotgun microphones it is collected all the way orMCVF multichannel voice frequency, before step S330, method 300 be may further include: control one or more shotgun microphones court respectivelyTo at least partly object to acquire one or more audio.
Shotgun microphone can be the shotgun microphone with holder.Shotgun microphone can more clearly acquire its courtTo object voice, and almost do not acquire the voice of other objects.Therefore, high noise may be implemented by shotgun microphoneThe audio collection of ratio.
Under conference scenario, the video of the face comprising all participants can be collected first.Then real-time perfomingFace datection distributes shotgun microphone to participant according to the face detected.Preferably, the number of shotgun microphone is equal toOr the number greater than one or more faces described above.In this way, being all objects in meeting-place in one or more objectsIn the case where, it is ensured that all objects in meeting-place are assigned with shotgun microphone, thereby may be ensured that the language of all objectsSound is recorded, and avoids the occurrence of the omission of voice.If the number of shotgun microphone is less than the number of one or more facesMesh can neatly distribute shotgun microphone.It speaks in general, only having an object in synchronization.Orientation is assigned when currentWhen the object of microphone is silent, shotgun microphone can be reassigned to next object spoken.These operations can be withImplemented based on Face datection result.
It is of course also possible to by shotgun microphone fixed allocation to object, in this way, if the number of shotgun microphone is less than oneThe number of a or multiple faces, then only collect the voice of partial objects.Specifically, after detecting the face in meeting-place,At least partly object distributes one or more shotgun microphones corresponding at least partly face thereto.It is fixed to be assigned toThe number of shotgun microphone is depended on to the number of the object of microphone.Each shotgun microphone can acquire audio all the way, becauseThis can obtain one or more audio.
In the present embodiment, step S350 may include: and be somebody's turn to do at least partly each of face according to by directionPhonetic feature in the shotgun microphone of object corresponding to face audio all the way collected is segmented the road audio, withObtain initial audio section corresponding with the face.
By the object of shotgun microphone institute direction be it is known, the corresponding relationship of every road audio and face is known's.For example, it is assumed that shotgun microphone m is towards object A, then it only include the language of object A in the audio all the way from shotgun microphone mSound.Audio all the way from shotgun microphone m is segmented, initial audio section corresponding with object A can be directly obtained.Certainly, in the case where the object of shotgun microphone institute direction is adjusted flexibly, the corresponding relationship of every road audio and face may beVariation.However, this variation is also known, the corresponding relationship of every road audio and face can be determined at times, and then reallyFixed initial audio section corresponding with each object.
By Face datection with the shotgun microphone with holder cooperate, can obtain than wide scope microphone (such as aboveThe unified microphone) audio that is more clear, so as to subsequent audio parsing, categorizing operation and speech recognitionAnd etc. generate extraordinary gain effect.
According to embodiments of the present invention, at least partly object is respectively facing to acquire in the one or more shotgun microphones of controlBefore one or more audio, method 300 be may further include: according to the face characteristic of one or more faces and/or be movedMake the priority of determining each face;And determine that one or more shotgun microphones will direction according to the priority of each faceObject as at least partly object.
It can be that object distributes shotgun microphone according to priority, this is less than one or more in the number of shotgun microphoneIt is particularly useful in the case where the number of face.Priority can be determined according to the face characteristic and/or movement of face.Face characteristicIt may include the size of facial contour.For example, shotgun microphone can be placed at one with camera, when camera is collectedWhen face is larger, it is believed that object corresponding to the face is closer from shotgun microphone, so that it may by the priority of the faceIt improves, allows to that shotgun microphone is preferentially distributed to object corresponding to the face.Face characteristic can also include faceMouth action.For example, if finding the first face institute by the face mouth action in several successive video frames in videoCorresponding object pipes down, and it was found that object corresponding to the second face loquiturs, then can be by the excellent of the first faceFirst grade reduces, and the priority of the second face is improved, so that distributing to the oriented microphone of object corresponding to the first face originallyWind can be reassigned to object corresponding to the second face.The movement of face may include whether face is stable.For example, ifIt will not be tampered by the way that object is more stable corresponding to several successive video frames discovery face in video, then it can be by the facePriority improve, allow to that shotgun microphone is preferentially distributed to object corresponding to the face.
It can make the direction of shotgun microphone that can more neatly adjust by priority, in shotgun microphone numberIt can guarantee to collect the voice of object as much as possible in the case that amount is inadequate.
According to embodiments of the present invention, above-mentioned steps S340 can be implemented according to following rule: at least partly faceEach, if the mouth of the face at the first moment from closed state changed to open configuration and before the first momentThe first predetermined period in be continuously in closed state, then using the first moment as the video segmentation time started, if the faceMouth change in closed state and the second predetermined period after the second moment and hold from open configuration at the second momentIt is continuous to be in closed state, then using the second moment as the video segmentation end time, wherein the video, positioned at adjacent viewPart between frequency Segment start times and video segmentation end time is initial video section.
In first predetermined period, the second predetermined period and third predetermined period described below, the 4th predetermined periodAny the two can be same or different, and depending on can according to need, the present invention limits not to this.
If the mouth of some object opens suddenly after being closed the first predetermined period, it is believed that object startsWords, can be considered as the video segmentation time started for time point at this time.If the mouth of some object is opening the second pre- timingIt is closed suddenly after section, it is believed that object pipes down, and time point at this time can be considered as the video segmentation end time.
It is, of course, understood that can also be segmented according to mouth action to video according to other rules, Huo ZhegenVideo is segmented according to other face characteristics, should all be fallen within the scope of protection of the present invention.
According to embodiments of the present invention, above-mentioned steps S350 can be implemented according to following rule: if the voice in audio existsThe third moment is never continuously in third predetermined period of the sounding state change to sounding state and before the third momentNot sounding state, then using the third moment as the audio parsing time started, if voice in audio is at the 4th moment from soundingState change is continuously in not sounding state in the 4th predetermined period to not sounding state and after the 4th moment, then will4th moment is as the audio parsing end time, wherein audio, positioned at adjacent audio parsing time started and audio parsingPart between end time is initial audio section.
Similarly with video segmentation, if the voice in audio continue for the processus aboralis of third predetermined period in not sounding stateRight sounding, it is believed that there is object to loquitur, time point at this time can be considered as the audio parsing time started.If audioIn voice after sounding state continue for the 4th predetermined period no longer sounding suddenly, it is believed that object pipes down, canTime point at this time is considered as the audio parsing end time.
It is, of course, understood that audio can also be segmented according to phonetic feature according to other rules, it should allIt falls within the scope of protection of the present invention.
Fig. 5 shows in accordance with another embodiment of the present invention for handling the signal of the method 500 of video and related audioProperty flow chart.The step S510 to S550 of method 500 shown in fig. 5 respectively with the step S210 of method 200 shown in Fig. 2 extremelyS250 is corresponding.Those skilled in the art are understood that the above-mentioned steps in Fig. 5 with description above according to fig. 2, for sake of simplicity,Details are not described herein.In the present embodiment, after step S550, method 500 may further include following steps.
Audio-frequency unit corresponding with the face is carried out at least partly each of face in step S560Speech recognition, to obtain the text file for representing audio-frequency unit corresponding with the face.
In step S570, at least partly each of face, text file is associated with the face.
For some face, after obtaining audio-frequency unit corresponding with the face, speech recognition can be carried out.VoiceIdentification can be realized using routine techniques, not repeated herein.The text file identified is the object indicated with written formSpeech content can be associated together with the object spoken.It is understood that in the embodiment for including classifying step,Originally the effective audio section for being associated with some object may be reclassified to another pair as in such a case, it is possible to rightEffective audio section after categorized carries out speech recognition, and the text file that will identify that is associated together with correct object.
The speech content of object can be converted into text by speech recognition, this is conducive to the storage of voice, and makesUser is obtained to retrieve voice by keyword with can be convenient.
According to embodiments of the present invention, method 200 (300 or 500) may further include: output expectation information.The phaseHope that information includes one or more in following item: the video, the audio, comprising the particular person in one or more facesThe video frame of face, the acquisition time of video frame comprising Given Face, audio-frequency unit corresponding with Given Face and with it is specificThe acquisition time of the corresponding audio-frequency unit of face.
Given Face can be at least partly face for example described above.For example, under conference scenario, when having handledAfter the video and audio of entire session acquisition, can be informed in the session once said the participant to exchange words and its saidTalk about content.The facial image for once saying the participant to exchange words and its speech content (audio or textual form) can be exported,It is presented to the user that conferencing information is checked in expectation.It is of course also possible to export the facial image of all participants and once saidThe speech content (audio or textual form) of the participant of words.Furthermore it is also possible to by the session acquisition entire video orAudio output.
In one example, it can use the relevant information that output device 108 such as shown in FIG. 1 exports Given Face.For example, output device 108 can be the output interface of server end, desired information can be output to the client of user.In another example output device 108 can be one or more of display, loudspeaker etc., expectation letter can be shown or playedBreath.When information it is expected in display, can be shown using the face of time and/or object as clue.For example, in conference scenarioUnder, it can show all participants or say the facial image of the participant to exchange words, speak time and/or speech content etc..
Information it is expected by output, and family timely learning can be used and say the object exchanged words and its speech content, such as in meetingThe case where user can know entire meeting under view scene.
According to a further aspect of the invention, a kind of search method is provided.Fig. 6 shows retrieval according to an embodiment of the inventionThe schematic flow chart of method 600.As shown in fig. 6, search method 600 includes the following steps.
In step S610, the retrieval received for target face is indicated.
Retrieval instruction can be the user of audio and/or video that record is checked from expectation.For example, in conference scenarioUnder, the entire session can once be said that the facial image of the participant to exchange words was presented to the user, user is via interaction circleMillet cake hits certain facial image, and input is indicated for the retrieval of certain face.Search method may be implemented in server end, such as realizeOn electronic equipment 100 described above, user can input retrieval instruction via input unit 106.In another example, it usesRetrieval instruction can be transmitted to server end by the mobile terminal at family, server end by the information retrieved (such as with certain faceCorresponding audio-frequency unit) it is transmitted to the mobile terminal of user.In another example, search method also may be implemented in clientEnd, such as realize on the mobile terminal of user, server can will be by described above for processing videos and related soundThe video and audio of the method processing of frequency and some other information, such as audio-frequency unit corresponding with each face and/or faceWith the incidence relation of audio-frequency unit etc., these information stored in the storage device, and can be transmitted to user by storageMobile terminal.User can retrieve the information of needs on the mobile terminal of oneself.In another example, use described aboveIt may be implemented together in client in the method and search method of processing video and related audio, such case is real together with the twoPresent server end is similar, repeats no more.
In step S620, the relevant information of target face is searched from database according to retrieval instruction, wherein database is usedIn storage according to described above for processing video and related audio the video and audio that is handled of method and/or withAt least partly corresponding audio-frequency unit of each of face, and wherein, the relevant information of target face includes following itemIn it is one or more: the video frame comprising target face, the video frame comprising target face acquisition time, with target faceThe acquisition time of corresponding audio-frequency unit and audio-frequency unit corresponding with target face.
As described above, under conference scenario, can once be said to the face of the participant to exchange words the entire sessionImage is presented to the user, and user clicks certain facial image via interactive interface, and input is indicated for the retrieval of certain face.Work as userAfter clicking some face, the face can be searched from database in the video section and audio-frequency unit of session.IncludeThe video frame of the face can be single video frame, be also possible to continuous video frame (i.e. one section of video).In addition, database is alsoIt can store the text file for representing audio-frequency unit corresponding at least partly each of face.That is, can be withStore the speech content with the object of textual representation in the session.In this way, the relevant information of target face can also include withThe corresponding text file of target face.
In step S630, the relevant information of target face is exported.
Video frame comprising target face can be exported via display interface (such as display etc.).It is opposite with target faceThe audio-frequency unit answered can be exported via sound play device (such as loudspeaker etc.).The video frame or audio needed by outputPart can be provided a user about the information for saying the object and its speech content exchanged words in the session.
For according to described above for processing video and related audio the video that is handled of method and audio comeIt says, the incidence relation between each face and its voice is known, therefore can fast and effeciently be retrieved opposite with faceThe voice answered.
It should be noted that the present invention is not limited to the above search method, other any suitable search methods are also feasible.For example, it is also possible to according to the time come searched targets face, the video frame comprising target face, audio corresponding with target facePart etc..
Fig. 7 shows according to an embodiment of the invention for handling the signal of the device 700 of video and related audioProperty block diagram.
As shown in fig. 7, the device 700 according to an embodiment of the present invention for handling video and related audio is obtained including firstModulus block 710, face detection module 720, second obtain module 730, audio-frequency unit determining module 740 and audio relating module750。
First acquisition module 710 is used to obtain the video of one or more faces including one or more objects.FirstThe program that obtaining module 710 can store in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1 refers toIt enables to realize.
Face detection module 720 is used to carry out Face datection to each video frame in video, to identify one or moreFace.Face detection module 720 can store in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1Program instruction realize.
Second acquisition module 730 be used for obtains with the video acquired in same time period including one or more it is rightThe audio of the voice of at least partly object as in.Second obtains module 730 can processing in electronic equipment as shown in Figure 1The program instruction that stores in 102 Running storage device 104 of device is realized.
Audio-frequency unit determining module 740 is used for at least partly each of face in one or more faces,Determine audio-frequency unit in audio, corresponding with the face, wherein at least partly face is belonging respectively at least partly object.Audio-frequency unit determining module 740 can store in 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1Program instruction realize.
Audio relating module 750 is used for at least partly each of face, by the face and corresponding audio portionDivide and associates.Audio relating module 750 can 102 Running storage device 104 of processor in electronic equipment as shown in Figure 1The program instruction of middle storage is realized.
Illustratively, it may further include for handling the device 700 of video and related audio: video segmentation module,For being segmented to video according to the mouth action of the face at least partly each of face, to obtain and be somebody's turn to doThe corresponding initial video section of face;Audio parsing module, for being directed at least partly each of face, according in audioPhonetic feature audio is segmented, to obtain corresponding with face initial audio section;And effective video and audioModule is obtained, for obtaining in the video and face according to initial video section corresponding with the face and initial audio sectionEffective audio section in corresponding effective video section and audio, corresponding with the face.Audio-frequency unit determining module 740 canTo include determining submodule, for determining effective audio corresponding with the face at least partly each of faceThe corresponding audio-frequency unit of the Duan Weiyu face.
Illustratively, audio relating module 750 may include: video frame selection submodule, for at least partly peopleEach of face, for each effective video section corresponding with the face, from all video frames of the effective video sectionSelect the optimal video frame of face quality;And association submodule, for by selected video frame and with the effective video sectionCorresponding effective audio section associates, to form a video/audio combination.
Illustratively, may further include for handling the device 700 of video and related audio: face characteristic extracts mouldBlock, it is special for carrying out face to the video frame in particular video frequency audio combination for face corresponding to particular video frequency audio combinationSign is extracted, to obtain Given Face feature, wherein particular video frequency audio combination is at least partly all videos corresponding to faceOne of audio combination;Sound characteristic extraction module carries out sound characteristic to effective audio section in particular video frequency audio combination and mentionsIt takes, to obtain specific sound feature;Human face similarity degree computing module, for for remaining video in the combination of all video/audiosEach of audio combination, calculating Given Face feature combine the face between corresponding face characteristic with the video/audioSimilarity;Sound similarity calculation module, for calculating specific sound feature for each of remaining video/audio combinationThe sound similarity between corresponding sound characteristic is combined with the video/audio;Average similarity computing module, for being directed toRemaining video/audio combination each of, calculate particular video frequency audio combination combined with the video/audio between face it is similarDegree and the average value of sound similarity, combine with to obtain particular video frequency audio combination with the video/audio between be averaged it is similarDegree;Classifying module, for being directed to each of remaining video/audio combination, if particular video frequency audio combination and the video soundAverage similarity between frequency combination is greater than similarity threshold, then particular video frequency audio combination combines classification with the video/audioTo same target.
Illustratively, it may include: that effective video section determines submodule that effective video and audio, which obtain module, for being directed toAt least partly each of face, will initial video section corresponding with the face be determined as it is corresponding with the face effectivelyVideo-frequency band;And effectively audio section determines submodule, it, will be opposite with the face for being directed at least partly each of faceThe initial audio section answered is determined as effective audio section corresponding with the face.
Illustratively, it includes: that unified split time determines submodule that effective video and audio, which obtain module, for for extremelyEach of small part face is determined according to the split time of initial video section corresponding with the face and initial audio sectionUnified split time;Unified subsection submodule, for carrying out unified segmentation to video and audio according to unified split time, to obtainObtain effective video section corresponding with the face and effective audio section.
Illustratively, audio is acquired by unified microphone, and audio parsing module includes: the first subsection submodule, is usedAudio is segmented according to the phonetic feature in audio, to obtain mixed audio piece;And audio section selects submodule, usesIt is on acquisition time and corresponding with the face from being selected in mixed audio piece at least partly each of faceThe consistent mixed audio piece of initial video section is as initial audio section corresponding with the face.
Illustratively, audio includes being used respectively by one or more shotgun microphones one or more audio collectedIt may further include control module in the device 700 of processing video and related audio, for controlling one or more orientation wheatsGram wind is respectively facing at least partly object to acquire one or more audio;Audio parsing module may include the second segmentation submoduleBlock, for being directed at least partly each of face, according to as the shotgun microphone institute towards object corresponding to the facePhonetic feature in the audio all the way of acquisition is segmented the road audio, to obtain initial audio corresponding with the faceSection.
Illustratively, the number of shotgun microphone is equal to or more than the number of one or more faces.
Illustratively, may further include for handling the device 700 of video and related audio: priority determines mouldBlock determines the priority of each face for the face characteristic and/or movement according to one or more faces;And object determinesModule, for the priority according to each face determine one or more shotgun microphones will direction object as at least partlyObject.
Illustratively, video segmentation module is segmented video according to following rule: at least partly faceEach, if the mouth of the face at the first moment from closed state changed to open configuration and before the first momentClosed state is continuously in first predetermined period, then using the first moment as the video segmentation time started, if the faceMouth, which changes in closed state and the second predetermined period after the second moment at the second moment from open configuration, to be continuedIn closed state, then using the second moment as the video segmentation end time, wherein video, positioned at adjacent video segmentationPart between time started and video segmentation end time is initial video section.
Illustratively, audio parsing module is segmented audio according to following rule: if the voice in audio isThree moment were never continuously in not in third predetermined period of the sounding state change to sounding state and before the third momentSounding state, then using the third moment as the audio parsing time started, if voice in audio is at the 4th moment from sounding shapeState changes in not sounding state and the 4th predetermined period after the 4th moment and is continuously in not sounding state, then byFour moment are as the audio parsing end time, wherein audio, positioned at adjacent audio parsing time started and audio parsing knotPart between the beam time is initial audio section.
Illustratively, it may further include for handling the device 700 of video and related audio: speech recognition module,For carrying out speech recognition to audio-frequency unit corresponding with the face at least partly each of face, to obtainRepresent the text file of audio-frequency unit corresponding with the face;And textual association module, it is used for text file and the peopleFace associates.
Illustratively, it may further include output module for handling the device 700 of video and related audio, for defeatedIt is expected information out, wherein expectation information includes one or more in following item: video, includes one or more faces at audioIn the video frame of Given Face, the acquisition time of video frame comprising Given Face, audio portion corresponding with Given FacePoint and audio-frequency unit corresponding with Given Face acquisition time.
According to a further aspect of the invention, a kind of retrieval device is provided.Fig. 8 shows inspection according to an embodiment of the inventionThe schematic block diagram of rope device 800.Retrieving device 800 includes receiving module 810, searching module 820 and output module 830.
Receiving module 810 is used to receive the retrieval instruction for target face.
Searching module 820 is used to search the relevant information of target face from database according to retrieval instruction, wherein dataLibrary be used to store using the video and audio handled described above for the device for handling video and related audio with/Or audio-frequency unit corresponding at least partly each of face, and wherein, the relevant information of target face include withIt is one or more in lower item: the video frame comprising target face, the acquisition time of the video frame comprising target face and targetThe acquisition time of the corresponding audio-frequency unit of face and audio-frequency unit corresponding with target face.
Output module 830 is used to export the relevant information of target face.
The embodiment of search method 600 is hereinbefore described, those skilled in the art are according to above description and combineFig. 6 is understood that structure, the method for operation and its advantage etc. of retrieval device 800, repeats no more.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosureMember and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actuallyIt is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technicianEach specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceedThe scope of the present invention.
Fig. 9 shows according to an embodiment of the invention for handling the signal of the system 900 of video and related audioProperty block diagram.System 900 for handling video and related audio includes video acquisition device 910, audio collecting device 920, depositsStorage device 930 and processor 940.
Video acquisition device 910 is used to acquire the video of the face including object.Audio collecting device 920 is for acquiring packetInclude the audio of the voice of object.
The storage device 930 storage is for realizing according to an embodiment of the present invention for handling video and related audioThe program code of corresponding steps in method.
The processor 940 is for running the program code stored in the storage device 930, to execute according to the present inventionThe corresponding steps of the method for handling video and related audio of embodiment, and for realizing according to an embodiment of the present inventionFirst in device 700 for handling video and related audio obtains module 710, face detection module 720, second obtains mouldBlock 730, audio-frequency unit determining module 740 and audio relating module 750.
In one embodiment, following steps are executed when said program code is run by the processor 940: obtaining packetInclude the video of one or more faces of one or more objects;Face datection is carried out to each video frame in video, to knowNot one or more faces;Obtain with the video acquired in same time period include in one or more objects at leastThe audio of the voice of partial objects;For at least partly each of face in one or more faces, determine in audio, corresponding with face audio-frequency unit;The face is associated with corresponding audio-frequency unit, wherein at least partly peopleFace is belonging respectively at least partly object.
In one embodiment, following steps are also executed when said program code is run by the processor 940: be directed toAt least partly each of face is segmented video according to the mouth action of the face, opposite with the face to obtainThe initial video section answered;Audio is segmented according to the phonetic feature in audio, it is corresponding with the face initial to obtainAudio section;And in the video and face phase is obtained according to initial video section corresponding with the face and initial audio sectionEffective audio section in corresponding effective video section and audio, corresponding with the face.In said program code by the placeAt least partly each of face being directed in one or more faces performed when device 940 is run is managed to determine in audio, corresponding with the face audio-frequency unit the step of include: at least partly each of face, it is determining with the faceCorresponding effective audio section is audio-frequency unit corresponding with the face.
In one embodiment, when said program code is run by the processor 940 it is performed for one orThe step of at least partly each of face in multiple faces associates the face with corresponding audio-frequency unit include:For at least partly each of face, for each effective video section corresponding with the face, from the effective video sectionAll video frames in select the optimal video frame of face quality;By selected video frame and corresponding with the effective video sectionEffective audio section associates, to form a video/audio combination.
In one embodiment, following steps are also executed when said program code is run by the processor 940: be directed toFace corresponding to particular video frequency audio combination carries out face characteristic extraction to the video frame in particular video frequency audio combination, to obtainGiven Face feature, wherein particular video frequency audio combination is that at least partly all video/audios corresponding to face combine itOne;Sound characteristic extraction is carried out to effective audio section in particular video frequency audio combination, to obtain specific sound feature;For instituteThere is each of remaining video/audio combination in video/audio combination, calculates Given Face feature and combined with the video/audioHuman face similarity degree between corresponding face characteristic;It calculates specific sound feature and combines corresponding sound with the video/audioSound similarity between feature;Calculate particular video frequency audio combination combined with the video/audio between human face similarity degree and soundThe average value of sound similarity, combined with to obtain particular video frequency audio combination with the video/audio between average similarity;IfParticular video frequency audio combination combined with the video/audio between average similarity be greater than similarity threshold, then by particular video frequency soundFrequency combination is combined with the video/audio is referred to same target.
In one embodiment, performed at least portion when said program code is run by the processor 940Each of face is divided to obtain in video and the people according to initial video section corresponding with the face and initial audio sectionThe step of effective audio section in the corresponding effective video section of face and audio, corresponding with the face includes: at leastInitial video section corresponding with the face is determined as effective video corresponding with the face by each of part faceSection, and initial audio section corresponding with the face is determined as effective audio section corresponding with the face.
In one embodiment, performed at least portion when said program code is run by the processor 940Each of face is divided to obtain in video and the people according to initial video section corresponding with the face and initial audio sectionThe step of effective audio section in the corresponding effective video section of face and audio, corresponding with the face includes: at leastEach of part face is determined according to the split time of initial video section corresponding with the face and initial audio section and is unitedOne split time;Unified segmentation is carried out to video and audio according to unified split time, is had so that acquisition is corresponding with the faceImitate video-frequency band and effective audio section.
In one embodiment, audio is acquired by unified microphone, in said program code by the processor 940When operation it is performed at least partly each of face according to the phonetic feature in audio to audio be segmented withThe step of obtaining initial audio section corresponding with the face includes: to be segmented according to the phonetic feature in audio to audio,To obtain mixed audio piece;And at least partly each of face, selection is in acquisition time from mixed audio pieceThe consistent mixed audio piece of initial video section upper and corresponding with the face is as initial audio section corresponding with the face.
In one embodiment, audio includes respectively by one or more shotgun microphones one or more sound collectedFrequently, following steps are also executed when said program code is run by the processor 940: controlling one or more shotgun microphonesAt least partly object is respectively facing to acquire one or more audio;When said program code is run by the processor 940It is performed audio is segmented according to the phonetic feature in audio at least partly each of face with obtain withThe step of face corresponding initial audio section includes: at least partly each of face, according to by towards the peoplePhonetic feature in the shotgun microphone of object corresponding to face audio all the way collected is segmented the road audio, to obtainObtain initial audio section corresponding with the face.
In one embodiment, the number of shotgun microphone is equal to or more than the number of one or more faces.
In one embodiment, following steps are also executed when said program code is run by the processor 940: according toThe face characteristic of one or more faces and/or movement determine the priority of each face;And according to the preferential of each faceGrade determine one or more shotgun microphones will direction object as at least partly object.
In one embodiment, performed at least portion when said program code is run by the processor 940The step of dividing each of face to be segmented according to the mouth action of the face to video implements according to following rule: being directed toAt least partly each of face, if the mouth of the face the first moment from closed state change to open configuration andIt is continuously in closed state in the first predetermined period before the first moment, then when starting for the first moment as video segmentationBetween, if the mouth of the face the second moment from open configuration change to closed state and after the second moment secondClosed state is continuously in predetermined period, then using the second moment as the video segmentation end time, wherein video, be located atPart between the adjacent video segmentation time started and video segmentation end time is initial video section.
In one embodiment, performed at least portion when said program code is run by the processor 940The step of dividing each of face to be segmented according to the phonetic feature in audio to audio implements according to following rule: ifNever third of the sounding state change to sounding state and before the third moment is predetermined at the third moment for voice in audioNot sounding state is continuously in period, then using the third moment as the audio parsing time started, if the voice in audio exists4th moment was continuously in the 4th predetermined period to not sounding state and after the 4th moment from sounding state changeNot sounding state, then using the 4th moment as the audio parsing end time, wherein audio, open positioned at adjacent audio parsingThe part begun between time and audio parsing end time is initial audio section.
In one embodiment, following steps are also executed when said program code is run by the processor 940: be directed toAt least partly each of face carries out speech recognition to audio-frequency unit corresponding with the face, represents and is somebody's turn to do to obtainThe text file of the corresponding audio-frequency unit of face;Text file is associated with the face.
In one embodiment, following steps are also executed when said program code is run by the processor 940: outputIt is expected that information, wherein expectation information includes one or more in following item: video, audio include in one or more facesThe video frame of Given Face, video frame comprising Given Face acquisition time, audio-frequency unit corresponding with Given FaceWith the acquisition time of audio-frequency unit corresponding with Given Face.
In addition, according to embodiments of the present invention, additionally providing a kind of storage medium, storing program on said storageInstruction, when described program instruction is run by computer or processor for execute the embodiment of the present invention for handle video andThe corresponding steps of the method for related audio, and for realizing according to an embodiment of the present invention for handling video and related audioDevice in corresponding module.The storage medium for example may include the storage unit of the storage card of smart phone, tablet computerIt is part, the hard disk of personal computer, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compactDisk read-only memory (CD-ROM), any combination of USB storage or above-mentioned storage medium.
In one embodiment, the computer program instructions may be implemented real according to the present invention when being run by computerEach functional module of the device for handling video and related audio of example is applied, and/or can be executed according to the present inventionThe method for handling video and related audio of embodiment.
In one embodiment, the computer program instructions execute following steps when being run by computer: obtaining packetInclude the video of one or more faces of one or more objects;Face datection is carried out to each video frame in video, to knowNot one or more faces;Obtain with the video acquired in same time period include in one or more objects at leastThe audio of the voice of partial objects;For at least partly each of face in one or more faces, determine in audio, corresponding with face audio-frequency unit;The face is associated with corresponding audio-frequency unit, wherein at least partly peopleFace is belonging respectively at least partly object.
In one embodiment, the computer program instructions also execute following steps when being run by computer: being directed toAt least partly each of face is segmented video according to the mouth action of the face, opposite with the face to obtainThe initial video section answered;Audio is segmented according to the phonetic feature in audio, it is corresponding with the face initial to obtainAudio section;And in the video and face phase is obtained according to initial video section corresponding with the face and initial audio sectionEffective audio section in corresponding effective video section and audio, corresponding with the face.Exist in the computer program instructionsIt is performed when being run by computer to be determined in audio at least partly each of face in one or more faces, corresponding with the face audio-frequency unit the step of include: at least partly each of face, it is determining with the faceCorresponding effective audio section is audio-frequency unit corresponding with the face.
In one embodiment, the computer program instructions are performed when being run by computer is directed to one or moreThe step of at least partly each of face in a face associates the face with corresponding audio-frequency unit includes: needleTo at least partly each of face, for each effective video section corresponding with the face, from the effective video sectionThe optimal video frame of face quality is selected in all video frames;By selected video frame and with this, effective video section is corresponding hasEffect audio section associates, to form a video/audio combination.
In one embodiment, the computer program instructions also execute following steps when being run by computer: being directed toFace corresponding to particular video frequency audio combination carries out face characteristic extraction to the video frame in particular video frequency audio combination, to obtainGiven Face feature, wherein particular video frequency audio combination is that at least partly all video/audios corresponding to face combine itOne;Sound characteristic extraction is carried out to effective audio section in particular video frequency audio combination, to obtain specific sound feature;For instituteThere is each of remaining video/audio combination in video/audio combination, calculates Given Face feature and combined with the video/audioHuman face similarity degree between corresponding face characteristic;It calculates specific sound feature and combines corresponding sound with the video/audioSound similarity between feature;Calculate particular video frequency audio combination combined with the video/audio between human face similarity degree and soundThe average value of sound similarity, combined with to obtain particular video frequency audio combination with the video/audio between average similarity;IfParticular video frequency audio combination combined with the video/audio between average similarity be greater than similarity threshold, then by particular video frequency soundFrequency combination is combined with the video/audio is referred to same target.
In one embodiment, performed at least portion when being run by computer in the computer program instructionsEach of face is divided to obtain in video and the people according to initial video section corresponding with the face and initial audio sectionThe step of effective audio section in the corresponding effective video section of face and audio, corresponding with the face includes: at leastInitial video section corresponding with the face is determined as effective video corresponding with the face by each of part faceSection, and initial audio section corresponding with the face is determined as effective audio section corresponding with the face.
In one embodiment, performed at least portion when being run by computer in the computer program instructionsEach of face is divided to obtain in video and the people according to initial video section corresponding with the face and initial audio sectionThe step of effective audio section in the corresponding effective video section of face and audio, corresponding with the face includes: at leastEach of part face is determined according to the split time of initial video section corresponding with the face and initial audio section and is unitedOne split time;Unified segmentation is carried out to video and audio according to unified split time, is had so that acquisition is corresponding with the faceImitate video-frequency band and effective audio section.
In one embodiment, audio is acquired by unified microphone, is being calculated in the computer program instructionsMachine is performed when running to be segmented audio according to the phonetic feature in audio at least partly each of faceIt include: to be divided according to the phonetic feature in audio audio with the step of obtaining initial audio section corresponding with the faceSection, to obtain mixed audio piece;And at least partly each of face, selection is in acquisition from mixed audio pieceBetween upper and corresponding with the face consistent mixed audio piece of initial video section as initial audio corresponding with the faceSection.
In one embodiment, audio includes respectively by one or more shotgun microphones one or more sound collectedFrequently, following steps are also executed when being run by computer in the computer program instructions: controlling one or more oriented microphonesWind is respectively facing at least partly object to acquire one or more audio;It is run in the computer program instructions by computerWhen performed audio is segmented to obtain according to the phonetic feature in audio at least partly each of faceThe step of initial audio section corresponding with the face includes: to be somebody's turn to do at least partly each of face according to by directionPhonetic feature in the shotgun microphone of object corresponding to face audio all the way collected is segmented the road audio, withObtain initial audio section corresponding with the face.
In one embodiment, the number of shotgun microphone is equal to or more than the number of one or more faces.
In one embodiment, following steps are also executed when being run by computer in the computer program instructions: rootThe priority of each face is determined according to the face characteristic and/or movement of one or more faces;And according to the excellent of each faceFirst grade determine one or more shotgun microphones will direction object as at least partly object.
In one embodiment, performed at least portion when being run by computer in the computer program instructionsThe step of dividing each of face to be segmented according to the mouth action of the face to video implements according to following rule: being directed toAt least partly each of face, if the mouth of the face the first moment from closed state change to open configuration andIt is continuously in closed state in the first predetermined period before the first moment, then when starting for the first moment as video segmentationBetween, if the mouth of the face the second moment from open configuration change to closed state and after the second moment secondClosed state is continuously in predetermined period, then using the second moment as the video segmentation end time, wherein video, be located atPart between the adjacent video segmentation time started and video segmentation end time is initial video section.
In one embodiment, performed at least portion when being run by computer in the computer program instructionsThe step of dividing each of face to be segmented according to the phonetic feature in audio to audio implements according to following rule: ifNever third of the sounding state change to sounding state and before the third moment is predetermined at the third moment for voice in audioNot sounding state is continuously in period, then using the third moment as the audio parsing time started, if the voice in audio exists4th moment was continuously in the 4th predetermined period to not sounding state and after the 4th moment from sounding state changeNot sounding state, then using the 4th moment as the audio parsing end time, wherein audio, open positioned at adjacent audio parsingThe part begun between time and audio parsing end time is initial audio section.
In one embodiment, following steps are also executed when being run by computer in the computer program instructions: needleTo at least partly each of face, speech recognition is carried out to audio-frequency unit corresponding with the face, with obtain represent withThe text file of the corresponding audio-frequency unit of the face;Text file is associated with the face.
In one embodiment, following steps are also executed when being run by computer in the computer program instructions: defeatedIt is expected information out, wherein expectation information includes one or more in following item: video, includes one or more faces at audioIn the video frame of Given Face, the acquisition time of video frame comprising Given Face, audio portion corresponding with Given FacePoint and audio-frequency unit corresponding with Given Face acquisition time.
Each module in system according to an embodiment of the present invention for handling video and related audio can pass through basisThe processor operation of the electronic equipment of the detection for handling video and related audio of the embodiment of the present invention is deposited in memoryThe computer program instructions of storage realize, or can in the computer of computer program product according to an embodiment of the present invention canThe realization when computer instruction for reading to store in storage medium is run by computer.
It is according to an embodiment of the present invention for handle video and related audio method and device, search method and device,It, can by the way that the face of object and its voice association get up for handling the system and storage medium of video and related audioTo determine speak time and the speech content of object, thus facilitate user check in speech content of the later period to the object andRetrieval.
Although describing example embodiment by reference to attached drawing here, it should be understood that above example embodiment are only exemplary, and be not intended to limit the scope of the invention to this.Those of ordinary skill in the art can carry out various changes whereinAnd modification, it is made without departing from the scope of the present invention and spiritual.All such changes and modifications are intended to be included in appended claimsWithin required the scope of the present invention.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosureMember and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actuallyIt is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technicianEach specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceedThe scope of the present invention.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through itIts mode is realized.For example, apparatus embodiments described above are merely indicative, for example, the division of the unit, onlyOnly a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be tiedAnother equipment is closed or is desirably integrated into, or some features can be ignored or not executed.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the inventionExample can be practiced without these specific details.In some instances, well known method, structure is not been shown in detailAnd technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the present invention and help to understand one or more of the various inventive aspects,To in the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure,Or in descriptions thereof.However, the method for the invention should not be construed to reflect an intention that i.e. claimedThe present invention claims features more more than feature expressly recited in each claim.More precisely, as correspondingAs claims reflect, inventive point is that all features less than some disclosed single embodiment can be usedFeature solves corresponding technical problem.Therefore, it then follows thus claims of specific embodiment are expressly incorporated in the toolBody embodiment, wherein each, the claims themselves are regarded as separate embodiments of the invention.
It will be understood to those skilled in the art that any combination pair can be used other than mutually exclusive between featureAll features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed any methodOr all process or units of equipment are combined.Unless expressly stated otherwise, this specification (is wanted including adjoint rightAsk, make a summary and attached drawing) disclosed in each feature can be replaced with an alternative feature that provides the same, equivalent, or similar purpose.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodimentsIn included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the inventionWithin the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of anyCan in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processorsSoftware module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practiceMicroprocessor or digital signal processor (DSP) realize some moulds in article analytical equipment according to an embodiment of the present inventionThe some or all functions of block.The present invention is also implemented as a part or complete for executing method as described hereinThe program of device (for example, computer program and computer program product) in portion.It is such to realize that program of the invention can storeOn a computer-readable medium, it or may be in the form of one or more signals.Such signal can be from internetDownloading obtains on website, is perhaps provided on the carrier signal or is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and abilityField technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of notElement or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple suchElement.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer realIt is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branchTo embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fameClaim.
The above description is merely a specific embodiment or to the explanation of specific embodiment, protection of the inventionRange is not limited thereto, and anyone skilled in the art in the technical scope disclosed by the present invention, can be easilyExpect change or replacement, should be covered by the protection scope of the present invention.Protection scope of the present invention should be with claimSubject to protection scope.