Movatterモバイル変換


[0]ホーム

URL:


CN111209897B - Video processing method, device and storage medium - Google Patents

Video processing method, device and storage medium
Download PDF

Info

Publication number
CN111209897B
CN111209897BCN202010157708.3ACN202010157708ACN111209897BCN 111209897 BCN111209897 BCN 111209897BCN 202010157708 ACN202010157708 ACN 202010157708ACN 111209897 BCN111209897 BCN 111209897B
Authority
CN
China
Prior art keywords
video
human body
human
body region
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010157708.3A
Other languages
Chinese (zh)
Other versions
CN111209897A (en
Inventor
吴韬
徐叙远
刘孟洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yayue Technology Co ltd
Original Assignee
Shenzhen Yayue Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yayue Technology Co ltdfiledCriticalShenzhen Yayue Technology Co ltd
Priority to CN202010157708.3ApriorityCriticalpatent/CN111209897B/en
Publication of CN111209897ApublicationCriticalpatent/CN111209897A/en
Application grantedgrantedCritical
Publication of CN111209897BpublicationCriticalpatent/CN111209897B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention relates to a video processing method, a video processing device and a storage medium. The method comprises the following steps: acquiring a video to be processed and a target human body area; detecting a plurality of human body areas in a video to be processed; inputting the plurality of human body areas into a trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body areas, and inputting the target human body areas into the trained feature extraction network to obtain second features describing the target human body areas; comparing the plurality of first features with the second features respectively to obtain at least one first matching feature matched with the second features; determining corresponding time points of at least one first matching feature in the video to be processed; the video to be processed is processed based on the respective points in time to obtain video portions associated with the target object. The feature extraction network is trained using a dataset constructed based on a set of human body region samples, and the set of human body region samples is generated separately for a plurality of video segments divided by video shots.

Description

Video processing method, device and storage medium
Technical Field
The invention relates to the technical field of deep learning and computer vision, in particular to a video processing method, a video processing device and a storage medium.
Background
With the development of multimedia technology, various images and audios and videos add much fun to the life of people. When viewing video files, people typically choose to view their own segments of interest. Current video clip clips generally clip based on certain specific categories or specific scenes, such as based on specific shots or text cues in sports video and game video (e.g., goal, shoot in sports video, kill, five kill, etc.) to determine whether it is a highlight, and clip the video. It is also desirable to view only paragraphs about a particular person in a video. In this case, the related art generally judges a person in a video picture through face recognition to complete a clip for the specific task.
Disclosure of Invention
In the technical scheme of identifying a video clip containing a specific person through face recognition, in some cases, the video clip containing the specific person cannot be identified or cannot be identified accurately, for example, when the face of the specific person is unclear, incomplete, the person is displayed as a side face, a back face, the action amplitude of the person is large (e.g., fight), etc., it is poor to clip the specific person based on face recognition. Embodiments of the present invention address, at least in part, the above-mentioned problems.
According to an aspect of the present invention, a video processing method is presented. The method comprises the following steps: acquiring a video to be processed and a target human body area representing a target object; detecting a plurality of human body areas in a video to be processed; inputting the plurality of human body areas into a trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body areas, and inputting the target human body areas into the trained feature extraction network to obtain second features describing the target human body areas; comparing the plurality of first features with the second features respectively to obtain at least one first matching feature in the first features matched with the second features; determining corresponding time points of at least one first matching feature in the video to be processed; processing the video to be processed based on each time point to acquire a video part associated with the target object; the feature extraction network is trained by using a data set constructed based on a human body area sample set, and the human body area sample set is respectively generated for a plurality of video segments divided according to video shooting shots.
In some embodiments, the dataset is constructed by: acquiring a training video for a feature extraction network; dividing a training video into a plurality of training video segments according to video shooting shots; creating, for each of a plurality of training video segments, one or more human region sample sets of the training video segments; determining whether one or more human body region sample sets contain human faces; in response to determining that each of the one or more body region sample sets contains a face, the one or more body region sample sets are combined based on features of the face to construct a training dataset.
In some embodiments, for each of the plurality of training video segments, creating one or more human region sample sets of the training video segments comprises: for each of a plurality of training video segments, each training video segment comprising a plurality of video frames belonging to the same video shot, detecting a human body region in the plurality of video frames; judging the similarity between the two or more detected human body regions; two or more human body regions whose similarity meets a predetermined threshold range are added to the same set to generate one or more human body region sample sets of training video segments.
In some embodiments, in response to determining that a face is contained in each of the one or more sets of human region samples, merging the one or more sets of human region samples based on features of the face to construct the training data set comprises: in response to determining that each of the one or more body region sample sets contains a face, respectively selecting the same predetermined number of faces from each of the body region sample sets; comparing the similarity of the faces selected from each human body region sample set; and merging the human body region sample sets with the human face similarity higher than the first preset threshold value to construct a training data set.
In some embodiments, the dataset is further constructed by: determining a human body region with the similarity of the human body region in the same human body region sample set lower than a preset threshold value by utilizing the pedestrian re-identification ReID; human body regions having a human body region similarity below a second predetermined threshold are removed from the human body region sample set.
In some embodiments, determining the similarity between the two or more detected human body regions comprises: the similarity between the two or more detected human body regions is determined based on the artificial features.
In some embodiments, multiple human body regions in the video to be processed are detected by a single polygon detector.
In some embodiments, processing the video to be processed based on the respective points in time to obtain video portions associated with the target object includes: and splicing the videos to be processed based on the time stamps of the time points to acquire the video part associated with the target object.
According to another aspect of the invention, a method for constructing a dataset for training a feature extraction network is presented. The method comprises the following steps: acquiring a training video for a feature extraction network; dividing a training video into a plurality of training video segments according to video shooting shots; creating, for each of a plurality of training video segments, one or more human region sample sets of the training video segments; determining whether one or more human body region sample sets contain human faces; in response to determining that each of the one or more body region sample sets contains a face, the one or more body region sample sets are combined based on features of the face to construct a training dataset.
In some embodiments, for each of the plurality of training video segments, creating one or more human region sample sets of the training video segments comprises: for each of a plurality of training video segments, each training video segment comprising a plurality of video frames belonging to the same video shot, detecting a human body region in the plurality of video frames; judging the similarity between the two or more detected human body regions; two or more human body regions whose similarity meets a predetermined threshold range are added to the same set to generate one or more human body region sample sets of training video segments.
In some embodiments, in response to determining that a face is contained in each of the one or more sets of human region samples, merging the one or more sets of human region samples based on features of the face to construct the training data set comprises: in response to determining that each of the one or more body region sample sets contains a face, respectively selecting the same predetermined number of faces from each of the body region sample sets; comparing the similarity of the faces selected from each human body region sample set; and merging the human body region sample sets with the human face similarity higher than the first preset threshold value to construct a training data set.
In some embodiments, the dataset is further constructed by: determining a human body region with the similarity of the human body region in the same human body region sample set lower than a preset threshold value by utilizing the pedestrian re-identification ReID; human body regions having a human body region similarity below a second predetermined threshold are removed from the human body region sample set.
According to another aspect of the present invention, a training method of a feature extraction network is provided, including: a training video for a feature extraction network is acquired, a training dataset is constructed based on the acquired training video using the method of constructing a dataset as in the previous aspect, and the feature extraction network is trained using the dataset to extract features describing a region of the human body.
According to another aspect of the present invention, a video processing apparatus is presented. The device comprises: the device comprises an acquisition module, a human body detection module, a feature extraction module, a comparison module, a time point determination module and a video processing module. The acquisition module is configured to acquire a video to be processed and a target human body region representing a target object. The human body detection module is configured to detect a plurality of human body regions in the video to be processed. The feature extraction module is configured to input a plurality of human body regions into a trained feature extraction network to obtain a plurality of first features that respectively describe the plurality of human body regions, and input a target human body region into the trained feature extraction network to obtain a second feature that describes the target human body region, wherein the feature extraction network is trained using a dataset constructed based on a set of human body region samples that are respectively generated for a plurality of video segments divided by a video capture lens. The comparison module is configured to compare the plurality of first features with the second features, respectively, resulting in at least one first matching feature of the first features matching the second features. The point in time determination module is configured to determine corresponding respective points in time of the at least one first matching feature in the video to be processed. The video processing module is configured to process the video to be processed based on the respective points in time to obtain video portions associated with the target object.
According to another aspect of the invention, a construction device for a dataset for training a feature extraction network is presented. The device comprises: the system comprises an acquisition module, a video segmentation module, a set creation module, a determination module, a set merging module and a set merging module. The acquisition module is configured to acquire training video for the feature extraction network. The video segmentation module is configured to divide the training video into a plurality of training video segments according to video shots. The set creation module is configured to create, for each of a plurality of training video segments, one or more human region sample sets of the training video segments. The determination module is configured to determine whether a face is contained in the one or more human region sample sets. The set merge module is configured to merge the one or more human region sample sets to construct a training data set based on features of the human face in response to determining that the human face is contained in each of the one or more human region sample sets.
According to another aspect of the present invention, there is provided a training apparatus of a feature extraction network, including: an acquisition module configured to acquire training videos for the feature extraction network, a dataset construction module configured to construct a training dataset using the method of constructing a dataset as above based on the acquired training videos, and a training module configured to train the feature extraction network using the dataset to extract features describing a region of a human body.
According to some embodiments of the present invention, there is provided a computer device comprising: a processor; and a memory having instructions stored thereon that, when executed on the processor, cause the processor to perform any of the methods as above.
According to some embodiments of the present invention, there is provided a computer readable storage medium having instructions stored thereon, which when executed on a processor, cause the processor to perform any of the methods as above.
The video processing method, the device and the storage medium provided by the invention are used for analyzing the personas in the video content by utilizing deep learning and carrying out fragment editing of the same personas in the video through a trained feature extraction network. The video processing method can automatically divide the segments with the same roles in the video (such as films, television shows and various products), saves a great deal of manpower and time cost, improves editing efficiency, is also beneficial to later video production, and enhances user experience.
Drawings
Embodiments of the present invention will now be described in more detail, by way of non-limiting example, with reference to the accompanying drawings, which are merely illustrative, and in which like reference numerals refer to like parts throughout, and in which:
FIG. 1 schematically illustrates a graphical user interface schematic according to one embodiment of the invention;
FIG. 2 schematically illustrates an example application scenario according to one embodiment of the invention;
FIG. 3 schematically illustrates a network framework diagram for target character video processing in accordance with one embodiment of the present invention;
FIG. 4 schematically shows a schematic diagram of the structure of a single-shot polygon detector;
FIG. 5 schematically illustrates a flow chart of a video processing method according to one embodiment of the invention;
FIG. 6 schematically illustrates a flow chart of a method of constructing a dataset according to another embodiment of the invention;
fig. 7 schematically shows a schematic diagram of a video processing apparatus according to an embodiment of the invention;
FIG. 8 schematically illustrates a schematic diagram of an apparatus for constructing a dataset according to another embodiment of the invention; and
fig. 9 schematically shows a schematic diagram of an example computer device for video processing and/or constructing a data set.
Detailed Description
The following description provides specific details for a thorough understanding and implementation of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms involved in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:
deep learning (DeepLearning, DL): a multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data. The motivation for studying deep learning is to build a neural network that simulates the human brain for analysis learning, which mimics the mechanisms of the human brain to interpret data, such as images, sounds, text, and the like.
Computer vision technology (ComputerVision, CV): computer vision is a science of how to "look" a machine. Further, computer vision refers to machine vision that uses a camera and a computer to replace human eyes to identify, detect and measure targets, and further uses the computer to perform graphic processing to form images more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Convolutional neural networks (ConvolutionalNetworks, CNN) are a type of feedforward neural network that includes convolutional calculations and has a deep structure, and are one of the representative algorithms for deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network.
The single-shot polygon detector (SingleShotMultiBoxDetector, SSD) is a method for detecting objects in a picture based on a single depth neural network. It discretizes the output space of the bounding box, placing a series of default bounding boxes with different aspect ratios and different dimensions at the location of each feature map. In the prediction, the neural network generates a score for whether each default bounding box belongs to a certain class, and generates a correction to the bounding box so that the border fits the shape of the object more.
Scale-invariant feature transform (SIFT) is a feature descriptor with Scale invariance and illumination invariance, and is also a set of theory of feature extraction. Published in 2004 by d.g. lowe for the first time, and implemented, extended and used in the open source algorithm library OpenCV. SIFT features remain unchanged from rotation, scaling, brightness variations, etc., and are very stable local features.
Pedestrian re-recognition (personnre-identification (REID)) is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. Is a sub-problem of image retrieval, given a detected pedestrian image, the pedestrian image is retrieved across devices. Which may for example retrieve the same pedestrian map under different cameras.
Triple loss function (TripletLossfunction): the so-called triplet contains three samples, for example (anchor, pos, neg), anchor representing the target, pos representing the positive sample, neg representing the negative sample. The ternary loss function is an objective function that defines a distance from the target to the negative sample that is greater than the sum of the distance from the target to the positive sample and a predetermined threshold.
The main purpose of the present invention is to analyze the personas in video content using deep learning and to clip video clips of the same personas through a feature extraction network. As the human body in the video has multiple gestures, multiple angles, multiple scales and the like, distinguishing the same human body area in the video segment is a complex task. The invention utilizes convolutional neural networks (e.g., single-shot frame detector SSD (SingleShotMultiBoxDetector)) to detect human body regions in the video, thereby extracting corresponding human body features. The invention uses the human body characteristics to locate the same human body in the video, and can automatically and effectively segment the segments with the same roles in the video.
FIG. 1 schematically illustrates a schematic diagram of a graphical user interface 100 according to one embodiment of the invention. The graphical user interface 100 may be displayed on various user terminals, such as a notebook computer, a personal computer, a tablet computer, a cell phone, a television, and the like. Thevideo 101 is a video viewed by a user through a user terminal. Thevideo 101 can be automatically clipped into a video clip about a selected target object, such as a target person, in thevideo 101 by the video processing method provided by the embodiment of the invention. The selected target person may be one or more. For example, the target persona may be a particular star or a particular character.Icons 102 of automatically-generated character video clips are also displayed on the graphical user interface 100. When viewing thevideo 101, the user can easily view a video clip of a corresponding person of interest by clicking on thecorresponding icon 102.
FIG. 2 illustrates anexample application scenario 200 according to one embodiment of the invention. Theserver 201 is connected to auser terminal 203 via anetwork 202. Theuser terminal 203 may be, for example, a notebook computer, a personal computer, a tablet computer, a mobile phone, a television, etc.Network 202 may include wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the internet, etc. An application program for viewing video is installed on theuser terminal 203. The user may click on an icon of a corresponding person clip presented by the application installed on theuser terminal 203 while viewing video through the application and desiring to view a video clip of a person of interest. In response to the user clicking on the icon of the corresponding person clip, the application presents a clip of the corresponding person. Notably, clips of the respective persons are obtained at theserver 201 or at the user terminal 203 (or at both theuser terminal 203 and the server 201) by performing the video processing method proposed by the present invention.
Fig. 3 schematically illustrates a schematic diagram of anetwork framework 300 for target character video processing in accordance with one embodiment of the invention. First, for a target character to be clipped, the character feature F304 of the target character is obtained by inputting the target characterhuman body area 302 into the feature extraction network 310 to perform character feature extraction. Thevideo 301 to be processed is input into the human body detection network 309, and allhuman body areas 303 in the detectedvideo 301 to be processed are obtained. Here, the human body detection network 309 and the feature extraction network 310 will be described in further detail below. The individualhuman body regions 303 are then input into the above-described feature extraction network 310, and features P for each human body region are extractedi And records the time point T of the human body in the videoi (e.g., may be a timestamp). Then, the characteristics P of all human body areas in the video to be processedi Into afeature pool 305. The character feature F304 of the target character and the feature Pi of each human body area in thefeature pool 305 are input into a matching calculation module 311 to perform similarity calculation to obtain all Pi Features P matching character featuresFk 306. Feature matching may be achieved by calculating euclidean distances between different features, for example. A distance d between one Pi of the pool of features and the human feature F is calculated. If d is less than the predetermined threshold, then determine the Pi Matching with human body characteristics F, i.e. the Pi The person corresponding to the corresponding human body area accords with the target role. Matchedfeature Pk 306, a timing aggregation module 312, finds matchingfeatures Pk 306 atCorresponding time point, and time-series for time point Tk Polymerization is performed, resulting in a plurality of polymerized time points 307. The video clipping module 313 clips the video based on the aggregated time stamps of the multiple time points to form a segment corresponding to the target role, that is, all video frames containing the target role in the video to be processed.
In the above, in the case where the target character is a single character, description is made on how the segments for the target character are obtained by processing through the video processing method provided by the present invention. It should be appreciated that in other embodiments, the target character may be multiple target characters.
Fig. 4 schematically shows a schematic diagram of astructure 300 of a single-shot polygon detector. The human detection network used herein employs a single multi-frame detector SSD structure. The SSD detection network has very good performance in detection speed and detection precision. Specifically, the human body detection efficiency of the SSD detection network can reach 100 frames/second on the graphics processor GPU while guaranteeing a detection rate higher than 85%. The structure of SSD is based on VGG-16 because VGG-16 can provide high quality image classification and transfer learning to improve results. Here, SSD adjusts VGG-16, starting with Con6 layer, replacing the original fully-connected layer with a series of auxiliary convolutional layers. By using auxiliary convolution layers, features of multiple scales of the image can be extracted and the size of each convolution layer progressively reduced.
Fig. 5 schematically shows a flow chart of avideo processing method 500 according to an embodiment of the invention. The method may be executed by a user terminal or a server, or may be executed by both the user terminal and the server, and the embodiment is described by taking the method executed by the server as an example. Instep 501, a video to be processed and a target human body region representing a target object are acquired. Here, the target human body region may be obtained by inputting an image sample of the target object or a video sample containing the target object into a human body detection network (for example, SSD). Instep 502, a plurality of human body regions in a video to be processed are detected using a human body detection network. Instep 503, a plurality of body regions are input into the trained featuresThe method comprises the steps of obtaining a plurality of first features respectively describing a plurality of human body areas by a sign extraction network, and inputting a target human body area into the trained feature extraction network to obtain a second feature describing the target human body area. How this feature extraction network trains will be described in detail below. Here, it is noted that the feature extraction network is trained using a dataset constructed based on a set of human body region samples. The human body region sample set is generated for a plurality of video segments divided by video shots, respectively. Instep 504, the plurality of first features are compared with the second features, respectively, to obtain at least one first matching feature of the first features that matches the second features. For example, the first characteristic is Pi The second feature is F, then will be defined by Pi Each P in the composed feature pooli Comparing with F to find P matching with Fk . Here, feature matching is achieved by calculating euclidean distances between different features. Computing one P in a feature pooli Distance d from feature F. If d is less than the predetermined threshold, then determine the Pi Matching with F, i.e. the Pi The person corresponding to the corresponding human body area accords with the target role. Instep 505, corresponding respective points in time of at least one first matching feature in the video to be processed are determined. That is, P matching with F is determinedk At a corresponding point in time T in the videok . Instep 506, the video to be processed is processed based on the respective points in time to obtain video portions associated with the target object. In one embodiment, for time point Tk Aggregation is performed in time sequence, thereby yielding a set of all time points for the same role. In one embodiment, for the final acquired point in time Tk The aggregation of the sets of (a) in time sequence includes: for any two points in time, if the interval is less than a certain threshold, then it is considered a continuous segment, otherwise it is considered a separate segment. By such processing, the selected video frames are more coherent, and the picture does not jump. Thus, a plurality of video clips are obtained. For the start time point and the end time point of each segment, the optical flow method is utilized from the start time point of each segment Searching for the nearest shot switching point and searching for the nearest scene switching point from the ending time point of each segment backwards to ensure the integrity of the intercepted segments. Here, optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the observation imaging plane. The optical flow method is a method for finding out the correspondence existing between the previous frame and the current frame by utilizing the change of pixels in an image sequence in a time domain and the correlation between adjacent frames, thereby calculating the motion information of an object between the adjacent frames. After this operation is performed on all segments, different segment clips of the same target object (e.g., the same character) in the video are obtained. Thevideo processing method 500 can automatically segment segments of the same roles in videos (such as films, dramas and variety), saves a great deal of labor and time cost, improves editing efficiency, is also beneficial to later video production, and enhances user experience.
In the video processing method, the feature extraction network is trained by using a data set constructed based on the human body region sample set. The dataset for training the feature extraction network is constructed using the temporal and spatial correlation of the video, and using, for example, face recognition and pedestrian re-recognition ReID techniques. The dataset is constructed by the following steps of themethod 600 of constructing a dataset shown in fig. 6.
Instep 601, training video for a feature extraction network is acquired.
Instep 602, the training video is divided into a plurality of training video segments by video shot. Each of the plurality of training video segments contains a plurality of video frames belonging to the same video shot. Illustratively, whether shot switching exists in the training video can be judged through an optical flow method. If shot cuts exist, the video is divided at the video frames where the shot cuts occur, thereby splitting a full training video into segments corresponding to different shots.
Instep 603, for each of a plurality of training video segments, one or more sets of human region samples of the training video segments are created. In one embodiment, for each training video segment, a human body region in a plurality of video frames it contains is detected; judging the similarity between the two or more detected human body regions; and adding two or more human body regions whose similarity meets a predetermined threshold range to the same set to generate one or more human body region sample sets of training video segments. Detecting a human body region in a plurality of video frames is accomplished through a human body detection network SSD. Here, the similarity between the human body regions is judged using artificial features. For example, the human feature may be a scale-invariant SIFT feature. In one embodiment, the predetermined threshold range is set to be above the first threshold and below the second threshold, and two or more human body regions satisfying the predetermined threshold range are added to the same set of human body region samples as the set of positive sample pairs. Setting the predetermined threshold higher than the first threshold here is for ensuring that the human body regions have a high degree of similarity, i.e. that the two human body regions belong to the same character; meanwhile, the fact that the preset threshold value is lower than the second threshold value is used for removing the human body area with too high similarity is required, and two frames with too high similarity are hardly changed, so that training of a network model is not facilitated. In another embodiment, the predetermined threshold range is set below a third threshold, and two or more human body regions satisfying the predetermined threshold range are added to the same set of human body region samples as a set of negative sample pairs, i.e. such human body regions do not belong to the same role.
Instep 604, it is determined whether a face is contained in one or more human region sample sets. This step is implemented by face recognition techniques. Instep 605, in response to determining that each of the one or more human region sample sets contains a human face, the one or more human region sample sets are merged based on features of the human face to construct a training data set. In one embodiment, in response to determining that each of the one or more human region sample sets contains a human face, selecting the same predetermined number of human faces from each of the human region sample sets; comparing the similarity of the faces selected from each human body region sample set; and merging the human body area sample sets with the human face similarity meeting the preset threshold. Specifically, face recognition technology is utilized to compare faces in each human body region sample set. For example, in each human body region sample set where a human face is determined to exist, N human faces are selected, respectively, where N is a positive integer. And carrying out cross comparison on the N selected faces. In the event that the proportion of N faces in two or more human region sample sets match exceeds a predetermined threshold (e.g., 50%), then the two or more human region sample sets are merged into the same human region sample set. I.e. the human body regions in the two human body region sample sets are described as actually belonging to the same person. This is in some cases due to switching from the first lens to the second lens and then back again.
In one embodiment, the method for constructing a data set further includes: determining a human body region with the similarity of the human body region in the same human body region sample set lower than a preset threshold value by utilizing the pedestrian re-identification ReID; human body regions having a human body region similarity below a predetermined threshold are removed from the human body region sample set. Here, reID is a trained ReID network that uses open sources, i.e., by which it is determined whether dissimilar human body regions exist in a set of human body region samples that are constructed.
In addition, on the basis of the method for constructing the data set, since the same person can have various gesture angles and backgrounds in the video, manual screening is needed after the steps, so that the human body in each set is ensured to be the same person image.
The invention also provides a training method of the feature extraction network, which trains the feature extraction network based on the data set obtained by themethod 600. It is noted that in training, attacks including random clipping, blurring, rotation, etc. are added to these samples, thereby improving the robustness of the feature extraction network.
The feature extraction network of the invention is further improved and optimized on the basis of the existing deep network structure, so that the effect of the task is improved. Firstly, a larger convolution kernel and a larger step length are adopted in a shallow layer of the network, so that the effect of increasing the receptive field and accelerating the deep speed of the network is achieved. With the deep network, the feature dimension is continuously increased, and in order to improve the operation efficiency, the convolution kernel size is gradually reduced, and finally the convolution kernel size is reduced to a convolution kernel of 3x 3. In addition, the feature extraction network uses a triplet loss function as the final loss function. The loss function can reduce the distance between positive sample pairs, and meanwhile, increases the distance between negative sample pairs, so that the loss function has a very good effect on the follow-up judgment of whether the human bodies are similar. Here, positive samples refer to pairs of samples that determine human body regions belonging to the same person through similarity between human body regions; negative samples refer to pairs of samples of human body regions belonging to different persons determined by the similarity between the human body regions. In addition, the final feature is a superposition of deep features and shallow features. Shallow features of the depth network represent structural information of the image, and deep features are rich in more semantic information. The invention combines the deep information and the shallow information of the network by using the attention model, and can improve the very high accuracy compared with the single use of shallow features or deep features.
Fig. 7 schematically shows a schematic diagram of avideo processing apparatus 700 according to an embodiment of the invention. Avideo processing apparatus 700 includes: anacquisition module 701, a humanbody detection module 702, afeature extraction module 703, acomparison module 704, a point-in-time determination module 705, and avideo processing module 706. Theacquisition module 701 is configured to acquire a video to be processed and a target human body region representing a target object. Thehuman detection module 702 is configured to detect a plurality of human body regions in a video to be processed. Thefeature extraction module 703 is configured to input a plurality of human body regions into a trained feature extraction network resulting in a plurality of first features describing the plurality of human body regions, respectively, and to input a target human body region into the trained feature extraction network resulting in a second feature describing the target human body region, the feature extraction network being trained using a dataset constructed based on a set of human body region samples generated separately for a plurality of video segments divided by video shots. Thecomparison module 704 is configured to compare the plurality of first features with the second features, respectively, resulting in at least one first matching feature of the first features matching the second features. The point intime determination module 705 is configured to determine corresponding respective points in time of at least one first matching feature in the video to be processed. Thevideo processing module 706 is configured to process the video to be processed based on the respective points in time to obtain video portions associated with the target object. Thevideo processing device 700 can automatically segment the same role segments in videos (such as films, dramas and variety), saves a great deal of labor and time cost, improves editing efficiency, is also beneficial to later video production, and enhances user experience.
Fig. 8 schematically shows a schematic diagram of anapparatus 800 for constructing a data set for training a feature extraction network according to another embodiment of the invention. The dataset constructing apparatus 800 includes: anacquisition module 801, avideo segmentation module 802, acollection creation module 803, adetermination module 804, acollection merge module 805, and a data set construction module 806. Theacquisition module 801 is configured to acquire training video for a feature extraction network. Thevideo segmentation module 802 is configured to divide the training video into a plurality of training video segments by video shots, each of the plurality of training video segments containing a plurality of video frames belonging to the same video shot. Theset creation module 803 is configured to create, for each training video segment, one or more human region sample sets of the training video segment. Thedetermination module 804 is configured to determine whether a face is contained in one or more human region sample sets. Theset merge module 805 is configured to merge one or more human region sample sets based on features of a human face in response to determining that the human face is contained in each human region in the one or more human region sample sets.
Fig. 9 schematically illustrates a schematic diagram showing anexample computer device 900 for video processing and/or constructing a data set. Thecomputer device 900 may be a variety of different types of devices, such as a server computer (e.g.,server 201 shown in fig. 2), a device associated with an application program (e.g.,user terminal 203 shown in fig. 2), a system-on-chip, and/or any other suitable computer device or computing system.
Computer device 900 may include at least oneprocessor 902, memory 904, communication interface(s) 906,display device 908, other input/output (I/O) devices 910, and one or more mass storage 912 capable of communicating with each other, such as through a system bus 914 or other suitable connection.
Theprocessor 902 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. Theprocessor 902 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Theprocessor 902 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 904, mass storage 912, or other computer-readable medium, such as program code of theoperating system 916, program code of theapplication programs 918, program code ofother programs 920, etc., to implement the methods for video processing and/or constructing data sets provided by one embodiment of the present invention.
Memory 904 and mass storage device 912 are examples of computer storage media for storing instructions that are executed byprocessor 902 to implement the various functions as previously described. For example, the memory 904 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 912 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. Memory 904 and mass storage device 912 may both be referred to herein collectively as memory or computer storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed byprocessor 902 as a particular machine configured to implement the operations and functions described in the examples herein.
A number of program modules may be stored on the mass storage device 912. These programs include anoperating system 916, one ormore application programs 918,other programs 920, andprogram data 922, and they may be loaded into the memory 904 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: anacquisition module 701, ahuman detection module 702, afeature extraction module 703, acomparison module 704, a point intime determination module 705 and avideo processing module 706, as well as an acquisition module 901, avideo segmentation module 802, acollection creation module 803, adetermination module 804, acollection merge module 805 and a dataset construction module 806 and/or further embodiments described herein.
Although illustrated in fig. 9 as being stored in memory 904 ofcomputer device 900,modules 916, 918, 920, and 922, or portions thereof, may be implemented using any form of computer readable media accessible bycomputer device 900. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer storage media and communication media.
Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.
Computer device 900 can also include one ormore communication interfaces 906 for exchanging data with other devices, such as via a network, direct connection, or the like, as discussed previously. One ormore communication interfaces 906 may facilitate communication over a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. Thecommunication interface 906 may also provide for communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.
In some examples, adisplay device 908, such as a display, may be included for displaying information and images. Other I/O devices 910 may be devices that take various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.
Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" as used herein does not exclude a plurality. Although certain features may be described in mutually different dependent claims, this mere fact is not intended to indicate that a combination of these features cannot be used or practiced.

Claims (13)

CN202010157708.3A2020-03-092020-03-09Video processing method, device and storage mediumActiveCN111209897B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010157708.3ACN111209897B (en)2020-03-092020-03-09Video processing method, device and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010157708.3ACN111209897B (en)2020-03-092020-03-09Video processing method, device and storage medium

Publications (2)

Publication NumberPublication Date
CN111209897A CN111209897A (en)2020-05-29
CN111209897Btrue CN111209897B (en)2023-06-20

Family

ID=70788826

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010157708.3AActiveCN111209897B (en)2020-03-092020-03-09Video processing method, device and storage medium

Country Status (1)

CountryLink
CN (1)CN111209897B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112861981B (en)*2021-02-222023-06-20每日互动股份有限公司Data set labeling method, electronic equipment and medium
CN113158867B (en)*2021-04-152024-11-19微马科技有限公司 Method, device and computer-readable storage medium for determining facial features
CN113190713B (en)*2021-05-062024-06-21百度在线网络技术(北京)有限公司Video searching method and device, electronic equipment and medium
CN113283381B (en)*2021-06-152024-04-05南京工业大学 A human motion detection method suitable for mobile robot platform
CN114363720B (en)*2021-12-082024-03-12广州海昇计算机科技有限公司Video slicing method, system, equipment and medium based on computer vision
CN114189754B (en)*2021-12-082024-06-28湖南快乐阳光互动娱乐传媒有限公司Video scenario segmentation method and system
CN114286198B (en)*2021-12-302023-11-10北京爱奇艺科技有限公司Video association method, device, electronic equipment and storage medium
CN114973612A (en)*2022-03-282022-08-30深圳市揽讯科技有限公司Automatic alarm monitoring system and method for faults of LED display screen

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108509457A (en)*2017-02-282018-09-07阿里巴巴集团控股有限公司A kind of recommendation method and apparatus of video data
CN109063667A (en)*2018-08-142018-12-21视云融聚(广州)科技有限公司A kind of video identification method optimizing and method for pushing based on scene
CN109284729A (en)*2018-10-082019-01-29北京影谱科技股份有限公司 Method, device and medium for acquiring training data of face recognition model based on video
CN109922373A (en)*2019-03-142019-06-21上海极链网络科技有限公司Method for processing video frequency, device and storage medium
CN110087144A (en)*2019-05-152019-08-02深圳市商汤科技有限公司Video file processing method, device, electronic equipment and computer storage medium
CN110119711A (en)*2019-05-142019-08-13北京奇艺世纪科技有限公司A kind of method, apparatus and electronic equipment obtaining video data personage segment
CN110366050A (en)*2018-04-102019-10-22北京搜狗科技发展有限公司Processing method, device, electronic equipment and the storage medium of video data
CN110505498A (en)*2019-09-032019-11-26腾讯科技(深圳)有限公司Processing, playback method, device and the computer-readable medium of video
CN110516572A (en)*2019-08-162019-11-29咪咕文化科技有限公司Method for identifying sports event video clip, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2012150602A1 (en)*2011-05-032012-11-08Yogesh Chunilal RathodA system and method for dynamically monitoring, recording, processing, attaching dynamic, contextual & accessible active links & presenting of physical or digital activities, actions, locations, logs, life stream, behavior & status

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108509457A (en)*2017-02-282018-09-07阿里巴巴集团控股有限公司A kind of recommendation method and apparatus of video data
CN110366050A (en)*2018-04-102019-10-22北京搜狗科技发展有限公司Processing method, device, electronic equipment and the storage medium of video data
CN109063667A (en)*2018-08-142018-12-21视云融聚(广州)科技有限公司A kind of video identification method optimizing and method for pushing based on scene
CN109284729A (en)*2018-10-082019-01-29北京影谱科技股份有限公司 Method, device and medium for acquiring training data of face recognition model based on video
CN109922373A (en)*2019-03-142019-06-21上海极链网络科技有限公司Method for processing video frequency, device and storage medium
CN110119711A (en)*2019-05-142019-08-13北京奇艺世纪科技有限公司A kind of method, apparatus and electronic equipment obtaining video data personage segment
CN110087144A (en)*2019-05-152019-08-02深圳市商汤科技有限公司Video file processing method, device, electronic equipment and computer storage medium
CN110516572A (en)*2019-08-162019-11-29咪咕文化科技有限公司Method for identifying sports event video clip, electronic equipment and storage medium
CN110505498A (en)*2019-09-032019-11-26腾讯科技(深圳)有限公司Processing, playback method, device and the computer-readable medium of video

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Online Video Recommendation through Tag-Cloud Aggregation;Jonghun Park et al;《IEEE Computer Society》;20110331;第18卷;78-87*
一种压缩视频流的视频分段和关键帧提取方法;王凤领等;《智能计算机与应用》;20171031;第7卷(第5期);79-82*

Also Published As

Publication numberPublication date
CN111209897A (en)2020-05-29

Similar Documents

PublicationPublication DateTitle
CN111209897B (en)Video processing method, device and storage medium
Liu et al.Weakly-supervised salient object detection with saliency bounding boxes
CN111062871B (en)Image processing method and device, computer equipment and readable storage medium
Gao et al.Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method
Xiao et al.Deep salient object detection with dense connections and distraction diagnosis
Hashmi et al.An exploratory analysis on visual counterfeits using conv-lstm hybrid architecture
CN112101344B (en)Video text tracking method and device
CN112381104B (en)Image recognition method, device, computer equipment and storage medium
CN111754541A (en)Target tracking method, device, equipment and readable storage medium
CN111241345A (en) A video retrieval method, device, electronic device and storage medium
WO2022089170A1 (en)Caption area identification method and apparatus, and device and storage medium
Mahmood et al.Automatic player detection and identification for sports entertainment applications
AU2018202767B2 (en)Data structure and algorithm for tag less search and svg retrieval
CN110619284B (en)Video scene division method, device, equipment and medium
CN112597341A (en)Video retrieval method and video retrieval mapping relation generation method and device
CN112488072B (en) A method, system and device for acquiring face sample set
Yi et al.Motion keypoint trajectory and covariance descriptor for human action recognition
Li et al.Videography-based unconstrained video analysis
CN112818995B (en)Image classification method, device, electronic equipment and storage medium
Nemade et al.Image segmentation using convolutional neural network for image annotation
Fei et al.Creating memorable video summaries that satisfy the user’s intention for taking the videos
CN115935049A (en)Recommendation processing method and device based on artificial intelligence and electronic equipment
Zhou et al.Modeling perspective effects in photographic composition
CN113516735A (en)Image processing method, image processing device, computer readable medium and electronic equipment
CN113395584A (en)Video data processing method, device, equipment and medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
TA01Transfer of patent application right
TA01Transfer of patent application right

Effective date of registration:20221128

Address after:1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518133

Applicant after:Shenzhen Yayue Technology Co.,Ltd.

Address before:518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Applicant before:TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp