Disclosure of Invention
The embodiment of the application provides a behavior recognition method, a behavior recognition device, a terminal device and a computer readable storage medium, which can reduce the data processing amount of video behavior recognition and improve the recognition accuracy.
In a first aspect, an embodiment of the present application provides a behavior identification method, including:
acquiring respective three-dimensional video data of a plurality of video clips in a video to be processed;
converting the three-dimensional video data of each of the video segments into two-dimensional video data;
determining an initial behavior recognition result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments;
and determining a final behavior recognition result of the video to be processed according to the initial behavior recognition results of the plurality of video segments.
In the embodiment of the application, three-dimensional video data of a video to be processed is converted into two-dimensional video data, which is equivalent to converting a three-dimensional data processing task of the video into a two-dimensional data processing task, so that the data processing amount is greatly reduced; in addition, because the three-dimensional video data contains the time sequence characteristics, the method can not only extract the image characteristic information of the video, but also extract the time sequence characteristic information among the images in the video, comprehensively identify the behavior type in the video according to the image characteristic information and the time-series characteristic information, and effectively improve the accuracy of the identification result.
In a possible implementation manner of the first aspect, the acquiring three-dimensional video data of each of a plurality of video segments in a video to be processed includes:
performing video frame extraction processing on the video to be processed to obtain a plurality of video clips, wherein the video clips comprise a plurality of frame images;
for each video segment, generating the three-dimensional video data of the video segment according to pixels on each frame image contained in the video segment, wherein the size of the three-dimensional video data is H × W × T, H is the number of pixels contained in each frame image in the video segment in the width direction, W is the number of pixels contained in each frame image in the video segment in the length direction, and T is the number of frames of the images contained in the video segment.
In a possible implementation manner of the first aspect, the converting the three-dimensional video data of each of the video segments into two-dimensional video data includes:
for each of the video segments, combining each two-dimensional data of the three-dimensional video data of the video segment into a set of the two-dimensional video data, obtaining three sets of the two-dimensional video data of the video segment.
In a possible implementation manner of the first aspect, the determining an initial behavior recognition result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments includes:
for each video clip, extracting respective initial characteristic information of three groups of two-dimensional video data of the video clip;
fusing the initial feature information of the three groups of two-dimensional video data of the video clip into fused feature information of the video clip;
and determining an initial behavior recognition result of the video clip according to the fusion characteristic information of the video clip.
In a possible implementation manner of the first aspect, the fusing the initial feature information of each of the three sets of two-dimensional video data of the video segment into fused feature information of the video segment includes:
splicing the initial feature information of the three groups of two-dimensional video data of the video clip into feature splicing vectors;
carrying out average pooling on the feature splicing vectors to obtain pooled feature information;
and converting the size of the pooled feature vector according to the preset behavior category number to obtain the fusion feature information.
In a possible implementation manner of the first aspect, the initial behavior result of the video segment includes a behavior type to which the video segment belongs;
determining a final behavior recognition result of the video to be processed according to the initial behavior recognition result of each of the plurality of video segments, including:
counting the number of the video clips belonging to each behavior type according to the initial behavior results of the plurality of video clips;
determining a target type in the behavior types corresponding to the video clips according to the number of the clips corresponding to the behavior types;
determining the target type as the final behavior recognition result of the video to be processed.
In a second aspect, an embodiment of the present application provides a behavior recognition apparatus, including:
the data acquisition unit is used for acquiring three-dimensional video data of a plurality of video clips in a video to be processed;
a data conversion unit for converting the three-dimensional video data of each of the video clips into two-dimensional video data;
a segment identification unit, configured to determine an initial behavior identification result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments;
and the identification result unit is used for determining a final behavior identification result of the video to be processed according to the initial behavior identification result of each of the plurality of video segments.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the behavior recognition method according to any one of the above first aspects.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, and the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the behavior recognition method according to any one of the foregoing first aspects.
In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the behavior recognition method according to any one of the first aspect.
It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.
Fig. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application. By way of example and not limitation, the method may include the steps of:
s101, acquiring three-dimensional video data of a plurality of video clips in a video to be processed.
Pixels in one image constitute two-dimensional data, and the size of the two-dimensional data is H × W, where H is the number of pixels included in the width direction of the image, and W is the number of pixels included in the length direction of the image. For example, the number of pixels included in an image in the length direction is 10, the number of pixels included in the width direction is 5, the size of the two-dimensional data of the image is 5 × 10, that is, the image includes 50 pixels, the 50 pixels constitute 5 × 10 two-dimensional data, and the set of two-dimensional data includes feature information of the image.
The video is formed by arranging a plurality of frames of images according to time sequence, and compared with the images, the video has more characteristic information of time sequence dimensionality. Namely, one-dimensional time series data is added on the basis of two-dimensional data of an image. Thus, video may be described by three-dimensional data.
Optionally, the three-dimensional video data of the video to be processed may be generated according to the images of all frames in the video to be processed.
In practical applications, adjacent frames of images in a video usually contain the same or similar content. If the pixels of all the frame images in the video to be processed are contained in the three-dimensional video data, a large amount of data redundancy will be caused, and the subsequent data processing amount will be large.
In order to reduce the data redundancy of the video to be processed, optionally, the video to be processed may be sampled, and the pixels of the sampled image may constitute the three-dimensional video data of the video to be processed. For example: the video to be processed comprises 100 frames of images, sampling is carried out by taking 4 frames of images as intervals, 20 frames of images are extracted, and pixels of the 20 frames of images form three-dimensional video data of the video to be processed.
If the sampling frequency is higher, the number of the obtained images is larger, and the data processing amount is still larger; if the sampling frequency is low, less image data is obtained, the data processing amount is less, but more image information is lost.
In order to reduce the data processing amount and simultaneously keep more image information, in the embodiment of the present application, the manner of acquiring the three-dimensional video data of each video clip includes:
performing video frame extraction processing on a video to be processed to obtain a plurality of video segments, wherein each video segment comprises a plurality of frame images; for each video segment, three-dimensional video data for the video segment is generated from pixels on frames of images contained in the video segment.
The size of the three-dimensional video data is H multiplied by W multiplied by T, H is the number of pixels contained in each frame image in the width direction of the video clip, W is the number of pixels contained in each frame image in the length direction of the video clip, and T is the number of frames of the images contained in the video clip.
The video frame extraction processing can be that one frame of image is extracted every n frames of images, then the extracted image is divided into a plurality of image groups according to time sequence, and each image group is a video clip; or extracting m frames of images every n frames of images, and determining the m frames of images as a video clip.
By the method, the data redundancy of the video to be processed is reduced through video frame extraction processing, and the extracted images are divided into video segments so as to keep time sequence characteristic information and image related information between adjacent images and provide a reliable data basis for the subsequent identification process.
Fig. 2 is a schematic diagram of a data processing flow of behavior recognition according to an embodiment of the present application. As shown in fig. 2, the size of the video to be processed is 3 × H × W × L, taking the video to be processed as an input video. Wherein, L is the total frame number of the images contained in the video to be processed; 3 denotes three color channels of RGB, and information of the three color channels can be embodied in pixel values, and therefore, the size 3, that is, the size of the video to be processed is H × W × L, can be ignored. Frame decimation combination (i.e. video decimation processing, such as grouping T frames in fig. 2) is performed on the input video to obtain a plurality of video clips, each of which has a size of H × W × T.
And S102, converting the three-dimensional video data of each video clip into two-dimensional video data.
Optionally, one implementation manner of converting three-dimensional video data into two-dimensional video data is as follows: data in one dimension of the three-dimensional video data is added to the other two dimensions to form two-dimensional video data.
For example, refer to fig. 3, which is a schematic diagram of a data conversion process provided in an embodiment of the present application.
As shown in fig. 3 (a), a video segment includes 4 frames of images I, II, III, IV, and 4 frames of images are combined in time series. Now 4 images are stitched into a large stitched image V, and pixels on the large stitched image V constitute the two-dimensional video data of the video clip.
As can be seen from the above example, although the above method can retain image information, it cannot retain timing information between images.
In order to simultaneously retain image information and timing information, in the embodiment of the present application, one implementation manner of converting three-dimensional video data into two-dimensional video data is as follows:
for each video clip, combining each two-dimensional data in the three-dimensional video data of the video clip into a set of two-dimensional video data, obtaining three sets of two-dimensional video data of the video clip.
Specifically, as shown in fig. 2, the three-dimensional video data H × W × T is divided into three sets of two-dimensional video data H × W, H × T and W × T.
Illustratively, as shown in fig. 3 (b), a video segment includes 4 frames of images I, II, III, and IV, and assuming that each frame of image has a size of 2 × 3 (i.e., H is 2, W is 3, T is 4, and each frame of image includes 6 pixels), the three-dimensional video data of the video segment has a size of 2 × 3 × 4 (as shown in fig. 3 (c), the three-dimensional video data may be regarded as a large rectangular parallelepiped, and each pixel in each frame of image may be regarded as a large rectangular parallelepiped voxel (i.e., a small square), a square labeled 1 in the figure represents a pixel in the 1 st frame of image of the video segment, a square labeled 2 represents a pixel in the 2 nd frame of image of the video segment, and so on, a square labeled 4 represents a pixel in the 4 th frame of image of the video segment).
As shown in fig. 3 (c), the three-dimensional video data is split into a group of 2 × 3 two-dimensional video data, the group including pixels on an abcd cross section in a rectangular parallelepiped. The three-dimensional video data is split into a set of 2 x 4 two-dimensional video data, the set comprising pixels in a cross-section abef in a cuboid. The three-dimensional video data is split into a set of 3 x 4 two-dimensional video data, the set comprising pixels on a bcge cross-section in a cuboid.
S103, determining the initial behavior recognition result of each of the plurality of video clips according to the two-dimensional video data of each of the plurality of video clips.
For each video clip, three groups of two-dimensional video data of the video clip can be input into a trained recognition model, and an initial behavior recognition result of the video clip is output.
However, in the above method, the input data of the recognition model is three sets of two-dimensional data, and there are many input data, and the data processing amount is large when the recognition model is trained.
To solve the above problem, in one embodiment, the implementation manner of S103 includes:
for each video clip, extracting respective initial characteristic information of three groups of two-dimensional video data of the video clip; fusing initial characteristic information of three groups of two-dimensional video data of the video clip into fused characteristic information of the video clip; and determining an initial behavior recognition result of the video clip according to the fusion characteristic information of the video clip.
For example, as shown in fig. 2, three sets of two-dimensional video data are respectively subjected to two-dimensional convolution processing to obtain initial feature information of each set of two-dimensional video data. For example: the two-dimensional data H × W is subjected to convolution processing of 3 × 3 × 1 and pooling processing of 3 × 3 × 1, and initial feature information of 1 × 1 is obtained. Since the T-dimension of the two-dimensional data is 1, it is practically equivalent to performing 3 × 3 two-dimensional convolution processing and two-dimensional pooling processing on the two-dimensional data H × W. Similarly, for H × T two-dimensional video data, 3 × 1 × 3 convolution processing and 3 × 1 × 3 pooling processing are used; for W × T two-dimensional video data, 1 × 3 × 3 convolution processing and 1 × 3 × 3 pooling processing are used.
The above is only an example of the manner of acquiring the initial feature information. In practical application, each set of two-dimensional video data may be subjected to convolution processing and pooling processing for multiple times, which is not specifically limited herein.
Optionally, the process of fusing the initial feature information of each of the three sets of two-dimensional video data into fused feature information includes:
splicing initial feature information of three groups of two-dimensional video data of a video clip into feature splicing vectors; carrying out average pooling on the feature splicing vectors to obtain pooled feature information; and converting the size of the pooled feature vector according to the preset behavior category number to obtain fusion feature information.
As shown in fig. 2, the initial feature information of each of the three sets of two-dimensional video data is a value of 1 × 1, and the initial feature information is spliced together by Concat operation to obtain a size CoutX 1 x 3 feature concatenation vector, where Cout3, i.e. the three RGB color channels. Then, on the dimension of characteristic information splicing, average pooling processing is used to obtain the size CoutPooling feature information of × 1 × 1 × 1. Through a full connection layer, connecting CoutThe dimension is changed to k (where k denotes a preset number of behavior categories, i.e., the number of categories of behaviors that need to be recognized), resulting in fused feature information having a size of k × 1 × 1 × 1. And finally, calculating probability values of the fusion characteristic information belonging to various behavior categories through a softmax layer, and determining the behavior category corresponding to the maximum probability value as an initial identification result of the video clip.
And S104, determining a final behavior recognition result of the video to be processed according to the respective initial behavior recognition results of the plurality of video segments.
Wherein, the initial behavior result of the video clip comprises the behavior type of the video clip.
Optionally, a voting manner may be adopted to determine a final behavior recognition result of the video to be processed according to the initial behavior recognition result. Specifically, the method comprises the following steps:
counting the number of the video clips belonging to each behavior type according to the respective initial behavior results of the plurality of video clips; determining a target type in the behavior types corresponding to the video clips according to the number of the clips corresponding to the behavior types; and determining the target type as a final behavior recognition result of the video to be processed.
Illustratively, the video to be processed includes 3 video segments, where a behavior class to which a first video segment belongs is a, a behavior class to which a second video segment belongs is B, and a behavior class to which a third video segment belongs is a. The number of segments of the video segment belonging to the behavior category a is 2, and the number of segments of the video segment belonging to the behavior category B is 1, so that the target type is a, that is, the final behavior result of the video to be processed is a.
In the embodiment of the application, three-dimensional video data of a video to be processed is converted into two-dimensional video data, which is equivalent to converting a three-dimensional data processing task of the video into a two-dimensional data processing task, so that the data processing amount is greatly reduced; in addition, because the three-dimensional video data contains the time sequence characteristics, the method can not only extract the image characteristic information of the video, but also extract the time sequence characteristic information among the images in the video, comprehensively identify the behavior type in the video according to the image characteristic information and the time-series characteristic information, and effectively improve the accuracy of the identification result.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 4 is a block diagram of a behavior recognition apparatus according to an embodiment of the present application, which corresponds to the behavior recognition method described in the foregoing embodiment, and only a part related to the embodiment of the present application is shown for convenience of description.
Referring to fig. 4, the apparatus includes:
adata obtaining unit 41, configured to obtain three-dimensional video data of each of a plurality of video clips in the video to be processed.
Adata conversion unit 42, configured to convert the three-dimensional video data of each of the video segments into two-dimensional video data.
Asegment identifying unit 43, configured to determine an initial behavior identification result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments.
Arecognition result unit 44, configured to determine a final behavior recognition result of the video to be processed according to the initial behavior recognition result of each of the plurality of video segments.
Optionally, thedata obtaining unit 41 is further configured to:
performing video frame extraction processing on the video to be processed to obtain a plurality of video clips, wherein the video clips comprise a plurality of frame images; for each video segment, generating the three-dimensional video data of the video segment according to pixels on each frame image contained in the video segment, wherein the size of the three-dimensional video data is H × W × T, H is the number of pixels contained in each frame image in the video segment in the width direction, W is the number of pixels contained in each frame image in the video segment in the length direction, and T is the number of frames of the images contained in the video segment.
Optionally, thedata conversion unit 42 is further configured to:
for each of the video segments, combining each two-dimensional data of the three-dimensional video data of the video segment into a set of the two-dimensional video data, obtaining three sets of the two-dimensional video data of the video segment.
Optionally, thesegment identifying unit 43 is further configured to:
for each video clip, extracting respective initial characteristic information of three groups of two-dimensional video data of the video clip; fusing the initial feature information of the three groups of two-dimensional video data of the video clip into fused feature information of the video clip; and determining an initial behavior recognition result of the video clip according to the fusion characteristic information of the video clip.
Optionally, thesegment identifying unit 43 is further configured to:
splicing the initial feature information of the three groups of two-dimensional video data of the video clip into feature splicing vectors; carrying out average pooling on the feature splicing vectors to obtain pooled feature information; and converting the size of the pooled feature vector according to the preset behavior category number to obtain the fusion feature information.
Optionally, the initial behavior result of the video segment includes a behavior type to which the video segment belongs.
Optionally, therecognition result unit 44 is further configured to:
counting the number of the video clips belonging to each behavior type according to the initial behavior results of the plurality of video clips; determining a target type in the behavior types corresponding to the video clips according to the number of the clips corresponding to the behavior types; determining the target type as the final behavior recognition result of the video to be processed.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
The behavior recognition device shown in fig. 4 may be a software unit, a hardware unit, or a combination of software and hardware unit built in the existing terminal device, may be integrated into the terminal device as an independent pendant, or may exist as an independent terminal device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, theterminal device 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), amemory 51, and acomputer program 52 stored in thememory 51 and executable on the at least oneprocessor 50, wherein theprocessor 50 executes thecomputer program 52 to implement the steps of any of the various behavior recognition method embodiments described above.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 5 is only an example of theterminal device 5, and does not constitute a limitation to theterminal device 5, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.
TheProcessor 50 may be a Central Processing Unit (CPU), and theProcessor 50 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Thememory 51 may in some embodiments be an internal storage unit of theterminal device 5, such as a hard disk or a memory of theterminal device 5. Thememory 51 may also be an external storage device of theterminal device 5 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on theterminal device 5. Further, thememory 51 may also include both an internal storage unit and an external storage device of theterminal device 5. Thememory 51 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. Thememory 51 may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.