CN113015022A

Movatterモバイル変換

Info

Publication number: CN113015022A
Application number: CN202110160081.1A
Authority: CN
Inventors: 林灿然; 程骏; 郭渺辰; 邵池; 张惊涛; 钱程浩; 庞建新
Original assignee: Shenzhen Ubtech Technology Co ltd
Current assignee: Shenzhen Ubtech Technology Co ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-06-22
Also published as: WO2022166258A1

Abstract

The application is applicable to the technical field of image processing, and provides a behavior identification method, a behavior identification device, a terminal device and a computer-readable storage medium, wherein the behavior identification method comprises the following steps: acquiring respective three-dimensional video data of a plurality of video clips in a video to be processed; converting the three-dimensional video data of each of the video segments into two-dimensional video data; determining an initial behavior recognition result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments; and determining a final behavior recognition result of the video to be processed according to the initial behavior recognition results of the plurality of video segments. By the method, the data processing amount in the video behavior recognition task can be effectively reduced, and meanwhile, the accuracy of the behavior recognition result is improved.

Description

Behavior recognition method and device, terminal equipment and computer readable storage medium

Technical Field

The present application belongs to the field of image processing technologies, and in particular, to a behavior recognition method, apparatus, terminal device, and computer-readable storage medium.

Background

The video-based behavior recognition technology is a technology for recognizing the type of behavior in a video through analysis of video data. Since the video is formed by combining a plurality of frames of images according to a time sequence, compared with two-dimensional data of the images, the video is added with data in a time sequence dimension. Therefore, the video-based behavior recognition technology needs to analyze and process three-dimensional data of a video.

With the development of deep learning technology, the technology gradually penetrates into video-based behavior recognition technology. In the prior art, on the basis of 2D convolution processing in an image recognition task, 3D convolution processing is extended, that is, three-dimensional data of a video is processed by using a 3D convolution processing method. However, the method has large data processing amount, and is difficult to converge during network training, so that the accuracy of the recognition result cannot be ensured.

Disclosure of Invention

The embodiment of the application provides a behavior recognition method, a behavior recognition device, a terminal device and a computer readable storage medium, which can reduce the data processing amount of video behavior recognition and improve the recognition accuracy.

In a first aspect, an embodiment of the present application provides a behavior identification method, including:

acquiring respective three-dimensional video data of a plurality of video clips in a video to be processed;

converting the three-dimensional video data of each of the video segments into two-dimensional video data;

determining an initial behavior recognition result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments;

and determining a final behavior recognition result of the video to be processed according to the initial behavior recognition results of the plurality of video segments.

In the embodiment of the application, three-dimensional video data of a video to be processed is converted into two-dimensional video data, which is equivalent to converting a three-dimensional data processing task of the video into a two-dimensional data processing task, so that the data processing amount is greatly reduced; in addition, because the three-dimensional video data contains the time sequence characteristics, the method can not only extract the image characteristic information of the video, but also extract the time sequence characteristic information among the images in the video, comprehensively identify the behavior type in the video according to the image characteristic information and the time-series characteristic information, and effectively improve the accuracy of the identification result.

In a possible implementation manner of the first aspect, the acquiring three-dimensional video data of each of a plurality of video segments in a video to be processed includes:

performing video frame extraction processing on the video to be processed to obtain a plurality of video clips, wherein the video clips comprise a plurality of frame images;

for each video segment, generating the three-dimensional video data of the video segment according to pixels on each frame image contained in the video segment, wherein the size of the three-dimensional video data is H × W × T, H is the number of pixels contained in each frame image in the video segment in the width direction, W is the number of pixels contained in each frame image in the video segment in the length direction, and T is the number of frames of the images contained in the video segment.

In a possible implementation manner of the first aspect, the converting the three-dimensional video data of each of the video segments into two-dimensional video data includes:

for each of the video segments, combining each two-dimensional data of the three-dimensional video data of the video segment into a set of the two-dimensional video data, obtaining three sets of the two-dimensional video data of the video segment.

In a possible implementation manner of the first aspect, the determining an initial behavior recognition result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments includes:

for each video clip, extracting respective initial characteristic information of three groups of two-dimensional video data of the video clip;

fusing the initial feature information of the three groups of two-dimensional video data of the video clip into fused feature information of the video clip;

and determining an initial behavior recognition result of the video clip according to the fusion characteristic information of the video clip.

In a possible implementation manner of the first aspect, the fusing the initial feature information of each of the three sets of two-dimensional video data of the video segment into fused feature information of the video segment includes:

splicing the initial feature information of the three groups of two-dimensional video data of the video clip into feature splicing vectors;

carrying out average pooling on the feature splicing vectors to obtain pooled feature information;

and converting the size of the pooled feature vector according to the preset behavior category number to obtain the fusion feature information.

In a possible implementation manner of the first aspect, the initial behavior result of the video segment includes a behavior type to which the video segment belongs;

determining a final behavior recognition result of the video to be processed according to the initial behavior recognition result of each of the plurality of video segments, including:

counting the number of the video clips belonging to each behavior type according to the initial behavior results of the plurality of video clips;

determining a target type in the behavior types corresponding to the video clips according to the number of the clips corresponding to the behavior types;

determining the target type as the final behavior recognition result of the video to be processed.

In a second aspect, an embodiment of the present application provides a behavior recognition apparatus, including:

the data acquisition unit is used for acquiring three-dimensional video data of a plurality of video clips in a video to be processed;

a data conversion unit for converting the three-dimensional video data of each of the video clips into two-dimensional video data;

a segment identification unit, configured to determine an initial behavior identification result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments;

and the identification result unit is used for determining a final behavior identification result of the video to be processed according to the initial behavior identification result of each of the plurality of video segments.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the behavior recognition method according to any one of the above first aspects.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, and the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the behavior recognition method according to any one of the foregoing first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the behavior recognition method according to any one of the first aspect.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a behavior recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a data processing flow of behavior recognition provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a data conversion process provided by an embodiment of the present application;

fig. 4 is a block diagram illustrating a structure of a behavior recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.

Fig. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present application. By way of example and not limitation, the method may include the steps of:

s101, acquiring three-dimensional video data of a plurality of video clips in a video to be processed.

Pixels in one image constitute two-dimensional data, and the size of the two-dimensional data is H × W, where H is the number of pixels included in the width direction of the image, and W is the number of pixels included in the length direction of the image. For example, the number of pixels included in an image in the length direction is 10, the number of pixels included in the width direction is 5, the size of the two-dimensional data of the image is 5 × 10, that is, the image includes 50 pixels, the 50 pixels constitute 5 × 10 two-dimensional data, and the set of two-dimensional data includes feature information of the image.

The video is formed by arranging a plurality of frames of images according to time sequence, and compared with the images, the video has more characteristic information of time sequence dimensionality. Namely, one-dimensional time series data is added on the basis of two-dimensional data of an image. Thus, video may be described by three-dimensional data.

Optionally, the three-dimensional video data of the video to be processed may be generated according to the images of all frames in the video to be processed.

In practical applications, adjacent frames of images in a video usually contain the same or similar content. If the pixels of all the frame images in the video to be processed are contained in the three-dimensional video data, a large amount of data redundancy will be caused, and the subsequent data processing amount will be large.

In order to reduce the data redundancy of the video to be processed, optionally, the video to be processed may be sampled, and the pixels of the sampled image may constitute the three-dimensional video data of the video to be processed. For example: the video to be processed comprises 100 frames of images, sampling is carried out by taking 4 frames of images as intervals, 20 frames of images are extracted, and pixels of the 20 frames of images form three-dimensional video data of the video to be processed.

If the sampling frequency is higher, the number of the obtained images is larger, and the data processing amount is still larger; if the sampling frequency is low, less image data is obtained, the data processing amount is less, but more image information is lost.

In order to reduce the data processing amount and simultaneously keep more image information, in the embodiment of the present application, the manner of acquiring the three-dimensional video data of each video clip includes:

performing video frame extraction processing on a video to be processed to obtain a plurality of video segments, wherein each video segment comprises a plurality of frame images; for each video segment, three-dimensional video data for the video segment is generated from pixels on frames of images contained in the video segment.

The size of the three-dimensional video data is H multiplied by W multiplied by T, H is the number of pixels contained in each frame image in the width direction of the video clip, W is the number of pixels contained in each frame image in the length direction of the video clip, and T is the number of frames of the images contained in the video clip.

The video frame extraction processing can be that one frame of image is extracted every n frames of images, then the extracted image is divided into a plurality of image groups according to time sequence, and each image group is a video clip; or extracting m frames of images every n frames of images, and determining the m frames of images as a video clip.

By the method, the data redundancy of the video to be processed is reduced through video frame extraction processing, and the extracted images are divided into video segments so as to keep time sequence characteristic information and image related information between adjacent images and provide a reliable data basis for the subsequent identification process.

Fig. 2 is a schematic diagram of a data processing flow of behavior recognition according to an embodiment of the present application. As shown in fig. 2, the size of the video to be processed is 3 × H × W × L, taking the video to be processed as an input video. Wherein, L is the total frame number of the images contained in the video to be processed; 3 denotes three color channels of RGB, and information of the three color channels can be embodied in pixel values, and therefore, the size 3, that is, the size of the video to be processed is H × W × L, can be ignored. Frame decimation combination (i.e. video decimation processing, such as grouping T frames in fig. 2) is performed on the input video to obtain a plurality of video clips, each of which has a size of H × W × T.

And S102, converting the three-dimensional video data of each video clip into two-dimensional video data.

Optionally, one implementation manner of converting three-dimensional video data into two-dimensional video data is as follows: data in one dimension of the three-dimensional video data is added to the other two dimensions to form two-dimensional video data.

For example, refer to fig. 3, which is a schematic diagram of a data conversion process provided in an embodiment of the present application.

As shown in fig. 3 (a), a video segment includes 4 frames of images I, II, III, IV, and 4 frames of images are combined in time series. Now 4 images are stitched into a large stitched image V, and pixels on the large stitched image V constitute the two-dimensional video data of the video clip.

As can be seen from the above example, although the above method can retain image information, it cannot retain timing information between images.

In order to simultaneously retain image information and timing information, in the embodiment of the present application, one implementation manner of converting three-dimensional video data into two-dimensional video data is as follows:

for each video clip, combining each two-dimensional data in the three-dimensional video data of the video clip into a set of two-dimensional video data, obtaining three sets of two-dimensional video data of the video clip.

Specifically, as shown in fig. 2, the three-dimensional video data H × W × T is divided into three sets of two-dimensional video data H × W, H × T and W × T.

Illustratively, as shown in fig. 3 (b), a video segment includes 4 frames of images I, II, III, and IV, and assuming that each frame of image has a size of 2 × 3 (i.e., H is 2, W is 3, T is 4, and each frame of image includes 6 pixels), the three-dimensional video data of the video segment has a size of 2 × 3 × 4 (as shown in fig. 3 (c), the three-dimensional video data may be regarded as a large rectangular parallelepiped, and each pixel in each frame of image may be regarded as a large rectangular parallelepiped voxel (i.e., a small square), a square labeled 1 in the figure represents a pixel in the 1 st frame of image of the video segment, a square labeled 2 represents a pixel in the 2 nd frame of image of the video segment, and so on, a square labeled 4 represents a pixel in the 4 th frame of image of the video segment).

As shown in fig. 3 (c), the three-dimensional video data is split into a group of 2 × 3 two-dimensional video data, the group including pixels on an abcd cross section in a rectangular parallelepiped. The three-dimensional video data is split into a set of 2 x 4 two-dimensional video data, the set comprising pixels in a cross-section abef in a cuboid. The three-dimensional video data is split into a set of 3 x 4 two-dimensional video data, the set comprising pixels on a bcge cross-section in a cuboid.

S103, determining the initial behavior recognition result of each of the plurality of video clips according to the two-dimensional video data of each of the plurality of video clips.

For each video clip, three groups of two-dimensional video data of the video clip can be input into a trained recognition model, and an initial behavior recognition result of the video clip is output.

However, in the above method, the input data of the recognition model is three sets of two-dimensional data, and there are many input data, and the data processing amount is large when the recognition model is trained.

To solve the above problem, in one embodiment, the implementation manner of S103 includes:

for each video clip, extracting respective initial characteristic information of three groups of two-dimensional video data of the video clip; fusing initial characteristic information of three groups of two-dimensional video data of the video clip into fused characteristic information of the video clip; and determining an initial behavior recognition result of the video clip according to the fusion characteristic information of the video clip.

For example, as shown in fig. 2, three sets of two-dimensional video data are respectively subjected to two-dimensional convolution processing to obtain initial feature information of each set of two-dimensional video data. For example: the two-dimensional data H × W is subjected to convolution processing of 3 × 3 × 1 and pooling processing of 3 × 3 × 1, and initial feature information of 1 × 1 is obtained. Since the T-dimension of the two-dimensional data is 1, it is practically equivalent to performing 3 × 3 two-dimensional convolution processing and two-dimensional pooling processing on the two-dimensional data H × W. Similarly, for H × T two-dimensional video data, 3 × 1 × 3 convolution processing and 3 × 1 × 3 pooling processing are used; for W × T two-dimensional video data, 1 × 3 × 3 convolution processing and 1 × 3 × 3 pooling processing are used.

The above is only an example of the manner of acquiring the initial feature information. In practical application, each set of two-dimensional video data may be subjected to convolution processing and pooling processing for multiple times, which is not specifically limited herein.

Optionally, the process of fusing the initial feature information of each of the three sets of two-dimensional video data into fused feature information includes:

splicing initial feature information of three groups of two-dimensional video data of a video clip into feature splicing vectors; carrying out average pooling on the feature splicing vectors to obtain pooled feature information; and converting the size of the pooled feature vector according to the preset behavior category number to obtain fusion feature information.

As shown in fig. 2, the initial feature information of each of the three sets of two-dimensional video data is a value of 1 × 1, and the initial feature information is spliced together by Concat operation to obtain a size C_outX 1 x 3 feature concatenation vector, where C_out3, i.e. the three RGB color channels. Then, on the dimension of characteristic information splicing, average pooling processing is used to obtain the size C_outPooling feature information of × 1 × 1 × 1. Through a full connection layer, connecting C_outThe dimension is changed to k (where k denotes a preset number of behavior categories, i.e., the number of categories of behaviors that need to be recognized), resulting in fused feature information having a size of k × 1 × 1 × 1. And finally, calculating probability values of the fusion characteristic information belonging to various behavior categories through a softmax layer, and determining the behavior category corresponding to the maximum probability value as an initial identification result of the video clip.

And S104, determining a final behavior recognition result of the video to be processed according to the respective initial behavior recognition results of the plurality of video segments.

Wherein, the initial behavior result of the video clip comprises the behavior type of the video clip.

Optionally, a voting manner may be adopted to determine a final behavior recognition result of the video to be processed according to the initial behavior recognition result. Specifically, the method comprises the following steps:

counting the number of the video clips belonging to each behavior type according to the respective initial behavior results of the plurality of video clips; determining a target type in the behavior types corresponding to the video clips according to the number of the clips corresponding to the behavior types; and determining the target type as a final behavior recognition result of the video to be processed.

Illustratively, the video to be processed includes 3 video segments, where a behavior class to which a first video segment belongs is a, a behavior class to which a second video segment belongs is B, and a behavior class to which a third video segment belongs is a. The number of segments of the video segment belonging to the behavior category a is 2, and the number of segments of the video segment belonging to the behavior category B is 1, so that the target type is a, that is, the final behavior result of the video to be processed is a.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 4 is a block diagram of a behavior recognition apparatus according to an embodiment of the present application, which corresponds to the behavior recognition method described in the foregoing embodiment, and only a part related to the embodiment of the present application is shown for convenience of description.

Referring to fig. 4, the apparatus includes:

adata obtaining unit 41, configured to obtain three-dimensional video data of each of a plurality of video clips in the video to be processed.

Adata conversion unit 42, configured to convert the three-dimensional video data of each of the video segments into two-dimensional video data.

Asegment identifying unit 43, configured to determine an initial behavior identification result of each of the plurality of video segments according to the two-dimensional video data of each of the plurality of video segments.

Arecognition result unit 44, configured to determine a final behavior recognition result of the video to be processed according to the initial behavior recognition result of each of the plurality of video segments.

Optionally, thedata obtaining unit 41 is further configured to:

performing video frame extraction processing on the video to be processed to obtain a plurality of video clips, wherein the video clips comprise a plurality of frame images; for each video segment, generating the three-dimensional video data of the video segment according to pixels on each frame image contained in the video segment, wherein the size of the three-dimensional video data is H × W × T, H is the number of pixels contained in each frame image in the video segment in the width direction, W is the number of pixels contained in each frame image in the video segment in the length direction, and T is the number of frames of the images contained in the video segment.

Optionally, thedata conversion unit 42 is further configured to:

Optionally, thesegment identifying unit 43 is further configured to:

for each video clip, extracting respective initial characteristic information of three groups of two-dimensional video data of the video clip; fusing the initial feature information of the three groups of two-dimensional video data of the video clip into fused feature information of the video clip; and determining an initial behavior recognition result of the video clip according to the fusion characteristic information of the video clip.

Optionally, thesegment identifying unit 43 is further configured to:

splicing the initial feature information of the three groups of two-dimensional video data of the video clip into feature splicing vectors; carrying out average pooling on the feature splicing vectors to obtain pooled feature information; and converting the size of the pooled feature vector according to the preset behavior category number to obtain the fusion feature information.

Optionally, the initial behavior result of the video segment includes a behavior type to which the video segment belongs.

Optionally, therecognition result unit 44 is further configured to:

counting the number of the video clips belonging to each behavior type according to the initial behavior results of the plurality of video clips; determining a target type in the behavior types corresponding to the video clips according to the number of the clips corresponding to the behavior types; determining the target type as the final behavior recognition result of the video to be processed.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

The behavior recognition device shown in fig. 4 may be a software unit, a hardware unit, or a combination of software and hardware unit built in the existing terminal device, may be integrated into the terminal device as an independent pendant, or may exist as an independent terminal device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, theterminal device 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), amemory 51, and acomputer program 52 stored in thememory 51 and executable on the at least oneprocessor 50, wherein theprocessor 50 executes thecomputer program 52 to implement the steps of any of the various behavior recognition method embodiments described above.

The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 5 is only an example of theterminal device 5, and does not constitute a limitation to theterminal device 5, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.

TheProcessor 50 may be a Central Processing Unit (CPU), and theProcessor 50 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Thememory 51 may in some embodiments be an internal storage unit of theterminal device 5, such as a hard disk or a memory of theterminal device 5. Thememory 51 may also be an external storage device of theterminal device 5 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on theterminal device 5. Further, thememory 51 may also include both an internal storage unit and an external storage device of theterminal device 5. Thememory 51 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. Thememory 51 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

Translated fromChinese

1.一种行为识别方法，其特征在于，所述方法包括：1. a behavior recognition method, is characterized in that, described method comprises:

获取待处理视频中的多个视频片段各自的三维视频数据；Acquiring respective three-dimensional video data of multiple video clips in the video to be processed;

将每个所述视频片段的所述三维视频数据转换为二维视频数据；converting the three-dimensional video data of each of the video segments into two-dimensional video data;

根据所述多个视频片段各自的所述二维视频数据确定所述多个视频片段各自的初始行为识别结果；Determine the respective initial behavior recognition results of the multiple video segments according to the respective two-dimensional video data of the multiple video segments;

根据所述多个视频片段各自的所述初始行为识别结果确定所述待处理视频的最终行为识别结果。A final action recognition result of the video to be processed is determined according to the respective initial action recognition results of the plurality of video segments.

2.如权利要求1所述的行为识别方法，其特征在于，所述获取待处理视频中的多个视频片段各自的三维视频数据，包括：2. The behavior recognition method according to claim 1, wherein the acquiring the respective three-dimensional video data of a plurality of video clips in the video to be processed comprises:

对所述待处理视频进行视频抽帧处理，获得所述多个视频片段，其中，所述视频片段中包括多帧图像；Perform video frame extraction processing on the to-be-processed video to obtain the multiple video clips, wherein the video clips include multiple frames of images;

对于每个所述视频片段，根据所述视频片段中包含的各帧图像上的像素生成所述视频片段的所述三维视频数据，其中，所述三维视频数据的尺寸为H×W×T，所述H为所述视频片段中各帧图像在宽度方向上包含的像素个数，所述W为所述视频片段中各帧图像在长度方向上包含的像素个数，所述T为所述视频片段中包含的图像的帧数。For each of the video clips, the 3D video data of the video clip is generated according to the pixels on each frame image included in the video clip, wherein the size of the 3D video data is H×W×T, The H is the number of pixels included in the width direction of each frame image in the video clip, the W is the number of pixels included in the length direction of each frame image in the video clip, and the T is the The number of frames of the image contained in the video clip.

3.如权利要求1所述的行为识别方法，其特征在于，所述将每个所述视频片段的所述三维视频数据转换为二维视频数据，包括：3. The behavior recognition method according to claim 1, wherein the converting the three-dimensional video data of each of the video clips into two-dimensional video data comprises:

对于每个所述视频片段，将所述视频片段的所述三维视频数据中的每两维数据组合成一组所述二维视频数据，获得所述视频片段的三组所述二维视频数据。For each of the video clips, each two-dimensional data in the three-dimensional video data of the video clip is combined into a set of the two-dimensional video data, and three sets of the two-dimensional video data of the video clip are obtained.

4.如权利要求3所述的行为识别方法，其特征在于，所述根据所述多个视频片段各自的所述二维视频数据确定所述多个视频片段各自的初始行为识别结果，包括：4. The behavior recognition method according to claim 3, wherein the determining the respective initial behavior recognition results of the plurality of video clips according to the respective two-dimensional video data of the plurality of video clips comprises:

对于每个所述视频片段，提取所述视频片段的三组所述二维视频数据各自的初始特征信息；For each of the video clips, extract the respective initial feature information of the three groups of the two-dimensional video data of the video clip;

将所述视频片段的三组所述二维视频数据各自的所述初始特征信息融合为所述视频片段的融合特征信息；merging the respective initial feature information of the three groups of the two-dimensional video data of the video clip into the fusion feature information of the video clip;

根据所述视频片段的所述融合特征信息确定所述视频片段的初始行为识别结果。The initial behavior recognition result of the video clip is determined according to the fusion feature information of the video clip.

5.如权利要求4所述的行为识别方法，其特征在于，所述将所述视频片段的三组所述二维视频数据各自的所述初始特征信息融合为所述视频片段的融合特征信息，包括：5 . The behavior recognition method according to claim 4 , wherein the initial feature information of each of the three groups of the two-dimensional video data of the video clip is fused into the fusion feature information of the video clip. 6 . ,include:

将所述视频片段的三组所述二维视频数据各自的所述初始特征信息拼接为特征拼接向量；The respective initial feature information of the three groups of the two-dimensional video data of the video clip is spliced into a feature splicing vector;

对所述特征拼接向量进行平均池化处理，得到池化特征信息；Perform an average pooling process on the feature splicing vector to obtain pooled feature information;

根据预设的行为类别数量转换所述池化特征向量的尺寸，获得所述融合特征信息。Convert the size of the pooled feature vector according to a preset number of behavior categories to obtain the fusion feature information.

6.如权利要求1所述的行为识别方法，其特征在于，所述视频片段的所述初始行为结果包括所述视频片段所属的行为类型；6. The behavior identification method according to claim 1, wherein the initial behavior result of the video clip comprises a behavior type to which the video clip belongs;

所述根据所述多个视频片段各自的所述初始行为识别结果确定所述待处理视频的最终行为识别结果，包括：The determining the final behavior recognition result of the video to be processed according to the respective initial behavior recognition results of the multiple video clips includes:

根据所述多个视频片段各自的所述初始行为结果统计属于每种所述行为类型的所述视频片段的片段个数；According to the respective initial behavior results of the plurality of video segments, count the number of segments of the video segments belonging to each of the behavior types;

根据所述行为类型对应的所述片段个数，确定所述多个视频片段各自对应的所述行为类型中的目标类型；Determine the target type in the behavior type corresponding to each of the multiple video clips according to the number of the segments corresponding to the behavior type;

将所述目标类型确定为所述待处理视频的所述最终行为识别结果。The target type is determined as the final action recognition result of the video to be processed.

7.一种行为识别装置，其特征在于，所述装置包括：7. A behavior recognition device, characterized in that the device comprises:

数据获取单元，用于获取待处理视频中的多个视频片段各自的三维视频数据；a data acquisition unit for acquiring the respective 3D video data of multiple video clips in the video to be processed;

数据转换单元，用于将每个所述视频片段的所述三维视频数据转换为二维视频数据；a data conversion unit for converting the three-dimensional video data of each of the video segments into two-dimensional video data;

片段识别单元，用于根据所述多个视频片段各自的所述二维视频数据确定所述多个视频片段各自的初始行为识别结果；a segment identification unit, configured to determine the respective initial behavior identification results of the multiple video segments according to the respective two-dimensional video data of the multiple video segments;

识别结果单元，用于根据所述多个视频片段各自的所述初始行为识别结果确定所述待处理视频的最终行为识别结果。An identification result unit, configured to determine a final action identification result of the video to be processed according to the respective initial action identification results of the plurality of video segments.

8.如权利要求7所述的行为识别装置，其特征在于，所述数据获取单元还用于：8. The behavior recognition device according to claim 7, wherein the data acquisition unit is further used for:

9.一种终端设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，其特征在于，所述处理器执行所述计算机程序时实现如权利要求1至6任一项所述的方法。9. A terminal device, comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the implementation as claimed in the claims The method of any one of 1 to 6.

10.一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现如权利要求1至6任一项所述的方法。10 . A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 6 is implemented. 11 .