CN111985385B

Movatterモバイル変換

Info

Publication number: CN111985385B
Application number: CN202010821323.2A
Authority: CN
Inventors: 赵飞
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2023-08-29
Anticipated expiration: 2040-08-14
Also published as: CN111985385A

Abstract

The application provides a behavior detection method, a behavior detection device and behavior detection equipment, wherein the method comprises the following steps: acquiring a video to be detected, wherein the video to be detected comprises a plurality of images to be detected; inputting the video to be detected into a trained target image behavior detection model, and outputting the object position in each candidate image to be detected in a plurality of candidate images to be detected by the target image behavior detection model; selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions of the target objects in the candidate to-be-detected images, and determining the target frame positions of the target objects based on the object positions of the target objects in each target to-be-detected image; acquiring a behavior sequence to be detected according to the position of the target frame; and inputting the behavior sequence to be detected into a trained target behavior sequence recognition model, and outputting a target behavior class corresponding to the behavior sequence to be detected by the target behavior sequence recognition model. By the technical scheme, the video behavior detection accuracy is high, and the detection mode is simple.

Description

Behavior detection method, device and equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a behavior detection method, apparatus, and device.

Background

Video is a sequence of successive images, consisting of successive images. Due to the persistence of vision effect of the human eye, when the video is played at a certain rate, the human eye sees a sequence of images that are continuous in motion.

Video behavior detection is a technique that locates the time interval (how to start, when to end, etc.) and spatial location (e.g., where the behavior occurs) where the behavior occurs from the video, and classifies the behavior categories. The video behavior detection can be widely applied to the application fields of security and protection monitoring, man-machine interaction, intelligent parks, intelligent classrooms, intelligent farms and the like, for example, the falling behaviors and climbing behaviors of targets in the monitoring video can be detected for security protection; the hand lifting behavior and the standing behavior of students in the class can be detected to analyze the interaction atmosphere of teachers and students in the class; it is possible to detect whether or not the industrial production flow meets the standard behavior specification or not.

At present, the video behavior detection technology has the problems of low detection accuracy, complex detection mode and the like.

Disclosure of Invention

The application provides a behavior detection method, which comprises the following steps:

acquiring a video to be detected, wherein the video to be detected comprises a plurality of images to be detected;

Inputting the video to be detected into a trained target image behavior detection model, and outputting the object position in each candidate image to be detected in a plurality of candidate images to be detected by the target image behavior detection model; the candidate images to be detected are images to be detected of objects in the plurality of images to be detected;

selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions of the target objects in the candidate to-be-detected images, and determining the target frame positions of the target objects based on the object positions of the target objects in each target to-be-detected image;

acquiring a behavior sequence to be detected according to the target frame position, wherein the behavior sequence to be detected comprises target frame sub-images selected from each target image to be detected based on the target frame position;

and inputting the behavior sequence to be detected into a trained target behavior sequence recognition model, and outputting a target behavior class corresponding to the behavior sequence to be detected by the target behavior sequence recognition model.

The application provides a behavior detection device, comprising: the acquisition module is used for acquiring a video to be detected, wherein the video to be detected comprises a plurality of images to be detected; the input module is used for inputting the video to be detected into a trained target image behavior detection model, and outputting the object position in each candidate image to be detected in a plurality of candidate images to be detected by the target image behavior detection model; wherein the candidate image to be detected is an image to be detected of an object existing in the plurality of images to be detected; the determining module is used for selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions of the target objects in the candidate to-be-detected images, and determining the target frame positions of the target objects based on the object positions of the target objects in each target to-be-detected image; the acquisition module is further used for acquiring a behavior sequence to be detected according to the target frame position, wherein the behavior sequence to be detected comprises target frame sub-images selected from each target image to be detected based on the target frame position; the input module is further configured to input the behavior sequence to be detected to a trained target behavior sequence recognition model, and output, by the target behavior sequence recognition model, a target behavior class corresponding to the behavior sequence to be detected.

The present application provides a behavior detection apparatus including: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor;

the processor is configured to execute machine-executable instructions to perform the steps of:

According to the technical scheme, in the embodiment of the application, the target image behavior detection model can be used for extracting the potential behavior targets in the video to be detected, the target tracking association is used for generating the behavior target tracks, a plurality of target images to be detected of the same target object are obtained, the behavior sequence to be detected is obtained based on the target images to be detected, and then the target behavior sequence identification model is used for outputting the target behavior types corresponding to the behavior sequence to be detected, so that the function of behavior classification (or false alarm removal) is completed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly describe the drawings required to be used in the embodiments of the present application or the description in the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings of the embodiments of the present application for a person having ordinary skill in the art.

FIG. 1 is a flow chart of a behavior detection method in one embodiment of the application;

FIG. 2 is a schematic diagram of a model training process in another embodiment of the application;

FIGS. 3A and 3B are schematic diagrams of sample frame positions in one embodiment of the application;

FIG. 4 is a schematic diagram of a deployment detection process in another embodiment of the present application;

FIG. 5 is a block diagram of a behavior detection device in one embodiment of the present application;

fig. 6 is a block diagram of a behavior detection apparatus in one embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".

Before describing the technical scheme of the application, concepts related to the embodiments of the application are described.

Machine learning: machine learning is a way to implement artificial intelligence to study how computers simulate or implement learning behavior of humans to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their own performance. Deep learning belongs to a subclass of machine learning, and is a process of modeling specific problems in the real world using mathematical models to solve similar problems in the field. Neural networks are implementations of deep learning, and for ease of description, the structure and function of the neural network is described herein by taking neural networks as an example, and for other subclasses of machine learning, the structure and function of the neural network are similar.

Neural network: the neural network includes, but is not limited to, convolutional Neural Network (CNN), cyclic neural network (RNN), fully connected network, etc., and the structural units of the neural network include, but are not limited to, convolutional layer (Conv), pooling layer (Pool), excitation layer, fully connected layer (FC), etc.

In practical applications, the neural network may be constructed by combining one or more convolution layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers according to different requirements.

In the convolution layer, the input data features are enhanced by performing convolution operation by using convolution kernel, the convolution kernel can be a matrix with m x n, the input data features of the convolution layer are convolved with the convolution kernel, the output data features of the convolution layer can be obtained, and the convolution operation is actually a filtering process.

In the pooling layer, operations such as maximum value taking, minimum value taking, average value taking and the like are performed on input data features (such as output of a convolution layer), so that the input data features are subsampled by utilizing the principle of local correlation, the processing amount is reduced, the feature invariance is kept, and the pooling layer operation is actually a downsampling process.

In the excitation layer, the input data features may be mapped using an activation function (e.g., a nonlinear function) to introduce a nonlinear factor such that the neural network enhances expression through nonlinear combinations.

The activation function may include, but is not limited to, a ReLU (Rectified Linear Units, rectified linear unit) function that is used to place features less than 0 at 0, while features greater than 0 remain unchanged.

In the fully-connected layer, all data features input to the fully-connected layer are fully-connected, so that a feature vector is obtained, and the feature vector can comprise a plurality of data features.

Training and deployment processes for neural networks (e.g., convolutional neural networks): the sample data may be used to train various neural network parameters within the neural network, such as convolutional layer parameters (e.g., convolutional kernel parameters), pooling layer parameters, excitation layer parameters, full-link layer parameters, etc., without limitation. By training various neural network parameters within the neural network, the neural network can be fitted to the mapping of inputs and outputs.

After the neural network training is completed, the trained neural network can be deployed to each device, so that each device can realize artificial intelligence processing based on the neural network, and the artificial intelligence processing process is not limited.

Image behavior detection model: a network model implemented based on a machine learning algorithm, such as a network model implemented based on a deep learning algorithm, is exemplified by an image behavior detection model implemented based on a neural network in the deep learning algorithm. For convenience of description, the image behavior detection model that has not completed training is referred to as an initial image behavior detection model, and the image behavior detection model that has completed training is referred to as a target image behavior detection model.

Behavior sequence recognition model: a network model implemented based on a machine learning algorithm, such as a network model implemented based on a deep learning algorithm, is exemplified by a behavior sequence recognition model implemented based on a neural network in the deep learning algorithm. For convenience of description, the behavior sequence recognition model that does not complete training is referred to as an initial behavior sequence recognition model, and the behavior sequence recognition model that has completed training is referred to as a target behavior sequence recognition model.

Sample training video: the sample training video is a video in a training process, that is, training is performed based on the sample training video in a training process of an initial image behavior detection model and an initial behavior sequence recognition model. The sample training video includes a plurality of sample training images that are consecutive images, such as sample training video includes consecutive sample training image 1, sample training images 2, …, sample training image m.

Video to be detected: the video to be detected is a video in the detection process, that is, after the target image behavior detection model and the target behavior sequence identification model are deployed to the device, the video to be detected can be detected based on the target image behavior detection model and the target behavior sequence identification model, so that the target behavior category in the video to be detected is detected. The video to be detected comprises a plurality of images to be detected, and the images to be detected are continuous images, for example, the video to be detected comprises a continuous image to be detected 1, an image to be detected 2, … and an image to be detected n.

The technical scheme of the embodiment of the application is described below with reference to specific embodiments.

In the embodiment of the present application, a behavior detection method is provided, and referring to fig. 1, which is a flow chart of the behavior detection method, the method may be applied to any device (such as an analog Camera, an IPC (internet protocol Camera), a background server, an application server, etc.), and the method may include:

Step 101, obtaining a video to be detected, wherein the video to be detected comprises a plurality of images to be detected.

Step 102, inputting the video to be detected into a trained target image behavior detection model, and outputting the object position in each candidate image to be detected in a plurality of candidate images to be detected by the target image behavior detection model. Illustratively, the candidate to-be-detected image is an image to be detected in which an object exists in the plurality of to-be-detected images.

Step 103, selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions of the candidate to-be-detected images, and determining the target frame position of the target object based on the object position of the target object in each target to-be-detected image.

Illustratively, a tracking algorithm (such as a multi-target tracking algorithm, which is not limited in type) may be used to determine a target object based on the object position in the candidate image to be detected, and the tracking algorithm may be used to determine whether the object position in the candidate image to be detected has the object position of the target object; if yes, the candidate image to be detected can be determined to be the target image to be detected of the target object.

Step 104, obtaining a behavior sequence to be detected according to the target frame position, wherein the behavior sequence to be detected comprises target frame sub-images selected from each target image to be detected based on the target frame position.

Step 105, the behavior sequence to be detected is input into a trained target behavior sequence recognition model, and a target behavior class corresponding to the behavior sequence to be detected is output by the target behavior sequence recognition model.

In one possible implementation, after step 105, an alarm process may also be performed according to the target behavior class. Or if the target image behavior detection model also outputs an initial behavior category corresponding to the video to be detected, alarm processing can be performed according to the target behavior category and the initial behavior category.

In one possible implementation, the training process of the target image behavior detection model may include, but is not limited to: a sample training video is acquired that includes a plurality of sample training images, which may include a plurality of calibration sample training images in which a specified behavior occurs. And inputting the calibration sample training image and calibration information of the calibration sample training image into an initial image behavior detection model so as to train the initial image behavior detection model through the calibration sample training image and the calibration information, thereby obtaining a trained target image behavior detection model. The calibration information may include at least: and each calibration sample trains the object position of the object with the specified behavior in the image, and specifies the behavior class of the behavior.

Illustratively, selecting a plurality of target sample training images for the same sample object from a plurality of candidate sample training images based on the object position in the candidate sample training images may include, but is not limited to: based on the object position in the candidate sample training image, a tracking algorithm (such as a multi-target tracking algorithm, the type of the tracking algorithm is not limited) can be adopted to determine a target object, and the tracking algorithm is adopted to determine whether the object position in the candidate sample training image exists in the object position of the sample object; if so, the candidate sample training image may be determined as a target sample training image for the sample object.

Exemplary calibration information for a sample behavior sequence may include, but is not limited to: the predicted behavior class of the sample behavior sequence, and the manner in which the predicted behavior class of the sample behavior sequence is determined, may include, but is not limited to: and determining the position of a calibration frame of the calibration object based on the object position of the calibration object in each calibration sample training image, and determining the airspace matching degree based on the position of the calibration frame and the position of the sample frame. And determining the time domain matching degree based on the starting time and the ending time of the plurality of calibration sample training images and the starting time and the ending time of the plurality of target sample training images. And determining the predicted behavior category of the sample behavior sequence according to the airspace matching degree, the time domain matching degree and the behavior category of the specified behavior.

For example, if the spatial matching degree is greater than a spatial matching degree threshold and the temporal matching degree is greater than a temporal matching degree threshold, determining that the predicted behavior class of the sample behavior sequence is the same as the behavior class of the specified behavior; otherwise, determining that the predicted behavior class of the sample behavior sequence is opposite to the behavior class of the specified behavior.

For example, the above execution sequence is only an example given for convenience of description, and in practical application, the execution sequence between steps may be changed, which is not limited. Moreover, in other embodiments, the steps of the corresponding methods need not be performed in the order shown and described herein, and the methods may include more or less steps than described herein. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; various steps described in this specification, in other embodiments, may be combined into a single step.

The above technical solution of the embodiments of the present application is described below with reference to specific application scenarios.

The embodiment of the application provides an automatic universal video behavior detection method, which can automatically complete model training and model deployment based on video behavior annotation, locate the time interval (how to start and when to end, and the like) and the space position (such as the position where the behavior occurs) of the behavior from the video, and classify the behavior types. The method can be applied to the application fields of security monitoring, man-machine interaction, intelligent parks, intelligent classrooms, intelligent farms and the like. For example, detecting target falling behaviors and target climbing behaviors in the video for safety protection; detecting the hand lifting behavior and standing behavior of students in a classroom so as to analyze the interaction atmosphere of teachers and students in the classroom; and detecting whether the industrial production flow meets the standard behavior specification or not.

Embodiments of the present application may relate to model training processes and deployment detection processes. The model training process can be realized through a video behavior calibration module, an image behavior detection data building module, an image behavior detection model automatic training module, a behavior sequence data set building module, a behavior sequence recognition model automatic training module and the like. The deployment detection process can be realized through an automatic reasoning module, a behavior detection result visualization module and the like.

Referring to fig. 2, a schematic diagram of a model training process is shown, through which a trained target image behavior detection model and a trained target behavior sequence recognition model can be obtained.

Video behavior calibration module: sample videos are obtained, and specified behaviors (including but not limited to specified behaviors of pedestrians, vehicles, animals, machines and the like) in the sample videos are calibrated in the following way: during the occurrence of a specified behavior, the spatial position of the behavior is marked by a drawing frame (including but not limited to a rectangular frame, a circular frame, a polygonal frame, etc.) at specific time intervals (including but not limited to fixed time intervals, random time intervals, etc.), and labeling information of the behavior category is given. The input of the video behavior calibration module is a sample video, and the output is calibration information corresponding to the sample video, wherein the calibration information comprises, but is not limited to, time information of a specified behavior in the sample video, spatial information of the specified behavior, and behavior category of the specified behavior.

For example, the user inputs a sample video to the video behavior calibration module, where the sample video includes 100 frames of images, and if a specified behavior (such as a falling behavior) occurs in the 10 th-19 th frame of images, the video behavior calibration module may add calibration information to the sample video, where the calibration information includes time information of the specified behavior (such as a time t10 of the 10 th frame of images and a time t19 of the 19 th frame of images, indicating that the specified behavior occurs in a time interval between the time t10 and the time t 19), spatial information of the specified behavior (such as a spatial position of each frame of images in the 10 th-19 th frame of images, or a spatial position of a part of the frame of images in the 10 th-19 th frame of images, without limitation), and behavior category of the specified behavior (such as a falling category, indicating that the specified behavior is the falling behavior).

For the spatial position of the 10 th frame image, an object (such as a person) in which the falling action occurs in the 10 th frame image may be selected by a frame pulling manner, and taking a rectangular frame as an example, the rectangular frame includes the object in which the falling action occurs, and the spatial position of the 10 th frame image is the object position, and the object position may include, but is not limited to, coordinate information of the rectangular frame, such as an upper left corner coordinate (such as an upper left corner abscissa and an upper left corner ordinate) and a lower right corner coordinate (such as a lower right corner abscissa and a lower right corner ordinate), or a lower left corner coordinate (such as an upper left corner abscissa and an upper right corner ordinate). Of course, the above is just two examples of the coordinate information of the rectangular frame, and no limitation is made thereto. For example, the coordinate information may be the upper left corner coordinate, the width and height of the rectangular frame, and the lower right corner coordinate may be determined by the upper left corner coordinate, the width and height of the rectangular frame. For another example, the coordinate information may be a lower left corner coordinate, a width and a height of the rectangular frame, and an upper right corner coordinate may be determined by the lower left corner coordinate, the width and the height of the rectangular frame. Obviously, the rectangular frame of the object, namely the position of the object, of which the falling action occurs in the 10 th frame image can be determined through the coordinate information. The spatial positions of the 11 th to 19 th frame images are similar to those of the 10 th frame image, and will not be described again here.

In summary, for the 10 th to 19 th frame images, the position of the object where the falling action occurs can be calibrated.

The image behavior detection data building module: the video behavior calibration module can input a large number of sample videos and calibration information corresponding to each sample video into the image behavior detection data construction module, the image behavior detection data construction module can divide the sample videos into sample training videos and sample test videos, and the number of the sample training videos and the number of the sample test videos can be the same or different, so that the method is not limited. For example, the image behavior detection data building module acquires 100 sample videos, takes 70 sample videos as sample training videos, and takes the rest 30 sample videos as sample test videos.

For each sample training video, the sample training video comprises a plurality of sample training images, and the image behavior detection data building module can extract sample training images generating specified behaviors from the sample training video according to a preset strategy (including but not limited to random selection, fixed offset selection and the like), and the extracted sample training images are used as calibration sample training images. Referring to the above embodiment, the sample training image in which the specified behavior occurs has calibration information, and thus, the calibration sample training image also has calibration information.

The image behavior detection data construction module may automatically construct an image behavior detection data set including the calibration sample training image and calibration information of the calibration sample training image.

For example, the sample training video includes 100 frames of images, the calibration information of the sample training video includes time information (such as time t10 of 10 th frame image and time t19 of 19 th frame image) for specifying the behavior, space information (such as object position of each frame of image in 10-19 th frame image) for specifying the behavior, and behavior category (such as falling category) for specifying the behavior. Based on the above, the image behavior detection data building module learns that the 10 th-19 th frame images are sample training images with specified behaviors based on the time information, and extracts all or part of sample training images from the sample training images according to a preset strategy to serve as calibration sample training images.

For each calibration sample training image, calibration information for the calibration sample training image may include: spatial information of the calibration sample training image, and behavior category of the calibration sample training image. For example, the spatial information of the calibration sample training image may be the object position (i.e., the object position where the falling action occurs, such as coordinate information of the object, etc.) where the object where the specified action (such as the falling action) occurs in the calibration sample training image, where the action category is the action category of the specified action (such as the falling category).

In summary, the input of the image behavior detection data building module is a sample training video, and the output of the image behavior detection data building module is an image behavior detection data set, where the image behavior detection data set includes a plurality of calibration sample training images and calibration information of each calibration sample training image.

The automatic training module of the image behavior detection model: the image behavior detection data construction module may input an image behavior detection data set to the image behavior detection model automatic training module, the image behavior detection data set may include calibration sample training images and calibration information of the calibration sample training images. The automatic image behavior detection model training module can input a calibration sample training image and calibration information of the calibration sample training image into the initial image behavior detection model so as to train the initial image behavior detection model through the calibration sample training image and the calibration information, and a trained target image behavior detection model is obtained.

For example, because the calibration information includes the object position where the object with the specified behavior is located in the calibration sample training image and the behavior class of the specified behavior, the target image behavior detection model is used to fit the mapping relationship between the image feature vector and the behavior class, and the mapping relationship between the image feature vector and the object position.

In summary, the input of the image behavior detection model automatic training module is the image behavior detection dataset, and the output is the target image behavior detection model. The output of the image behavior detection model automatic training module may also include training progress of the initial image behavior detection model, key training state information, and the like, which is not limited.

The behavior sequence data set building module: the behavior sequence data set building module may automatically build a behavior sequence data set, where the behavior sequence data set may include a sample behavior sequence and calibration information of the sample behavior sequence, for example, the input of the behavior sequence data set building module is a sample training video, and the output is a behavior sequence data set, and the building process of the behavior sequence data set is described below with reference to specific steps:

Step a1, inputting the sample training video into a trained target image behavior detection model, and outputting the object position in each candidate sample training image in a plurality of candidate sample training images by the target image behavior detection model, wherein the candidate sample training image is a sample training image with an object in the plurality of sample training images.

For example, the sample training video may be input to the target image behavior detection model, and the sample training video includes a plurality of sample training images. With reference to the above embodiment, the target image behavior detection model is configured to fit a mapping relationship between an image feature vector and a behavior class, and a mapping relationship between an image feature vector and an object position, so that, for each sample training image, the target image behavior detection model may process the sample training image, and the processing manner is not limited, so as to obtain the image feature vector of the sample training image. If the image feature vector of the sample training image corresponds to the behavior type and the object position, the target image behavior detection model takes the sample training image as a candidate sample training image and outputs the behavior type and the object position corresponding to the candidate sample training image. And if the image feature vector of the sample training image does not correspond to the behavior category and the object position, the sample training image is not used as a candidate sample training image.

In summary, the target image behavior detection model may output an object position and a behavior class in a candidate sample training image, where the candidate sample training image is a sample training image in which an object (such as an object that has a falling behavior) exists, the object position represents coordinate information of the object in the candidate sample training image, and the behavior class represents a behavior class of a specified behavior that the object has occurred.

Step a2, selecting a plurality of target sample training images of the same sample object from a plurality of candidate sample training images based on the object positions in the candidate sample training images. For example, based on the object position in each candidate sample training image, determining a sample object by adopting a tracking algorithm, and determining whether the object position in the candidate sample training image exists in the object position of the sample object by adopting the tracking algorithm; if so, the candidate sample training image may be determined as a target sample training image for the sample object.

For each candidate sample training image, at least one object position in the candidate sample training image may be output, where each object position corresponds to an object, and the objects in different candidate sample training images may be the same or different. Based on this, a tracking algorithm (such as MOT (multi-object tracking) algorithm) may be used to determine a plurality of object positions belonging to the same object (the object is marked as a sample object), and the candidate sample training image in which the object positions are located is determined as the target sample training image of the sample object.

For example, based on the object position 11 of the object 1 and the object position 21 of the object 2 in the candidate sample training image 1, the object position 12 of the object 1 and the object position 32 of the object 3 in the candidate sample training image 2, and the object position 33 of the object 3 in the candidate sample training image 3, the object position 11 and the object position 12 belonging to the same object 1 can be determined by using a tracking algorithm, and the tracking process is not limited as long as the object position of the same object can be tracked. Then, the candidate sample training image 1 and the candidate sample training image 2 may be determined as target sample training images of the object 1.

For example, when the multi-target tracking algorithm is used to determine a plurality of object positions belonging to the same sample object, all object positions output by the target image behavior detection model may be input to the multi-target tracking algorithm. Based on this, in one possible implementation, the multi-target tracking algorithm first selects one object as a sample object, identifies a plurality of object positions of the sample object from all object positions, and outputs the plurality of object positions of the sample object, then selects another object as a sample object, identifies a plurality of object positions of the sample object from all object positions, and outputs the plurality of object positions of the sample object, and so on until the plurality of object positions of each sample object are output. In another possible implementation, the multi-target tracking algorithm may track multiple sample objects in all object positions in parallel, i.e., track object positions of multiple sample objects in parallel, the multi-target tracking algorithm being capable of identifying multiple object positions for each sample object and outputting the multiple object positions for each sample object.

In summary, for each sample object, a plurality of object positions of the sample object may be obtained based on the multi-target tracking algorithm, and a candidate sample training image where the plurality of object positions of the sample object are located is determined as the target sample training image of the sample object.

For example, when the multi-target tracking algorithm is adopted to determine the positions of a plurality of objects belonging to the same sample object, the implementation process of the multi-target tracking algorithm is not limited, for example, the multi-target tracking algorithm can match the existing target track according to the detection result of the target in each frame of image; for new emerging targets, new targets need to be generated; for an object that has left, it is necessary to terminate tracking of the trajectory. In this process, matching of the target and the detection may be regarded as re-recognition of the target, for example, when tracking a plurality of pedestrians, a set of pedestrian images of an existing track is regarded as an image library, a detection image is regarded as a query image, and a process of detecting a matching association with the track may be regarded as a process of retrieving the image library from the query image.

Step a3, determining a sample frame position of the sample object based on the object position of the sample object in each target sample training image, wherein the sample frame position represents a spatial range (including but not limited to an circumscribed rectangular frame, a circumscribed circular frame and a circumscribed polygonal frame) of the object position of the sample object in all target sample training images.

In one possible embodiment, taking the circumscribed rectangular frame as an example, when the coordinate system is established with the upper left corner position of the training image of the target sample as the origin of coordinates, the horizontal right as the horizontal axis, and the horizontal down as the vertical axis, the object position may include the upper left corner abscissa, the upper left corner ordinate, the lower right corner abscissa, and the lower right corner ordinate. On the basis, selecting the minimum value of the upper left-corner abscissa (namely the minimum value in the upper left-corner abscissa of the sample object in all target sample training images) based on the upper left-corner abscissa of the sample object in each target sample training image (namely the upper left-corner abscissa of the circumscribed rectangular frame of the object position); selecting the minimum value of the left upper corner ordinate of the sample object in each target sample training image based on the left upper corner ordinate; selecting the maximum value of the right lower-hand abscissa based on the right lower-hand abscissa of the sample object in each target sample training image; selecting a maximum value of the lower right-hand ordinate of the sample object in each target sample training image based on the lower right-hand ordinate; the sample frame position of the sample object is determined from the minimum value of the upper left-hand abscissa, the minimum value of the upper left-hand ordinate, the maximum value of the lower right-hand abscissa, and the maximum value of the lower right-hand ordinate.

For example, referring to fig. 3A, a coordinate system may be established with an upper left corner position of the training image of the target sample as an origin of coordinates, with a horizontal right axis as a horizontal axis, and with a horizontal down axis as a vertical axis, when all object positions belonging to the same sample object are determined by using a tracking algorithm, each object position may include an upper left corner coordinate (upper left corner abscissa left_top_x, upper left corner ordinate left_top_y) and a lower right corner coordinate (lower right corner abscissa right_bottom_x, lower right corner ordinate right_bottom_y). Then, the minimum value of the upper left-hand abscissa is selected based on all the upper left-hand abscissas, denoted as min ({ left_top_x }), and the minimum value of the upper left-hand ordinate is selected based on all the upper left-hand abscissas, denoted as min ({ left_top_y }). The maximum value of the lower right-hand abscissa is selected based on all lower right-hand abscissas and denoted as max ({ right_bottom_x }, and the maximum value of the lower right-hand ordinate is selected based on all lower right-hand abscissas and denoted as max ({ right_bottom_y }).

Then, min ({ left_top_x }) and min ({ left_top_y }) are combined into one coordinate point A1, max ({ right_bottom_x } and max ({ right_bottom_y }) are combined into one coordinate point A2, and a rectangular frame composed based on the coordinate point A1 and the coordinate point A2 is the sample frame position of the sample object.

In another possible embodiment, the object position includes a lower left-hand abscissa, a lower left-hand ordinate, an upper right-hand abscissa, and an upper right-hand ordinate, with the lower left-hand position of the training image of the target sample being the origin of coordinates, with the horizontal to the right being the horizontal axis, and with the horizontal to the up being the vertical axis. On the basis, selecting the minimum value of the left lower-corner abscissa based on the left lower-corner abscissa of the sample object in each target sample training image; selecting a minimum value of the left lower corner ordinate of the sample object in each target sample training image based on the left lower corner ordinate; selecting the maximum value of the upper right-hand abscissa based on the upper right-hand abscissa of the sample object in each target sample training image; selecting the maximum value of the upper right-hand ordinate of the sample object in each target sample training image based on the upper right-hand ordinate; and determining the sample frame position of the sample object according to the minimum value of the left lower-corner abscissa, the minimum value of the left lower-corner ordinate, the maximum value of the right upper-corner abscissa and the maximum value of the right upper-corner ordinate.

For example, referring to fig. 3B, a coordinate system may be established with a lower left corner position of the training image of the target sample as an origin of coordinates, with a horizontal right as a horizontal axis, and with a horizontal up as a vertical axis, when all object positions belonging to the same sample object are determined by using a tracking algorithm, each object position may include a lower left corner coordinate (lower left corner abscissa left_bottom_x, lower left corner ordinate left_bottom_y) and an upper right corner coordinate (upper right corner abscissa right_top_x, upper right corner ordinate right_top_y). Then, the minimum value of the lower left-hand abscissa is selected based on all lower left-hand abscissas and denoted as min ({ left_bottom_x }), and the minimum value of the lower left-hand ordinate is selected based on all lower left-hand abscissas and denoted as min ({ left_bottom_y }). The maximum value of the upper right-hand abscissa is selected based on all the upper right-hand abscissas and denoted as max ({ right_top_x }, and the maximum value of the upper right-hand ordinate is selected based on all the upper right-hand abscissas and denoted as max ({ right_top_y }).

Then, min ({ left_bottom_x }) and min ({ left_bottom_y }) are combined into one coordinate point B1, max ({ right_top_x } and max ({ right_top_y }) are combined into one coordinate point B2, and a rectangular frame based on the coordinate point B1 and the coordinate point B2 is the sample frame position of the sample object.

Of course, the above manner is merely two examples, and is not limited thereto, as long as the spatial range (e.g., circumscribed rectangular frame) of the object position of the sample object in all the target sample training images can be determined.

Step a4, acquiring a sample behavior sequence according to the sample frame position, wherein the sample behavior sequence can comprise a sample frame sub-image selected from each target sample training image based on the sample frame position.

Referring to the above embodiment, a plurality of target sample training images of a sample object and a sample frame position of the sample object may be obtained, and for each target sample training image, a sub-image matching the sample frame position may be truncated from the target sample training image, and the sub-image may be used as a sample frame sub-image. For example, a rectangular frame is determined based on the sample frame position, the upper left-corner abscissa of the rectangular frame may be the minimum value of the upper left-corner abscissa, the upper left-corner ordinate of the rectangular frame may be the minimum value of the upper left-corner ordinate, the lower right-corner abscissa of the rectangular frame may be the maximum value of the lower right-corner abscissa, the lower right-corner ordinate of the rectangular frame may be the maximum value of the lower right-corner ordinate, and after the rectangular frame is obtained, the sub-image matched with the rectangular frame in the target sample training image may be taken as the sample frame sub-image.

After obtaining the sample box sub-images in each target sample training image, the sample box sub-images may be organized into a sample behavior sequence, i.e., the sample behavior sequence may include a plurality of sample box sub-images.

And a5, determining calibration information of the sample behavior sequence, such as a predicted behavior category of the sample behavior sequence.

For example, referring to the above embodiment, the sample training video includes a calibration sample training image, and the calibration information of the calibration sample training image includes the object position where the object (marked as a calibration object) of the specified behavior is located, and the behavior class (such as a fall class) of the specified behavior. Based on this, the calibration frame position of the calibration object can be determined based on the object position of the calibration object in each calibration sample training image. The method for determining the position of the calibration frame of the calibration object based on the object position of the calibration object is similar to the method for determining the position of the sample frame of the sample object based on the object position of the sample object, but the sample object in the step a3 is replaced by the calibration object, and the position of the sample frame is replaced by the position of the calibration frame, which is not repeated here.

For example, referring to the above embodiment, a sample frame position and a calibration frame position may be obtained, and based on the calibration frame position and the sample frame position, the airspace matching degree may be determined. For example, a sample frame may be obtained based on the sample frame position, a calibration frame may be obtained based on the calibration frame position, and the srou may be determined using the following formula, which is, of course, merely an example and is not limited thereto.

siou= (intersection area of sample frame and calibration frame)/(union area of sample frame and calibration frame).

t_a1 Indicating the starting time, t, of the calibration sample training image_a2 Indicating the termination time of the calibration sample training image, t_b1 Representing the starting time, t, of a training image of a target sample_b2 The termination time of the training image of the target sample is indicated.

Step a6, constructing a behavior sequence data set, wherein the behavior sequence data set can comprise the sample behavior sequence and calibration information of the sample behavior sequence, such as a predicted behavior category of the sample behavior sequence.

For example, referring to step a4 and step a5, a sample behavior sequence and calibration information of the sample behavior sequence may be obtained, and the sample behavior sequence and the calibration information may be combined to obtain a behavior sequence data set.

For example, assume that a user in a sample training video marks two behaviors of 100 waving hands and 120 drinking, and a large number of waving hands and drinking tracks are obtained through detection and tracking of a target image behavior detection model, wherein the tracks contain time and space information of behaviors. Through automatic space-time matching with the behavior marked by the user, at most 100 hand waving tracks and 120 water drinking sub tracks can be generated, wherein the sub tracks are the rest part of tracks after the original tracks are matched with each other in time. And for the track with failed matching, automatically constructing a false alarm behavior sample, and extracting an image sequence in a corresponding space-time range from a sample training video based on the track with successful matching and the track with failed matching to form a behavior sequence data set.

Illustratively, since the calibration information includes a predicted behavior class of the sample behavior sequence, the target behavior sequence recognition model is used to fit the mapping relationship of the feature vector and the behavior class.

The automatic training module of the behavior sequence recognition model loads a preset behavior recognition model template (including but not limited to a behavior recognition model template of TSN, C3D, P3D, I3D, slowfast-Net and the like, the behavior recognition model template is used as an initial behavior sequence recognition model), and the initial behavior sequence recognition model is automatically trained based on a behavior sequence data set. For example, after the behavior sequence data set is input into the initial behavior sequence recognition model, the initial behavior sequence recognition model is automatically trained based on training parameters (such as training iteration times, training optimization strategies, training stopping condition strategies and the like), the training process is not limited, and after the training process is finished, the initial behavior sequence recognition model which has completed training is used as the target behavior sequence recognition model.

In summary, the input of the automatic training module of the behavior sequence recognition model is a behavior sequence data set, and the output is a target behavior sequence recognition model. The output of the automatic training module of the behavior sequence recognition model may also include training progress of the initial behavior sequence recognition model, key training state information, and the like, which is not limited.

Referring to fig. 4, a schematic diagram of a deployment detection process is shown, through the deployment detection process, a video to be detected can be detected based on a trained target image behavior detection model and a trained target behavior sequence recognition model, and a target behavior class corresponding to the video to be detected can be obtained. By way of example, the deployment detection process may be implemented by an automatic reasoning module, a behavior detection result visualization module, and the like.

Automatic reasoning module: the automatic reasoning module completes behavior detection of the video to be detected based on the target image behavior detection model and the target behavior sequence identification model, and the following description is made in connection with a specific step detection process:

and b1, acquiring a video to be detected, wherein the video to be detected comprises a plurality of images to be detected.

And b2, inputting the video to be detected into a trained target image behavior detection model, and outputting the object position of each candidate to-be-detected image in a plurality of candidate to-be-detected images by the target image behavior detection model, wherein the candidate to-be-detected image is an image to be detected of the objects in the plurality of to-be-detected images.

For example, the video to be detected may be input to a target image behavior detection model, and since the target image behavior detection model is used for fitting the mapping relationship between the image feature vector and the behavior class and the object position, the target image behavior detection model may process, for each image to be detected, the image feature vector of the image to be detected. If the image feature vector of the image to be detected corresponds to the behavior type and the object position, the target image behavior detection model takes the image to be detected as a candidate image to be detected, and outputs the initial behavior type and the object position corresponding to the candidate image to be detected. If the image feature vector of the image to be detected does not correspond to the behavior category and the object position, the image feature vector is not used as a candidate image to be detected.

In summary, the target image behavior detection model may output the object position and the initial behavior class in the candidate to-be-detected image, where the candidate to-be-detected image is the to-be-detected image in which the object (e.g., the object that has the falling behavior) exists, and the object position represents the coordinates of the object in the candidate to-be-detected image.

And b3, selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions in the candidate to-be-detected images. For example, based on the object position in each candidate image to be detected, a tracking algorithm may be used to determine a target object, and a tracking algorithm may be used to determine whether the object position in the candidate image to be detected has the object position of the target object; if yes, the candidate image to be detected is determined to be the target image to be detected of the target object.

For example, a tracking algorithm (such as a multi-target tracking algorithm) may be used to determine a plurality of object positions belonging to the same object (the object is marked as a target object) based on the object positions in all the candidate to-be-detected images, and the candidate to-be-detected images where the object positions are located are determined as target to-be-detected images of the target object. The implementation process of step b3 is similar to that of step a2, and will not be described here again.

And b4, determining a target frame position of the target object based on the object position of the target object in each target to-be-detected image, wherein the target frame position represents the spatial range (including but not limited to an external rectangular frame, an external circular frame and an external polygonal frame) of the object position of the target object in all target to-be-detected images.

In one possible implementation manner, taking an external rectangular frame as an example, taking the upper left corner position of the target to be detected as an origin of coordinates, taking the horizontal right as a horizontal axis, taking the horizontal downward as a vertical axis, and taking the object position as an upper left corner abscissa, an upper left corner ordinate, a lower right corner abscissa and a lower right corner ordinate, and selecting the minimum value of the upper left corner abscissa based on the upper left corner abscissa of the target object in each target to be detected; selecting the minimum value of the left upper corner ordinate based on the left upper corner ordinate of the target object in each target to-be-detected image; selecting the maximum value of the right lower-corner abscissa based on the right lower-corner abscissa of the target object in each target to-be-detected image; and selecting the maximum value of the right lower-corner ordinate based on the right lower-corner ordinate of the target object in each target to-be-detected image. Then, the target frame position of the target object is determined from the minimum value of the upper left-hand abscissa, the minimum value of the upper left-hand ordinate, the maximum value of the lower right-hand abscissa, and the maximum value of the lower right-hand ordinate.

In another possible implementation manner, taking an external rectangular frame as an example, taking the lower left corner position of the target object to be detected as an origin of coordinates, taking the horizontal right as a horizontal axis, taking the horizontal upward as a vertical axis, and taking the object position as a left lower corner abscissa, a left lower corner ordinate, a right upper corner abscissa and a right upper corner ordinate, and selecting the minimum value of the left lower corner abscissa based on the left lower corner abscissa of the target object in each target object to be detected; selecting the minimum value of the left lower-corner ordinate based on the left lower-corner ordinate of the target object in each target to-be-detected image; selecting the maximum value of the right upper corner abscissa based on the right upper corner abscissa of the target object in each target to-be-detected image; and selecting the maximum value of the upper right-hand ordinate based on the upper right-hand ordinate of the target object in each target to-be-detected image. And determining the target frame position of the target object according to the minimum value of the left lower-corner abscissa, the minimum value of the left lower-corner ordinate, the maximum value of the right upper-corner abscissa and the maximum value of the right upper-corner ordinate.

The implementation process of step b4 is similar to that of step a3, and the detailed description is not repeated here.

And b5, acquiring a behavior sequence to be detected according to the target frame position, wherein the behavior sequence to be detected comprises target frame sub-images selected from each target image to be detected based on the target frame position.

For each target to-be-detected image, a sub-image matched with the target frame position is cut from the target to-be-detected image, and the sub-image serves as a target frame sub-image. For example, a rectangular frame is determined based on the target frame position, and after the rectangular frame is obtained, a sub-image matched with the rectangular frame in the target to-be-detected image can be used as a target frame sub-image.

For example, using the target frame positions, regions of interest may be sequentially truncated from each target to-be-detected image, where the regions of interest form a target behavior spatiotemporal cube, i.e., the above-described to-be-detected behavior sequence. The extraction mode of the behavior sequence to be detected greatly reduces background information under the condition that information is not lost in the target behavior, is more beneficial to detection of the behavior sequence to be detected, and improves detection accuracy.

After obtaining the target frame sub-images in each target to-be-detected image, the target frame sub-images may be formed into a to-be-detected behavior sequence, i.e., the to-be-detected behavior sequence may include a plurality of target frame sub-images.

And b6, inputting the behavior sequence to be detected into a trained target behavior sequence recognition model, and outputting a target behavior class corresponding to the behavior sequence to be detected by the target behavior sequence recognition model.

For example, the behavior sequence to be detected may be input to a target behavior sequence recognition model, and since the target behavior sequence recognition model is used for fitting a mapping relationship between a feature vector (i.e., a feature vector of a sample behavior sequence) and a behavior class, the target behavior sequence recognition model may process the behavior sequence to be detected to obtain the feature vector of the behavior sequence to be detected, and determine the behavior class corresponding to the feature vector, where the behavior class is the target behavior class corresponding to the behavior sequence to be detected. In summary, the target behavior sequence recognition model may output a target behavior class corresponding to the behavior sequence to be detected.

In one possible implementation, the automatic reasoning module may also take the form of a time sliding window, and perform sliding window slicing (including but not limited to non-overlapping sliding window, continuous frame sliding window, changing frame interval sliding window, etc.) according to a certain time window size (including but not limited to fixed frame number, changing frame number, etc.), so as to obtain the behavior sequence to be detected. For example, sliding window slicing is performed according to a certain time window size based on all the target to-be-detected images, and all or part of the target to-be-detected images are selected, such as selecting a first frame of target to-be-detected image, a third frame of target to-be-detected image, a fifth frame of target to-be-detected image, and so on. And intercepting target frame sub-images from each selected target to-be-detected image based on the target frame position, and forming the target frame sub-images into a to-be-detected behavior sequence. Of course, the above manner is merely an example, and is not limited thereto.

And b7, carrying out alarm processing according to the target behavior category. Alternatively, alarm processing is performed based on the target behavior class and the initial behavior class (output by the target image behavior detection model).

In one possible implementation, the alarm processing may be performed according to the target behavior class, for example, assuming that the target behavior class is a class a (such as a fall class), an alarm message for the class a may be generated, where the alarm message may carry information of the class a, and indicates that the behavior of the class a exists in the video to be detected. The alarm message may also carry time information (such as a start time and an end time) of a plurality of target images to be detected of the target object, and represent the behavior of the type a occurring in the images to be detected in the time information. The alert message may also carry a target frame location indicating that the target frame location is subject to category a behavior.

In another possible implementation manner, the alarm processing may be performed according to the target behavior class and the initial behavior class, for example, if the target behavior class is a class a and the initial behavior class is a class a, that is, the target behavior class is the same as the initial behavior class, an alarm message for the class a may be generated, where the alarm message may carry information of the class a, which indicates that the behavior of the class a exists in the video to be detected.

If the target behavior class is a class a and the initial behavior class is a class B, that is, the target behavior class is different from the initial behavior class, an alarm message for the class a (the alarm message may carry information of the class a and indicates that the behavior of the class a exists in the video to be detected) may be generated, an alarm message for the class B (the alarm message may carry information of the class B and indicates that the behavior of the class B exists in the video to be detected) may be generated, or an alarm message for the class a and an alarm message for the class B may not be generated.

By way of example, alarm control strategies (including but not limited to controlling the same target alarm times, controlling the same behavior alarm times, controlling alarm space regions, controlling alarm target residence time lengths, etc.) may also be configured to reduce system alarm times. For example, the number of alarms for each behavior is 3, if an alarm message for the category a needs to be generated, it is first determined whether the number of alarms for the category a reaches 3, if not, an alarm message for the category a is generated, and if not, an alarm message for the category a is not generated.

And a behavior detection result visualization module: personalized display is performed according to the set interesting behavior set (which can comprise various types of target behaviors). After the automatic reasoning module detects the target behavior, the behavior detection result visualization module can analyze whether the target behavior is in the interesting behavior set, if yes, time information (such as time information of a plurality of target images to be detected, such as starting time and ending time) and space information (such as target frame positions) of the target behavior are recorded. When the video to be detected is played to the image to be detected corresponding to the time information, the position of the target frame can be overlapped in the picture of the image to be detected, and information such as the target behavior category, the confidence level and the like can be overlapped, so that a user can conveniently analyze and respond according to the alarm result.

In one possible implementation, after the image behavior detection data building module divides the sample video into the sample training video and the sample test video, the sample test video may further be sent to the automatic reasoning module, where the sample test video includes a plurality of sample test images, calibration sample test images exist in the plurality of sample test images, and calibration information of the calibration sample test images includes an actual behavior category.

And the automatic reasoning module completes the behavior detection of the sample test video based on the target image behavior detection model and the target behavior sequence identification model, and obtains the target behavior category of the sample test video. The behavior detection of the sample test video is similar to that of the video to be detected, see step b 1-step b6, and will not be described again here.

After the target behavior category of the sample test video is obtained, whether the target behavior category of the sample test video is the same as the actual behavior category of the calibration sample test image or not can be compared, if so, the behavior detection result of the sample test video is correct, and if not, the behavior detection result of the sample test video is wrong.

After the above processing is performed on a large number of sample test videos, the number of correct detection results and the number of incorrect detection results can be counted, and based on the number of correct detection results and the number of incorrect detection results, the detection performances (including but not limited to the behavior detection rate and the number of false alarm times) of the target image behavior detection model and the target behavior sequence recognition model can be counted. If the detection performance is higher, the detection performance of the target image behavior detection model and the target behavior sequence identification model can be deployed, and the video to be detected is detected based on the detection performance of the target image behavior detection model and the target behavior sequence identification model. If the detection performance is low, training the target image behavior detection model and the target behavior sequence recognition model again.

In the automatic training stage, only the user is required to complete the calibration of the behavior sample in the video, the division of the training video and the test video is automatically carried out, and the image behavior detection data set is automatically built and used for completing the training of the first-stage image behavior detection model. The extraction of potential behavior samples in the video is completed based on the image behavior detection model, a behavior sequence data set (such as a difficult case sequence sample set easy to report by mistake) in the scene is automatically generated through automatic matching with user calibration, the training of a second-level behavior sequence recognition model is automatically completed based on the behavior sequence data set, and the behavior sequence recognition model can obviously reduce the samples easy to report by mistake in the scene.

In the automatic reasoning stage, a first-stage image behavior detection model can be used for extracting potential behavior targets in a video to be detected, a behavior target track is generated through target tracking association, continuously existing behavior targets are triggered, a behavior sequence to be detected is extracted, and a second-stage behavior sequence recognition model is used for recognizing the behavior sequence to be detected, so that the function of behavior classification (or false alarm removal) is completed.

In the mode, the user only needs to complete the calibration of the interesting behaviors, the training of the image behavior detection model and the behavior sequence identification model can be automatically completed, potential false alarms are mined in the video, the behavior sequence data set is adaptively built, the false alarm distribution condition of the image behavior detection model and the behavior sequence identification model, which are matched with the current scene, is guaranteed, and the false alarm of the whole system can be well reduced.

In summary, based on the video behavior calibration, the system can automatically complete the assembly of a plurality of behavior detection data sets, automatically complete the training of an image behavior detection model and a behavior sequence recognition model, automatically complete performance evaluation, reduce the threshold of a user using the system, and facilitate popularization and use under a plurality of tasks in a plurality of scenes. The method for automatically extracting the behavior sequence data set in the scene based on the image behavior detection model completes the self-adaptive extraction of the behavior sample and the background sample in the scene, obviously reduces false alarm in the scene and improves the scene adaptability of the behavior detection system. After the user finishes video uploading and labeling, the user can pay attention to the progress of training and reasoning, and finally, the overall performance evaluation data of the system can be obtained, so that the user experience is improved.

Based on the same application concept as the above method, an apparatus for detecting behavior is further provided in an embodiment of the present application, as shown in fig. 5, which is a structural diagram of the apparatus, where the apparatus includes: the acquiring module 51 is configured to acquire a video to be detected, where the video to be detected includes a plurality of images to be detected; an input module 52, configured to input the video to be detected into a trained target image behavior detection model, and output, by the target image behavior detection model, a position of an object in each candidate image to be detected in a plurality of candidate images to be detected; wherein the candidate image to be detected is an image to be detected of an object existing in the plurality of images to be detected; a determining module 53, configured to select a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions of the target objects in the candidate to-be-detected images, and determine a target frame position of the target object based on the object position of the target object in each target to-be-detected image; the obtaining module 51 is further configured to obtain a behavior sequence to be detected according to the target frame position, where the behavior sequence to be detected includes a target frame sub-image selected from each target image to be detected based on the target frame position; the input module 52 is further configured to input the behavior sequence to be detected to a trained target behavior sequence recognition model, and output, by the target behavior sequence recognition model, a target behavior class corresponding to the behavior sequence to be detected.

In one possible implementation manner, the determining module 53 is specifically configured to, when selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions in the candidate to-be-detected images: determining a target object by adopting a tracking algorithm based on the object position in the candidate image to be detected, and determining whether the object position in the candidate image to be detected has the object position of the target object or not; if yes, the candidate image to be detected is determined to be the target image to be detected of the target object.

In one possible embodiment, the device further comprises (not shown in the figures):

the processing module is used for carrying out alarm processing according to the target behavior category; or,

and if the target image behavior detection model also outputs an initial behavior category corresponding to the video to be detected, carrying out alarm processing according to the target behavior category and the initial behavior category.

the training module is used for training the target image behavior detection model; the training module is specifically used for training the target image behavior detection model: acquiring a sample training video, wherein the sample training video comprises a plurality of sample training images, and the plurality of sample training images comprise a plurality of calibration sample training images with appointed behaviors; inputting the calibration sample training image and the calibration information of the calibration sample training image into an initial image behavior detection model to train the initial image behavior detection model through the calibration sample training image and the calibration information so as to obtain a trained target image behavior detection model; wherein, the calibration information at least comprises: and the calibration sample trains the object position of the object where the specified behavior occurs in the image, and the behavior category of the specified behavior.

In a possible implementation manner, the training module is further configured to train the target behavior sequence recognition model; the training module is specifically used for training the target behavior sequence recognition model: inputting the sample training video to a trained target image behavior detection model, and outputting object positions in each candidate sample training image in a plurality of candidate sample training images by the target image behavior detection model; wherein the candidate sample training image is a sample training image of an object present in the plurality of sample training images; selecting a plurality of target sample training images of the same sample object from the plurality of candidate sample training images based on object positions in the candidate sample training images, and determining a sample frame position of the sample object based on the object position of the sample object in each target sample training image; acquiring a sample behavior sequence according to the sample frame position, wherein the sample behavior sequence comprises a sample frame sub-image selected from each target sample training image based on the sample frame position; and inputting the sample behavior sequence and the calibration information of the sample behavior sequence into an initial behavior sequence recognition model, so as to train the initial behavior sequence recognition model through the sample behavior sequence and the calibration information of the sample behavior sequence, and obtain a trained target behavior sequence recognition model.

In a possible implementation manner, the calibration information of the sample behavior sequence includes a predicted behavior class of the sample behavior sequence, and the training module is further configured to: determining a calibration frame position of a calibration object based on an object position of the calibration object in each calibration sample training image, and determining airspace matching degree based on the calibration frame position and the sample frame position; determining a time domain matching degree based on the starting time and the ending time of the plurality of calibration sample training images and the starting time and the ending time of the plurality of target sample training images; and determining the predicted behavior category of the sample behavior sequence according to the airspace matching degree, the time domain matching degree and the behavior category of the specified behavior.

In a possible implementation manner, the training module is specifically configured to determine, according to the spatial matching degree, the temporal matching degree, and the behavior class of the specified behavior, a predicted behavior class of the sample behavior sequence: if the spatial domain matching degree is larger than a spatial domain matching degree threshold, determining that the predicted behavior class is the same as the behavior class of the appointed behavior; otherwise, determining that the predicted behavior class is opposite to the behavior class of the specified behavior.

Based on the same application concept as the above method, the embodiment of the present application further provides a behavior detection device, and in terms of hardware level, a schematic diagram of a hardware architecture of the behavior detection device may be shown in fig. 6. The behavior detection device may include: a processor 61 and a machine-readable storage medium 62, the machine-readable storage medium 62 storing machine-executable instructions executable by the processor 61; the processor 61 is configured to execute machine-executable instructions to implement the methods disclosed in the above examples of the present application. For example, the processor 61 is configured to execute machine executable instructions to implement the steps of:

Based on the same application concept as the above method, the embodiment of the present application further provides a machine-readable storage medium, where the machine-readable storage medium stores a number of computer instructions, where the computer instructions can implement the method disclosed in the above example of the present application when executed by a processor.

For example, the computer instructions, when executed by a processor, can implement the steps of:

By way of example, the machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, and the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of behavior detection, the method comprising:

inputting the behavior sequence to be detected into a trained target behavior sequence recognition model, and outputting a target behavior class corresponding to the behavior sequence to be detected by the target behavior sequence recognition model;

the training process of the target behavior sequence recognition model comprises the following steps: inputting a sample training video to a trained target image behavior detection model, outputting object positions in each candidate sample training image of a plurality of candidate sample training images by the target image behavior detection model; wherein the candidate sample training image is a sample training image of an object present in the plurality of sample training images;

Selecting a plurality of target sample training images of the same sample object from the plurality of candidate sample training images based on object positions in the candidate sample training images, and determining a sample frame position of the sample object based on the object position of the sample object in each target sample training image;

acquiring a sample behavior sequence according to the sample frame position, wherein the sample behavior sequence comprises a sample frame sub-image selected from each target sample training image based on the sample frame position;

inputting the sample behavior sequence and the calibration information of the sample behavior sequence into an initial behavior sequence recognition model, so as to train the initial behavior sequence recognition model through the sample behavior sequence and the calibration information of the sample behavior sequence, and obtain a trained target behavior sequence recognition model;

the calibration information of the sample behavior sequence comprises the prediction behavior category of the sample behavior sequence, the calibration frame position of the calibration object is determined based on the object position of the calibration object in each calibration sample training image, and the airspace matching degree is determined based on the calibration frame position and the sample frame position; determining a time domain matching degree based on the starting time and the ending time of the plurality of calibration sample training images and the starting time and the ending time of the plurality of target sample training images; and determining the predicted behavior category of the sample behavior sequence according to the airspace matching degree, the time domain matching degree and the behavior category of the specified behavior.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions in the candidate to-be-detected images includes:

determining a target object by adopting a tracking algorithm based on the object position in the candidate image to be detected, and determining whether the object position in the candidate image to be detected has the object position of the target object or not;

if yes, the candidate image to be detected is determined to be the target image to be detected of the target object.

3. The method of claim 1, wherein determining the target frame position of the target object based on the object position of the target object in each target image to be detected comprises:

establishing a coordinate system by taking the left upper corner position of the target to-be-detected image as an origin of coordinates, taking the horizontal right as a horizontal axis, taking the horizontal downward as a vertical axis, wherein the object position comprises the left upper corner abscissa, the left upper corner ordinate, the right lower corner abscissa and the right lower corner ordinate, and selecting the minimum value of the left upper corner abscissa based on the left upper corner abscissa of the target object in each target to-be-detected image; selecting the minimum value of the left upper corner ordinate of the target object based on the left upper corner ordinate of the target object in each target to-be-detected image; selecting the maximum value of the right lower-corner abscissa based on the right lower-corner abscissa of the target object in each target to-be-detected image; selecting the maximum value of the lower right angle ordinate based on the lower right angle ordinate of the target object in each target to-be-detected image;

Determining a target frame position of the target object according to the minimum value of the left upper corner abscissa, the minimum value of the left upper corner ordinate, the maximum value of the right lower corner abscissa and the maximum value of the right lower corner ordinate;

or,

establishing a coordinate system by taking the left lower corner position of the target to-be-detected image as an origin of coordinates, taking the horizontal right as a horizontal axis, taking the horizontal upward as a vertical axis, wherein the object position comprises the left lower corner abscissa, the left lower corner ordinate, the right upper corner abscissa and the right upper corner ordinate, and selecting the minimum value of the left lower corner abscissa based on the left lower corner abscissa of the target object in each target to-be-detected image; selecting the minimum value of the left lower-corner ordinate based on the left lower-corner ordinate of the target object in each target to-be-detected image; selecting the maximum value of the right upper corner abscissa based on the right upper corner abscissa of the target object in each target to-be-detected image; selecting the maximum value of the upper right-hand ordinate based on the upper right-hand ordinate of the target object in each target to-be-detected image;

and determining the target frame position of the target object according to the minimum value of the left lower-corner abscissa, the minimum value of the left lower-corner ordinate, the maximum value of the right upper-corner abscissa and the maximum value of the right upper-corner ordinate.

4. The method of claim 1, wherein the inputting the behavior sequence to be detected into a trained target behavior sequence recognition model, and wherein after outputting a target behavior class corresponding to the behavior sequence to be detected from the target behavior sequence recognition model, the method further comprises:

alarming according to the target behavior category; or alternatively, the first and second heat exchangers may be,

5. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the training process of the target image behavior detection model comprises the following steps:

acquiring a sample training video, wherein the sample training video comprises a plurality of sample training images, and the plurality of sample training images comprise a plurality of calibration sample training images with appointed behaviors;

inputting the calibration sample training image and the calibration information of the calibration sample training image into an initial image behavior detection model to train the initial image behavior detection model through the calibration sample training image and the calibration information so as to obtain a trained target image behavior detection model;

Wherein, the calibration information at least comprises: and the calibration sample trains the object position of the object where the specified behavior occurs in the image, and the behavior category of the specified behavior.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the determining, according to the spatial domain matching degree, the time domain matching degree and the behavior category of the specified behavior, the predicted behavior category of the sample behavior sequence includes:

if the spatial domain matching degree is larger than a spatial domain matching degree threshold, determining that the predicted behavior class is the same as the behavior class of the appointed behavior;

otherwise, determining that the predicted behavior class is opposite to the behavior class of the specified behavior.

7. A behavior detection apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video to be detected, wherein the video to be detected comprises a plurality of images to be detected;

the input module is used for inputting the video to be detected into a trained target image behavior detection model, and outputting the object position in each candidate image to be detected in a plurality of candidate images to be detected by the target image behavior detection model; wherein the candidate image to be detected is an image to be detected of an object existing in the plurality of images to be detected;

The determining module is used for selecting a plurality of target to-be-detected images of the same target object from the plurality of candidate to-be-detected images based on the object positions of the target objects in the candidate to-be-detected images, and determining the target frame positions of the target objects based on the object positions of the target objects in each target to-be-detected image;

the acquisition module is further used for acquiring a behavior sequence to be detected according to the target frame position, wherein the behavior sequence to be detected comprises target frame sub-images selected from each target image to be detected based on the target frame position;

the input module is further used for inputting the behavior sequence to be detected into a trained target behavior sequence recognition model, and outputting a target behavior class corresponding to the behavior sequence to be detected by the target behavior sequence recognition model;

the training module is used for training the target behavior sequence recognition model by adopting the following steps: inputting a sample training video to a trained target image behavior detection model, outputting object positions in each candidate sample training image of a plurality of candidate sample training images by the target image behavior detection model; wherein the candidate sample training image is a sample training image of an object present in the plurality of sample training images;

the training module is further used for determining a calibration frame position of a calibration object based on the object position of the calibration object in each calibration sample training image, and determining airspace matching degree based on the calibration frame position and the sample frame position; determining a time domain matching degree based on the starting time and the ending time of the plurality of calibration sample training images and the starting time and the ending time of the plurality of target sample training images; and determining the predicted behavior category of the sample behavior sequence according to the airspace matching degree, the time domain matching degree and the behavior category of the specified behavior.

8. A behavior detection apparatus, characterized by comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor;