Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
In the present invention, the AI process of the video is as follows:
1. video decoding: the video stream is transmitted to a CPU or a GPU for decoding, images of each frame are extracted, the video frame rate which generally meets the real-time processing requirement is 24 frames per second, and the video frame rate corresponds to 24 images per second.
2. Target Detection (Detection): detection is used for detecting the coordinate position of the target object in the frame of image data, for example, inputting the frame image into an AI model for Detection and identification. The common target detection algorithm includes a Young Only Look Once (YOLO) algorithm, YOLO is a public calculation method, and the YOLO algorithm can train an object to be recognized according to actual needs of a user, for example, if a person needs to be recognized, a human body detection model is trained by using the YOLO algorithm, if a tree needs to be recognized, a tree detection model is trained by using the YOLO algorithm, a bounding box (a minimum rectangular frame containing a target image) is output for the target object detected in a frame image, and an effective area image is extracted, where the process is a process of a block diagram, and the block diagram is generally processed in a GPU so as to accelerate processing speed.
3. Feature extraction (Extract): that is, feature vector (feature) data of the effective area image is extracted, the feature vector is a mathematical representation of a feature value of the image, for example, feature vector values of a person wearing yellow clothes and a person wearing black clothes are different, after the effective area image is extracted (such as pedestrians or trees), the feature vector of the effective area image is extracted again through the AI model, and the feature extraction is generally processed in the GPU to accelerate the processing speed.
4. Tracking (Tracking): the method is characterized in that effective area images with the same characteristics detected on each frame of image are correlated, the extracted characteristic vector data of each frame of image are compared, and then the same bounding boxes extracted from the images of the frames before and after the video are subjected to coordinate correlation to be drawn into a continuous line of a time sequence, and the tracking process can be generally directly carried out in a CPU without GPU operation.
5. Acquiring required service data according to service requirements, for example, if pedestrian counting is required, counting the number of pedestrians passing through a specific area, drawing a line in a video in advance to mark the specific area, and taking a tracking line which has an intersection with the drawn line as a counting unit according to a track result of target tracking; in addition, car detection, dangerous goods detection, and the like can be performed, and the obtained effective area images (cars or dangerous goods) are associated with each other in each frame image.
Fig. 1 shows a flow chart of an embodiment of a video processing method of the present invention, as shown in fig. 1, the method comprising the steps of:
s101: and acquiring each frame image in the video to be processed.
In this step, the video to be processed is transmitted to the CPU and the GPU for decoding, and each frame image is extracted.
S102: and extracting corresponding effective area images from each frame image.
In an optional manner, step S102 further includes: detecting a target object contained in each frame image and determining effective area information of the frame image; and extracting the effective area image from the frame image according to the effective area information.
Specifically, for each frame image, performing recognition processing on the frame image, and determining a tracking frame corresponding to each target object in the frame image; the tracking frame corresponding to each target object can completely select a foreground image frame aiming at the target object in the frame image; and determining a minimum rectangular envelope area for enveloping the tracking frames corresponding to all target objects in the frame image, and taking the area information of the minimum rectangular envelope area as the effective area information of the frame image.
The effective area images in the video to be processed are all areas where target objects needing to be detected appear, for pedestrian identification, the effective area images are all areas where pedestrians appear, tracking frames corresponding to all the target objects are obtained by continuously performing a Detection algorithm on the video to be processed, and the minimum rectangular envelope area enveloping the tracking frames corresponding to all the target objects in the frame image is used as the effective area information of the frame image. The longer the length of the video to be processed, the higher the accuracy of the active area image (better accuracy is obtained by selecting 24 hours a day), and this process may be performed for each frame image or at preset time intervals to obtain a plurality of active area images.
Fig. 2 is a schematic diagram of extracting a corresponding effective region image from a frame image, and as shown in fig. 2, in an example where a target object is a pedestrian, a coordinate position where the pedestrian appears in each frame image is detected, the detected pedestrian outputs a tracking frame, a minimum rectangular envelope region (bounding box) for enveloping the tracking frames corresponding to all the target objects in the frame image is further determined, and region information of the bounding box is used as effective region information of the frame image. And further, cutting the effective area information to obtain effective area images, and inputting each frame image of the video to be processed and the coordinates of the effective area images as input into a neural network analysis model for operation. It should be noted that the coordinates of the effective area image may be input by a manual marking method, or may be dynamically calculated by a manual intelligent method.
S103: and arranging and combining the effective area images corresponding to the frame images, and determining the coordinate position information of each effective area image in the combined image to obtain a recombined video containing a plurality of combined images.
In an optional manner, step S103 further includes: adding the effective area image corresponding to each frame image into an initial gallery; arranging and combining the effective area images in the initial image library according to a preset combination rule to generate a plurality of combined images, and determining coordinate position information of the effective area images in each combined image in the combined image; and summarizing all the combined images to obtain the recombined video.
Further, since the effective area image is a rectangle, a plurality of effective area images can be combined to generate a combined image, and all the combined images are summarized into one video to obtain a recombined video.
In an alternative form, the generation of a combined image comprises steps 1-5:
step 1: a canvas is created.
In this step, the canvas may be a polygon, preferably a rectangle.
Step 2: determining an effective canvas area in the canvas, and searching at least one placement point in the effective canvas area; wherein, effective canvas area is the blank area in the canvas, places the point and is the nodical of horizontal limit and perpendicular limit in the effective canvas area.
Fig. 3 is a schematic diagram of canvas placement points, as shown in fig. 3, a rectangular canvas is taken as an example, a blank canvas, a canvas filled with one effective area image, and a canvas filled with two effective area images are respectively shown in the diagram from left to right, the position marked by a circle in the diagram is the position of a placement point, and the placement point of the blank canvas is the left vertex of the blank area (i.e., the effective canvas area) of the canvas (i.e., the canvas is traversed from top to bottom from left to right, and the intersection point of the leftmost vertical edge and the uppermost horizontal edge of the effective canvas area is selected); similarly, the selection of the placement points of the canvas filled with one effective area image and the canvas filled with two effective area images is that the effective canvas areas are sequentially traversed from top to bottom and from left to right, and the intersection point of the leftmost vertical edge of the effective canvas area and the lower horizontal edge of the filled effective area image and the intersection point of the right vertical edge of the filled effective area image and the uppermost horizontal edge of the effective canvas area are respectively selected as the placement points; similarly, for a canvas filled with two or more effective area images, a placement point in the effective canvas area is found.
And step 3: judging whether the initial gallery contains an effective area image which can be used for filling an effective canvas area; if yes, executing step 4; if not, executing step 5.
And 4, step 4: selecting an effective area image which can be used for filling an effective canvas area from an initial gallery, selecting a placing point from at least one placing point, filling the effective area image at the position corresponding to the placing point, and then jumping to execute the step 2.
Specifically, whether an effective area image capable of being used for filling an effective canvas area is contained in an initial gallery is judged, if yes, an effective area image is selected from the initial gallery, a first placing point is selected, the effective canvas area starts to be filled, it is required to be explained that the selected effective area image cannot exceed the range of the effective canvas area, after a picture is filled, the step 2 is executed again, the effective canvas area and all the placing points in the canvas are continuously calculated, for the canvas with a plurality of placing points, all the placing points can be traversed from top to bottom and from left to right, and the uppermost edge (or the leftmost edge) of the canvas is selected as the first placing point to start to fill the effective canvas area.
And 5: and generating a combined image according to the filled effective area images in the canvas, and determining the coordinate position information of the effective area images in the combined image.
Specifically, whether the initial gallery contains an effective area image which can be used for filling an effective canvas area is judged, if not, the effective area image which can be filled cannot be found in the effective area of the canvas, one canvas is considered to be finished, the effective area of the canvas is filled with the rest effective area images according to the process, and the arranging and combining process is finished after all the effective area images are filled.
Further, the result of permutation and combination is many, and it is also the result of permutation and combination that one effective area image is placed on one canvas, and the optimal result is that all the effective area images are placed on the same canvas. The smaller the number of the combined images, the less the calculation amount calculated by using the neural network analysis model, and the final number of the combined images is the result of the permutation and combination at this time, wherein the calculation formula of the number of the combined images is as follows:
wherein, C (n, m) is the mark permutation combination, namely the combination of n numbers selected from m numbers.
In addition, for any one of the combined images, if the effective area images included in the combined image are the same and the arrangement order is different, the same image needs to be removed. And extracting the result of the minimum number of the combined images in the permutation and combination according to the calculation formula of the number of the combined images to obtain the coordinate position of each effective area image in the combined image.
S104: and analyzing each combined image in the recombined video by using the neural network analysis model to obtain an analysis result of the recombined video.
Because training of the neural network analysis model is usually based on fixed video resolution, videos with 720P (1280x720) resolution are generally adopted for training of close-range cameras, videos with 1080P (1920x1080) resolution or 4K resolution can be adopted for a far-range camera, and the model trained through videos with a certain resolution also needs video sources with consistent resolution to be input into the neural network analysis model for calculation during reasoning. Reducing the video to be processed through the steps S101 to S103, wherein the resolution of the obtained image is already smaller than that of the original video to be processed, collecting the obtained effective area images uniformly, recombining the effective area images into an image with a video source resolution (i.e. a combined image), integrating a plurality of combined images into a virtual camera video stream, i.e. a recombined video, and outputting the virtual camera video stream to perform AI operation.
S105: and restoring the analysis result of the recombined video according to the coordinate position information of each effective area image in the combined image and the effective area information of each effective area image in the frame image to obtain the analysis result of the target video.
For the recombined video obtained in step S104, that is, the virtual camera video stream, it is necessary to correspond to the original position of the video to be processed according to the coordinate position information of the effective area image in the combined image, for example, restore the position of the effective area image in the video to be processed to the original shooting range, so as to obtain the target video analysis result. And performing subsequent business processing according to the business requirements of the video to be processed, such as the crossing line counting of pedestrians and vehicles.
By adopting the method provided by the embodiment, on the premise of not adding hardware, the effective area of the video to be processed is extracted, the effective area of the video to be processed is dynamically calculated, and then the effective area images are arranged and combined to obtain the recombined video, so that the computer can automatically determine the effective area according to a detection algorithm, and the AI operation efficiency can be doubled by combining a plurality of combined images to the recombined video through the combination algorithm, thereby reducing the calculation times in the video processing process, fully utilizing the existing hardware resources and improving the video processing efficiency; meanwhile, the target video analysis result is obtained by restoring the recombined video, so that the video processing precision is ensured, the whole method flow has the self-adaptability of dynamic and complex execution environment, the precision of machine learning operation is not influenced, and the problems of hardware increase requirement and precision reduction risk brought by various existing methods are solved.
Fig. 4 is a schematic structural diagram of an embodiment of a video processing apparatus according to the present invention. As shown in fig. 4, the apparatus includes: a frame image acquisition module 401, an extraction module 402, a video recombination module 403, an analysis module 404, and a restoration module 405.
The frame image obtaining module 401 is configured to obtain each frame image in the video to be processed.
An extracting module 402, configured to extract corresponding effective region images from each frame image.
In an optional manner, the extraction module 402 is further configured to: detecting a target object contained in each frame image and determining effective area information of the frame image; and extracting the effective area image from the frame image according to the effective area information.
In an optional manner, the extraction module 402 is further configured to: aiming at each frame image, carrying out identification processing on the frame image, and determining a tracking frame corresponding to each target object in the frame image; the tracking frame corresponding to each target object can completely select a foreground image frame aiming at the target object in the frame image; and determining a minimum rectangular envelope area for enveloping the tracking frames corresponding to all target objects in the frame image, and taking the area information of the minimum rectangular envelope area as the effective area information of the frame image.
The video recombination module 403 is configured to perform permutation and combination on the effective region images corresponding to each frame image, determine coordinate position information of each effective region image in the combined image, and obtain a recombined video including multiple combined images.
In an alternative manner, the video recomposition module 403 is further configured to: adding the effective area image corresponding to each frame image into an initial gallery; arranging and combining the effective area images in the initial image library according to a preset combination rule to generate a plurality of combined images, and determining coordinate position information of the effective area images in each combined image in the combined image; and summarizing all the combined images to obtain the recombined video.
In an alternative manner, for the generation process of a combined image, the video recomposition module 403 is further configured to:
step 1: creating a canvas;
step 2: determining an effective canvas area in the canvas, and searching at least one placement point in the effective canvas area; the effective canvas area is a blank area in the canvas, and the placement point is an intersection point of a horizontal edge and a vertical edge in the effective canvas area;
and step 3: judging whether the initial gallery contains an effective area image which can be used for filling an effective canvas area; if yes, executing step 4; if not, executing the step 5;
and 4, step 4: selecting an effective area image which can be used for filling an effective canvas area from an initial gallery, selecting a placing point from at least one placing point, filling the effective area image at a position corresponding to the placing point, and then jumping to execute the step 2;
and 5: and generating a combined image according to the filled effective area images in the canvas, and determining the coordinate position information of the effective area images in the combined image.
And the analysis module 404 is configured to analyze and process each combined image in the recombined video by using the neural network analysis model to obtain an analysis result of the recombined video.
And a restoring module 405, configured to restore the reconstructed video analysis result according to the coordinate position information of each effective area image in the combined image and the effective area information of each effective area image in the frame image, to obtain a target video analysis result.
By adopting the device provided by the embodiment, each frame image in the video to be processed is obtained; extracting corresponding effective area images from each frame image; arranging and combining the effective area images corresponding to the frame images, and determining the coordinate position information of each effective area image in the combined image to obtain a recombined video containing a plurality of combined images; analyzing each combined image in the recombined video by using a neural network analysis model to obtain an analysis result of the recombined video; and restoring the analysis result of the recombined video according to the coordinate position information of each effective area image in the combined image and the effective area information of each effective area image in the frame image to obtain the analysis result of the target video. According to the device provided by the embodiment, on the premise of not adding hardware, the effective area of the video to be processed is extracted, and the images of the effective area are arranged and combined to obtain the recombined video, so that the calculation times in the video processing process are reduced, the existing hardware resources are fully utilized, and the video processing efficiency is improved; meanwhile, the target video analysis result is obtained by restoring the recombined video, so that the video processing precision is ensured.
An embodiment of the present invention provides a non-volatile computer storage medium, where at least one executable instruction is stored in the computer storage medium, and the computer executable instruction may execute the video processing method in any of the above method embodiments.
The executable instructions may be specifically configured to cause the processor to:
acquiring each frame image in a video to be processed;
extracting corresponding effective area images from each frame image;
arranging and combining the effective area images corresponding to the frame images, and determining the coordinate position information of each effective area image in the combined image to obtain a recombined video containing a plurality of combined images;
analyzing each combined image in the recombined video by using a neural network analysis model to obtain an analysis result of the recombined video;
and restoring the analysis result of the recombined video according to the coordinate position information of each effective area image in the combined image and the effective area information of each effective area image in the frame image to obtain the analysis result of the target video.
Fig. 5 is a schematic structural diagram of an embodiment of a computing device according to the present invention, and a specific embodiment of the present invention does not limit a specific implementation of the computing device.
As shown in fig. 5, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.
Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. And the processor is used for executing the program, and specifically can execute the relevant steps in the video processing method embodiment.
In particular, the program may include program code comprising computer operating instructions.
The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The server comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program may specifically be adapted to cause a processor to perform the following operations:
acquiring each frame image in a video to be processed;
extracting corresponding effective area images from each frame image;
arranging and combining the effective area images corresponding to the frame images, and determining the coordinate position information of each effective area image in the combined image to obtain a recombined video containing a plurality of combined images;
analyzing each combined image in the recombined video by using a neural network analysis model to obtain an analysis result of the recombined video;
and restoring the analysis result of the recombined video according to the coordinate position information of each effective area image in the combined image and the effective area information of each effective area image in the frame image to obtain the analysis result of the target video.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.