VIDEO PROCESSING SYSTEM EMPLOYING BEHAVIOR SUBTRACTION BETWEEN REFERENCE AND OBSERVED VIDEO IMAGE SEQUENCES
BACKGROUND Pervasive, wide-area visual surveillance, unthinkable even 10 years ago, is a reality today due to the invention of wireless network video camera. Easily deployed and remotely managed, such a camera can generate round-the-clock visual data (also at night) that can be used in access control, vehicle and pedestrian traffic analysis, gait recognition, detection of unattended luggage, etc. However, the sheer amount of data produced by each camera prevents human-operator monitoring. Therefore, automatic algorithms need to be developed for specific tasks. One such task recently gaining prominence is the detection of suspicious behavior, i.e., locating someone or something whose behavior differs from behavior observed in a reference video sequence. Many surveillance methods are based on a general pipeline- based framework. Moving objects are first detected in a motion segmentation step, then classified and tracked over a certain number of frames, and, finally, the resulting paths used to distinguish "normal" from "suspicious" objects (path different from a path of most objects). In general, these methods contain a training phase during which a probabilistic model is built using paths followed by "normal" objects.
SUMMARY
While there are advantages of using object path as a motion attribute, there are several drawbacks as well. Fundamentally, object path discriminant techniques focus on the question of which path is anomalous. However, there is often important information in the dynamics of the path itself, usually ignored by these schemes. The present disclosure addresses a slightly different question - how conventional paths are visited, and whether these are statistically anomalous. To illustrate the importance of this approach, the following scenarios can be considered: (a) the lingering time of an object along a normal path (or in some regions) may be statistically significant; (b) an unusually large/small object along a conventional path traveling at normal speeds. One novelty in the disclosed approach is the modeling of temporal behavior, and another novelty is in terms of the solution to the problem. While path-based schemes require many stages of processing ranging from low- level detection to high-level inferencing, and can often result in dramatic failure due to the multiple stages of processing (imprecise motion field, tracking error), the disclosed technique may require only low-level processing and may be efficiently implemented in real-time. Furthermore, on account of its simplicity it may be possible to provide performance guarantees. A third advantage is its generality, i.e., it applies to individual moving objects
(car, truck, person), groups of objects, merging and/or splitting objects. Finally, the disclosed technique is robust against harsh environments (jittery cameras with unknown calibration, highly cluttered scenes, rain/snow/fog, etc.).
This disclosure describes a method and system for the detection and localization of either dissimilarities or similarities between sequences of images based on analysis of patterns of motion and non-motion in pixels of the image sequences. The technique can be used in a security context, for example, to identify anomalies in sequences of images obtained from a security camera. It can also be used in an identification context, for example to identify unknown/suspicious video material by comparison with a database or library of known (e.g., copyrighted) video material.
More particularly, the disclosed methods and apparatus form a compact, low- dimensional representation (referred to as behavior representation or behavior image) to capture temporal activity (or lack thereof) and account for broad temporal variations in size, shape and color characteristics of real-time video sequences. The disclosed system accumulates motion data from multiple image sequences to form corresponding behavior representations. In one class of embodiments one behavior representation is a reference behavior representation obtained by an accumulation applied to a reference (training) image sequence, and the other behavior representation is an observed behavior representation obtained by an accumulation applied to an observed image sequence. The image sequences can be obtained from a visible-light or infra-red video camera (sequence of fields if interlaced scan is used, or sequence of frames if progressive scan is used), both analog and digital, from a web camera, computer camera, surveillance camera, etc.
A comparing function is applied to the behavior representations to detect either dissimilar (anomalous, unusual, abnormal) or similar (highly-correlated) dynamics, or motion patterns, between the image sequences. The detection of unusual motion patterns is an enabling technology for suspicious behavior detection in surveillance, homeland security, military, dynamic data analysis (e.g., bird motion patterns), and other applications. The detection of similar motion patterns, on the other hand, is an enabling technology for the identification of copyrighted video material (against a database of such material) for the purposes of copyright policing/enforcement. The technique is a completely new approach to visual anomaly (or similarity) detection and is generic, with no specific models needed for the analysis of motion patterns of humans, animals, vehicles, etc. The temporal accumulation of images (or video frames) is, in general, non- linear and can take many different forms. In one embodiment, anomaly detection based on motion patterns is performed by applying motion detection first (e.g., by means of background subtraction) in order to obtain per-pixel motion detection labels, e.g., 1 for moving, 0 for stationary pixels, from both observed and reference image sequences. Subsequently, the motion detection labels of the observed image sequence undergo temporal summation over N images (current image and N-I prior images) to form a cumulative label field, which is the observed behavior representation. A separate cumulative label field is computed similarly from the reference image sequence, but additionally the motion labels undergo a nonlinear "max" operation in time, pixel-by-pixel, i.e., a maximum cumulative label is found at each pixel over all time instants. This produces the reference behavior representation.
The technique can be used for comparison of a target video sample against representative images (and hence video sequences). The output can be identification of copyrighted video material, detecting and localizing abnormal behavior in surveillance, or enabling detecting of illegal activities for example, and it can be employed either indoors, outdoors or on mobile units. It may be implemented on specialized integrated circuit(s) or a general-purpose processor, and can be employed in a stand-alone scenario such as within a single camera unit or as components within a larger networked system. This technique can also be used to perform detection of "essential" motion in harsh environments (i.e., detection of moving items of interest (vehicles, people etc.) in environments that also have noisy non- essential motion such as camera jitter, trees rustling, waves, etc.). It can also be used to automatically register multiple cameras by comparing reference behavior representations from multiple cameras. A networked system can then perform local processing and may use these results in combination with advanced fusion technologies for providing area-wide, realtime pervasive surveillance in limited communication environments. Advantages of the disclosed technique are (a) Compactness: typical temporal behavior is projected into a low-dimensional image or feature space; (b) Generality: requires little prior information, training or knowledge of the video environment; (c) Robustness: provides a degree of insensitivity to harsh environments; (d) Real-time processing: can be implemented to provide real-time information with existing hardware in current stand-alone video cameras.
Other embodiments of the accumulation of images are possible. For example, the temporal segmentation of motion detection labels into 1 (motion activity) and 0 (no motion activity) states may be followed by measurement of the average temporal length of the 1 and
0 states ("busy" and "idle" times) over a time window, and forming a scatter plot of these average lengths. Now, the behavior representation is a single two-dimensional histogram of average "busy" versus "idle" times over a time window. Alternatively, a separate two- dimensional histogram of average busy versus idle times over a time window can describe each pixel position separately. Although more memory-consuming, this embodiment allows for finer granularity of anomalous motion pattern detection.
The comparing function applied to the behavior representations may be a simple difference operation followed by "floor-to-zero" operation, or it may be a more elaborate scheme. For example, the histograms of average "busy" and "idle" times for the observed and reference image sequences can be compared using histogram equalization. In yet another embodiment, the histograms are transformed into continuous probability density functions by means of adding low- variance random noise to each histogram count, followed by non- parametric kernel estimation to evaluate the likelihood of an observed pixel's busy and idle times being drawn from the estimated non-parametric probability density function. The basic motion information may also be augmented by other information to enhance operation, such as information identifying the size, shape and/or color of the object that each pixel or region is part of.
The use of "busy" and "idle" cycles registered at each pixel (or region) can also be used to find correspondence between cameras. Since the busy-idle distributions are unique for each pixel (or region) containing a non-zero activity, they can be used as a signature to uniquely define pixels in the video. Furthermore, since binary motion labels are almost invariant, the position and orientation of the camera (under the assumption that the moving objects' height be significantly smaller than the elevation of the camera), pixels of different cameras looking at the same region shall have a similar busy-idle distribution. Thus, with a simple distance metric between busy-idle distributions (be it a Euclidean distance, a Kullback-Leibler distance, or a Kolmogorov-Smirnov distance for example), the correspondences between pixels of different cameras looking at a single scene can be found. Once such pixel-to-pixel (or region-to-region) correspondence has been established, the behavior model learned by one camera can be transferred to other cameras looking at the scene, enabling them to detect abnormal activity.
Based on a similar idea, one can use a light projector (be it laser-based, LCD-based or any other) to project patterns of light (typically, patterns containing white and black regions) on a three-dimensional object. If one or more cameras look at that object while the light patterns are projected, those patterns can be interpreted by the cameras as being binary motion labels. Thus, following the same busy-idle distance metric, a correspondence map between the pixels (or the regions) of each camera and the projector can be established. This correspondence map can then be used as a disparity map to build a three dimensional view of the observed object.
One embodiment of the video identification application is as follows. A database of copyrighted video material is assembled by applying one of the behavior representations described above to each of a collection of known copyrighted videos. Then, the same representation is computed for a suspicious video. The two resulting behavior representations are compared using either the difference and floor-to-zero operators, or a histogram-based metric. If these behavior representations are judged very similar, a copyright violation alert is issued. The use of a single-image representation in this case leads to a very compact and fast video identification system. It can be used as a triage system to separate definitive non- violations from possible violations. In another embodiment that requires more memory and computing power but achieves higher robustness, a behavior sequence representation is used. A behavior sequence is a sequence of behavior images, each such image computed from a sub-sequence (temporal segment) of an analyzed video sequence. The sub-sequences can be obtained either by uniform temporal partitioning of the analyzed video sequence (e.g., 100-frame segments) or by a non-uniform temporal partitioning for example, partitioning based on scene-cut detection). Using the same type of partitioning (uniform or non-uniform), behavior image representations for each sub-sequence and comparison metric are applied to both the database and suspicious video material. However, since the suspicious video material may be cropped (in spatial dimensions and time), the matching of the observed behavior sequence with the video database is preferably performed through a "sliding window" (space and time) approach. The observed behavior sequence is matched with all same-length temporal segments and same-size spatial windows of all behavior sequences in the copyright material database. Although more demanding computationally and memory-wise, this approach assures robustness to spatial and temporal video cropping.
The generality of the proposed representations and comparison mechanisms makes the anomaly detection and video identification robust to spatial resolution change (spatial scaling), compression, spatial filtering, and other forms of spatial video manipulation.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features and advantages will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of various embodiments of the invention. Figure 1 is a block diagram of an image-processing system; Figure 2 is a flow diagram of operation of the system of Figure 1; Figures 3(a)-3(f) and 4(a)-4(f) are video images and image-like representations illustrating examples of the disclosed method.
DETAILED DESCRIPTION
Figure 1 shows a video processing system 10 including a camera 12, video storage 14, image processor 16, and a database 18. In operation, the camera 12 generates a video signal 20 representing captured video frames, which are stored in the storage 14. The stored video frames are provided to the image processor 16 as image sequences 22. As described in more detail below, the image processing circuitry 16 calculates reference behavior representations 24 and stores these in the database 18, and then later retrieves the stored reference behavior representations 24 as part of processing observed image sequences 22 and generating user output. The nature of the user output varies based on the particular application. In a surveillance application, the user output may include alerts as well as an image on a user output device (e.g., display) indicating the location of anomalous behavior in an imaging field. In an identification application, the user output may include an identification of a reference image sequence that has matched an observed image sequence.
Figure 2 illustrates pertinent operation of the image processor 16. At step 26, sequences of motion labels are calculated from corresponding sequences of images, each sequence including motion label values each indicating presence and absence of motion on a region-by-region basis. Various region sizes and motion label types may be employed. As mentioned above, one simple motion label is the presence or absence of motion, i.e., for each pixel in a frame, whether or not that pixel is part of an object in motion. The result of this operation for each video frame is a corresponding frame of per-pixel motion label values, such as "1" for "moving" and "0" for "not moving". In alternative embodiments, regions larger than a pixel may be covered by a label, and the label values may be more complex. In particular as noted elsewhere herein, the label values may be accompanied by additional information that can enhance operation, such as information identifying other attributes of the moving objects that each pixel/region constitutes, such as size, shape, color, etc. Referring again to Figure 2, the image processor 16 at step 32 uses the motion labels for the first and second image sequences to calculate respective first and second "behavior representations", each being a low-dimensional representation of the corresponding image sequence. In general, one or both of the behavior representations is stored in the database 18, especially in applications such as surveillance in which a reference behavior representation is calculated from a reference image sequence that precedes an observed image sequence from which the observed behavior representation is calculated. This operation may be repeated for multiple image sequences. At step 34, the image processor 16 calculates a comparing function of the behavior representations from step 32, which may include retrieving one or more of the subject behavior representations from the database 18 if necessary. The result of the calculation of step 34 can be used to provide an indication to a user or other application- specific functionality. For example, in the case of video surveillance, the calculation of step 34 may identify anomalous behavior, and in that case appropriate visual and/or other indications can be made to a user. In a video matching application, the calculation of step 34 may identify which of several reference image sequences is matched by the observed image sequence.
Before describing the details of operation, example images/representations are presented in order to briefly illustrate how the process works. Reference is made to Figures 3(a) through 3(f). Figure 3(a) is a representative image from a video of a highway scene near an overpass. It will be appreciated that over a typical short interval (e.g., 15 minutes), many cars will pass by the camera both on the highway (left-to-right) as well as on the overpass
(top right corner of image). Figure 3(b) is a single motion label field derived from the image of Figure 3(a), with the pixels of moving objects (e.g., cars and trucks) showing as white and the pixels of stationary objects showing as black. It will be appreciated that for each pixel, the set of motion labels calculated for that pixel over all images of the sequence forms a respective sequence of motion labels. Figure 3(c) is a reference behavior representation which is calculated from all the per-pixel sequences of motion labels over a typical interval. The behavior representation of Figure 3(c) is taken as a reference behavior representation. The image can be interpreted as indicating (by pixel amplitude) the amount of motion experienced in different areas of the image.
Figures 3(d) through 3(f) illustrate the use of the reference behavior representation of Figure 3(c) to identify anomalous behavior. Figure 3(d) shows the scene at a time when a train is passing by (near the top of the image), and Figure 3(e) shows the corresponding motion label field. Figure 3(f) shows the result of a comparing function applied between the reference behavior representation of Figure 3(c) and an observed behavior representation calculated from the sequences of motion labels for the observed image sequence. As shown in Figure 3(f), only the areas where the train is located show up, indicating that the passing of the train is anomalous based on the training sequence represented by Figure 3(a). Further details of operation are now provided.
In many cases, the goal of a video surveillance system is to detect unusual behavior such as a motor vehicle accident, vehicle breakdown, an intrusion into a restricted area, or someone leaving a suspicious object (e.g. walking away from a suitcase). Thus, the starting point is a definition of "normality". It is proposed that a video sequence exhibiting normal behavior is one whose dynamic content has already been observed in a training sequence.
Consequently, an abnormal sequence is one whose dynamic content has not been observed in the training sequence. Following this definition, let l{x,t) be a training video sequence exhibiting normal activity and l(x,t)be an observed video sequence which may contain unusual behavior, where xis a pixel location and t denotes time. Also, letj be an abnormality label field that is sought to be estimated; ^(x^is 0 for normal behavior and 1 for abnormal behavior.
To discriminate between normal and abnormal behavior, a measure is used of how likely it is that the dynamic content of an N-frame video sequence {l(x, t
t )}f = {l(x, t
{ ), l(x, t
2),..., l(x, t
N )} has already been observed in the M-frame training sequence {ϊ(x, t
t )}f = [ϊ(x, t
x ), ϊ(x, t
2),...,ϊ(x, t
M )). Since the comparison is to be made based on dynamic content in both video sequences, a motion attribute at (x,t) is of interest. Let {L[x, t
t )}f be an observed motion label sequence, i.e., a sequence of motion labels, with value of 1 if a pixel belongs to a moving object and 0 otherwise, computed from the observed sequence
using motion detection. Similarly, let {L{x,t)}f bQ a reference motion label sequence computed similarly from the reference (training) sequence {ϊ(x,t
t)}f . Many motion detection techniques have been devised to date, including fixed- and adaptive- threshold hypothesis testing, background subtraction, Markov Random Field (MRF), and level-set inspired methods.
Applying a motion detection algorithm to all frames of the training video sequence and, independently, to all frames of the observed video sequence results in respective temporal series of zeros and ones at each pixel location. Each such binary sequence reflects the amount of activity occurring at a given pixel location during a certain period of time. For instance, a pixel whose zero-one sequence contains many l's is located in a relatively "busy" area (exhibiting relatively more motion over time), whereas a pixel associated with many O's is located in a relatively "static" area (exhibiting relatively less motion over time). From the observed and reference motion label sequences {L{X, ^ )}f and {Lyx,^)}^ , respectively, observed and reference behavior representations are computed and compared to establish abnormality. More specifically, suppose that abnormality is defined as follows: a series of observations is abnormal if it contains an unusually high amount of activity. Therefore, a JF-frame time series {Z,(jco,t:)}£ _w+l is abnormal if it contains more l's than any W- frame time series in the training sequence {L(X0 , tt )}kk_w+l for W < k < M . In other words, a pixel at x0 is abnormal if the previous W observations {Z,(xo,t:)}£ _w+l exhibit more activity than the maximum amount of activity registered at x0 in the training sequence.
Since the maximum amount of activity in the training sequence is constant, it may be pre-calculated and stored in a background behavior image B :
where W < k < M . The B image succinctly synthesizes the ongoing activity in the training sequence and thus is a form of reference (training) behavior representation. It implicitly includes the paths followed by moving objects as well as the amount of activity registered at every point in the training sequence.
Similarly, the activity in the observed sequence is measured by computing an observed behavior image v:
v(x) = ∑ L(x,
tj) j =k-W which contains the total amount of activity during the last W frames of the observed sequence. Thus, v is a form of observed behavior representation. Once images B and v have been computed, one simple possibility for the comparing function is a distance-measuring function D such as : D(B(x),v(x)) = lv(x)- B(x)i, where |_α_|
0 is a
Λ Λfloor to zero" operator (0 if a < 0 and a otherwise). With such a distance measure, the behavior detection problem is reduced to background subtraction
i.e., subtraction of image v, containing a snapshot of activity just prior to t, from the background image B, containing an aggregate of long-term activity in the training sequence. Based on this formulation, the method may be seen as involving behavior subtraction.
In another embodiment, average activities can be compared, and the background behavior image computed using an averaging operator over M frames of the reference motion label sequence as follows:
and similarly the observed behavior image can be computed as W- frame average over the observed motion label sequence:
)-
The background and observed behavior images obtained by means of averaging are again compared using the distance-measuring function D. Since in this embodiment an average motion activity is compared between the observed and reference sequences, only a departure from average motion is detected. Therefore, this approach is a very effective method for motion detection in presence of some nominal activity in the scene, such as due to camera vibrations, animated water, or fluttering leaves.
The behavior subtraction method has been tested on real video sequences from network video cameras. Example results are shown in Figures 3(a)-3(f) and 4(a)-4(f). As noted above, Figures 3(a)-3(f) are of a highway scene with adjacent overpass and train tracks. Figures 4(a)-4(f) are of a busy intersection, and the anomalous behavior identified by Figure 4(f) is the presence of a streetcar visible in the center of the image of Figure 4(d) but absent from the training sequence represented by Figure 4(a). These figures correspond to the above mathematical designations of the images/fields as follows:
The results of Figures 3(a)-3(f) and 4(a)-4(f) were obtained with 1700 and 2000 training frames, respectively. In both cases, values of JF=IOO, τ= 20, and T= 30 were used. Motion detection was implemented using a simple background subtraction with background derived from a temporal median filter.
While various embodiments of the invention have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.