Part of the book series:Lecture Notes in Computer Science ((LNIP,volume 9900))
Included in the following conference series:
4073Accesses
Abstract
Throughout a patient’s stay in the Intensive Care Unit (ICU), accurate measurement of patient mobility, as part of routine care, is helpful in understanding the harmful effects of bedrest [1]. However, mobility is typically measured through observation by a trained and dedicated observer, which is extremely limiting. In this work, we present a video-based automated mobility measurement system calledNIMS: Non-Invasive Mobility Sensor. Our main contributions are: (1) a novel multi-person tracking methodology designed for complex environments with occlusion and pose variations, and (2) an application of human-activity attributes in a clinical setting. We demonstrate NIMS on data collected from an active patient room in an adult ICU and show a high inter-rater reliability using a weighted Kappa statistic of 0.86 for automatic prediction of the highest level of patient mobility as compared to clinical experts.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1Introduction
Monitoring human activities in complex environments are finding an increasing interest [2,3]. Our current investigation is driven by automated hospital surveillance, specifically, for critical care units that house the sickest and most fragile patients. In 2012, the Institute of Medicine released their landmark report [4] on developing digital infrastructures that enable rapid learning health systems; one of their key postulates is the need for improvement technologies for measuring the care environment. Currently, simple measures such as whether the patient has moved in the last 24 h, or whether the patient has gone unattended for several hours require manual observation by a nurse, which is highly impractical to scale. Early mobilization of critically ill patients has been shown to reduce physical impairments and decrease length of stay [5], however the reliance on direct observation limits the amount of data that may be collected [6].
To automate this process, non-invasive low-cost camera systems have begun to show promise [7,8], though current approaches are limited due to the unique challenges common to complex environments. First, though person detection in images is an active research area [9,10], significant occlusions present limitations because the expected appearances of people do not match what is observed in the scene. Part-based deformable methods [11] do somewhat address these issues as well as provide support for articulation, however when combining deformation with occlusion, these too suffer for similar reasons.
This paper presents two main contributions towards addressing challenges common to complex environments. First, using an RGB-D sensor, we demonstrate a novel methodology for human tracking systems which accounts for variations in occlusion and pose. We combine multiple detectors and model their deformable spatial relationship with a temporal consistency so that individual parts may be occluded at any given time, even through articulation. Second, we apply an attribute-based framework to supplement the tracking information in order to recognize activities, such as mobility events in a complex clinical environment. We call this systemNIMS: A Non-Invasive Mobility Sensor.
1.1Related Work
Currently, few techniques exist to automatically and accurately monitor ICU patient’s mobility. Accelerometry is one method that has been validated [12], but it has limited use in critically ill inpatient populations [6]. Related to multi-person tracking, methods have been introduced to leverage temporal cues [13,14], however hand-annotated regions are typically required at the onset, limiting automation. To avoid manual initializations, techniques such as [15,16] employ a single per-frame detector with temporal constraints. Because single detectors are limited towards appearance variations, [15] proposes to make use of multiple detectors, however this assumes that the spatial configuration between the detectors is fixed, which does not scale to address significant pose variations.
Much activity analysis research has approached action classification with bag-of-words approaches. Typically, spatio-temporal features, such as Dense Trajectories [17], are used with a histogram of dictionary elements or a Fisher Vector encoding [17]. Recent work has applied Convolutional Neural Network (CNN) models to the video domain [18,19] by utilizing both spatial and temporal information within the network topology. Other work uses Recurrent Neural Networks with Long Short Term Memory [20] to model sequences over time. Because the “activities” addressed in this paper are more high-level in nature, traditional spatio-temporal approaches often suffer. Attributes describe high-level properties that have been demonstrated for activities [21], but these tend to ignore contextual information.
The remainder of this paper is as follows: first, we describe our multi-person tracking framework followed by our attributes and motivate their use in the clinical setting to predict mobility. We then describe our data collection protocol and experimental results and conclude with discussions and future directions.
2Methods
Figure 1 shows an overview of our NIMS system. People are localized, tracked, and identified using an RGB-D sensor. We predict the pose of the patient and identify nearby objects to serve as context. Finally, we analyze in-place motion and train a classifier to determine the highest level of patient mobility.
Flowchart of our mobility prediction framework. Our system tracks people in the patient’s room, identifies the “role” of each (“patient”, “caregiver”, or “family member”), relevant objects, and builds attribute features for mobility classification.
2.1Multi-person Tracking by Fusing Multiple Detectors
Our tracking method works by formulating an energy functional comprising of spatial and temporal consistency over multiple part-based detectors (see Fig. 2). We model the relationship between detectors within a single frame using a deformable spatial model and then track in an online setting.
Full-body (red) and Head (green) detectors trained by [11]. The head detector may fail with (a) proximity or (d) distance. The full-body detector may also struggle with proximity [(b) and (c)]. (To protect privacy, all images are blurred). (Color figure online)
Modeling Deformable Spatial Configurations -For objects that exhibit deformation, such as humans, there is an expected spatial structure between regions of interest (ROIs) (e.g., head, hands, etc.) across pose variations. Within each pose (e.g. lying, sitting, or standing), we can speculate about an ROI (e.g. head) based on other ROIs (e.g. full-body). To model such relationships, we assume that there is a projection matrix\(A_{ll'}^c\) which maps the location of ROIl to that of\(l'\) for a given posec. With a training dataset,C types of poses are determined automatically by clustering location features [10], and projection matrix\(A_{ll'}^c\) can be learnt by solving a regularized least-square optimization problem.
To derive the energy function of our deformable model, we denote the number of persons in thet-th frame as\(M^t\). For them-th person, the set of corresponding bounding boxes fromL ROIs is defined by\(X^t = \{X^t_1(m), \cdots , X^t_L(m)\}\). For any two proposed bounding boxes\(X^{t}_{l'}(m)\) and\(X^{t}_{l}(m)\) at framet for individualm, the deviation from the expected spatial configuration is quantified as the error between the expected location of the bounding box for the second ROI conditioned on the first. The total cost is computed by summing, for each of the\(M^t\) individuals, the minimum cost for each of theC subcategories:
Grouping Multiple Detectors -Next we wish to automate the process of detecting people to track using a combination of multiple part-based detectors. A collection of existing detection methods [11] can be employed to trainK detectors; each detector is geared towards detecting an ROI. Let us consider two bounding boxes\(D^{t}_{k}(n)\) and\(D^{t}_{k'}(n')\) from any two detectorsk and\(k'\), respectively. If these are from the same person, the overlapped region is large when they are projected to the same ROI using our projection matrix. In this case, the average depths in these two bounding boxes are similar. We calculate the probability that these are from the same person as:
wherea is a positive weight,\(p_{over}\) and\(p_{depth}\) measure the overlapping ratio and depth similarity between two bounding boxes, respectively. These scores are:
where\(\ell \) maps thek-th detector to thel-th region-of-interest,v and\(\sigma \) denote the mean and standard deviation of the depth inside a bounding box, respectively.
By the proximity measure given by (2), we group the detection outputs into\(N^t\) sets of bounding boxes. In each group\(G^{t}(n)\), the bounding boxes are likely from the same person. Then, we define a cost function that represents the matching relationships between the true positions of our tracker and the candidate locations suggested by the individual detectors as:
where\(w^{t}_{k}(n)\) is the detection score as a penalty for each detected bounding box.
Tracking Framework -We initialize our tracker at time\(t = 1\) by aggregating the spatial (Eq. 1) and detection matching (Eq. 5) cost functions. To determine the best bounding box locations at timet conditioned on the inferred bounding box locations at time\(t-1\), we extend the temporal trajectory\(E_{dyn}\) and appearance\(E_{app}\) energy functions from [16] and solve the joint optimization (definition for\(E_{exc},E_{reg},E_{dyn},E_{app}\) left out for space considerations) as:
We refer the interested reader to [22] for more details on our tracking framework.
2.2Activity Analysis by Contextual Attributes
We describe the remaining steps for our NIMS system here.
Patient Identification -We fine-tune a pre-trained CNN [24] based on the architecture in [25], which is initially trained on ImageNet (http://image-net.org/). From our RGB-D sensor, we use the color images to classify images of people into one of the following categories:patient,caregiver, orfamily-member. Given each track from our multi-person tracker, we extract a small image according to the tracked bounding box to be classified. By understanding therole of each person, we can tune our activity analysis to focus on the patient as the primary “actor” in the scene and utilize the caregivers into supplementary roles.
Patient Pose Classification and Context Detection -Next, we seek to estimate the pose of the patient, and so we fine-tune a pre-trained network to classify ourdepth images into one of the following categories:lying-down,sitting, orstanding. We choose depth over color as this is a geometric decision. To supplement our final representation, we apply a real-time object detector [24] to localize important objects that supplement the state of the patient, such as:bed upright,bed down, andchair. By combining bounding boxes identified as people with bounding boxes of objects, the NIMS may better ascertain if a patient is, for example, “lying-down in a bed down” or “sitting in a chair”.
Motion Analysis -Finally, we compute in-place body motion. For example, if a patient is lying in-bed for a significant period of time, clinicians are interested in how much exercise in-bed occurs [23]. To achieve this, we compute the mean magnitude of motion with a dense optical flow field within the bounding box of the tracked patient between successive frames in the sequence. This statistic indicates how much frame-to-frame in-place motion the patient is exhibiting.
Mobility Classification -[23] describes a clinically-accepted 11-point mobility scale (ICU Mobility Scale), as shown in Table 1 on the right. We collapsed this into ourSensor Scale (left) into 4 discrete categories. The motivation for this collapse was that when a patient walks, this is often performed outside the room where our sensors cannot see.
By aggregating the different sources of information described in the preceding steps, we construct our attribute feature\(F_{t}\) with:
- 1.
Was a patient detected in the image? (0 forno; 1 foryes)
- 2.
What was the patient’s pose? (0 forsitting; 1 forstanding; 2 forlying-down; 3 forno patient found)
- 3.
Was a chair found? (0 forno; 1 foryes)
- 4.
Was the patient in a bed? (0 forno; 1 foryes)
- 5.
Was the patient in a chair? (0 forno; 1 foryes)
- 6.
Average patient motion value
- 7.
Number of caregivers present in the scene
We chose these attributes because their combination describes the “state” of the activity. Given a video segment of lengthT, all attributes\(\mathbf {F} = [F_{1}, F_{2}, \dots , F_{T}]\) are extracted and the mean\(\mathbf {F}_{\mu } = \sum _{t=1}^{T}F_{t}/T\) is used to represent the overall video segment (the mean is used to account for spurious errors that may occur). We then train a Support Vector Machine (SVM) to automatically map each\(\mathbf {F}_{\mu }\) to the corresponding Sensor Scale mobility level from Table 1.
3Experiments and Discussions
Video data was collected from a surgical ICU at a large tertiary care hospital. All ICU staff and patients were consented to participate in our IRB-approved study. A Kinect sensor was mounted on the wall of a private patient room and was connected to a dedicated encrypted computer where data was de-identified and encrypted. We recorded 362 h of video and manually curated 109 video segments covering 8 patients. Of these 8 patients, we use 3 of them to serve as training data for the NIMS components (Sect. 2), and the remaining 5 to evaluate.
Training -To train the person, patient, pose, and object detectors we selected 2000 images from the 3 training patients to cover a wide range of appearances. We manually annotated: (1) head and full body bounding boxes; (2) person identification labels; (3) pose labels; and (4) chair, upright, and down beds.
To train the NIMS Mobility classifier, 83 of the 109 video segments covering the 5 left-out patients were selected, each containing 1000 images. For each clip, a senior clinician reviewed and reported the highest level of patient mobility and we trained our mobility classifier through leave-one-out cross validation.
Tracking, Pose, and Identification Evaluation -We quantitatively compared our tracking framework to the current SOTA. We evaluate with the widely used metric MOTA (Multiple Object Tracking Accuracy) [26], which is defined as 100 % minus three types of errors: false positive rate, missed detection rate, and identity switch rate. With our ICU dataset, we achieved a MOTA of 29.14 % compared to −18.88 % with [15] and −15.21 % with [16]. Using a popular RGBD Pedestrian Dataset [27], we achieve a MOTA of 26.91 % compared to 20.20 % [15] and 21.68 % [16]. We believe the difference in improvement here is due to there being many more occlusions in our ICU data compared to [27]. With respect to our person and pose ID, we achieved 99 % and 98 % test accuracy, respectively, over 1052 samples. Our tracking framework requires a runtime of 10 secs/frame (on average), and speeding this up to real-time is a point of future work.
Mobility Evaluation -Table 2 shows a confusion matrix for the 83 video segments to demonstrate the inter-rater reliability between the NIMS and clinician ratings. We evaluated the NIMS using a weighted Kappa statistic with a linear weighting scheme [28]. The strength of agreement for the Kappa score was qualitatively interpreted as: 0.0–0.20 asslight, 0.21–0.40 asfair, 0.41–0.60 asmoderate, 0.61–0.80 assubstantial, 0.81–1.0 asperfect [28]. Our weighted Kappa was 0.8616 with a 95 % confidence interval of (0.72, 1.0). To compare to a popular technique, we computed features using Dense Trajectories [17] and trained an SVM (using Fisher Vector encodings with 120 GMMs), achieving a weighted Kappa of 0.645 with a 95 % confidence interval of (0.43, 0.86).
The main source of difference in agreement was contained within differentiating “A” from “B”. This disagreement highlights a major difference between human and machine observation in that the NIMS is a computational method being used to distinguish activities containing motion from those that do not with a quantitative, repeatable approach.
4Conclusions
In this paper, we demonstrated a video-based activity monitoring system called NIMS. With respect to the main technical contributions, our multi-person tracking methodology addresses a real-world problem of tracking humans in complex environments where occlusions and rapidly-changing visual information occurs. We will to continue to develop our attribute-based activity analysis for more general activities as well as work to apply this technology to rooms with multiple patients and explore the possibility of quantifying patient/provider interactions.
References
Brower, R.: Consequences of bed rest. Crit. Care Med.37(10), S422–S428 (2009)
Corchado, J., Bajo, J., De Paz, Y., Tapia, D.: Intelligent environment for monitoring Alzheimer patients, agent technology for health care. Decis. Support Syst.44(2), 382–396 (2008)
Hwang, J., Kang, J., Jang, Y., Kim, H.: Development of novel algorithm and real-time monitoring ambulatory system using bluetooth module for fall detection in the elderly. In: IEEE EMBS (2004)
Smith, M., Saunders, R., Stuckhardt, K., McGinnis, J.: Best Care at Lower Cost: the Path to Continuously Learning Health Care in America. National Academies Press, Washington, DC (2013)
Hashem, M., Nelliot, A., Needham, D.: Early mobilization and rehabilitation in the intensive care unit: moving back to the future. Respir. Care61, 971–979 (2016)
Berney, S., Rose, J., Bernhardt, J., Denehy, L.: Prospective observation of physical activity in critically ill patients who were intubated for more than 48 hours. J. Crit. Care30(4), 658–663 (2015)
Chakraborty, I., Elgammal, A., Burd, R.: Video based activity recognition in trauma resuscitation. In: International Conference on Automatic Face and Gesture Recognition (2013)
Lea, C., Facker, J., Hager, G., et al.: 3D sensing algorithms towards building an intelligent intensive care unit. In: AMIA Joint Summits Translational Science Proceedings (2013)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR (2005)
Chen, X., Mottaghi, R., Liu, X., et al.: Detect what you can: detecting and representing objects using holistic models and body parts. In: IEEE CVPR (2014)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI32(9), 1627–1645 (2010)
Verceles, A., Hager, E.: Use of accelerometry to monitor physical activity in critically ill subjects: a systematic review. Respir. Care60(9), 1330–1336 (2015)
Babenko, D., Yang, M., Belongie, S.: Robust object tracking with online multiple instance learning. PAMI33(8), 1619–1632 (2011)
Lu, Y., Wu, T., Zhu, S.: Online object tracking, learning and parsing with and-or graphs. In: IEEE CVPR (2014)
Choi, W., Pantofaru, C., Savarese, S.: A general framework for tracking multiple people from a moving camera. PAMI35(7), 1577–1591 (2013)
Milan, A., Roth, S., Schindler, K.: Continuous energy minimization for multi-target tracking. TPAMI36(1), 58–72 (2014)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE ICCV (2013)
Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: IEEE CVPR (2014)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Wu, Z., Wang, X., Jiang, Y., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: ACMMM (2015)
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: IEEE CVPR (2011)
Ma, A.J., Yuen, P.C., Saria, S.: Deformable distributed multiple detector fusion for multi-person tracking (2015).arXiv:1512.05990 [cs.CV]
Hodgson, C., Needham, D., Haines, K., et al.: Feasibility and inter-rater reliability of the ICU mobility scale. Heart Lung43(1), 19–24 (2014)
Girshick, R.: Fast R-CNN (2015).arXiv:1504.08083
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Keni, B., Rainer, S.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Proces.2008, 1–10 (2008)
Spinello, L., Arras, K.O.: People detection in RGB-D data. In: IROS (2011)
McHugh, M.: Interrater reliability: the Kappa statistic. Biochemia Med.22(3), 276–282 (2012)
Author information
Authors and Affiliations
The Johns Hopkins University, Baltimore, MD, USA
Austin Reiter, Andy Ma & Suchi Saria
Johns Hopkins Medical Institutions, Baltimore, MD, USA
Nishi Rawat & Christine Shrock
- Austin Reiter
You can also search for this author inPubMed Google Scholar
- Andy Ma
You can also search for this author inPubMed Google Scholar
- Nishi Rawat
You can also search for this author inPubMed Google Scholar
- Christine Shrock
You can also search for this author inPubMed Google Scholar
- Suchi Saria
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toAustin Reiter.
Editor information
Editors and Affiliations
University College London , London, United Kingdom
Sebastien Ourselin
The Hebrew University of Jerusalem , Jerusalem, Israel
Leo Joskowicz
Harvard Medical School , Boston, Massachusetts, USA
Mert R. Sabuncu
Istanbul Technical University , Istanbul, Turkey
Gozde Unal
Harvard Medical School and Brigham and Women's Hospital, Boston, Massachusetts, USA
William Wells
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Reiter, A., Ma, A., Rawat, N., Shrock, C., Saria, S. (2016). Process Monitoring in the Intensive Care Unit: Assessing Patient Mobility Through Activity Analysis with a Non-Invasive Mobility Sensor. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9900. Springer, Cham. https://doi.org/10.1007/978-3-319-46720-7_56
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-319-46719-1
Online ISBN:978-3-319-46720-7
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative