1. Introduction
Human interaction recognition (HIR) deals with the understanding of communication taking place between a human and an object or other persons [
1]. HIR includes an understanding of various actions, such as social interaction, person to person talking, meeting or greeting in the form of a handshake or a hug and the performance of inappropriate actions, such as fighting, kicking or punching each other. There are many different kinds of interactions that can easily be identified by human observations. However, in many situations, personal human observation of some actions is impractical due to the cost of resources and also to hazardous environments. For example, in the case of smart rehabilitation, it is more suitable for a machine to monitor a patient’s daily routine rather than for a human to constantly observe a patient (24/7) [
2]. Similarly, in the case of video surveillance, it is more appropriate to monitor human actions via sensor devices, especially in places where risk factors and suspicious activities are involved.
Due to a wide variety of applications, HIR has gained much attention in recent years. These applications include public security surveillance, e-health care, smart homes, assistive robots and sports assistance [
3] each of which requires an efficient understanding and identification of discrete human movements [
4,
5,
6,
7,
8,
9]. Many HIR systems have been proposed to tackle problems faced in activity monitoring in healthcare, rehabilitation, surveillance and many other situations. Reliable and accurate monitoring is essential in order to monitor the progress of patients in physical therapy and rehabilitation centers, to detect potential and actual dangers, such as falls and thefts, and to prevent mishaps and losses due to lack of attention [
10]. HIR systems are also proposed for security reasons [
11], such as a Fuzzy logic based human activity recognition system, which was proposed in [
12]. A Hidden Markov Model [HMM] based HIR system was proposed for surveillance [
13]. A random forest based HIR system for smart home and elderly care was proposed by H. Xu et al. [
14]. A neural network-based HIR system was presented by S. Chernbumroong for assisted living [
15]. Clearly, HIR systems are in demand and highly applicable in many daily life domains.
Motivated by the applications of HIR systems in daily life, we proposed a robust system which is able to track human interactions and which is easy to deploy in real world applications [
16]. Challenges, such as complex and cluttered background, intra-class variations and interclass similarity make it difficult to accurately recognize and distinguish between human interactions. Therefore, we aim to increase the recognition rate of human–human interactions and tackle challenges faced by recent HIR systems by incorporating depth sensors. The recognition rate of human interactions is being boosted with a recent low-cost depth sensors technology [
17,
18]. Depth imaging technology is getting more attention in recent years because it is providing promising results without the attachment of marker sensors [
2,
19,
20]. HIR systems based on depth sensors are easy to deploy in daily life applications compared to systems based on wearable or marker sensors [
21]. This is because wearable sensors need to be attached to the body of an individual in order to give better performance, but this creates usability and mobility problems for the wearer. The main purpose of this research work is to propose a multi-vision sensor based HIR system which consists of a hybrid of four unique features in order to achieve a better performance rate. Our system aims at giving computers sensitivity to automatically monitor, recognize and distinguish between human actions happening in respective surroundings.
Basically, HIR can be categorized into four types—i.e., human–object interaction (HOI), human–robot interaction (HRI), human–human interaction (HHI) and human–group of humans interaction (HGI). In the case of HOI, humans act, communicate and interact with various objects to perform different daily actions [
22,
23] such as picking-up a glass for drinking, holding a ball for throwing and taking food for eating. During HRI, a robot may be able to perform different postures, such as hand shaking, serving food and waving hands, etc. Robots in HRI can precisely predict a human’s future actions and analyze the gestures of the persons that interact with them [
24]. Similarly, in HHI and HGI, a system can estimate the trajectory information of the human–human or capture the movement patterns of groups of people [
25] in crowded or public areas. However, our research work is focused on human to human interaction.
In this paper, we propose a novel hybrid HIR system and entropy Markov model that examines the daily interactions of humans. The proposed model measures the spatiotemporal properties of the body’s posture and estimates an empirical expectation of pattern changes using depth sensors. For the vision (RGB or depth) filtered data, we have adopted mean filter, pixel connectivity analysis and Otsu’s thresholding method. For hybrid descriptor features, we proposed four types of features characteristics as follows:
Space and time based—i.e., spatio-temporal features—in which displacement measurements between key human body points are recognized as temporal features. Intensity changes along the curved body points of silhouettes are taken as spatial features.
Motion-orthogonal histograms of oriented gradient (MO-HOG) features are based on three different views of human silhouette. These views are projected in the form of orthogonal shape and then HOG is applied.
Shape based angular and geometric features include angular measurements over two types of shapes—i.e., inter-silhouettes and intra-silhouette.
Energy based features examine distinct body parts energy distribution within a silhouette.
These hybrid descriptors are fed into a Gaussian mixture model (GMM) and into fisher encoding for codebook generation and for proper discrimination among various activity classes. Then, we applied cross entropy algorithm which resulted in the optimized distribution of matrixes. Finally, the maximum entropy Markov model (MEMM) is embodied in the proposed HIR system to estimate empirical expectation and the highest entropy of different human interactions to achieve significant accuracy. Four experiments were performed using a leave-one-out cross validation method on three well-known datasets. Our proposed method acquired significant performance compared to well-known statistical state-of-the-art methods. The major contributions of this paper can be highlighted as follows:
We proposed to apply hybrid descriptor features of spatiotemporal characteristics, invariant properties, view-orientation as well as displacement and intra/inter angular values to distinct human interactions.
We introduced a combination of GMM with fisher encoding for codebook generation and optimal discrimination of features.
We designed cross entropy optimization and MEMM to analyze contextual information as well as to classify complex human interactions in a better way.
We performed experiments using three publicly available datasets and the proposed method was fully validated for the efficacy, outperforming other state-of-the-art methods, including deep learning.
The rest of the paper is organized as follows:
Section 2 consists of related work in the field of HIR.
Section 3 presents details of our proposed methodology.
Section 4 reports the experimentation, dataset description results generation.
Section 5 presents a discussion of the overall paper. Finally,
Section 6 concludes the proposed research work with some future directions.
4. Experimental Setting and Results
In this section, we report training/testing experimentation results using then-fold cross validation method over three publicly available benchmarks datasets. SBU Kinect interaction and UoL 3D datasets include both RGB and depth image sequences while UT interaction dataset consists of RGB data only. Furthermore, complete descriptions of each dataset are given in this section. The proposed model is evaluated on the basis of various performance parameters—i.e., computation time, recognition accuracy, precision, recall, F1 Score, number of states and number of observations. Discussion about various well-known classifiers and comparison of the proposed HIR with other statistically known state-of-the-art HIR methods is also given in this section.
4.1. Datasets Description
4.1.1. SBU Kinect Interaction Dataset
The SBU Kinect interaction dataset [
78] consists of RGB, depth and skeletal information for the two-person performing interactions collected by Microsoft Kinect sensors in an indoor environment. Eight types of interactions including Approaching, Departing, Kicking, Punching, Pushing, Shaking Hands, Exchanging Object and Hugging are performed. The overall dataset is really challenging to interpret due to the similarity or closer proximity of movements in the different interaction classes. The sizes of both RGB and depth images are 649 × 480. Additionally, the dataset has a total of 21 folders, where each folder consists of all eight interaction classes performed by a different combination of seven actors. The ground truth labels of each interaction class are also provided. Videos are segmented at the rate of 15 frames per second (fps).
Figure 13 shows some examples of human interaction classes of the SBU dataset.
4.1.2. UoL 3D Dataset
In the UoL 3D dataset, there is a combination of three types of interaction, such as casual daily life, harmful and assisted living interactions [
79]. Included are interactions, such as handshake, hug, help walk, help stand-up, fight, push, conversation and call attention, performed by four males and two females. In addition, RGB, depth and skeletal information for each interaction is captured through the Kinect 2 sensor. Each folder has 24-bit RGB images, 8-bit and 16-bit resolution depth images of both 8-bit and 16-bit resolution and the skeletal information has 15 joints. There are ten different sessions of eight interactions performed by two subjects (in pairs), which are recorded in an indoor environment for period of 40–60 repetitions. This is a very challenging dataset and consists of over 120,000 data frames. Some snapshots of interactions of this dataset are shown in
Figure 14.
4.1.3. UT Interaction Dataset
The UT interaction dataset [
80] consists of only RGB data. It has six interaction classes: point, push, shake hands, hug, kick and punch performed, by several participants with different appearances. This dataset is divided into two sets, named as: UT-Interaction Set 1 and UT-Interaction Set 2. The environment of Set 1 is a parking lot and the environment of Set 2 is a windy lawn. Video is captured with a resolution of 720 × 480 at 30 fps. There are 20 videos per interaction providing a total of 120 videos of six interactions.
Figure 15 demonstrates some examples of interaction classes for UT-Interaction dataset.
4.2. Performance Parameters and Evaluation
In order to validate the methodology of the proposed HIR system, four different types of experiments with various performance parameters—i.e., recognition accuracy, precision, recall, F-score, computational time and comparison with state-of-the art methods—were performed. Details and observations for each experiment are discussed in the sub-sections.
4.2.1. First Experiment
In the first experiment, optimized feature vectors are subjected to MEMM in order to evaluate the average accuracy of the proposed system. We used the
n-fold cross validation method for training/testing over three benchmark datasets.
Table 2 and
Table 3 show the accuracy of the interactions of SBU and UoL datasets in the form of a confusion matrix. Similarly, recognition accuracies of UT-Interaction Set 1 and Set 2 are shown in
Table 4 and
Table 5, respectively. While, the mean accuracy of the SBU dataset is 91.25%, the accuracy of UoL is 90.4% and the combined accuracy of the UT-Interaction Set 1 and Set 2 is 87.4%.
From the experimental results, it is observed that our hybrid features methodology, along with cross entropy optimization and the MEMM, can clearly recognize human interactions better. However, some confusion is observed between pairs of similar interactions, such as shaking hands and exchanging object, and punching and pushing interactions, in the SBU dataset. In the UoL dataset, confusion is observed between handshake and help stand-up interactions. Such confusion is due to the similarity in body movements involved in these interactions. In the UT-interaction dataset, there is confusion between shaking hands and point, and push and punch interactions due to similarities of these interactions. In addition, it is also observed that when combinations of RGB and depth vectors were fed into the MEMM, we achieved better recognition rates compared to RGB alone. The recognition rate of the RGB dataset i.e., UT interaction (87.4%) is less than those of the SBU and the UoL datasets, which are 91.25% and 90.4%, respectively. Thus, incorporating depth information results causes improvements in accuracy rate.
4.2.2. Second Experiment
In the second experiment, precision, recall and F1 Score for each interaction class of three datasets are evaluated, as shown in
Table 6.
It is observed that, in the SBU dataset, the Approaching interaction has the least precise rate of 88% and it also has a highest rate of false positive. This is because many periodic actions of many interactions such as departing, shaking hands and exchanging object are similar to the approaching interaction. On the other hand, the kicking interaction gives the most precise results with a less false positive ratio of 3%. In the UoL dataset, Hug interaction gives the most precise result of 95% because the periodic actions performed during the Hug interaction are different from the other interactions of this dataset. Handshake and conversation interactions have the highest false positive ratios of 13% and 14%, respectively, because body movements of silhouettes during these two interactions are similar to many other interactions. Overall, if we compare three datasets, the precision recall and F1 score ratios of both sets of the UT Interaction dataset are less as compared to the SBU and UoL datasets.
4.2.3. Third Experiment
In the third experiment, nine sub-experiments for each dataset were performed. In this experiment, different combinations of the two parameters (i.e., number of states and observations) were used to evaluate the performance of MEMM. As a result, comparisons are made in terms of time complexity and recognition accuracy. During MEMM, each transition not only depends on the current state but also on the previous state. Therefore, increasing the number of states and observations affects the performance rate of HIR.
Table 7,
Table 8 and
Table 9 show a comparison of number of states and observations for time complexity and recognition accuracy over the SBU, UoL 3D and UT-Interaction datasets.
In
Table 7, by using four states and changing the number of observations from 10 to 30, computational time and recognition accuracy were gradually increased. These experiments are repeated for five and six states. Similarly,
Table 8 used 4–6 states and received significant results for computational time and recognition accuracy at 15 to 35 numbers of observations.
Table 9 presents the results of these experiments on Set 1 of the UT-Interaction dataset, respectively.
It is concluded from the third experiment that reducing the number of states to two reduces recognition accuracy and computational time. On the other hand, increasing the number of states to six results in increased computational time with no change in accuracy. However, similar patterns of observations are noticed in
Table 7,
Table 8 and
Table 9 (i.e., increasing the number of states and observations results in increased computational time and accuracy as well).
4.2.4. Fourth Experiment
In the fourth experiment, we compared our proposed system in two parts. In the first part, a hybrid descriptor-based MEMM classifier is compared with other commonly used classifiers. In the second part, the proposed system is compared with other statistically well-known state-of-the-art HIR systems.
In the first part, quantized features vectors are given to most commonly used classifiers—i.e., ANN, HMM and Conditional Random Field (CRF)—and compared with MEMM to find the HIR accuracy rates for the interactions of each dataset.
Figure 16 shows a comparison of recognition accuracies for each interaction class of the SBU dataset using all four classifiers.
From
Figure 16, it can be seen that the mean recognition accuracy for ANN is 87.3%, CRF is 90%, HMM is 85.3% and MEMM is 91.25%. It is observed that, in some interactions, such as exchanging object and shaking hands, CRF performed better than MEMM. Additionally, ANN performed better in a few interactions, such as kicking and punching. Overall accuracy using the MEMM was higher than for other classifiers.
Figure 17 shows the comparison of recognition accuracies for each interaction class using the UoL dataset.
From
Figure 17, it is shown that the mean recognition accuracy of ANN is 82.75%, CRF is 88.5%, HMM is 86.37% and MEMM is 90.4%, using the UoL dataset. It is observed that some interactions, such as fight in the case of ANN, handshake in the case of CRF and help walk in the case of HMM, achieved better recognition accuracy than the MEMM. However, the overall recognition rate was still higher with the MEMM.
Figure 18 shows the comparison of four classifiers over interaction classes of the UT-Interaction Set 1 and Set 2.
From
Figure 18a,b, it is observed that the mean recognition accuracy rates of Set 1 and of Set 2 for the UT-interaction dataset are less than the depth datasets. The mean accuracies of Set 1 of the UT Interaction dataset are 79.16% with ANN, 84.3% with CRF, 82.8% with HMM and 88% with the MEMM classifier. Mean accuracies are further reduced with Set 2 for the UT interaction dataset due to the cluttered background of a windy lawn. The mean accuracy for ANN is 77%, CRF is 82.7%, HMM is 80.7% and MEMM is 86.8%. Meanwhile, it is observed that patterns of recognition accuracies for Set 1 and for Set 2 are similar to those of the UoL and of the SBU datasets and that the MEMM has the highest accuracy rate while ANN has the lowest. Accuracy rates for the MEMM and CRF are comparable. Moreover, CRF, HMM and MEMM performed better in most of the interactions classes except for the fight interaction, where ANN has better or nearly similar recognition rate. However, overall MEMM has best recognition rates. Thus, it is concluded that MEMM based performance is best for HIR.
In second part of this experiment, the proposed HIR system is compared with other statistically well-known state-of-the-art systems.
Table 10 presents a comparison of results for the SBU, UoL and UT interaction datasets, respectively.
6. Conclusions and Future Work
In this paper, we have proposed a novel HIR system to recognize human interactions using both RGB and depth environmental settings. The main accomplishments of this research work are: (1) we achieved adequate silhouette segmentation; (2) identification of key human body parts; (3) extraction of four novel features—i.e., spatio-temporal, MO-HOG, angular-geometric and energy based features; (4) cross-entropy optimization and recognition of each interaction via MEMM. In the first phase, both RGB and depth silhouettes are identified separately. For RGB silhouette segmentation, various skin colors, connected components and binary thresholding methods are applied to separate humans from their background. After the extraction of silhouettes, all spatio–temporal features are extracted. In these features the displacement between key body points is identified via Euclidean distance. In angular–geometric features, various geometrical shapes are made by connecting the extreme points of silhouettes. The angles of these shapes are then measured in each interaction class. After that, MO-HOG features are extracted, in which differential silhouettes are projected from three different views and then HOG is applied. Finally, unique energy features are extracted from each interaction class. A Hybrid of these feature descriptors results in very complex vector representation. In order to reduce the complexity of the feature descriptors, a GMM based FVC is applied and then cross entropy optimization is performed.
During experimental testing, four different types of experiments were conducted on three benchmark datasets in order to validate the performance of the proposed system. In the first experiment, recognition accuracies for the interaction classes of each dataset were measured. In the second experiment, F1 scores, precision and the recall of each interaction class were measured and compared. In the third experiment, computation time and accuracy were measured by changing the number of states and observations of the MEMM classifier. Finally, in the fourth experiment, recognition accuracies for the interaction classes of each dataset were measured via the most commonly used classifiers—i.e., ANN, HMM and CRF—and compared with MEMM. Results showed better performance, with an average recognition rate of 87.4% for UT-Interaction, 90.4% for UoL and 91.25% for SBU datasets. Results of these experiments validated the efficacy of the proposed system. The proposed system is applicable to various real-life scenarios, such as security monitoring, smart home, healthcare and content-based video indexing and retrieval, etc.
In the future, we plan to implement the proposed method in a group of human interactions as well as human–object interactions. We will also use entropy-based features. We will also work on more challenging datasets.