. Author manuscript; available in PMC: 2021 Mar 19.

Published in final edited form as:Computer (Long Beach Calif). 2019 May 14;52(5):48–57. doi:10.1109/MC.2019.2903246

Human Eye Movements Reveal Video Frame Importance

Zheng Ma¹,Jiaxin Wu²,Sheng-hua Zhong²,Jianmin Jiang²,Stephen J Heinen³

¹The Smith-Kettlewell Eye Research Institute

²Shenzhen University

³The Smith-Kettlewell Eye Research Institute

Issue date 2019 May.

PMC Copyright notice

PMCID: PMC7975628 NIHMSID: NIHMS1529492 PMID:33746238

Abstract

Human eye movements indicate important spatial information in staticimages as well as videos. Yet videos contain additional temporal information andconvey a storyline. Video summarization is a technique that reduces video size,but maintains the essence of the storyline. Here, the authors explore whethereye movement patterns reflect frame importance during video viewing andfacilitate video summarization. Eye movements were recorded while subjectswatched videos from the SumMe video summarization dataset. The authors find moregaze consistency for selected than unselected frames. They further introduce anovel multi-stream deep learning model for video summarization that incorporatessubjects’ eye movement information. Gaze data improved the model’sperformance over that observed when only the frames’ physical attributeswere used. The results suggest that eye movement patterns reflect cognitiveprocessing of sequential information that helps select important video frames,and provide an innovative algorithm that uses gaze information in videosummarization.

Eye movement information is useful for revealing observers’interests¹. This is likelybecause visual acuity is highest in a small foveal region, and eye movements reorientthe fovea to different scene elements that require high resolution viewing². Therefore, eye movements often reliablyindicate informative regions in static images or scenes.

Several studies explored eye movement dynamics during video watching, and foundthat subjects make similar eye movements while watching commercial films, TV shows orinterviews^3,4. Furthermore, gaze locations tend to cluster atthe center of a screen³, as well as onbiologically relevant social stimuli⁵.Eye movement data has been integrated into computational models to help them locatesalient information and objects in video frames⁶.

Unlike static images, videos contain both spatial and temporal information.Therefore, to understand gaze behavior in videos, it is not only important tocharacterize where people look, but also which video frames are the most important andinformative to the observers. Besides gleaning insights into neural mechanisms fordirecting gaze, knowing when and where people look has practical implications forresearch on video summarization. In video summarization, short videos are created fromlong ones using only those frames that are important for conveying the essence of thelong video. Understanding gaze patterns for video summarization is especially importantgiven the recent development of new video technologies, uploading platforms, and theexplosive growth of video data. Approximately 400 hours of videos are uploaded toYouTube every minute (https://www.statista.com/topics/2019/youtube/), many of which are poorlyedited and have redundant or tediously long segments. Viewers would have a much moreefficient viewing experience if video summarization algorithms⁷ could provide an accurate summary of the originallong videos. If human eye movement patterns can help inform the importance of individualvideo frames, then we can use them to better predict which frames are criticallyimportant for understanding a video, and thus improve video summarizationalgorithms.

In the current study, we investigate whether human eye movements reflect theimportance of frames within a video, and whether they can improve computational modelsthat perform video summarization. Our work makes the following contributions:

We demonstrate that eye movement patterns of subjects are similar toeach other while viewing videos, even when no instructions are given abouthow to view them.
We show that eye movements are more consistent while viewing frameswith higher importance scores than those with lower scores, suggesting thathuman eye movements reliably predict importance judgements.
In a computational experiment, we demonstrate that a model that usedgaze information performs better video summarization than a model with onlyframebased physical information.

Together, the results demonstrate that human eyemovement consistency indicates whether a video frame is important, and suggest that eyemovement data facilitates computer vision models of video summarization.

BEHAVIORAL EXPERIMENT

SumMe Video Summarization Dataset

Our experiments were conducted using raw videos from the SumMe videosummarization benchmark dataset⁸. The SumMe dataset contains 25 raw videos together withvideo summaries that are generated by human observers. The videos depict variousevents such as cooking, plane landing, and base jumping. The length of thevideos varies from one to six minutes. In the manuscript, we refer to the peoplewho created the personal video summaries for the SumMe dataset as“users”, and those whose eye movements we recorded as“subjects.”

A total of 41 users (19 males and 22 females) participated in the videosummarization for the SumMe dataset⁸. The users were instructed to generate video summaries thatcontain the most important content of the original raw video, but with only 5%to 15% of the original length. The audio was muted to ensure only visual stimuliwere used for summarization. Frame-level importance scores were calculated basedon the probability that a frame was included in the user summaries. Finally, agroup-level summary was generated based on the frames with the highest 15% ofthe importance scores⁸.

Human Eye Movement Data Collection

We collected human eye movement data from a separate group of sixsubjects (two males and four females) while they viewed the raw videos of theSumMe dataset. All subjects had normal or corrected to normal vision. The studywas approved by the Institutional Review Board at the Smith-Kettlewell EyeResearch Institute, and also adhered to the Declaration of Helsinki. Informedconsent was obtained for experimentation with human subjects.

Videos were presented on a Samsung screen (resolution: 2560×1440,refresh rate: 60 Hz) and generated by Psychtoolbox-3 for MATLAB on a MacBook Procomputer. Observers’ heads were stabilized by a chin and forehead rest.Viewing distance was 57 cm, and the display was 58.2°×33.3°. Eye movements were recorded with an SR Research EyeLink 1000video-based eye tracker at 1000 Hz.

All of the original videos from the SumMe dataset were resized to havethe same width (1920 pixels; 43.6 degree of visual angle). The audio was muted.Subjects were asked to watch the entire video without any additionalinstructions. The experiment was divided into six blocks, each of which lastedapproximately 10 minutes. Before each video, observers fixated a blank screenwith a red central dot for 2s.

Since the temporal resolution of the eye movement data (1000 Hz) ishigher than the frame rate of the videos (15 to 30 Hz), we first down-sampledthe eye movement data by averaging the gaze positions over the samplescorresponding to each frame.

Results

Eye movement consistency across subjects

We first tested whether the eye movements were consistent acrosssubjects within the same video. For each subject, we obtained gaze location oneach video frame. We then calculated gaze velocity from the gaze locationdifference between every two contiguous video frames.Figure 1a andbshow two example frames with gaze locations and velocities from six subjectssuperimposed on them.

Example frames from two videos of the SumMe dataset,‘Jumps’ and ‘Saving dolphins’. The‘Jumps’ video shows a person traveling down a slide (a) and beinglaunched into the air (b). In (a) and (b), blue dots indicate gaze locations ofsix subjects, and arrows indicate their eye gaze velocities computed from eyeposition from the current and next frames. Arrow length denotes eye speed. Notethat eye position and velocity are inconsistent in (a), but consistent in (b).(c) A sample frame of the ‘Saving dolphins’ video, which showspeople on the beach saving several stranded dolphins. The mixture Gaussian gazedistributions from five subjects (in blue) and the predicted gaze location ofthe remaining subject (red dot) were superimposed on the original videoframe.

Consistency of gaze location.

Guided by previous studies that computed gaze consistency^9,10, for each video frame, we used a “leave oneout” technique to test how well the gaze location of five subjectspredicted the gaze location of the remaining one. For a video framet and each subject, we first created a mixture ofGaussian distributions based on the gaze locations from the other fivesubjects. It is defined as

p_{t}^{i} (x, y) = \frac{1}{N - 1} \sum_{n \ i}^{N} \frac{1}{2 π σ^{2}} e^{- \frac{{(x - x_{t}^{n})}^{2} + {(y - y_{t}^{n})}^{2}}{2 σ^{2}}}

(1)

wheren\i denotes the sum over all subjectsexcept for subjecti, andσ equals1 degree of visual angle.Figure 1cshows an example of the Gaussian mixture gaze distributions and the gazelocation of the predicted subject. If the probability density at subjecti’s gaze location $(x_{t}^{i}, y_{t}^{i})$ was within the top 20% probability densityof the whole distribution, it was counted as a correct‘detection’ and suggests that the gaze location of subjecti is successfully predicted by that of the othersubjects. We used the average detection rate across all six subjects toindicate inter-subject consistency. We also established a random baseline byusing the gaze locations of five subjects at another random framer to predict the gaze location of the remaining subjectat framet. We applied this procedure to 1000 randomlyselected video frames. The first 20 frames of each video were not used inthis analysis to avoid confounds from the central dot being presented beforeeach video.

Figure 2a shows the averagedetection rate from the inter-subject consistency and random baselineanalysis. A pair-wise t-test across the 1000 frames shows that the detectionrate in the inter-subject consistency analysis is significantly higher thanthat in the cross-frame control analysis (t(999)=23.5,p<0.001). Note that the detection rate of therandom baseline is also above chance. This might reflect a bias of observersto look at center of the screen in some of the video frames, since this canoccur regardless of image content^3,10.

Consistency of human eye movements. (a) Average gaze locationconsistency across 1000 selected frames, and the corresponding randomcross-frame control. (b) Average difference of gaze velocity direction acrossany two continuous frames. (c) Average difference of gaze speed across any twocontinuous frames. (d) Consistency of gaze locations for selected and unselectedframes. (e) Direction and (f) Speed difference of eye velocities across twocontinuous frames for selected and unselected frames. Error bars show standarderror of the mean (a, d, e, f) for within-subject designs, calculated followingthe suggestion by Cousineau (2005)¹¹, and 95% confidence interval of the mean (b, c).

Consistency of gaze velocities.

In this analysis we determined if gaze velocity at each video framewas consistent across each pair of subjects. For each of the 15 pairs ofsubjects, we calculated their gaze velocity difference at each video frame.We also generated a random baseline to compare with the pairwise data byshuffling the temporal order of subjects’ eye movement sequenceswithin a video. This was done 1000 times to generate baseline distributionsof gaze velocity differences.

For each subject pair, we calculated the differences between thevelocity directions using the following formula:

g = \arccos (\frac{{\vec{a}}_{i} \cdot {\vec{a}}_{j}}{|{\vec{a}}_{i}| \cdot |{\vec{a}}_{j}|}),

(2)

where ${\vec{a}}_{i}$ indicates the eye velocity from framet to the next framet +1 of subjecti. For the speed difference, we computed the absolutedifferences between the lengths of the two vectors using the formula:

b = | | {\vec{a}}_{i} | - | {\vec{a}}_{j} | |,

(3)

The results are shown inFigure 2b andc. One samplet-tests show that the mean difference in gaze velocity direction and speedfrom subject pairs are significantly lower than the means of the randombaselines (direction: t(14)=31.55, p < 0.001; speed: t(14)=18.97,p<0.001). In summary, similar to the results of previousstudies³, eyemovement patterns were consistent during video watching for both gazelocation and gaze velocity.

Eye movement consistency reflects the importance of a frame

Although subjects’ eye movements were generally consistent whilewatching videos, there were frames in which eye movements were inconsistent (seeFigure 1b). Could gaze consistencyserve as an indicator of the importance of individual frames? To test this, wecompared eye movement consistency between subjects in the frames that wereselected for the user summarized videos with those that were excluded. Forconsistency of gaze location, an independent t-test showed that the consistencylevel of the selected frames was significantly higher than that of theunselected frames (t(1998)=6.85,p<0.001,Figure2d). For consistency of gaze velocity, a pair-wise t-test showed thatacross the 15 pairs of subjects, the difference in eye velocity direction in theselected frames was significantly lower than that in the unselected frames(t(14)=9.96,p<0.001,Fig 2e). Similarly, the difference in eyespeed in the selected frames was also significantly lower than that in theunselected frames (t(14)=10.07,p<0.001,Fig 2f).Therefore, subjects’ eye movement patterns were more similar to eachother while viewing the frames that were included in the user summaries thanthose that were excluded. The results suggest that human gaze patterns canindicate whether a video frame is important for summarizing a video, and thatpeople tend to make similar eye movements while viewing the important videoframes.

COMPUTATIONAL EXPERIMENT

Our behavioral results provide evidence that eye movement consistency withina given frame reflects whether a video frame is important for video summarization.These results suggest that eye movements indicate informative aspects of videos thatare not directly related to their low level visual features and instead might becognitively derived, e.g. object-based knowledge or prediction of future frames. Ifso, adding gaze information to a video-summarization model should improve itsperformance. To test this, we developed a gaze-aware deep learning model to performvideo summarization, and compared its performance to a model that operated with onlylow-level physical attributes of the frames.

Gaze-Aware Deep Learning Model for Video Summarization

We employed a deep learning model similar to those used in manyvideo-related tasks¹². Thesemodels rely on multiple layers of nonlinear processing to extract usefulfeatures in a task (e.g. object detection, classification, etc.). Since deeplearning networks are effective in processing images and videos, they might bewell suited to learn the importance of video frames from low level visualfeatures as well as gaze patterns.

We constructed a novel multi-stream deep learning model for videosummarization that we call DLSum (Figure3a). There are three streams in the model: 1) the spatial stream, 2) thetemporal stream, and 3) the gaze stream.

Illustration of the DLSum models and their performance on the SumMedataset. (a) General model structure. Spatial and temporal information from thevideo, as well as human gaze data are transmitted to the convolutional neuralnetwork. Later, the fused features are input to the SVR algorithm to predict theimportance score for each frame. Final summary is generated based on the scores.(b) A schematic illustration of Gaze representation in the model. Two sequentialvideo framest andt+1 with onesubject’s gaze position (the red dot) were transformed to the horizontal $g_{t}^{x}$ and vertical $g_{t}^{y}$ component of the Gaze representationg_t(x, y). Location of the dotscorrespond to the subject’s gaze location at framet.Darker colors correspond to leftward and upward movements. The darker dot in the $g_{t}^{y}$ image represents a large upward displacement ofgaze location. In the actual model implementation, gaze information from allsubjects was combined to form a single Gaze representation. (c) Performance ofdifferent version of the DLSum models in terms of the AMF and (d) AHFscores.

The gaze stream takes the raw eye position and velocity as inputs. The“Gaze” representation of an arbitrary framet isdenoted asg_t. $g_{t}^{x}$ and $g_{t}^{y}$ are the horizontal and vertical components ofg_t. Both $g_{t}^{x}$ and $g_{t}^{y}$ are matrices that have the same size as theoriginal video frame, and the values are initialized to zero. We then centered a1° diameter circle on each subject’s gaze location at framet. The values within each circle reflect gaze velocity fromframe t to framet+1 of that subject. Positive values indicaterightward and downward movements for the horizontal component $g_{t}^{x}$ and vertical component $g_{t}^{y}$ , respectively. The absolute values reflecthorizontal and vertical gaze speeds. Finally, all values were normalized to the(0, 255) range to generate gray-scaleg_t. representations for eachframe. A schematic illustration of the construction of one subject’s Gazerepresentation is shown inFigure 3b. Westacked multiple Gaze representations to represent the movement of gaze acrossframes. A 2L input based on gaze positions and velocities for gaze stream wasformed by stacking the Gaze representations from framet to thenextL−1 frames. The final gaze input $Y_{t} \in R^{w \times h \times 2 L}$ for framet is defined as:

\{\begin{cases} Y_{t} (2 i - 1) = g_{t + i - 1}^{x} \\ Y_{t} (2 i) = g_{t + i - 1}^{y} \end{cases}, 1 \leq i \leq L

(4)

The spatial stream takes the RGB values of each pixel as inputsto represent visual characteristics of the video frames. The temporal streamtakes multi-frame motion vectors as inputs to convey the motion information ofthe video to the model. The motion vectors of an arbitrary framet are denoted asm_t, which contains thedisplacement from framet to framet +1. $m_{t}^{x}$ and $m_{t}^{y}$ represent the horizontal and verticalcomponents ofm_t. A2L input was stacked to convey the motion informationbetween framet and the nextL −1frames withL represents the stacking length. The multi-framemotion vectors $F_{t} \in R^{w \times h \times 2 L}$ for an arbitrary framet weredefined as:

\{\begin{cases} F_{t} (2 i - 1) = m_{t + i - 1}^{x} \\ F_{t} (2 i) = m_{t + i - 1}^{y} \end{cases}, 1 \leq i \leq L

(5)

We followed Simonyan and Zisserman (2014)¹² to set the stacking lengthL of Gaze representation and multi-frame motion vectorsequal to 10. In the training stage, each stream is trained separately. In thespatial stream, RGB images with their corresponding labels are input to theconvolutional neural network. Temporal stream and gaze stream are trainedsimilarly with different inputs (multi-frame motion vectors and gazerepresentations). Then, the outputs in the second fully-connected layer with4096 neurons of each stream are combined in order to define the features foreach video frame. The combined features with their corresponding labels areinput to the support vector regression (SVR) algorithm¹³ to train a regression model. In the teststage, we first use the trained ConvNets to extract features of the test frames.Then, the learned SVR model is used to predict an importance score for eachframe. The final summary is composed of the video frames within the top 15% ofthe model-generated importance scores^8,14.

Implementation and Evaluation Details

We compared the gaze aware model (DLSum-RealGaze) with a variation thatonly used the video-based spatial and temporal streams as inputs(DLSum–NoGaze). Since collecting eye movement data during video watchingrequires additional human effort, the DLSumRealGaze model might be inefficientfor practical applications. Therefore, we also tested another variation in whichthe gaze stream used gaze locations predicted by the class activation mapping(CAM) technique¹⁵ as input(DLSum-PredictedGaze). This model but uses predicted instead of collected realgaze information.

For each variation of our model, we performed a 3-fold cross validationfor training and testing. The VGG-16 deep convolutional neural networks werepre-trained on the ImageNet dataset to avoid over-fitting. We implemented ourmodel on the Caffe deep learning framework with Tesla K80 GPUs. In addition, agrid search was run to optimize the parameters for SVR.

We compared the automatic summary (A) with the human-annotatedsummarizations (H) and report the F-score to evaluate the performance of thedifferent models^8,14. The F-score is defined as:

F = \frac{2 \times p \times r}{p + r},

(6)

r = \frac{n u m b e r o f m a t c h e d p a i r s}{n u m b e r o f f r a m e s i n H} \times 100 %,

(7)

p = \frac{n u m b e r o f m a t c h e d p a i r s}{n u m b e r o f f r a m e s i n A} \times 100 %,

(8)

wherer is the recall andp isthe precision. Following conventions in the video summarizationliterature^8,14, we report the average mean F-scoreacross all users (AMF) and the average highest F-score (AHF) by comparing themodel generated summaries with the summaries generated by all users. The highestF-score was defined as the score of the model when it best matched one of theusers’ summaries. AMF reflects how well the model prediction matchesaverage user preferences, and AHF reflects how well the model prediction matchesits most similar user’s summary.

Results

Figure 4 shows an example of how theDLSum-RealGaze model operates on a sample video, ‘Cooking’, from theSumMe dataset. ‘Cooking’ demonstrates how a chef constructs an‘onion volcano’, which is composed of onion slices with a flammablecooking-oil core. The model identified important frames during the video that weresimilar to those a human user chose, as is evident from the alignment of the threepeaks.

A comparison of human selection and model outputs of the video“Cooking” in the SumMe dataset. Top row: average importance scorefor each frame provided by users that constructed the summarized video in theSumMe dataset. Second row: model-generated importance score of each frame fromour DLSum-RealGaze model. Red regions represent frames selected for thesummarized video. Third row: representative samples of the video frames selectedby the model.

We then compared the performance of our three-stream gaze-aware model withthat of the two-stream model that does not use gaze. We found that the gaze awaremodel achieved higher AMF and AHF scores than the two-stream model (Figure 3c andd).Across all movies, a pairwise t-test showed that the DLSum-RealGaze model has amarginally significantly higher AMF score than the DLSum-NoGaze model(t(24)=1.98,p=0.059), and performssignificantly better than the DLSum-NoGaze model in terms of the AHF measurement(t(24)=2.34,p<0.05). We also testedthe DLSum-PredictedGaze model, which uses model-predicted instead of real gazeinformation. We found that although the DLSum-PredictedGaze model achievednumerically lower AMF and AHF scores, its performance is not statistically differentfrom the performance of the RealGaze model, and also significantly higher than thatof the NoGaze model (Figure 3c andd). These results suggest that gaze patterninformation produces better video summarization than when only video-based spatialand temporal information is used.

We also asked how well a model with only gaze information performs videosummarization. We tested a DLSum-GazeOnly model that only uses the gaze stream. Wecompared its performance with a baseline that randomly selected 15% of all frames asthe summary, as well as the performance of our three-stream gaze-aware model. Acrossall movies, a pairwise t-test showed that the DLSum-GazeOnly model performedsignificantly better than the Random model in terms of AMF(t(24)=6.96,p<0.001) and AHF scores(t(24)=9.25,p<0.001). However, itsperformance was significantly lower than the DLSum-RealGaze model (AMF:t(24)=5.36,p<0.001, AHF:t(24)=6.19,p<0.001), which has bothvisual and gaze information. These results suggested that although gaze informationis valuable for video summarization, combining it with visual information maximizesits contribution.

We also compared our model to several state-of-the-art video summarizationmodels. The comparison results are shown inTable1. We found that our model outperformed most existing models includingThe Creating summaries from user videos (CSUV) model⁸, The Summarizing web videos using titles(SWVT) model¹⁴ and The VideoSummarization with Long Short-term Memory (dppLSTM) model¹⁹. Although The Unsupervised VideoSummarization with Adversarial LSTM Networks (SUM-GAN_sup)¹⁶ model had higher AMF score thanus, it has a more complex architecture and needs more data to train. Actually, ourmodels achieve a higher AMF score than their simple model (SUM-GANw/o-GAN). Theresults suggest that our model is competitive in the video summarization task.

TABLE 1.

The performance comparisons of our proposed methods and thestate-of-the-art on SumMe dataset.

	Method	AMF	AHF
Baseline	Random	13.95%	16.79%
Existing method	CSUV	23.40 %	39.40%
	SWVT	26.60 %	--------
	dppLSTM	17.70 %	42.90%
	SUM_GAN_sup	43.60 %	--------
Proposed method	DLSum-NoGaze	35.44 %	56.29%
	DLSum-PredictedGaze	35.99 %	57.27%
	DLSum-Real Gaze	36.00 %	57.33%
	DLSum-GazeOnly	19.60%	30.81%

Open in a new tab

‘--------’ means the result is not reported inpublished papers.

DISCUSSION and CONCLUSION

We collected eye movement data from subjects watching 25 videos from theSumMe dataset with no explicit instructions to guide their eye movements. We showedthat the consistency of gaze location and velocity across subjects was greater inthe frames that human users chose as important for summarizing the videos. We thenconstructed a gaze-aware multi-stream deep learning model that incorporatedsubjects’ eye movement information to determine if gaze information canfacilitate video summarization. We found that when gaze was taken as an input to themodel, it outperformed a version that input only low level visual features. Theresults suggest that eye movements indicate important information in videos.

Previous studies using static images found that observers’ eyemovements reflect the information of different spatial regions in a scene^1,17. In the current study, we found that eye movements also reflectthe importance of video frames in the temporal domain. We first showed that oursubjects’ eye movement patterns were more consistent in frames that wererated as important by users who summarized SumMe dataset. The similarity of eyemovements may have arisen because the selected important frames contain objects thatsubjects found important, and thus enabled the eye movements to help predict videoframe importance.

We tested a computational model that incorporated eye movement informationand found that it facilitated frame selection for video summarization. Most previousvideo summarization models find important content using low level visual features,such as color and shape information¹⁴, as well as faces and other landmarks⁸. There are previous methods that use gazeinformation to facilitate video summarization. Xu et al. used fixation counts asattention scores, and used fixation regions to generate video summaries²⁰. Salehin et al. investigated theconnection between video summarization and smooth pursuit to detect important videoframes²¹. In contrast toour method, gaze patterns in both previous studies had to be determined first beforethey could be incorporated into video summarization models.

Similar to in the current study, our previous work¹⁸ also investigated whether incorporating gazeinformation into the spatial stream facilitates video summarization. In that study,gaze information was used to preprocess the raw video, therefore the gazeinformation and the spatial information were confounded. Here, we test whether thegaze information by itself facilitates model performance. To do this, we input thegaze information as an independent stream. The higher performance of our gaze awaremodel suggests that using gaze information by itself facilitates videosummarization, and summarizes videos in a similar fashion as human users. Since eyemovements are guided by top-down, memory-based knowledge and the semantic meaning ofa scene¹, gaze data might revealinformation about internal cognitive processes while watching videos that are notcaptured by low level visual features. For example, a subject may use higher levelobject-based information to obtain the storyline of a long video. However, thespatial properties or semantic content that cause consistent eye movement patternsand higher importance judgement remains unknown. Future studies will determine thecontent in a video frame that leads to eye movement patterns to evaluate specificcognitive aspects of a video that are important for video summarization.

We also tested that a model using predicted instead of real gazeinformation. The specific gaze prediction method we used did not consider thetemporal structure of videos, but still achieved comparable performance as theDLSum-RealGaze model. Since the temporal stream of our model already takes thetemporal structure into account, it may make the temporal information embedded inthe gaze stream redundant. Future work is needed to determine the importance of thetemporal structures of human eye movements without other confounding factors. Ourresults could also be used to develop models that can predict eye movements invideos and then be integrated into video summarization models.

In sum, our study shows that eye movement is a reliable predictor of theimportance of temporal information in videos, and other video summarization modelsmay consider using eye movement patterns to obtain better video summaries.

ACKNOWLEDGMENTS

This work was supported by the National Institutes of Health grant5T32EY025201–03 to Z.M., the Smith-Kettlewell Eye Research Institute grant toS.H., and the National Natural Science Foundation of China grant 61502311 toS.Z.

Biography

Zheng Ma is a postdoctoral research fellow at theSmith-Kettlewell Eye Research Institute. Her research interests include visualcognition and using computational models to understand human mind. Dr. Ma received aPhD in psychology from the Johns Hopkins University. Contact her atzma@ski.org.

Jiaxin Wu is a graduate student in College of Computer Science& Software Engineering at Shenzhen University. Her research interests includevideo content analysis and deep learning. Jiaxin received a MS in computer sciencefrom Shenzhen University. Contact her atjiaxin.wu@email.szu.edu.cn.

Sheng-hua Zhong is an assistant professor in College ofComputer Science & Software Engineering at Shenzhen University. Her researchinterests include multimedia content analysis, neuroscience, and machine learning.Dr. Zhong received a PhD in computer science from the Hong Kong PolytechnicUniversity. Contact her atcsshzhong@szu.edu.cn.

Jianmin Jiang is a professor in College of Computer Science& Software Engineering at Shenzhen University. His research interests includecomputer vision and machine learning in media processing. Dr. Jiang received a PhDin computer science from the University of Nottingham. Contact him atjianmin.jiang@szu.edu.cn.

Stephen J. Heinen is a senior scientist at the Smith-KettlewellEye Research Institute. His research interests include eye movements and humanvisual cognition. Dr. Heinen received a PhD in experimental psychology from theNortheastern University. Contact him atheinen@ski.org.

Contributor Information

Zheng Ma, The Smith-Kettlewell Eye Research Institute.

Stephen J. Heinen, The Smith-Kettlewell Eye Research Institute

REFERENCES

1.Henderson JM. Human gaze control during real-world sceneperception.Trends in cognitive sciences2003;7(11):498–504. [DOI] [PubMed] [Google Scholar]
2.Najemnik J, Geisler WS. Optimal eye movement strategies in visualsearch.Nature2005;434(7031):387. [DOI] [PubMed] [Google Scholar]
3.Goldstein RB, Woods RL, Peli E. Where people look when watching movies: Do all viewerslook at the same place?Computers in biology and medicine2007;37(7):957–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dorr M, Martinetz T, Gegenfurtner KR, Barth E. Variability of eye movements when viewing dynamicnatural scenes.Journal of vision, 2010;10(10):28–28. [DOI] [PubMed] [Google Scholar]
5.Shepherd SV, Steckenfinger SA, Hasson U, Ghazanfar AA. Human-monkey gaze correlations reveal convergent anddivergent patterns of movie viewing.Current Biology2010;20(7):649–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kienzle W, Schölkopf B, Wichmann FA, Franz MO, editors. How to find interesting locations in video: aspatiotemporal interest point detector learned from human eyemovements.Joint Pattern Recognition Symposium; 2007:Springer. [Google Scholar]
7.Truong BT, Venkatesh S. Video abstraction: A systematic review andclassification.ACM transactions on multimedia computing, communications, andapplications (TOMM)2007;3(1):3. [Google Scholar]
8.Gygli M, Grabner H, Riemenschneider H, Van Gool L, editors. Creating summaries from uservideos.European conference on computer vision; 2014:Springer. [Google Scholar]
9.Mathe S, Sminchisescu C. Dynamic eye movement datasets and learnt saliency modelsfor visual action recognition.Computer Vision–ECCV 2012:Springer; 2012. p.842–56. [Google Scholar]
10.Torralba A, Oliva A, Castelhano MS, Henderson JM. Contextual guidance of eye movements and attention inreal-world scenes: the role of global features in objectsearch.Psychological review2006;113(4):766. [DOI] [PubMed] [Google Scholar]
11.Cousineau DConfidence intervals in within-subject designs: A simplersolution to Loftus and Masson’s method.Tutorials in quantitative methods for psychology2005;1(1):42–5. [Google Scholar]
12.Simonyan K, Zisserman A, editors. Two-stream convolutional networks for actionrecognition in videos.Advances in neural information processing systems;2014.
13.Drucker H, Burges CJ, Kaufman L, Smola AJ, Vapnik V, editors. Support vector regressionmachines.Advances in neural information processing systems;1997.
14.Song Y, Vallmitjana J, Stent A, Jaimes A, editors. Tvsum: Summarizing web videos usingtitles.Proceedings of the IEEE Conference on Computer Vision and PatternRecognition; 2015. [Google Scholar]
15.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminativelocalization. In: Proceedings of the 2016 IEEEconference on computer vision and patternrecognition. [Google Scholar]
16.Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial LSTMnetworks. In Proceedings of the 2017 IEEEconference on computer vision and patternrecognition. [Google Scholar]
17.Loftus GR, Mackworth NH. Cognitive determinants of fixation location duringpicture viewing.Journal of Experimental Psychology: Human perception andperformance1978;4(4):565. [DOI] [PubMed] [Google Scholar]
18.Wu J, Zhong S-h, Ma Z, Heinen SJ, Jiang J. Foveated convolutional neural networks for videosummarization.Multimedia Tools and Applications2018:1–23.
19.Zhang K, Chao WL, Sha F, Grauman K. Video summarization with long short-termmemory.European conference on computer vision;2016.
20.Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V.: Gaze-enabled ego-centric video summarization viaconstrained submodular maximization. InProceedings of the 2015 IEEE conference on computer vision andpattern recognition, pp.2235–2244. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Salehin MM, Paul M: A novel framework for video summarization based onsmooth pursuit information from eye tracker data. InProceedings of the 2017 international conference on multimediaretrieval, pp.692–697 [Google Scholar]

Movatterモバイル変換

PERMALINK

Human Eye Movements Reveal Video Frame Importance

Zheng Ma

Jiaxin Wu

Sheng-hua Zhong

Jianmin Jiang

Stephen J Heinen

Abstract