
Visual temporal attention is a special case ofvisual attention that involves directing attention to specific instant of time. Similar to its spatial counterpartvisual spatial attention, these attention modules have been widely implemented invideo analytics incomputer vision to provide enhanced performance and human interpretable explanation[3] ofdeep learning models.
As visual spatial attention mechanism allows human and/orcomputer vision systems to focus more on semantically more substantial regions in space, visual temporal attention modules enablemachine learning algorithms to emphasize more on critical video frames invideo analytics tasks, such ashuman action recognition. Inconvolutional neural network-based systems, the prioritization introduced by the attention mechanism is regularly implemented as a linear weighting layer with parameters determined by labeled training data.[3]

Recent video segmentation algorithms often exploits both spatial and temporal attention mechanisms.[2][4] Research inhuman action recognition has accelerated significantly since the introduction of powerful tools such asConvolutional Neural Networks (CNNs). However, effective methods for incorporation of temporal information into CNNs are still being actively explored. Motivated by the popular recurrent attention models innatural language processing, the Attention-aware Temporal Weighted CNN (ATW CNN) is proposed[4] in videos, which embeds a visual attention model into a temporal weighted multi-stream CNN. This attention model is implemented as temporal weighting and it effectively boosts the recognition performance of video representations. Besides, each stream in the proposed ATW CNN framework is capable of end-to-end training, with both network parameters and temporal weights optimized bystochastic gradient descent (SGD) withback-propagation. Experimental results show that the ATW CNN attention mechanism contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments.