Movatterモバイル変換

Yan Sun¹⁶ &
Bo Ren¹⁶

Part of the book series:Communications in Computer and Information Science ((CCIS,volume 771))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

3018Accesses
1Citation

Abstract

Automatically generating a natural sentence describing the content of an image is a hot issue in artificial intelligence which links the domain of computer vision and language processing. Most of the existing works leverage large object recognition datasets and external text corpora to integrate knowledge between similar concepts. As current works aim to figure out ‘what is it’, ‘where is it’ and ‘what is it dong’ in images, we focus on a less considered but critical concept: ‘how it feels’ in the content of images. We propose to express feelings contained in images via a more direct and vivid way. To achieve this goal, we extend a pre-trained caption model with an emotion classifier to add abstract knowledge to the original caption. We practice our method on datasets originated from various domains. Especially, we examine our method on the newly constructed SentiCap dataset with multiple evaluation metrics. The results show that the newly generated descriptions can summarize the images vividly.

You have full access to this open access chapter, Download conference paper PDF

Captioning the Images: A Deep Analysis

A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future Perspectives

Article16 October 2024

CapGen: A Neural Image Caption Generator with Speech Synthesis

Keywords

1Introduction

Automatically describing the content of images in the form of a sentence is an important but challenging problem in the field of artificial intelligence. In the last few years, many works have been done and made considerable progress in the development of image caption. Generation of a standard description attach to an exact image requires comprehensive knowledge including object detection, language processing and so on.

The idea of image caption is to convey the content of images from vision to language knowledge. Particularly, much attention has been paid to object or region in the image. Referring expression and accuracy boosting are two major points in the development of image caption. As Fig. 1 shows, in the recent interest of image caption, content modeling usually results in the form of ‘what is it’, ‘where is it’ and ‘what is it doing’ in the image. To some extent, related captions reflect actual substances in images precisely. Either in a global or local way, details in images are expressed. Though these sentences look reasonable, it can hardly to be said as a human-like expression. There is usually some more information extracted in human intuition. When human looks at the left one in Fig. 1, they always wonder if it is funny riding a unicycle. The man playing in the picture looks exciting on that. As to the right, there must be someone who will sigh ‘What a pity coyote walking in the snow!’. Except for representational existences, sensible meanings are hidden in images which come from human culture, experiences and so on.

Concrete captions have been widely studied. The universal forms has been adopted in not only works [9,10,16,22,40,41,42] attempt to model the whole image, but also researches [8,11,13,21,32,37] based on local regions. Besides, there are also many works [5,6,14,28] contributing to extending the caption to hidden objects in current datasets in a various way. These work expand the range of real recognition in images. With more and more images uploaded in all kinds of social network or photo sharing sites, is it enough to mimic human intuition with simple detection on visible objects? In contrast, in another field of computer vision, researchers argue that there are multiple emotions are contained in images. Especially, they pointed out that abundant human intuition and emotion are hidden in creative applications such as art, music, but existing works lack the ability to model and utilize this information. Though the [30] modeled adjective descriptions with the aid of ANPs, sentiments haven’t been presented directly or clearly as ANPs are essentially assistant tools for emotion prediction but not destinations. Strong sentiments embedded in images strength the opinion conveyed in the content, which can influence others effectively. Aware of components of emotions included in a picture will benefit the understanding of creative applications. [39] suggestedEmotional Semantic Image Retrival (ESIR), which is built to mimic human decisions and provide the user with tools to integrate their emotional components to their creative works. [26] classified images in affective aspect by [39] using low-level features, extracting information from texture and colors. [1] crawled large-scale images from the web to train a sentiment detector targeting at affective image classification and learning what is there in pictures. Affective information derives from the image is obtained. Images are increasing rapidly among human life, not only exist as a figural narration but also a carrier of affects. Nowadays, ‘kid is not only a kid’ in human intuition, the mind state of the child always being attractive, but the feeling also brought from animals in the picture usually the point. Meanwhile, vividness is a test indicator in the field of mental imagery [33]. Just like the Fig. 2 shows, traditional caption gives an illustration of what there is in the image as the left one. The description sketch can point to more than one figure. Instead, the illusion at right gives a more vivid description with affective expression ‘with funny looks’. The caption with newly added phrase makes the subfigure against the orange background distinguished from others.

Prior works about captioning images have made significant progress, it is exciting and critical to further the description of images to the affective level which gives an understanding both figural and vivid. To achieve the goal of affective caption, data with emotions is needed. Constructing new datasets with affective description leads to huge consumption of time and human resources. Considering this situation, we plan a cross-domain learning process which is build on data from both image caption and affective image classification. Conclusively, in this paper, we propose to describe images with a novel concept. We mine the existent but abstract information contained in images, expressing it in a caption and convey more vivid story than speak in a flat style. Our contribution can be grown into two folds: (i) we come up with the addition of novel affective concept in image caption to generate more vivid descriptions; (ii) To combine the caption with affective concepts, we propose to learn with cross-domain data which utilizes existing datasets designed for image caption and image sentiment. The interaction between cross domain data brings human-sources thrift to the implement of our method.

2Related Works

Our paper draws on recent works in the image caption. The link to affective image classification tasks makes more vivid descriptions for images. Processing of recognition content and affective components allows the creation of affective caption.

2.1Image Captioning

Describing images with natural language has become a focal task nowadays. General paradigms dealing with image caption can be split into two part: top-down and bottom-up. The top-down strategy targets at transforming images to sentences. Nevertheless, the bottom-up strategy picks out words respond to a different part of images and then combine them into sentence form. With the development of the recurrent neural network, the top-down paradigms show better performance. The models of the deep formulation are widely realized in top-down approaches. On the one hand, Incorporation of CNN and RNN is adopted in many works [9,20,38]. CNN serves as a high-level feature extractor while RNN models the sequence of words to generate captions. On the other hand, CNN and RNN models are integrated into a multimodal framework [25,28,29]. CNN and RNN interact with each other in the multimodal space and predict captions words by words. What’s more, deep captioning is not addicted to merely descriptions of the whole image any longer. Goal orientation of many recent works [8,11,13,21,27,32,37] is to create exclusive descriptions in allusion to one exact region in the image. Such captioning task is named as ‘referring expression’. Besides, [18] proposed a Fully Convolutional Localize Network (FCLN) for dense captioning based on object detection and language models. The FCLN model obtains a set of regions and associates them with captions. Far more abundant information remains captured with region captioning.

Apart from contributions to accurate descriptions and precise localization, the coverage of captioning is expanded as well. [14,28] succeeded in making it available to describe objects first occur. [28] constructed three novel concept datasets supplementing the information shortage of objects while [14] learned knowledge from unpaired image data and text data.

With so much contribution been devoted to the fields of image caption, we proposed to introduce another novel concept for image description. Unlike prior works focus on the coverage of real recognition, we aim to convey emotional components hidden in images for a vivid expression. Similarily, [30] proposed to add adjectives in the descriptions on the basis of sentibank [1]. According to [30], emotional information is also what they considered to capture in the generation of captions. Given ANPs they depended on, it is one assistant step for the prediction of sentiments. Images in the range of some ANPs are crawled from the Flickr without any checkout.

2.2Emotion Modeling

Image sentiment attracts more and more attention nowadays. Many researchers attempt to model emotion existence contained in images. Particularly, affective image classification is the hottest topic in emotion modeling. Inspired by psychological and art theories [26,39] proposed to extract emotions with texture and color features. As to theSemantic Gap between low-level features to human intuition, mid-level representations are studied to fill the gap. [46] come up with the principle-of-arts features to predict which emotion is to one image. Meanwhile, [1] utilize the rich repository of image source on the web, generating an emotion detector. The detector is in 1,200 dimensions which are related to Adjective Nouns Pairs(ANPs). [3,7,43] tried to improve the affective image classification in various way. More recently, a deep neural network has been introduced extensively to deal with emotion modeling problems. [2] extends the sentibank to DeepSentibank based upon Caffe [17]. Besides, trial on the deep architecture [43] and reliable benchmark [44] is established as well.

3Overview

Figure 3 gives an overview of our proposed framework on the affective caption. The process consists of two parallel parts: caption channel and affective classification channel.

For the caption channel, the input images are sent into the caption architecture based upon a neural network. This part is to obtain dense and traditional captions from images. The output captions serve as foundations which are to be modified with affective concepts. As for the affective classification, it manages to capture emotion attributes in images. The affective channel can be implemented with the caption channel simultaneously. Through the caption channel, images will be divided into different categories. We treat sentimental categories as different baskets, in which various affective concepts are included. Concepts will be mapped to images in the same basket as the output of the classification channel. Intermediate objects come out from different channels combined via a fusion method. Therefore, novel affective caption referring to exact images emerge. Details of channel conduction and fusion will be given in the Section Approach.

Also, there is no existing data simultaneously specialized for image captioning and affective classification to the best of our knowledge. Data collection and annotation require large demand of human sources and time consumption. Given the reality, we argue that it is reliable to train the emotion classifier with cross domain data when we evaluate our results with datasets designed for captioning. In the next, we will give a discussion on the selection of data.

3.1Data Preparation

As mentioned in above section, we utilize emotion datasets to train the affective classification dependently. The emotion classifier will be applied to caption datasets for the extraction of emotional components.

First, we briefly browse several datasets for affective classification tasks. There are datasets collected in various types.IAPS is the earliest and the most common stimuli set in tasks of affective image classification, which is proposed by [23]. Pictures in IAPS are mainly depicting complex scenes. Usually, its subsetIAPSa is often used in visual sentiment researches. Besides, [26] constructedArtphoto andAbstract Paintings. The two datasets are formed of pictures with artistic attributes. Artphotos are comprised of works of photographers. Meanwhile, Abstract Paintings are a set of works of art. Above three datasets are established on a regular scale whose magnitude is under a thousand. [44] established a larger scale dataset for image emotion recognition. For short, we call the datasetFI. This dataset not only expands the data to a greater magnitude but also balance in each category. The definition systemAmusement, Anger, Awe, Contentment, Disgust, Excitement, Fear, Sad proposed by [31] is widely used in emotion models. Datasets mentioned above all followed this definition. In this paper, we will support the standard as well. Comparing above-mentioned emotion datasets, the magnitude ofFI is large enough, and the Amazon Mechanical Turk(AMT) assures the reliability of labels. Therefore, we determine to train the emotion classifier on the basis of the datasetFI.

On the other hand, Flickr8k, Flickr30k, and MS COCO have become widely used datasets in the domain of image caption.Flickr8k: The dataset [35] consists of over 8,000 images from the largest photo sharing site: Flickr. Each image is provided with five descriptions. Expert annotation is supplied as well.Flickr30k: The dataset is similar to Flickr8k [45]. Five sentences are attached to each image. Both of Flickr8k and Flickr30k are designed for image-sentence retrieval process.MS COCO: This dataset is proposed by [24], including more than 80,000 images with five sentences annotations, whose magnitude is ten times larger than Flickr8k and Flickr30k. [30] recently annotate theSenticap dataset whose images are overlapped between ANPs. They designed the re-writing task upon the original captions from MS COCO. Though we argue that ANPs is just an assistant tool for emotion predict, the SentiCap annotation is closer to affective captions than traditional annotations. Thus, we evaluate our method on theSenticap dataset.

3.2Affective Concepts

Affective concepts are what we will add to traditional captions. These concepts can be adverbs, phrases and so on. In the fusion with traditional captions, the affective concept is treated as an entity. It plays an important role in the generation of the affective caption. The selection of affective concepts depends on the sentimental label related to each image. Each emotion can be mapped to multiple concepts. In tasks of affective image classification, two linguistic-based language models – SentiWordNet and SentiStrength are widely utilized. They contain large quantities of sentimental vocabularies. Inspired by the two ontologies, we select some concepts referring to the eight emotion categories. Table 1 shows some examples similar to fixed categories.

3.3Model Selection

In the early days, the model from [41] illustrated to describe images in two steps: (a) image parsing; (b) text description. Decomposition split images into parts to find out the relations between objects or part. The way [19] models captions is similar to [41]. Bidirectional model sentence mapping is proposed. Association is learned via image datasets and contexts. Fragments related to objects detection and sentences are pointed. Furthermore, as there is not always one principal object exist in pictures, referring expression started drawing attention. [20] detects regions in images and treat the associated sentence as a rich label space. From this, the model in [20] involves correspondences in the training set and then learns to generate new descriptions. Besides extension on content, novel description concepts are introduced by the model in [28]. New words are updated in original models. The datasets with novel concepts are constructed to train the model. Images with novel concepts can be described. Surprisingly, [14] models the unique concepts without the need of prepared data supplement. Simply unpaired image and text help the model recognize novel concepts. Besides, [18] propose DenseCap to capture rich information contained in images, which mines adequate sources in limited data. In that work, an end-to-end framework is constructed with localization layer. There is no need of extra bounding box preparing. Thus sentences are generated in a convenient and direct way.

In the first, captions are made on the whole image. With time goes by, captions approach to more and more correctness. Pointed objects or regions is detected. In our work, the dataset IAPS is well labeled but lack in scale. In the advantage of models strive in areas, much more referring sentences or sub-figures can be cropped from original datasets. To have the privilege, we followed the model guide published by [18]. Bounding boxes are gathered to complement affective concepts.

4Approach

The goal of this section is to generate affective captions with the aid of emotion prediction and traditional image caption. Given an instance imageI, we define the$\mathbf X = \{ x\_1, x\_2, ... , x_d \}$ presenting the visual features, where$d \in \mathbb {R}$. A sequence of words$\mathbf Y = \{ y_1, y_2, ... , y_n\}$ on behalf of the original caption. We target at extending the traditional caption to an affective expression, which is indicated as$ \hat{\mathbf{Y }} = \{ \hat{y_1}, \hat{y_2}, ... , \hat{y_n}, \hat{y_{n+1}} \}$. The extra$y_{n+1}$ refers to the entity of affective concepts. For the generation of affective captions, we need to prepare the original caption$\mathbf Y $ and affective concept$y_{n+1}$ separately.

4.1Concrete Modelling

In this section, we aim to obtain general captions of images. We follow the FCLN model proposed by [18] which concentrate on Dense Captions. Dense Descriptions make it closer for adequate information. Also, it can be treated as a multiplier for affective data. In this model, dense information is gained based on bounding boxes which are computed in the new localization layer. Region proposals are predicted via regressing offsets from a set of translation-invariant anchors. The FCLN model adopted the parameterization in [12]. Proposals are regressed from anchors with certain scalars settings.B most confident bounding boxes are subsampled from the large quantities of proposals. As sizes and aspect ratios are varying among proposals, the FCLN model applied the bilinear interpolation for consistent dimension features. Features of chosen bounding boxes are output from the localization layer and serve as input for the generation of captions (Fig. 4).

Table 1. Examples: affective concepts ontology according to eight emotions.

Full size table

With regions prepared, RNN language model is used to generate descriptions related to regions. The input for the RNN model is a sequence of words$\{y_0, y_1, ... , y_n, y_{n+1}\}$. Additional variances$y_0$ and$y_{n+1}$ on behalf of the start code and end code. LSTM units are used to model the description sequence. A series of hidden neurons$h_t$ are computed to formulate output generation$\hat{\mathbf{Y }} = \{ \hat{y_1}, \hat{y_2}, ... ,\hat{y_n}\}$. Recurrent formula function is used as Eq. 1.

$$\begin{aligned} \hat{y_t}=f(h_{t-1},x_i); \end{aligned}$$

(1)

The subscriptt means timestep in the LSTM model, which follows [15] in the FCLN model. ‘Gates’ are the main composition in the structure of the memory cell. Whether value from layers to be kept or not is up to ‘Gates’. Gate ‘1’ means value preserved while gate ‘0’ stands for value to be forgotten. The input value and output value are defined asi ando separately. As$h_{t-1}$,$c_{t-1}$ and$x_t$ are already known at the timet, the hidden unit$h_t$ can be modulated up to$h_{t-1}$ and$x_t$. When meets the end code$y_{n+1}$, the process ends. Based on the formulation of a recurrent neural network, the original dense captions are prepared.

4.2Affective Concepts Predicting

The prediction of affective concepts plays an important role in the generation of affective captions. An affective concept is treated as an entity to be added in a flat caption for the generation of an affective caption. Each affective concept is related to an emotion category. Therefore, we claim to select affective concepts in a hierarchical way. First of all, the emotion class of the image or region is needed. An emotion classifier satisfies the requirement inherently. Then we scan concepts refer to the predicting class and pick out a proper one to be embedded into captions. Details are given in the next. Affective concepts are denoted as$\hat{y_{n+1}}$ in the final form of affective captions. As what we mentioned before, we use the emotion dataset FI training an emotion classifier. Featuresx’ will be mapped to one of eight labels$l_{emo}$. Corresponding to each category, labels are related to a series of affective concepts:$l_{emo} \rightarrow \{ concept_1, concept_2, ... , concept_k \}$, the number of concepts selected in each category is denoted ask. A random function is used to pick out feasible concepts:

$$\begin{aligned} concept_{select} = rand(set(l_{emo})) \end{aligned}$$

(2)

5Experiments

Our experiments employ a caption model and an affective classifier in a parallel way. In the followed model supported on the project page of [18], 1,000 is set as the maximum of bounding boxes and captions. However, too much consideration on the prediction of objects is unnecessary and consume lots of time as well. It is either needless for information extraction. Concentration should be paid on the generation of vivid telling. Therefore, we changed the maximum of bounding box into 10 for convenience. In the paradigm of affective caption generation, emotional components are discriminative considering the novel role they play in the description. For the addition of emotional components, eight emotions are utilized in the paper. To satisfy the syntax structure, we convert the predicting emotion to responding adverb for the proper embedding. At the first step, the sentiment of each image is decided via the global features. Features of images are sent into the emotion classifier to obtain their referring sentiments. In the baseline experiment, simple concepts are chosen to generate comparable captions. As for the eight emotions, they are represented by adjectives. They are transformed into common forms of adverbs or phrases. Instances like :$\{Amuse\rightarrow happily,\, Awe\rightarrow amazingly,\, Content\rightarrow contentedly,\, Excite\rightarrow excitedly,\, Anger\rightarrow looks\,anger,\, Disgust\rightarrow queasily,\, Fear\rightarrow in\,terror,\, Sad\rightarrow sadly\}$. Thus traditional captions appended with affective concepts comprise of the novel emotional descriptions.

Table 2. Evaluations on the Senticap dataset under six metrics ($BLEU\_4$,$BLEU\_3$,$BLEU\_2$,$BLEU\_1$,CIDEr,METEOR). The n in BLEU_n on behalf of the n-gram co-occurrences.top1 totop5 correspond to the confidence level referring to captions. The left-most row presents the model used in the generation of captions.

Full size table

To have our evaluations effective, we conduct the implementation on theSenticap dataset. The novel annotations with ANPs gives a comparable benchmark for affective captions generated from our model. The influence of affective concepts would be better judged against emotional components.

5.1Implementation

Conduction of affective caption generation implements in two different ways. Relatively speaking, the first way is simply but rigid generation. It is feasible setting the affective adverbs appendix at the end of sentences. Emotional components are introduced forcibly.

To have the emotion classifier applicable to images may be in plain affection, we decide to adopt the VGG-16 for the extraction of features in the model$EmoCap_{vgg}$. In addition, the FI dataset is used to fine-tune the VGG net as well. Another model practices with the fine-tune features is denoted as$EmoCap_{vggft}$. The specially designed emotion detector sentibank is also applied in the classification for sentiments. The model with the aid of sentibank features is recognized as$EmoCap_{bank}$. Then each image from the field of captioning is attached with exact sentiments. With classification based on features, emotion label is fused with corresponding generation caption, and a novel description is created. Features of cropped regions are extracted via the same feature extractors mentioned before. Then they are sent into the prepared emotion classifier to tag their novel sentiments, which may be different via different features.

In the equation,I correspond to an instance image.$Score_{mother}$ represents emotion scores predicted for mother images.$Score_{r_{old}}$ on behalf of regions’ emotion scores given no global thoughts which$Score_{r_{new}}$ is the final score after the combination of local and global scores. As Eq. 3 show:

$$\begin{aligned} Score_{r_{new}}^I = Score_{r_{old}}^I * \alpha + Score_{mother}^I * \beta \end{aligned}$$

(3)

5.2Evaluation

To evaluate our system in the generation of novel affective concepts, we adopt common automatic judgements, theBLEU,METEOR andCIDEr metrics, which derive from the Microsoft COCO evaluation software [4]. These metrics is also utilized in [30].

Table 2 shows the results of region captions from the SentiCap Dataset. Affective concepts conducted on$EmoCap_{vgg}$,$EmoCap_{vggft}$ and$EmoCap_{bank}$ are evaluated. Compared with the traditional captions generation from theDensecap. We extracted the top five captions which are predicted from the model. Captions in each rank are evaluated separately. The higher the rank, the preciser captions are. From the above tables, we can see that the top 1 performs well in the overall evaluation. InBLEU metrics, numbers behind refer to then in n-gram. The differences between ground truth and generation caption become stronger as the metrics go from unigram to 4-gram [34]. We useC as the candidates set andS as the reference set. The BLEU score is decided by a modification probability($P_n$) and a Brevity Penalty (b). The$P_n$ computes the counts n-grams appear in candidates set, and

$$\begin{aligned} P_n(C,S)=\frac{\sum _{c}\sum _{n-gram\in C} {Count_{clip}(n-gram)}}{\sum _{c'}\sum _{n-gram'\in C'} {Count(n-gram')}} \end{aligned}$$

(4)

where$Count_{clip}$ means the maximum counts of words appear in annotations but not exceed the biggest counts obtained from any references of words. The subscriptc correspond to$C \in Candidates$. so does$c'$. The Brevity Penality is determined as:

$$\begin{aligned} b(C,S)=\left\{ \begin{matrix} 1, \quad \quad \quad if \quad l_c>l_s \\ e^{(1-r/c)}, \quad if \quad l_c\le l_s, \end{matrix}\right. \end{aligned}$$

(5)

where$l_c$ refers to the length of candidate sentences while$l_s$ refers to the length of the reference. Thus the final BLEU score is gain via the following equation:

$$\begin{aligned} BLEU\_n(C,S)=b(C,S)exp(\sum _{n=1}^N {\omega _n logP_n(C,S)}) \end{aligned}$$

(6)

The evaluation ofMETEOR is computed on the basis of uni-gram. Precision and Harmonic mean are considered. Though the relevance in segmentations is not cared whileMETEOR gives thoughts to the coherence among uni-gram. The correlation betweenMETEOR and human judgment is higher than that betweenBLEU and human judgment.

Apart from above evaluations, [36] proposed theCIDEr, which is specially designed for the evaluation of image caption. The coherence in caption is evaluated via n-gram weighted via TF-IDF. The TF-IDF computes weights$g(\cdot )$ for each n-gram and take them into the final consideration of evaluations. The higher scores are, the better results are gain is applicable to all this evaluations.

5.3Results

As the Table 2 shows, the$BLEU\_1$ to$BLEU\_4$ scores of affective captions are higher than scores of traditional captions. The differences among the three features are wispy. This condition may because evaluation metrics are up to all words in captions. The affective concepts take a sincerely small proportion. As for theMETEOR metrics, there is not any difference between the affective captions and traditional captions. Nevertheless, traditional captions get better scores under theCIDEr standard. According to the evaluations, affective captions outperform in most evaluating metrics. It proved that affective captions give more human-like descriptions to images. The introduction of affective components makes sense against simply descriptions of objects. However, affective captions show lower performance on theCIDEr. As that theCIDEr considers the coherence in captions, it may mean that the affective concepts are not idiomatic enough. Though there are shortages exist in affective captions, we have the confidence to say that affective concepts make senses. Emotional concepts added in captions give closer descriptions to human-intuition captions. Figure 5 presents example results with affective captions. Final emotions of each fragment are determined from themselves and weighted emotion from their mother images. In this figure, every description downward the referring fragments gives a sentimental expression of the content. Positive or negative ANPs used in [30] to present a state of objects, while affective concepts give an expression of sentiment directly. Additional adverbs in the novel caption convey moods from the principal subjects in photos. Compared to the traditional caption, conveying moods resonate human intuitions directly and make it easily associated.

6Conclusion

From the results and example images with descriptions, it can be understood that the addition of affective concepts helps minds in photos brighter. The addition of affective concepts will make the caption of images more vivid. Affective concepts mine hidden sentiments in images and give appropriate expressions. Whats more, there are a lot of developments need to be accomplished. Out of evaluations, though emotion components are highlighted in captions, they are not smoothing enough in evaluations. There are all kinds of adverbs or phrases in the ontology of human literary describing human emotions. There can always be an elegant expression conveying various sophisticated sentiments. The prediction of emotions should be more delicate to capture complicated human feelings. Besides, the selection of position where affective concepts are embedded is to be improved. One of the expanding orientation is adaptive affective concepts addition. An end-to-end framework consists of the caption channel, and emotion classification channel would help the automatic generation of affective concepts. Moreover, the process of affective concepts could be down under various perspectives. Emotion classification gives an instruction for the selection of affective concepts. Further consideration of emotion distribution may lead to a better coverage for sentiments and preciser descriptions.

References

Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: ACM MM (2013)
Google Scholar
Chen, T., Borth, D., Darrell, T., Chang, S.F.: Deepsentibank: visual sentiment concept classification with deep convolutional neural networks. In: CVPR (2014)
Google Scholar
Chen, T., Yu, F.X., Chen, J., Cui, Y., Chen, Y.Y., Chang, S.F.: Object-based visual sentiment concept analysis and application. In: ACM MM (2014)
Google Scholar
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. In: CVPR (2015)
Google Scholar
Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption. In: CoRR (2014)
Google Scholar
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR (2015)
Google Scholar
Chen, Y.Y., Chen, T., Hsu, W.H., Liao, H.Y.M., Chang, S.F.: Predicting viewer affective comments based on image content in social media. In: ICMR (2014)
Google Scholar
Deemter, K.V., Sluis, I.V.D., Gatt, A.: Building a semantically transparent corpus for the generation of referring expressions. In: INLG (2006)
Google Scholar
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., Saenko, K.: Long-term recurrent convolutional networks for visual recognition and description. Elsevier (1988)
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: ECCV (2010)
Google Scholar
FitzGerald, N., Artzi, Y., Zettlemoyer, L.S.: Learning distributions over logical forms for referring expression generation. In: EMNLP (2013)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Golland, D., Liang, P., Dan, K.: A game-theoretic approach to generating spatial descriptions. In: EMNLP (2010)
Google Scholar
Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput.9(8), 1735–1780 (1997)
Article Google Scholar
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: CVPR (2016)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.: Caffe: convolutional architecture for fast feature embedding. In: CVPR (2014)
Google Scholar
Johnson, J., Karpathy, A., Li, F.F.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)
Google Scholar
Karpathy, A., Joulin, A., Li, F.F.: Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inf. Process. Syst.3, 1889–1897 (2014)
Google Scholar
Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: ACL: Long Papers (2012)
Google Scholar
Lang, P., Bradley, M., Cuthbert, B.: International affective picture system (IAPS): technical manual and affective ratings. Technical report, University of Florida, Gainesville
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Lynch, C., Aryafar, K., Attenberg, J.: Unifying visual-semantic embeddings with multimodal neural language models. In: TACL (2015)
Google Scholar
Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: ACM MM (2010)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Google Scholar
Mao, J., Wei, X., Yang, Y., Wang, J.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: ICCV (2015)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)
Google Scholar
Mathews, A., Xie, L., He, X.: SentiCap: generating image descriptions with sentiments. In: AAAI (2016)
Google Scholar
Mikels, J.A., Fredrickson, B.L., Larkin, G.R., Lindberg, C.M., Maglio, S.J., Reuter-Lorenz, P.A.: Emotional category data on images from the international affective picture system. Behav. Res. Methods37(4), 626–630 (2005)
Article Google Scholar
Mitchell, M., Deemter, K.V., Reiter, E.: Natural reference to objects in a visual domain. In: INLG (2010)
Google Scholar
Morina, N., Leibold, E., Ehring, T.: Vividness of general mental imagery is associated with the occurrence of intrusive memories. J. Behav. Ther. Exp. Psychiatry44(2), 221–226 (2013)
Article Google Scholar
Papineni, K.: BLEU: a method for automatic evaluation of machine translation. Wirel. Netw.4(4), 307–318 (2002)
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL HLT Workshop (2010)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CoRR (2015)
Google Scholar
Viethen, J., Dale, R.: The use of spatial relations in referring expression generation. In: INLG (2010)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Wang, W., He, Q.: A survey on emotional semantic image retrieval. In: ICIP (2008)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICCV (2015)
Google Scholar
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. In: Proceedings of the IEEE, vol. 98, no. 8, pp. 1485–1508 (2010)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)
Google Scholar
You, Q., Luo, J., Jin, H., Yang, J.: Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: AAAI (2015)
Google Scholar
You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In: AAAI (2016)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: ACL (2014)
Google Scholar
Zhao, S., Gao, Y., Jiang, X., Yao, H., Chua, T.S., Sun, X.: Exploring principles-of-art features for image emotion recognition. In: ACM MM (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Nankai University, Tianjin, China
Yan Sun & Bo Ren

Authors

Yan Sun
View author publications
Search author on:PubMed Google Scholar
Bo Ren
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence toBo Ren.

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
School of Computer Science and Technology, Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Information Science and Technology, Nanjing University, Beijing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, Hubei, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, Y., Ren, B. (2017). Automatic Image Description Generation with Emotional Classifiers. In: Yang, J.,et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-7299-4_63

Download citation

DOI:https://doi.org/10.1007/978-981-10-7299-4_63
Published:30 November 2017
Publisher Name:Springer, Singapore
Print ISBN:978-981-10-7298-7
Online ISBN:978-981-10-7299-4
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics