Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Automatic Image Description Generation with Emotional Classifiers

  • Conference paper
  • First Online:
Computer Vision(CCCV 2017)

Part of the book series:Communications in Computer and Information Science ((CCIS,volume 771))

Included in the following conference series:

  • 3018Accesses

  • 1Citation

Abstract

Automatically generating a natural sentence describing the content of an image is a hot issue in artificial intelligence which links the domain of computer vision and language processing. Most of the existing works leverage large object recognition datasets and external text corpora to integrate knowledge between similar concepts. As current works aim to figure out ‘what is it’, ‘where is it’ and ‘what is it dong’ in images, we focus on a less considered but critical concept: ‘how it feels’ in the content of images. We propose to express feelings contained in images via a more direct and vivid way. To achieve this goal, we extend a pre-trained caption model with an emotion classifier to add abstract knowledge to the original caption. We practice our method on datasets originated from various domains. Especially, we examine our method on the newly constructed SentiCap dataset with multiple evaluation metrics. The results show that the newly generated descriptions can summarize the images vividly.

You have full access to this open access chapter, Download conference paper PDF

Similar content being viewed by others

Keywords

1Introduction

Automatically describing the content of images in the form of a sentence is an important but challenging problem in the field of artificial intelligence. In the last few years, many works have been done and made considerable progress in the development of image caption. Generation of a standard description attach to an exact image requires comprehensive knowledge including object detection, language processing and so on.

The idea of image caption is to convey the content of images from vision to language knowledge. Particularly, much attention has been paid to object or region in the image. Referring expression and accuracy boosting are two major points in the development of image caption. As Fig. 1 shows, in the recent interest of image caption, content modeling usually results in the form of ‘what is it’, ‘where is it’ and ‘what is it doing’ in the image. To some extent, related captions reflect actual substances in images precisely. Either in a global or local way, details in images are expressed. Though these sentences look reasonable, it can hardly to be said as a human-like expression. There is usually some more information extracted in human intuition. When human looks at the left one in Fig. 1, they always wonder if it is funny riding a unicycle. The man playing in the picture looks exciting on that. As to the right, there must be someone who will sigh ‘What a pity coyote walking in the snow!’. Except for representational existences, sensible meanings are hidden in images which come from human culture, experiences and so on.

Fig. 1.
figure 1

There are two images with the corresponding caption published by [14]. The captions reflect the information: ‘what is it’, ‘where is it’ and ‘what is it dong’ precisely. In human intuitions, do these captions reflect all information people can capture from the above images? The answer is NO.

Concrete captions have been widely studied. The universal forms has been adopted in not only works [9,10,16,22,40,41,42] attempt to model the whole image, but also researches [8,11,13,21,32,37] based on local regions. Besides, there are also many works [5,6,14,28] contributing to extending the caption to hidden objects in current datasets in a various way. These work expand the range of real recognition in images. With more and more images uploaded in all kinds of social network or photo sharing sites, is it enough to mimic human intuition with simple detection on visible objects? In contrast, in another field of computer vision, researchers argue that there are multiple emotions are contained in images. Especially, they pointed out that abundant human intuition and emotion are hidden in creative applications such as art, music, but existing works lack the ability to model and utilize this information. Though the [30] modeled adjective descriptions with the aid of ANPs, sentiments haven’t been presented directly or clearly as ANPs are essentially assistant tools for emotion prediction but not destinations. Strong sentiments embedded in images strength the opinion conveyed in the content, which can influence others effectively. Aware of components of emotions included in a picture will benefit the understanding of creative applications. [39] suggestedEmotional Semantic Image Retrival (ESIR), which is built to mimic human decisions and provide the user with tools to integrate their emotional components to their creative works. [26] classified images in affective aspect by [39] using low-level features, extracting information from texture and colors. [1] crawled large-scale images from the web to train a sentiment detector targeting at affective image classification and learning what is there in pictures. Affective information derives from the image is obtained. Images are increasing rapidly among human life, not only exist as a figural narration but also a carrier of affects. Nowadays, ‘kid is not only a kid’ in human intuition, the mind state of the child always being attractive, but the feeling also brought from animals in the picture usually the point. Meanwhile, vividness is a test indicator in the field of mental imagery [33]. Just like the Fig. 2 shows, traditional caption gives an illustration of what there is in the image as the left one. The description sketch can point to more than one figure. Instead, the illusion at right gives a more vivid description with affective expression ‘with funny looks’. The caption with newly added phrase makes the subfigure against the orange background distinguished from others.

Fig. 2.
figure 2

The left: the traditional caption gives a description of what exists in the image and what it is doing. There are no more emotional feelings expressed. Information conveyed by traditional caption can be simply copied by a black and white picture. The right: to some extent, the addition of one more affective adverb brings colorful components, lives up to human intuition when seeing the picture.

Prior works about captioning images have made significant progress, it is exciting and critical to further the description of images to the affective level which gives an understanding both figural and vivid. To achieve the goal of affective caption, data with emotions is needed. Constructing new datasets with affective description leads to huge consumption of time and human resources. Considering this situation, we plan a cross-domain learning process which is build on data from both image caption and affective image classification. Conclusively, in this paper, we propose to describe images with a novel concept. We mine the existent but abstract information contained in images, expressing it in a caption and convey more vivid story than speak in a flat style. Our contribution can be grown into two folds: (i) we come up with the addition of novel affective concept in image caption to generate more vivid descriptions; (ii) To combine the caption with affective concepts, we propose to learn with cross-domain data which utilizes existing datasets designed for image caption and image sentiment. The interaction between cross domain data brings human-sources thrift to the implement of our method.

2Related Works

Our paper draws on recent works in the image caption. The link to affective image classification tasks makes more vivid descriptions for images. Processing of recognition content and affective components allows the creation of affective caption.

2.1Image Captioning

Describing images with natural language has become a focal task nowadays. General paradigms dealing with image caption can be split into two part: top-down and bottom-up. The top-down strategy targets at transforming images to sentences. Nevertheless, the bottom-up strategy picks out words respond to a different part of images and then combine them into sentence form. With the development of the recurrent neural network, the top-down paradigms show better performance. The models of the deep formulation are widely realized in top-down approaches. On the one hand, Incorporation of CNN and RNN is adopted in many works [9,20,38]. CNN serves as a high-level feature extractor while RNN models the sequence of words to generate captions. On the other hand, CNN and RNN models are integrated into a multimodal framework [25,28,29]. CNN and RNN interact with each other in the multimodal space and predict captions words by words. What’s more, deep captioning is not addicted to merely descriptions of the whole image any longer. Goal orientation of many recent works [8,11,13,21,27,32,37] is to create exclusive descriptions in allusion to one exact region in the image. Such captioning task is named as ‘referring expression’. Besides, [18] proposed a Fully Convolutional Localize Network (FCLN) for dense captioning based on object detection and language models. The FCLN model obtains a set of regions and associates them with captions. Far more abundant information remains captured with region captioning.

Apart from contributions to accurate descriptions and precise localization, the coverage of captioning is expanded as well. [14,28] succeeded in making it available to describe objects first occur. [28] constructed three novel concept datasets supplementing the information shortage of objects while [14] learned knowledge from unpaired image data and text data.

With so much contribution been devoted to the fields of image caption, we proposed to introduce another novel concept for image description. Unlike prior works focus on the coverage of real recognition, we aim to convey emotional components hidden in images for a vivid expression. Similarily, [30] proposed to add adjectives in the descriptions on the basis of sentibank [1]. According to [30], emotional information is also what they considered to capture in the generation of captions. Given ANPs they depended on, it is one assistant step for the prediction of sentiments. Images in the range of some ANPs are crawled from the Flickr without any checkout.

2.2Emotion Modeling

Image sentiment attracts more and more attention nowadays. Many researchers attempt to model emotion existence contained in images. Particularly, affective image classification is the hottest topic in emotion modeling. Inspired by psychological and art theories [26,39] proposed to extract emotions with texture and color features. As to theSemantic Gap between low-level features to human intuition, mid-level representations are studied to fill the gap. [46] come up with the principle-of-arts features to predict which emotion is to one image. Meanwhile, [1] utilize the rich repository of image source on the web, generating an emotion detector. The detector is in 1,200 dimensions which are related to Adjective Nouns Pairs(ANPs). [3,7,43] tried to improve the affective image classification in various way. More recently, a deep neural network has been introduced extensively to deal with emotion modeling problems. [2] extends the sentibank to DeepSentibank based upon Caffe [17]. Besides, trial on the deep architecture [43] and reliable benchmark [44] is established as well.

Fig. 3.
figure 3

Pipeline: the generation of affective captions consists of two channel. One is the extraction of traditional captions following the FCLN model. The other channel is affective concepts preparation via emotion classification.

3Overview

Figure 3 gives an overview of our proposed framework on the affective caption. The process consists of two parallel parts: caption channel and affective classification channel.

For the caption channel, the input images are sent into the caption architecture based upon a neural network. This part is to obtain dense and traditional captions from images. The output captions serve as foundations which are to be modified with affective concepts. As for the affective classification, it manages to capture emotion attributes in images. The affective channel can be implemented with the caption channel simultaneously. Through the caption channel, images will be divided into different categories. We treat sentimental categories as different baskets, in which various affective concepts are included. Concepts will be mapped to images in the same basket as the output of the classification channel. Intermediate objects come out from different channels combined via a fusion method. Therefore, novel affective caption referring to exact images emerge. Details of channel conduction and fusion will be given in the Section Approach.

Also, there is no existing data simultaneously specialized for image captioning and affective classification to the best of our knowledge. Data collection and annotation require large demand of human sources and time consumption. Given the reality, we argue that it is reliable to train the emotion classifier with cross domain data when we evaluate our results with datasets designed for captioning. In the next, we will give a discussion on the selection of data.

3.1Data Preparation

As mentioned in above section, we utilize emotion datasets to train the affective classification dependently. The emotion classifier will be applied to caption datasets for the extraction of emotional components.

First, we briefly browse several datasets for affective classification tasks. There are datasets collected in various types.IAPS is the earliest and the most common stimuli set in tasks of affective image classification, which is proposed by [23]. Pictures in IAPS are mainly depicting complex scenes. Usually, its subsetIAPSa is often used in visual sentiment researches. Besides, [26] constructedArtphoto andAbstract Paintings. The two datasets are formed of pictures with artistic attributes. Artphotos are comprised of works of photographers. Meanwhile, Abstract Paintings are a set of works of art. Above three datasets are established on a regular scale whose magnitude is under a thousand. [44] established a larger scale dataset for image emotion recognition. For short, we call the datasetFI. This dataset not only expands the data to a greater magnitude but also balance in each category. The definition systemAmusement, Anger, Awe, Contentment, Disgust, Excitement, Fear, Sad proposed by [31] is widely used in emotion models. Datasets mentioned above all followed this definition. In this paper, we will support the standard as well. Comparing above-mentioned emotion datasets, the magnitude ofFI is large enough, and the Amazon Mechanical Turk(AMT) assures the reliability of labels. Therefore, we determine to train the emotion classifier on the basis of the datasetFI.

On the other hand, Flickr8k, Flickr30k, and MS COCO have become widely used datasets in the domain of image caption.Flickr8k: The dataset [35] consists of over 8,000 images from the largest photo sharing site: Flickr. Each image is provided with five descriptions. Expert annotation is supplied as well.Flickr30k: The dataset is similar to Flickr8k [45]. Five sentences are attached to each image. Both of Flickr8k and Flickr30k are designed for image-sentence retrieval process.MS COCO: This dataset is proposed by [24], including more than 80,000 images with five sentences annotations, whose magnitude is ten times larger than Flickr8k and Flickr30k. [30] recently annotate theSenticap dataset whose images are overlapped between ANPs. They designed the re-writing task upon the original captions from MS COCO. Though we argue that ANPs is just an assistant tool for emotion predict, the SentiCap annotation is closer to affective captions than traditional annotations. Thus, we evaluate our method on theSenticap dataset.

3.2Affective Concepts

Affective concepts are what we will add to traditional captions. These concepts can be adverbs, phrases and so on. In the fusion with traditional captions, the affective concept is treated as an entity. It plays an important role in the generation of the affective caption. The selection of affective concepts depends on the sentimental label related to each image. Each emotion can be mapped to multiple concepts. In tasks of affective image classification, two linguistic-based language models – SentiWordNet and SentiStrength are widely utilized. They contain large quantities of sentimental vocabularies. Inspired by the two ontologies, we select some concepts referring to the eight emotion categories. Table 1 shows some examples similar to fixed categories.

3.3Model Selection

In the early days, the model from [41] illustrated to describe images in two steps: (a) image parsing; (b) text description. Decomposition split images into parts to find out the relations between objects or part. The way [19] models captions is similar to [41]. Bidirectional model sentence mapping is proposed. Association is learned via image datasets and contexts. Fragments related to objects detection and sentences are pointed. Furthermore, as there is not always one principal object exist in pictures, referring expression started drawing attention. [20] detects regions in images and treat the associated sentence as a rich label space. From this, the model in [20] involves correspondences in the training set and then learns to generate new descriptions. Besides extension on content, novel description concepts are introduced by the model in [28]. New words are updated in original models. The datasets with novel concepts are constructed to train the model. Images with novel concepts can be described. Surprisingly, [14] models the unique concepts without the need of prepared data supplement. Simply unpaired image and text help the model recognize novel concepts. Besides, [18] propose DenseCap to capture rich information contained in images, which mines adequate sources in limited data. In that work, an end-to-end framework is constructed with localization layer. There is no need of extra bounding box preparing. Thus sentences are generated in a convenient and direct way.

In the first, captions are made on the whole image. With time goes by, captions approach to more and more correctness. Pointed objects or regions is detected. In our work, the dataset IAPS is well labeled but lack in scale. In the advantage of models strive in areas, much more referring sentences or sub-figures can be cropped from original datasets. To have the privilege, we followed the model guide published by [18]. Bounding boxes are gathered to complement affective concepts.

4Approach

The goal of this section is to generate affective captions with the aid of emotion prediction and traditional image caption. Given an instance imageI, we define the\(\mathbf X = \{ x\_1, x\_2, ... , x_d \}\) presenting the visual features, where\(d \in \mathbb {R}\). A sequence of words\(\mathbf Y = \{ y_1, y_2, ... , y_n\}\) on behalf of the original caption. We target at extending the traditional caption to an affective expression, which is indicated as\( \hat{\mathbf{Y }} = \{ \hat{y_1}, \hat{y_2}, ... , \hat{y_n}, \hat{y_{n+1}} \}\). The extra\(y_{n+1}\) refers to the entity of affective concepts. For the generation of affective captions, we need to prepare the original caption\(\mathbf Y \) and affective concept\(y_{n+1}\) separately.

4.1Concrete Modelling

In this section, we aim to obtain general captions of images. We follow the FCLN model proposed by [18] which concentrate on Dense Captions. Dense Descriptions make it closer for adequate information. Also, it can be treated as a multiplier for affective data. In this model, dense information is gained based on bounding boxes which are computed in the new localization layer. Region proposals are predicted via regressing offsets from a set of translation-invariant anchors. The FCLN model adopted the parameterization in [12]. Proposals are regressed from anchors with certain scalars settings.B most confident bounding boxes are subsampled from the large quantities of proposals. As sizes and aspect ratios are varying among proposals, the FCLN model applied the bilinear interpolation for consistent dimension features. Features of chosen bounding boxes are output from the localization layer and serve as input for the generation of captions (Fig. 4).

Fig. 4.
figure 4

Examples images: we present several images from a different domain. The top shows images from image caption. The bottom shows images for affective image classification tasks.

Table 1. Examples: affective concepts ontology according to eight emotions.

With regions prepared, RNN language model is used to generate descriptions related to regions. The input for the RNN model is a sequence of words\(\{y_0, y_1, ... , y_n, y_{n+1}\}\). Additional variances\(y_0\) and\(y_{n+1}\) on behalf of the start code and end code. LSTM units are used to model the description sequence. A series of hidden neurons\(h_t\) are computed to formulate output generation\(\hat{\mathbf{Y }} = \{ \hat{y_1}, \hat{y_2}, ... ,\hat{y_n}\}\). Recurrent formula function is used as Eq. 1.

$$\begin{aligned} \hat{y_t}=f(h_{t-1},x_i); \end{aligned}$$
(1)

The subscriptt means timestep in the LSTM model, which follows [15] in the FCLN model. ‘Gates’ are the main composition in the structure of the memory cell. Whether value from layers to be kept or not is up to ‘Gates’. Gate ‘1’ means value preserved while gate ‘0’ stands for value to be forgotten. The input value and output value are defined asi ando separately. As\(h_{t-1}\),\(c_{t-1}\) and\(x_t\) are already known at the timet, the hidden unit\(h_t\) can be modulated up to\(h_{t-1}\) and\(x_t\). When meets the end code\(y_{n+1}\), the process ends. Based on the formulation of a recurrent neural network, the original dense captions are prepared.

4.2Affective Concepts Predicting

The prediction of affective concepts plays an important role in the generation of affective captions. An affective concept is treated as an entity to be added in a flat caption for the generation of an affective caption. Each affective concept is related to an emotion category. Therefore, we claim to select affective concepts in a hierarchical way. First of all, the emotion class of the image or region is needed. An emotion classifier satisfies the requirement inherently. Then we scan concepts refer to the predicting class and pick out a proper one to be embedded into captions. Details are given in the next. Affective concepts are denoted as\(\hat{y_{n+1}}\) in the final form of affective captions. As what we mentioned before, we use the emotion dataset FI training an emotion classifier. Featuresx’ will be mapped to one of eight labels\(l_{emo}\). Corresponding to each category, labels are related to a series of affective concepts:\(l_{emo} \rightarrow \{ concept_1, concept_2, ... , concept_k \}\), the number of concepts selected in each category is denoted ask. A random function is used to pick out feasible concepts:

$$\begin{aligned} concept_{select} = rand(set(l_{emo})) \end{aligned}$$
(2)

5Experiments

Our experiments employ a caption model and an affective classifier in a parallel way. In the followed model supported on the project page of [18], 1,000 is set as the maximum of bounding boxes and captions. However, too much consideration on the prediction of objects is unnecessary and consume lots of time as well. It is either needless for information extraction. Concentration should be paid on the generation of vivid telling. Therefore, we changed the maximum of bounding box into 10 for convenience. In the paradigm of affective caption generation, emotional components are discriminative considering the novel role they play in the description. For the addition of emotional components, eight emotions are utilized in the paper. To satisfy the syntax structure, we convert the predicting emotion to responding adverb for the proper embedding. At the first step, the sentiment of each image is decided via the global features. Features of images are sent into the emotion classifier to obtain their referring sentiments. In the baseline experiment, simple concepts are chosen to generate comparable captions. As for the eight emotions, they are represented by adjectives. They are transformed into common forms of adverbs or phrases. Instances like :\(\{Amuse\rightarrow happily,\, Awe\rightarrow amazingly,\, Content\rightarrow contentedly,\, Excite\rightarrow excitedly,\, Anger\rightarrow looks\,anger,\, Disgust\rightarrow queasily,\, Fear\rightarrow in\,terror,\, Sad\rightarrow sadly\}\). Thus traditional captions appended with affective concepts comprise of the novel emotional descriptions.

Table 2. Evaluations on the Senticap dataset under six metrics (\(BLEU\_4\),\(BLEU\_3\),\(BLEU\_2\),\(BLEU\_1\),CIDEr,METEOR). The n in BLEU_n on behalf of the n-gram co-occurrences.top1 totop5 correspond to the confidence level referring to captions. The left-most row presents the model used in the generation of captions.

To have our evaluations effective, we conduct the implementation on theSenticap dataset. The novel annotations with ANPs gives a comparable benchmark for affective captions generated from our model. The influence of affective concepts would be better judged against emotional components.

5.1Implementation

Conduction of affective caption generation implements in two different ways. Relatively speaking, the first way is simply but rigid generation. It is feasible setting the affective adverbs appendix at the end of sentences. Emotional components are introduced forcibly.

To have the emotion classifier applicable to images may be in plain affection, we decide to adopt the VGG-16 for the extraction of features in the model\(EmoCap_{vgg}\). In addition, the FI dataset is used to fine-tune the VGG net as well. Another model practices with the fine-tune features is denoted as\(EmoCap_{vggft}\). The specially designed emotion detector sentibank is also applied in the classification for sentiments. The model with the aid of sentibank features is recognized as\(EmoCap_{bank}\). Then each image from the field of captioning is attached with exact sentiments. With classification based on features, emotion label is fused with corresponding generation caption, and a novel description is created. Features of cropped regions are extracted via the same feature extractors mentioned before. Then they are sent into the prepared emotion classifier to tag their novel sentiments, which may be different via different features.

In the equation,I correspond to an instance image.\(Score_{mother}\) represents emotion scores predicted for mother images.\(Score_{r_{old}}\) on behalf of regions’ emotion scores given no global thoughts which\(Score_{r_{new}}\) is the final score after the combination of local and global scores. As Eq. 3 show:

$$\begin{aligned} Score_{r_{new}}^I = Score_{r_{old}}^I * \alpha + Score_{mother}^I * \beta \end{aligned}$$
(3)

5.2Evaluation

To evaluate our system in the generation of novel affective concepts, we adopt common automatic judgements, theBLEU,METEOR andCIDEr metrics, which derive from the Microsoft COCO evaluation software [4]. These metrics is also utilized in [30].

Table 2 shows the results of region captions from the SentiCap Dataset. Affective concepts conducted on\(EmoCap_{vgg}\),\(EmoCap_{vggft}\) and\(EmoCap_{bank}\) are evaluated. Compared with the traditional captions generation from theDensecap. We extracted the top five captions which are predicted from the model. Captions in each rank are evaluated separately. The higher the rank, the preciser captions are. From the above tables, we can see that the top 1 performs well in the overall evaluation. InBLEU metrics, numbers behind refer to then in n-gram. The differences between ground truth and generation caption become stronger as the metrics go from unigram to 4-gram [34]. We useC as the candidates set andS as the reference set. The BLEU score is decided by a modification probability(\(P_n\)) and a Brevity Penalty (b). The\(P_n\) computes the counts n-grams appear in candidates set, and

$$\begin{aligned} P_n(C,S)=\frac{\sum _{c}\sum _{n-gram\in C} {Count_{clip}(n-gram)}}{\sum _{c'}\sum _{n-gram'\in C'} {Count(n-gram')}} \end{aligned}$$
(4)

where\(Count_{clip}\) means the maximum counts of words appear in annotations but not exceed the biggest counts obtained from any references of words. The subscriptc correspond to\(C \in Candidates\). so does\(c'\). The Brevity Penality is determined as:

$$\begin{aligned} b(C,S)=\left\{ \begin{matrix} 1, \quad \quad \quad if \quad l_c>l_s \\ e^{(1-r/c)}, \quad if \quad l_c\le l_s, \end{matrix}\right. \end{aligned}$$
(5)

where\(l_c\) refers to the length of candidate sentences while\(l_s\) refers to the length of the reference. Thus the final BLEU score is gain via the following equation:

$$\begin{aligned} BLEU\_n(C,S)=b(C,S)exp(\sum _{n=1}^N {\omega _n logP_n(C,S)}) \end{aligned}$$
(6)

The evaluation ofMETEOR is computed on the basis of uni-gram. Precision and Harmonic mean are considered. Though the relevance in segmentations is not cared whileMETEOR gives thoughts to the coherence among uni-gram. The correlation betweenMETEOR and human judgment is higher than that betweenBLEU and human judgment.

Fig. 5.
figure 5

Eight selected images from the Senticap dataset. Each image represents an affective caption corresponding to one particular emotion. All of the captions are embedded with sentimental expressions.

Apart from above evaluations, [36] proposed theCIDEr, which is specially designed for the evaluation of image caption. The coherence in caption is evaluated via n-gram weighted via TF-IDF. The TF-IDF computes weights\(g(\cdot )\) for each n-gram and take them into the final consideration of evaluations. The higher scores are, the better results are gain is applicable to all this evaluations.

5.3Results

As the Table 2 shows, the\(BLEU\_1\) to\(BLEU\_4\) scores of affective captions are higher than scores of traditional captions. The differences among the three features are wispy. This condition may because evaluation metrics are up to all words in captions. The affective concepts take a sincerely small proportion. As for theMETEOR metrics, there is not any difference between the affective captions and traditional captions. Nevertheless, traditional captions get better scores under theCIDEr standard. According to the evaluations, affective captions outperform in most evaluating metrics. It proved that affective captions give more human-like descriptions to images. The introduction of affective components makes sense against simply descriptions of objects. However, affective captions show lower performance on theCIDEr. As that theCIDEr considers the coherence in captions, it may mean that the affective concepts are not idiomatic enough. Though there are shortages exist in affective captions, we have the confidence to say that affective concepts make senses. Emotional concepts added in captions give closer descriptions to human-intuition captions. Figure 5 presents example results with affective captions. Final emotions of each fragment are determined from themselves and weighted emotion from their mother images. In this figure, every description downward the referring fragments gives a sentimental expression of the content. Positive or negative ANPs used in [30] to present a state of objects, while affective concepts give an expression of sentiment directly. Additional adverbs in the novel caption convey moods from the principal subjects in photos. Compared to the traditional caption, conveying moods resonate human intuitions directly and make it easily associated.

6Conclusion

From the results and example images with descriptions, it can be understood that the addition of affective concepts helps minds in photos brighter. The addition of affective concepts will make the caption of images more vivid. Affective concepts mine hidden sentiments in images and give appropriate expressions. Whats more, there are a lot of developments need to be accomplished. Out of evaluations, though emotion components are highlighted in captions, they are not smoothing enough in evaluations. There are all kinds of adverbs or phrases in the ontology of human literary describing human emotions. There can always be an elegant expression conveying various sophisticated sentiments. The prediction of emotions should be more delicate to capture complicated human feelings. Besides, the selection of position where affective concepts are embedded is to be improved. One of the expanding orientation is adaptive affective concepts addition. An end-to-end framework consists of the caption channel, and emotion classification channel would help the automatic generation of affective concepts. Moreover, the process of affective concepts could be down under various perspectives. Emotion classification gives an instruction for the selection of affective concepts. Further consideration of emotion distribution may lead to a better coverage for sentiments and preciser descriptions.

References

  1. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: ACM MM (2013)

    Google Scholar 

  2. Chen, T., Borth, D., Darrell, T., Chang, S.F.: Deepsentibank: visual sentiment concept classification with deep convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  3. Chen, T., Yu, F.X., Chen, J., Cui, Y., Chen, Y.Y., Chang, S.F.: Object-based visual sentiment concept analysis and application. In: ACM MM (2014)

    Google Scholar 

  4. Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. In: CVPR (2015)

    Google Scholar 

  5. Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption. In: CoRR (2014)

    Google Scholar 

  6. Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR (2015)

    Google Scholar 

  7. Chen, Y.Y., Chen, T., Hsu, W.H., Liao, H.Y.M., Chang, S.F.: Predicting viewer affective comments based on image content in social media. In: ICMR (2014)

    Google Scholar 

  8. Deemter, K.V., Sluis, I.V.D., Gatt, A.: Building a semantically transparent corpus for the generation of referring expressions. In: INLG (2006)

    Google Scholar 

  9. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., Saenko, K.: Long-term recurrent convolutional networks for visual recognition and description. Elsevier (1988)

    Google Scholar 

  10. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: ECCV (2010)

    Google Scholar 

  11. FitzGerald, N., Artzi, Y., Zettlemoyer, L.S.: Learning distributions over logical forms for referring expression generation. In: EMNLP (2013)

    Google Scholar 

  12. Girshick, R.: Fast R-CNN. In: ICCV (2015)

    Google Scholar 

  13. Golland, D., Liang, P., Dan, K.: A game-theoretic approach to generating spatial descriptions. In: EMNLP (2010)

    Google Scholar 

  14. Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR (2016)

    Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput.9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: CVPR (2016)

    Google Scholar 

  17. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.: Caffe: convolutional architecture for fast feature embedding. In: CVPR (2014)

    Google Scholar 

  18. Johnson, J., Karpathy, A., Li, F.F.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)

    Google Scholar 

  19. Karpathy, A., Joulin, A., Li, F.F.: Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inf. Process. Syst.3, 1889–1897 (2014)

    Google Scholar 

  20. Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  21. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)

    Google Scholar 

  22. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: ACL: Long Papers (2012)

    Google Scholar 

  23. Lang, P., Bradley, M., Cuthbert, B.: International affective picture system (IAPS): technical manual and affective ratings. Technical report, University of Florida, Gainesville

    Google Scholar 

  24. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-10602-1_48

    Google Scholar 

  25. Lynch, C., Aryafar, K., Attenberg, J.: Unifying visual-semantic embeddings with multimodal neural language models. In: TACL (2015)

    Google Scholar 

  26. Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: ACM MM (2010)

    Google Scholar 

  27. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)

    Google Scholar 

  28. Mao, J., Wei, X., Yang, Y., Wang, J.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: ICCV (2015)

    Google Scholar 

  29. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)

    Google Scholar 

  30. Mathews, A., Xie, L., He, X.: SentiCap: generating image descriptions with sentiments. In: AAAI (2016)

    Google Scholar 

  31. Mikels, J.A., Fredrickson, B.L., Larkin, G.R., Lindberg, C.M., Maglio, S.J., Reuter-Lorenz, P.A.: Emotional category data on images from the international affective picture system. Behav. Res. Methods37(4), 626–630 (2005)

    Article  Google Scholar 

  32. Mitchell, M., Deemter, K.V., Reiter, E.: Natural reference to objects in a visual domain. In: INLG (2010)

    Google Scholar 

  33. Morina, N., Leibold, E., Ehring, T.: Vividness of general mental imagery is associated with the occurrence of intrusive memories. J. Behav. Ther. Exp. Psychiatry44(2), 221–226 (2013)

    Article  Google Scholar 

  34. Papineni, K.: BLEU: a method for automatic evaluation of machine translation. Wirel. Netw.4(4), 307–318 (2002)

    Google Scholar 

  35. Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL HLT Workshop (2010)

    Google Scholar 

  36. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CoRR (2015)

    Google Scholar 

  37. Viethen, J., Dale, R.: The use of spatial relations in referring expression generation. In: INLG (2010)

    Google Scholar 

  38. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)

    Google Scholar 

  39. Wang, W., He, Q.: A survey on emotional semantic image retrieval. In: ICIP (2008)

    Google Scholar 

  40. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICCV (2015)

    Google Scholar 

  41. Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. In: Proceedings of the IEEE, vol. 98, no. 8, pp. 1485–1508 (2010)

    Google Scholar 

  42. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)

    Google Scholar 

  43. You, Q., Luo, J., Jin, H., Yang, J.: Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: AAAI (2015)

    Google Scholar 

  44. You, Q., Luo, J., Jin, H., Yang, J.: Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In: AAAI (2016)

    Google Scholar 

  45. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: ACL (2014)

    Google Scholar 

  46. Zhao, S., Gao, Y., Jiang, X., Yao, H., Chua, T.S., Sun, X.: Exploring principles-of-art features for image emotion recognition. In: ACM MM (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Nankai University, Tianjin, China

    Yan Sun & Bo Ren

Corresponding author

Correspondence toBo Ren.

Editor information

Editors and Affiliations

  1. Civil Aviation University of China, Tianjin, China

    Jinfeng Yang

  2. School of Computer Science and Technology, Tianjin University, Tianjin, China

    Qinghua Hu

  3. Nankai University, Tianjin, China

    Ming-Ming Cheng

  4. Institute of Automation, Chinese Academy of Sciences, Beijing, China

    Liang Wang

  5. Information Science and Technology, Nanjing University, Beijing, China

    Qingshan Liu

  6. Huazhong University of Science and Technology, Wuhan, Hubei, China

    Xiang Bai

  7. Xi’an Jiaotong University, Xi’an, China

    Deyu Meng

Rights and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, Y., Ren, B. (2017). Automatic Image Description Generation with Emotional Classifiers. In: Yang, J.,et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-7299-4_63

Download citation

Publish with us


[8]ページ先頭

©2009-2025 Movatter.jp