Improving Multi-Modal Representations Using Image Dispersion:
Why Less is Sometimes More

Douwe Kiela*, Felix Hill*, Anna Korhonen and Stephen Clark
University of Cambridge
Computer Laboratory

Models that learn semantic representations from both linguistic and perceptual input outperform text-only models in many contexts and better reflect human concept acquisition. However, experiments suggest that while the inclusion of perceptual input improves representations of certain concepts, it degrades the representations of others. We propose an unsupervised method to determine whether to include perceptual input for a concept, and show that it significantly improves the ability of multi-modal models to learn and represent word meanings. The method relies solely on image data, and can be applied to a variety of other NLP tasks.

1 Introduction

Multi-modal models that learn semantic concept representations from both linguistic and perceptual input were originally motivated by parallels with human concept acquisition, and evidence that many concepts are grounded in the perceptual system [3]. Such models extract information about the perceptible characteristics of words from data collected in property norming experiments [22, 24] or directly from ‘raw’ data sources such as images [11, 6]. This input is combined with information from linguistic corpora to produce enhanced representations of concept meaning. Multi-modal models outperform language-only models on a range of tasks, including modelling conceptual association and predicting compositionality [6, 24, 22].

Despite these results, the advantage of multi-modal over linguistic-only models has only been demonstrated on concrete concepts, such as chocolate or cheeseburger, as opposed to abstract concepts such as such as guilt or obesity. Indeed, experiments indicate that while the addition of perceptual input is generally beneficial for representations of concrete concepts [13, 7], it can in fact be detrimental to representations of abstract concepts [13]. Further, while the theoretical importance of the perceptual modalities to concrete representations is well known, evidence suggests this is not the case for more abstract concepts [21, 14]. Indeed, perhaps the most influential characterization of the abstract/concrete distinction, the Dual Coding Theory [21], posits that concrete representations are encoded in both the linguistic and perceptual modalities whereas abstract concepts are encoded only in the linguistic modality.

Existing multi-modal architectures generally extract and process all the information from their specified sources of perceptual input. Since perceptual data sources typically contain information about both abstract and concrete concepts, such information is included for both concept types. The potential effect of this design decision on performance is significant because the vast majority of meaning-bearing words in everyday language correspond to abstract concepts. For instance, 72% of word tokens in the British National Corpus [17] were rated by contributors to the University of South Florida dataset (USF) [20] as more abstract than the noun war, a concept that many would consider quite abstract.

In light of these considerations, we propose a novel algorithm for approximating conceptual concreteness. Multi-modal models in which perceptual input is filtered according to our algorithm learn higher-quality semantic representations than previous approaches, resulting in a significant performance improvement of up to 17% in capturing the semantic similarity of concepts. Further, our algorithm constitutes the first means of quantifying conceptual concreteness that does not rely on labor-intensive experimental studies or annotators. Finally, we demonstrate the application of this unsupervised concreteness metric to the semantic classification of adjective-noun pairs, an existing NLP task to which concreteness data has proved valuable previously.

2 Experimental Approach

Our experiments focus on multi-modal models that extract their perceptual input automatically from images. Image-based models more naturally mirror the process of human concept acquisition than those whose input derives from experimental datasets or expert annotation. They are also more scalable since high-quality tagged images are freely available in several web-scale image datasets.

We use Google Images as our image source, and extract the first n image results for each concept word. It has been shown that images from Google yield higher-quality representations than comparable sources such as Flickr [4]. Other potential sources, such as ImageNet [9] or the ESP Game Dataset [30], either do not contain images for abstract concepts or do not contain sufficient images for the concepts in our evaluation sets.

2.1 Image Dispersion-Based Filtering

Following the motivation outlined in Section 1, we aim to distinguish visual input corresponding to concrete concepts from visual input corresponding to abstract concepts. Our algorithm is motivated by the intuition that the diversity of images returned for a particular concept depends on its concreteness (see Figure 1). Specifically, we anticipate greater congruence or similarity among a set of images for, say, elephant than among images for happiness. By exploiting this connection, the method approximates the concreteness of concepts, and provides a basis to filter the corresponding perceptual information.

Figure 1: Example images for a concrete (elephant – little diversity, low dispersion) and an abstract concept (happiness – greater diversity, high dispersion).

Formally, we propose a measure, image dispersion d of a concept word w, defined as the average pairwise cosine distance between all the image representations {w1wn} in the set of images for that concept:

d(w)=12n(n-1)i<jn1-wiwj|wi||wj| (1)

We use an average pairwise distance-based metric because this emphasizes the total variation more than e.g. the mean distance from the centroid. In all experiments we set n=50.

Figure 2: Computation of PHOW descriptors using dense SIFT for levels l=0 to l=2 and the corresponding histogram representations [5].

Generating Visual Representations

Visual vector representations for each image were obtained using the well-known bag of visual words (BoVW) approach [25]. BoVW obtains a vector representation for an image by mapping each of its local descriptors to a cluster histogram using a standard clustering algorithm such as k-means.

Previous NLP-related work uses SIFT [11, 6] or SURF [22] descriptors for identifying points of interest in an image, quantified by 128-dimensional local descriptors. We apply Pyramid Histogram Of visual Words (PHOW) descriptors, which are particularly well-suited for object categorization, a key component of image similarity and thus dispersion [5]. PHOW is roughly equivalent to running SIFT on a dense grid of locations at a fixed scale and orientation and at multiple scales (see Fig 2), but is both more efficient and more accurate than regular (dense) SIFT approaches [5]. We resize the images in our dataset to 100x100 pixels and compute PHOW descriptors using VLFeat [29].

The descriptors for the images were subsequently clustered using mini-batch k-means [23] with k=50 to obtain histograms of visual words, yielding 50-dimensional visual vectors for each of the images.

Generating Linguistic Representations

We extract continuous vector representations (also of 50 dimensions) for concepts using the continuous log-linear skipgram model of Mikolov et al. (2013a), trained on the 100M word British National Corpus [17]. This model learns high quality lexical semantic representations based on the distributional properties of words in text, and has been shown to outperform simple distributional models on applications such as semantic composition and analogical mapping [19].

2.2 Evaluation Gold-standards

We evaluate models by measuring the Spearman correlation of model output with two well-known gold-standards reflecting semantic proximity – a standard measure for evaluating the quality of representations (see e.g. Agirre et al. (2009)).

To test the ability of our model to capture concept similarity, we measure correlations with WordSim353 [12], a selection of 353 concept pairs together with a similarity rating provided by human annotators. WordSim has been used as a benchmark for distributional semantic models in numerous studies (see e.g. [15, 6]).

As a complementary gold-standard, we use the University of South Florida Norms (USF) [20]. This dataset contains scores for free association, an experimental measure of cognitive association, between over 40,000 concept pairs. The USF norms have been used in many previous studies to evaluate semantic representations [2, 11, 24, 22]. The USF evaluation set is particularly appropriate in the present context because concepts in the dataset are also rated for conceptual concreteness by at least 10 human annotators.

We create a representative evaluation set of USF pairs as follows. We randomly sample 100 concepts from the upper quartile and 100 concepts from the lower quartile of a list of all USF concepts ranked by concreteness. We denote these sets C, for concrete, and A for abstract respectively. We then extract all pairs (w1,w2) in the USF dataset such that both w1 and w2 are in AC. This yields an evaluation set of 903 pairs, of which 304 are such that w1,w2C and 317 are such that w1,w2A.

The images used in our experiments and the evaluation gold-standards can be downloaded from

3 Improving Multi-Modal Representations

We apply image dispersion-based filtering as follows: if both concepts in an evaluation pair have an image dispersion below a given threshold, both the linguistic and the visual representations are included. If not, in accordance with the Dual Coding Theory of human concept processing [21], only the linguistic representation is used. For both datasets, we set the threshold as the median image dispersion, although performance could in principle be improved by adjusting this parameter. We compare dispersion filtered representations with linguistic, perceptual and standard multi-modal representations (concatenated linguistic and perceptual representations). Similarity between concept pairs is calculated using cosine similarity.

Figure 3: Performance of conventional multi-modal (visual input included for all concepts) vs. image dispersion-based filtering models (visual input only for concepts classified as concrete) on the two evaluation gold-standards.

As Figure 3 shows, dispersion-filtered multi-modal representations significantly outperform standard multi-modal representations on both evaluation datasets. We observe a 17% increase in Spearman correlation on WordSim353 and a 22% increase on the USF norms. Based on the correlation comparison method of Steiger (1980), both represent significant improvements (WordSim353, t=2.42, p<0.05; USF, t=1.86, p<0.1). In both cases, models with the dispersion-based filter also outperform the purely linguistic model, which is not the case for other multi-modal approaches that evaluate on WordSim353 (e.g. Bruni et al. (2012)).

4 Concreteness and Image Dispersion

The filtering approach described thus far improves multi-modal representations because image dispersion provides a means to distinguish concrete concepts from more abstract concepts. Since research has demonstrated the applicability of concreteness to a range of other NLP tasks [28, 16], it is important to examine the connection between image dispersion and concreteness in more detail.

4.1 Quantifying Concreteness

To evaluate the effectiveness of image dispersion as a proxy for concreteness we evaluated our algorithm on a binary classification task based on the set of 100 concrete and 100 abstract concepts AC introduced in Section 2. By classifying concepts with image dispersion below the median as concrete and concepts above this threshold as abstract we achieved an abstract-concrete prediction accuracy of 81%.

Figure 4: Visual input is valuable for representing concepts that are classified as concrete by the image dispersion algorithm, but not so for concepts classified as abstract. All correlations are with the USF gold-standard.

While well-understood intuitively, concreteness is not a formally defined notion. Quantities such as the USF concreteness score depend on the subjective judgement of raters and the particular annotation guidelines. According to the Dual Coding Theory, however, concrete concepts are precisely those with a salient perceptual representation. As illustrated in Figure 4, our binary classification conforms to this characterization. The importance of the visual modality is significantly greater when evaluating on pairs for which both concepts are classified as concrete than on pairs of two abstract concepts.

Image dispersion is also an effective predictor of concreteness on samples for which the abstract/concrete distinction is less clear. On a different set of 200 concepts extracted by random sampling from the USF dataset stratified by concreteness rating (including concepts across the concreteness spectrum), we observed a high correlation between abstractness and dispersion (Spearman ρ=0.61,p<0.001). On this more diverse sample, which reflects the range of concepts typically found in linguistic corpora, image dispersion is a particularly useful diagnostic for identifying the very abstract or very concrete concepts. As Table 1 illustrates, the concepts with the lowest dispersion in this sample are, without exception, highly concrete, and the concepts of highest dispersion are clearly very abstract.

It should be noted that all previous approaches to the automatic measurement of concreteness rely on annotator ratings, dictionaries or manually-constructed resources. Kwong (2008) proposes a method based on the presence of hard-coded phrasal features in dictionary entries corresponding to each concept. By contrast, Sánchez et al. (2011) present an approach based on the position of word senses corresponding to each concept in the WordNet ontology [10]. Turney et al. (2011) propose a method that extends a large set of concreteness ratings similar to those in the USF dataset. The Turney et al. algorithm quantifies the concreteness of concepts that lack such a rating based on their proximity to rated concepts in a semantic vector space. In contrast to each of these approaches, the image dispersion approach requires no hand-coded resources. It is therefore more scalable, and instantly applicable to a wide range of languages.

Concept Image Dispersion Conc. (USF)
shirt .488 6.05
bed .495 5.91
knife .560 6.08
dress .578 6.59
car .580 6.35
ego 1.000 1.93
nonsense .999 1.90
memory .999 1.78
potential .997 1.90
know .996 2.70
Table 1: Concepts with highest and lowest image dispersion scores in our evaluation set, and concreteness ratings from the USF dataset.

4.2 Classifying Adjective-Noun Pairs

Finally, we explored whether image dispersion can be applied to specific NLP tasks as an effective proxy for concreteness. Turney et al. (2011) showed that concreteness is applicable to the classification of adjective-noun modification as either literal or non-literal. By applying a logistic regression with noun concreteness as the predictor variable, Turney et al. achieved a classification accuracy of 79% on this task. This model relies on significant supervision in the form of over 4,000 human lexical concreteness ratings.11The MRC Psycholinguistics concreteness ratings [8] used by Turney et al. (2011) are a subset of those included in the USF dataset. Applying image dispersion in place of concreteness in an identical classifier on the same dataset, our entirely unsupervised approach achieves an accuracy of 63%. This is a notable improvement on the largest-class baseline of 55%.

5 Conclusions

We presented a novel method, image dispersion-based filtering, that improves multi-modal representations by approximating conceptual concreteness from images and filtering model input. The results clearly show that including more perceptual input in multi-modal models is not always better. Motivated by this fact, our approach provides an intuitive and straightforward metric to determine whether or not to include such information.

In addition to improving multi-modal representations, we have shown the applicability of the image dispersion metric to several other tasks. To our knowledge, our algorithm constitutes the first unsupervised method for quantifying conceptual concreteness as applied to NLP, although it does, of course, rely on the Google Images retrieval algorithm. Moreover, we presented a method to classify adjective-noun pairs according to modification type that exploits the link between image dispersion and concreteness. It is striking that this apparently linguistic problem can be addressed solely using the raw data encoded in images.

In future work, we will investigate the precise quantity of perceptual information to be included for best performance, as well as the optimal filtering threshold. In addition, we will explore whether the application of image data, and the interaction between images and language, can yield improvements on other tasks in semantic processing and representation.


DK is supported by EPSRC grant EP/I037512/1. FH is supported by St John’s College, Cambridge. AK is supported by The Royal Society. SC is supported by ERC Starting Grant DisCoTex (306920) and EPSRC grant EP/I037512/1. We thank the anonymous reviewers for their helpful comments.


  • [1] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca and A. Soroa(2009) A study on similarity and relatedness using distributional and wordnet-based approaches. NAACL ’09, Boulder, Colorado, pp. 19–27. External Links: ISBN 978-1-932432-41-1 Cited by: 2.2.
  • [2] M. Andrews, G. Vigliocco and D. Vinson(2009) Integrating experiential and distributional data to learn semantic representations.. Psychological review 116 (3), pp. 463. Cited by: 2.2.
  • [3] L. W. Barsalou, W. Kyle Simmons, A. K. Barbey and C. D. Wilson(2003) Grounding conceptual knowledge in modality-specific systems. Trends in cognitive sciences 7 (2), pp. 84–91. Cited by: 1.
  • [4] S. Bergsma and R. Goebel(2011) Using visual information to predict lexical preference.. pp. 399–405. Cited by: 2.
  • [5] A. Bosch, A. Zisserman and X. Munoz(2007) Image classification using random forests and ferns. Cited by: 2, 2.1.
  • [6] E. Bruni, G. Boleda, M. Baroni and N. Tran(2012) Distributional semantics in technicolor. pp. 136–145. Cited by: 1, 2.1, 2.2, 3.
  • [7] E. Bruni, N. K. Tran and M. Baroni(2014) Multimodal distributional semantics. Journal of Artificial Intelligence Research 49, pp. 1–47. Cited by: 1.
  • [8] M. Coltheart(1981) The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology 33 (4), pp. 497–505. Cited by: 4.2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei(2009) Imagenet: a large-scale hierarchical image database. pp. 248–255. Cited by: 2.
  • [10] C. Fellbaum(1999) WordNet. Wiley Online Library. Cited by: 4.1.
  • [11] Y. Feng and M. Lapata(2010) Visual information in semantic representation. pp. 91–99. Cited by: 1, 2.1, 2.2.
  • [12] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman and E. Ruppin(2001) Placing search in context: the concept revisited. pp. 406–414. Cited by: 2.2.
  • [13] F. Hill, D. Kiela and A. Korhonen(2013) Concreteness and corpora: a theoretical and practical analysis. CMCL 2013. Cited by: 1.
  • [14] F. Hill, A. Korhonen and C. Bentz(2013) A quantitative empirical analysis of the abstract/concrete distinction. Cognitive science 38 (1). Cited by: 1.
  • [15] E. H. Huang, R. Socher, C. D. Manning and A. Y. Ng(2012) Improving word representations via global context and multiple word prototypes. pp. 873–882. Cited by: 2.2.
  • [16] O. Y. Kwong(2008) A preliminary study on the impact of lexical concreteness on word sense disambiguation.. pp. 235–244. Cited by: 4.1, 4.
  • [17] G. Leech, R. Garside and M. Bryant(1994) CLAWS4: the tagging of the british national corpus. pp. 622–628. Cited by: 1, 2.1.
  • [18] T. Mikolov, K. Chen, G. Corrado and J. Dean(2013) Efficient estimation of word representations in vector space. Scottsdale, Arizona, USA. Cited by: 2.1.
  • [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean(2013) Distributed representations of words and phrases and their compositionality. pp. 3111–3119. Cited by: 2.1.
  • [20] D. L. Nelson, C. L. McEvoy and T. A. Schreiber(2004) The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers 36 (3), pp. 402–407. Cited by: 1, 2.2.
  • [21] A. Paivio(1990) Mental representations: a dual coding approach. Oxford University Press. Cited by: 1, 3.
  • [22] S. Roller and S. Schulte im Walde(2013-10) A multimodal LDA model integrating textual, cognitive and visual modalities. Seattle, Washington, USA, pp. 1146–1157. Cited by: 1, 2.1, 2.2.
  • [23] D. Sculley(2010) Web-scale k-means clustering. pp. 1177–1178. Cited by: 2.1.
  • [24] C. Silberer and M. Lapata(2012) Grounded models of semantic representation. pp. 1423–1433. Cited by: 1, 2.2.
  • [25] J. Sivic and A. Zisserman(2003-10) Video Google: a text retrieval approach to object matching in videos. Vol. 2, pp. 1470–1477. Cited by: 2.1.
  • [26] J. H. Steiger(1980) Tests for comparing elements of a correlation matrix.. 87 (2), pp. 245. Cited by: 3.
  • [27] D. Sánchez, M. Batet and D. Isern(2011) Ontology-based information content computation. Knowledge-Based Systems 24 (2), pp. 297–303. Cited by: 4.1.
  • [28] P. D. Turney, Y. Neuman, D. Assaf and Y. Cohen(2011) Literal and metaphorical sense identification through concrete and abstract context. pp. 680–690. Cited by: 4.1, 4.2, 4.
  • [29] A. Vedaldi and B. Fulkerson(2008) VLFeat: an open and portable library of computer vision algorithms. Note: \url Cited by: 2.1.
  • [30] L. Von Ahn and L. Dabbish(2004) Labeling images with a computer game. pp. 319–326. Cited by: 2.