Skip to main content Accessibility help
×
Hostname: page-component-857557d7f7-s7d9s Total loading time: 0 Render date: 2025-12-03T08:26:30.519Z Has data issue: false hasContentIssue false

Chapter 5 - Combining Collocation Measures and Distributional Semantics to Detect Idioms

Published online by Cambridge University Press:  aN Invalid Date NaN

Mikko Laitinen
Affiliation:
University of Eastern Finland
Paula Rautionaho
Affiliation:
University of Eastern Finland
Get access

Summary

Distributional approaches following the Firthian principle have revolutionized linguistics. While Firthian approaches in collocation research detect syntagmatic relations and are a key research area in corpus linguistics, Firthian distributional semantics and their neural counterpart of word embeddings detect paradigmatic relations and have fundamentally impacted computational linguistics. We combine these two closely related approaches: our hypothesis, following Ricoeur’s view of a metaphor as a clash of two normally distinct semantic fields, is that idioms are collocations in which the lexical participants typically have low semantic similarity in the word embedding space, i.e. low values for the cosine metric. We test if the cosine metric, replaceability with synonyms, and linear combinations with collocation measures improve idiom detection for three constructions: verb-PP, light verbs, and compound nouns. We report improved idiom detection by 10 to 80 per cent, and almost half of compound noun non-compositionality is predicted by cosine alone. We trace how compound nouns are changing in spoken and written English, mirroring digitalisation and the revolution of the internet.

Information

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2025

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Altenberg, Bengt, and Tapper, Marie (1998). “The use of adverbial connectors in advanced Swedish learners’ written English,” in Granger, Sylviane (ed.), Learner English on Computer. London: Longman, pp. 8093.Google Scholar
Aston, Guy, and Burnard, Lou (1998). The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.Google Scholar
Baroni, Marco, Dinu, Georgiana, and Kruszewski, German (2014). “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Baltimore: Association for Computational Linguistics, pp. 238247.CrossRefGoogle Scholar
Baroni, Marco, and Lenci, Alessandro (2010). “Distributional memory: A general framework for corpus-based semantics.” Computational Linguistics, 36(4), 673721.CrossRefGoogle Scholar
Bartsch, Sabine, and Evert, Stefan (2014). “Towards a Firthian notion of collocation,” in Abel, A. and Lemnitzer, L. (eds.), Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern. In OPAL: Online publizierte Arbeiten zur Linguistik. Mannheim: Institut für Deutsche Sprache, pp. 4861.Google Scholar
Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin, Christian (2003). “A neural probabilistic language model.” Journal of Machine Learning Research, 3, 11371155.Google Scholar
Bengio, Yoshua, Schwenk, Holger, Senécal, Jean-Sébastien, Morin, Fréderic, and Gauvain, Jean-Luc (2006). “Neural probabilistic language models,” in Holmes, Dawn E., and Jain, Lakhmi C. (eds.), Innovations in Machine Learning. Berlin: Springer, pp. 137186.CrossRefGoogle Scholar
Choueka, Yaacov (1988). “Looking for needles in the haystack or locating interesting collocational expressions in large textual databases,” in RIAO ’88 – Proceedings of RIAO, pp. 609623. https://dblp.org/rec/conf/riao/Choueka88.html.Google Scholar
Deerwester, Scott, Dumais, Susan T., Furnas, George W., Landauer, Thomas K., and Harshman, Richard (1990). “Indexing by latent semantic analysis.” Journal of the American Society of Information Science, 41(6), 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Erman, Britt, and Warren, Beatrice (2000). “The idiom principle and the open choice principle.” TEXT, 20(1), 2962.Google Scholar
Evert, Stefan (2009). “Corpora and collocations.” Corpus Linguistics: An International Handbook, 58, 12121248.CrossRefGoogle Scholar
Firth, John Rupert (1957). “A synopsis of linguistic theory 1930–1955.” Studies in Linguistic Analysis, 132.Google Scholar
Glynn, Dylan (2010). “Corpus-driven cognitive semantics. Introduction to the field,” in Glynn, Dylan, and Fischer, Kerstin (eds.), Quantitative Methods in Cognitive Semantics: Corpus-Driven Approaches. Cognitive Linguistics Research 46. Berlin: Mouton de Gruyter, pp. 142.CrossRefGoogle Scholar
Goldberg, Adele E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. Chicago: University of Chicago Press.Google Scholar
Goldberg, Adele E. (2006). Constructions at Work: The Nature of Generalization in Language. Oxford: Oxford University Press.Google Scholar
Gries, Stefan (2013). “50-something years of work on collocations.” International Journal of Corpus Linguistics, 18(1), 137165.CrossRefGoogle Scholar
Günther, Fritz, Dudschig, Carolin, and Kaup, Barbara (2014). “LSAfun – An R package for computations based on Latent Semantic Analysis.” Behavior Research Methods, 47, 930944.CrossRefGoogle Scholar
Hilpert, Martin (2005). “Keeping an eye on the data: Metonymies and their patterns,” in Stefanowitsch, Anatol, and Gries, Stefan Thomas (eds.), Corpus-Based Approaches to Metaphor and Metonymy. Berlin: Mouton de Gruyter, pp. 123152.Google Scholar
Hilpert, Martin (2014). Construction Grammar and Its Application to English. Edinburgh Textbooks on the English Language. Edinburgh: Edinburgh University Press.Google Scholar
Hilpert, Martin (2020). “Constructional approaches,” in Aarts, Bas, Bowie, Jill, and Popova, Gergana (eds.), The Oxford Handbook of English Grammar. Oxford: Oxford University Press, pp. 106123.CrossRefGoogle Scholar
Dan, Jurafsky, and Martin, James H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar
Karlgren, Jussi, and Sahlgren, Magnus (2001). “From words to understanding,” in Uesaka, Yoshinori, Kanerva, Pentti, and Asoh, Hideki (eds.), Foundations of Real-World Intelligence. Stanford: Center for the Study of Language and Information, pp. 294308.Google Scholar
Katz, Jerrold, and Fodor, Jerry A. (1963). “The structure of a semantic theory”, Language, 39, 170210.CrossRefGoogle Scholar
Krennmayr, Tina, and Steen, Gerard J. (2017). “VU Amsterdam Metaphor Corpus,” in Ide, Nancy, and Pustejovsky, James (eds.), Handbook of Linguistic Annotation. Berlin: Springer, pp. 10531071.CrossRefGoogle Scholar
Landauer, Thomas K., and Dumais, Susan T. (1997). “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.” Psychological Review, 104, 211240.CrossRefGoogle Scholar
Leech, Geoffrey, Hundt, Marianne, Mair, Christian, and Smith, Nicholas (2009). Change in Contemporary English. A Grammatical Study. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Lehmann, Hans Martin, and Schneider, Gerold (2011). “A large-scale investigation of verb-attached prepositional phrases,” in Rayson, Paul, Hoffmann, Sebastian, and Leech, Geoffrey (eds.), Methodological and Historical Dimensions of Corpus Linguistics. Helsinki: VARIENG e-series. https://varieng.helsinki.fi/series/volumes/06/lehmann_schneider/.Google Scholar
Ljubešić, Nikola, Logar, Nataša, and Kosem, Iztok (2021). “Collocation ranking: Frequency vs. semantics.” Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 9(2), 4170.Google Scholar
Love, Robbie, Dembry, Claire, Hardie, Andrew, Brezina, Vaclav, and McEnery, Tony (2017). “The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations.” International Journal of Corpus Linguistics, 22(3), 319344.Google Scholar
Maldonado-Guerra, Alfredo, and Emms, Martin (2011). “Measuring the compositionality of collocations via word co-occurrence vectors: Shared task system description,” in Proceedings of the Workshop on Distributional Semantics and Compositionality (DiSCo ’11). Stroudsburg, PA: Association for Computational Linguistics, pp. 4853.Google Scholar
Master, Peter. (2003). “Noun compounds and compressed definitions.” English Teaching Forum, 41(3), 225. http://americanenglish.state.gov/files/ae/resource_files/03-41-3-b.pdf.Google Scholar
Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey (2013). “Efficient estimation of word representations in vector space,” paper presented at the First International Conference on Learning Representations (ICLR), Scottsdale, AZ, https://bibbase.org/network/publication/mikolov-chen-corrado-dean-efficientestimationofwordrepresentationsinvectorspace-2013.Google Scholar
Pecina, Pavel (2009). Lexical Association Measures: Collocation Extraction. Prague: Institute of Formal and Applied Linguistics, Charles University.Google Scholar
Rayson, Paul, Leech, Geoffrey, and Hodges, Mary (1997). “Social differentiation in the use of English vocabulary: Some analyses of the conversational part of the British National Corpus.” International Journal of Corpus Linguistics, 2(1), 133152.CrossRefGoogle Scholar
Reddy, Siva, McCarthy, Diana, and Manandhar, Suresh (2011). “An empirical study on compositionality in compound nouns,” in Proceedings of the Fifth International Joint Conference on Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, pp. 210218. https://aclanthology.org/I11-1024.pdf.Google Scholar
Ricœur, Paul (1977). The Rule of Metaphor. Translated by Czerny, Robert, McLaughlin, Kathleen, and Costello, John. New York: Routledge.Google Scholar
Ricœur, Paul (1978). “The metaphorical process as cognition, imagination, and feeling.” Critical Inquiry, 1(5), 143159. www.humanities.uci.edu/poeticshistorytheory/user_files/Ricoeur.pdf.CrossRefGoogle Scholar
Ronan, Patricia, and Schneider, Gerold (2015). “Determining light verb constructions in contemporary British and Irish English.” International Journal of Corpus Linguistics, 20(3), 326354.CrossRefGoogle Scholar
Sahlgren, Magnus (2006). “Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces,” PhD thesis, University of Stockholm.Google Scholar
Salehi, Bahar, Cook, Paul, and Baldwin, Timothy (2015). “A word embedding approach to predicting the compositionality of multiword expressions,” in Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computer Linguistics. Stroudsburg, PA: Association for Computational Linguistics, pp. 977983.Google Scholar
Salton, Gerald (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
Schäfer, Roland, and Bildhauer, Felix (2013). Web Corpus Construction. San Francisco: Morgan and Claypool.CrossRefGoogle Scholar
Schneider, Gerold (2022). “Recent changes in spoken British English according to spoken BNC2014,” in Flach, Susanne, and Hilpert, Martin (eds.), Broadening the Spectrum of Corpus Linguistics: New Approaches to Variability and Change. Studies in Corpus Linguistics. Amsterdam: John Benjamins, pp. 173195.CrossRefGoogle Scholar
Schneider, Gerold, El-Assady, Menna, and Lehmann, Hans Martin (2017). “Tools and methods for processing and visualizing large corpora,” in Hiltunen, Turo, McVeigh, Joe, and Säily, Tanja (eds.), Big and Rich Data in English Corpus Linguistics: Methods and Explorations – Studies in Variation, Contacts and Change in English, Volume 19. Helsinki, Finland: VARIENG. https://varieng.helsinki.fi/series/volumes/19/schneider_el-assady_lehmann/.Google Scholar
Schneider, Nathan, Hovy, Dirk, Johannsen, Anders, and Carpuat, Marine (2016). “SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM),” in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). San Diego: Association for Computational Linguistics, pp. 546559.Google Scholar
Schütze, Hinrich (1998). “Automatic word sense discrimination.” Computational Linguistics, 24(1), 97124.Google Scholar
Senaldi, Marco S. G., Lebani, Gianluca E., and Lenci, Alessandro (2016). “Lexical variability and compositionality: Investigating idiomaticity with distributional semantic models,” in Proceedings of the 12th Workshop on Multiword Expressions (MWE 2016). Berlin: Association for Computational Linguistics, pp. 2131.CrossRefGoogle Scholar
Sparck Jones, Karen (1986). Synonymy and Semantic Classification. Edinburgh: Edinburgh University Press. (Republication of a 1964 PhD thesis.)Google Scholar
Sullivan, Karen (2013). Frames and Constructions in Metaphoric Language. Amsterdam: Benjamins.CrossRefGoogle Scholar
Steen, Gerard J. (2002). “Identifying metaphor in language: A cognitive approach.” Style, 36(3), 386406.Google Scholar
Steen, Gerard J. (2014). “The cognitive-linguistic revolution in metaphor studies,” in Jeanette, Littlemore, and Taylor, John R. (eds.), The Bloomsbury Companion to Cognitive Linguistics. London: Bloomsbury, pp. 117142.CrossRefGoogle Scholar
Tognini-Bonelli, Elena (2001). Corpus Linguistics at Work. Amsterdam: Benjamins.CrossRefGoogle Scholar
Weichert, Katarzyna (2019). “The role of image and imagination in Paul Ricoeur’s metaphor theory.” Eidos. A Journal for Philosophy of Culture, 3(1), 6477.CrossRefGoogle Scholar
Wilks, Yorick (1975). “A preferential, pattern-seeking, semantics for natural language inference.” Artificial Intelligence, 6, 5374.CrossRefGoogle Scholar
Wilks, Yorick (1978). “Making preferences more active.” Artificial Intelligence, 11, 197223.CrossRefGoogle Scholar
Wulff, Stefanie (2008). Rethinking Idiomaticity. London: Continuum.Google Scholar

Accessibility standard: Inaccessible, or known limited accessibility

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

The PDF of this book is known to have missing or limited accessibility features. We may be reviewing its accessibility for future improvement, but final compliance is not yet assured and may be subject to legal exceptions. If you have any questions, please contact accessibility@cambridge.org.

Content Navigation

Table of contents navigation
Allows you to navigate directly to chapters, sections, or non‐text items through a linked table of contents, reducing the need for extensive scrolling.
Index navigation
Provides an interactive index, letting you go straight to where a term or subject appears in the text without manual searching.

Reading Order & Textual Equivalents

Single logical reading order
You will encounter all content (including footnotes, captions, etc.) in a clear, sequential flow, making it easier to follow with assistive tools like screen readers.
Short alternative textual descriptions
You get concise descriptions (for images, charts, or media clips), ensuring you do not miss crucial information when visual or audio elements are not accessible.

Visual Accessibility

Use of colour is not sole means of conveying information
You will still understand key ideas or prompts without relying solely on colour, which is especially helpful if you have colour vision deficiencies.

Structural and Technical Features

ARIA roles provided
You gain clarity from ARIA (Accessible Rich Internet Applications) roles and attributes, as they help assistive technologies interpret how each part of the content functions.

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×