Uncovering patterns of semantic predictability in sentence processing

A popularly held belief is that true friends can always finish each other’s … sandwiches. The remarkable ability for us to predict the continuations of what others will say is a hallmark of human language comprehension. Indeed, when people are given the task to guess the next word of a sentence or to fill in the blank in a cloze task (Taylor, 1953), people often guess words that are related to one other. Despite this systematicity, what we anticipate others will say is not perfectly predictable from the context, and it is not always obvious to determine how closely related participants' responses generally are. A word-centric focus in the use of cloze probability data may obscure the degree to which the future may be semantically, versus lexically, constrained. Thus, a major question in psycholinguistic research is how semantic information may be easier to predict than wordform information, and the degree to which semantic information facilitates production or creates competition. However, characterizing the semantic structure of cloze responses and organizing them into classes of related words is difficult to do by hand. Similarly, research in language production seeks to understand how the mechanisms that enable lexical selection are affected by the availability of alternatives. A long-standing debate in language production concerns whether different potential responses directly compete with each other for selection, or whether producers simply select the word that reaches a threshold of activation first (Levelt, 1989, Mahon et al., 2007, Oppenheim, 2024, Spalek et al., 2013, Staub et al., 2015).

A major challenge for modeling semantic processing in language comprehension and production is in representing the semantics of words and their relationship to each other (Landauer & Dumais, 1997). In this work, we approach the question of representing a word's meaning by approximating that meaning using a distributional semantics approach that we apply over data collected from cloze tasks. We combine neural language model representations with Bayesian Gaussian mixture modeling methods for clustering cloze responses to a variety of written English sentences. The resulting clusters act as a proxy for the semantic factors that readers and producers may be sensitive to, and we apply probabilistic measures of semantic predictability derived from clustering behavior to assess the contribution of semantics to the speed of production. We focus on cloze data for two reasons. First, cloze probabilities have been central to both our early and current understanding of the relationship between predictability and comprehension measures, including reading times and neural responses to written words (Kutas & Hillyard, 1980; see de Varda et al., 2023 for a review). Cloze data are also revealing about language production (Staub et al., 2015). For example, cloze responses are systematically biased “away” from word co-occurrence statistics by being composed of words that are more frequent, concrete, and semantically similar to the context than would be expected by n-gram statistics alone (Smith & Levy, 2011). We first review the cloze task as it has been applied in comprehension and production research and outline our motivation for clustering large language model-based representations of cloze responses to better understand semantic processing in production.

Comments (0)

No login
gif