Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Language Models perform poorly on quantification โFewโ-type quantifiers pose a particular challenge for Language Models 960 sentences were presented to 22 autoregressive transformer models of differing sizes Performance of larger models decreased, suggesting they reflect online rather than offline human processing Paper Content Introduction Quantifiers can change the meaning of an utterance Sentences with the same content words can have opposite meanings Language models struggle to predict which quantifier is used in a given context Language models have poor performance at generating appropriate continuations following logical quantifiers Large language models are being used as general systems for multiple tasks It is important that language models can distinguish between sentences with different meanings This study evaluates how well language models take into account the meaning of a quantifier when generating text Investigates whether there is an inverse scaling relationship with model size Negation is challenging for language models This study focuses on quantifiers indicating typicality such as most and few Uses stimuli from a previously published N400 study Tests whether language models show the same pattern of insensitivity towards the quantifiers Language models Analyzed GPT-2, GPT-3, GPT-Neo, OPT, and InstructGPT language models Compared different training data and numbers of parameters Evaluation Calculated the surprisal of the critical word in each stimulus sentence Considered the surprisal of the critical word given its preceding context Converted probability of the target word to surprisal using Equation 1 Used single and multi-token words Compared which of the two possible critical words had a lower surprisal Calculated accuracy as fraction of stimulus pairs for which model predicted the appropriate critical word Analyzed model sensitivity to the quantifiers All code and data will be published online on acceptance Results Accuracy of models increases with size for most-type quantifiers, but decreases for few-type quantifiers Small exceptions to this pattern exist Sensitivity of models varies, but is generally low No clear pattern in sensitivity Discussion Inverse scaling with quantifiers Models increase in size, they tend to improve at predicting words following most-type quantifiers and get worse at predicting words following few-type quantifiers Larger models make predictions increasingly in accordance with typicality, overwhelming any sensitivity to quantifier type Sensitivity analysis shows all models have a poor and largely invariant sensitivity overall Further implications Models tend to perform better as they get larger and are trained on more data Evidence supports this idea Predictions of larger models and those trained on more data correlate with human incremental online predictions Easier for humans to process well-formed sentences with plausible semantics Predictions of larger models can align less with explicit human judgements Language models may struggle to make predictions in line with human offline judgements Tailoring training may be necessary to avoid specific known issues