Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Past work has shown gains from multimodal cues.
- LC-PCFG outperforms previous multimodal methods on unsupervised constituency parsing.
- LC-PCFG results in a 50% reduction in parameter count and speedups in training time.
- Extralinguistic signals may not be needed for unsupervised grammar induction.
Paper Content
Introduction
- Recent work has shown that unsupervised grammar induction can be improved by pairing text data with extralinguistic inputs such as images, videos, audio or facial semantics
- There is limited evidence for the hypothesized import of this grounding signal from other modalities into text
- Large language models (LLMs) have revolutionized the field of natural language processing tasks
- LLMs often have surprisingly more detailed understanding of object-oriented concepts and physical mechanics of the world
- In this work, it is examined whether LLMs obviate the need for extralinguistic data in unsupervised constituency parsing
- LC-PCFG, an LLM-based text-only model, outperforms state-of-the-art multi-modal systems on both image and video benchmarks
- Adding visual signals to LC-PCFG does not further improve performance, suggesting that the benefits of multi-modal signals may be redundant with the benefits of using embeddings learned by LLMs
Unsupervised parsing
- Unsupervised parsing is the task of inducing syntactic structure from text.
- Many methods for unsupervised parsing rely on signals from text alone.
- Recent work suggests that multi-modal signals may be needed for accurate grammar induction.
- Prior work has argued that visual features can facilitate identification of syntactic constituents.
- Subsequent studies showed that adding visual and auditory features to word embeddings can improve model performance.
Llm representations for unsupervised parsing
- Recent advances in large pretrained language models have improved performance on downstream tasks, including syntactic parsing
- Prior work has focused on text-only baselines with weaker lexical representations
- Current pretrained language models could obviate the need for multimodal grounding for unsupervised parsing
- Goal of statistical grammar induction is to automatically induce syntactic structure over a text corpus
- Compound Probabilistic Context-Free Grammars (C-PCFGs) used as a testbed
- LC-PCFGs use LLM representations to boost C-PCFG performance without multimodal regularization
- LLM features can help improve performance of C-PCFG models, making the addition of multimodal regularization losses redundant
Experiments
Image-assisted parsing
- LC-PCFG compared to VG-NSL and VC-PCFG
- Evaluated on MSCOCO 2014 dataset
- Preprocessing included lowercasing and replacing numbers with “N”
- Captions greater than 45 words removed
- LC-PCFG used OPT-2.7B backbone to extract token-level embeddings
- LC-PCFG achieved highest overall corpus-level F1 and sentence-level F1
- LC-PCFG maintained comparable runtime and reduced model size by 85%
- LC-PCFG had right-branching bias
- Adding LLM embeddings to VC-PCFG reduced performance
Video-aided parsing
- LLMs help unsupervised parsing compared to image-regularized models
- Images are static and often fail to reflect all constituents in sentences
- Prior work has found greater gains leveraging multiple modalities found in video
- F-1 scores used to compare parsing results
- 3 benchmark video datasets used for experiments
- Results show that learning from large-scale video data enables stronger and more robust performance
- Model trained with captions from HowTo100M dataset
- Model trained with YouCook2 dataset outperformed by model trained with HowTo100M dataset
- Model trained with HowTo100M dataset outperforms models trained with multiple modalities
- Model trained with HowTo100M dataset requires less computation than other state-of-the-art methods
Conclusion
- Investigated whether multimodal grounding is necessary for unsupervised constituency parsing
- Compared performance of multi-modal models to LC-PCFG, a text-only model
- LC-PCFG performs as well as previous multi-modal models
- Challenging the notion that multi-modal signals are necessary for unsupervised grammar induction
- Replacing paired-modality datasets with large text-only corpora
- Significant parsing gains with large-language model (LLM) sentence embeddings
- State-of-the-art performance achieved by the use of LLM features without requiring multimodal regularization