Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Past work has shown gains from multimodal cues.
LC-PCFG outperforms previous multimodal methods on unsupervised constituency parsing.
LC-PCFG results in a 50% reduction in parameter count and speedups in training time.
Extralinguistic signals may not be needed for unsupervised grammar induction.

Recent work has shown that unsupervised grammar induction can be improved by pairing text data with extralinguistic inputs such as images, videos, audio or facial semantics
There is limited evidence for the hypothesized import of this grounding signal from other modalities into text
Large language models (LLMs) have revolutionized the field of natural language processing tasks
LLMs often have surprisingly more detailed understanding of object-oriented concepts and physical mechanics of the world
In this work, it is examined whether LLMs obviate the need for extralinguistic data in unsupervised constituency parsing
LC-PCFG, an LLM-based text-only model, outperforms state-of-the-art multi-modal systems on both image and video benchmarks
Adding visual signals to LC-PCFG does not further improve performance, suggesting that the benefits of multi-modal signals may be redundant with the benefits of using embeddings learned by LLMs

Unsupervised parsing is the task of inducing syntactic structure from text.
Many methods for unsupervised parsing rely on signals from text alone.
Recent work suggests that multi-modal signals may be needed for accurate grammar induction.
Prior work has argued that visual features can facilitate identification of syntactic constituents.
Subsequent studies showed that adding visual and auditory features to word embeddings can improve model performance.

Recent advances in large pretrained language models have improved performance on downstream tasks, including syntactic parsing
Prior work has focused on text-only baselines with weaker lexical representations
Current pretrained language models could obviate the need for multimodal grounding for unsupervised parsing
Goal of statistical grammar induction is to automatically induce syntactic structure over a text corpus
Compound Probabilistic Context-Free Grammars (C-PCFGs) used as a testbed
LC-PCFGs use LLM representations to boost C-PCFG performance without multimodal regularization
LLM features can help improve performance of C-PCFG models, making the addition of multimodal regularization losses redundant

LLMs help unsupervised parsing compared to image-regularized models
Images are static and often fail to reflect all constituents in sentences
Prior work has found greater gains leveraging multiple modalities found in video
F-1 scores used to compare parsing results
3 benchmark video datasets used for experiments
Results show that learning from large-scale video data enables stronger and more robust performance
Model trained with captions from HowTo100M dataset
Model trained with YouCook2 dataset outperformed by model trained with HowTo100M dataset
Model trained with HowTo100M dataset outperforms models trained with multiple modalities
Model trained with HowTo100M dataset requires less computation than other state-of-the-art methods

Investigated whether multimodal grounding is necessary for unsupervised constituency parsing
Compared performance of multi-modal models to LC-PCFG, a text-only model
LC-PCFG performs as well as previous multi-modal models
Challenging the notion that multi-modal signals are necessary for unsupervised grammar induction
Replacing paired-modality datasets with large text-only corpora
Significant parsing gains with large-language model (LLM) sentence embeddings
State-of-the-art performance achieved by the use of LLM features without requiring multimodal regularization