Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Many tasks can be broken down into subroutines.
- Neural networks can achieve impressive performance on vision and language tasks, but it is not known how they do this.
- One possibility is that neural networks break down tasks into subroutines and compose them into an overall solution.
- Model pruning techniques are used to investigate this question in vision and language tasks.
- Results suggest that neural networks can learn to exhibit compositionality.
Paper Content
Introduction
- Neural networks have come to dominate AI, but much is unknown about the functions they learn
- Debate over role of compositionality
- Compositionality is a key property of human cognition
- Representation system is compositional if it implements discrete constituent functions
- Open question whether neural networks require explicit symbolic mechanisms to implement compositional solutions
- Historically, neural networks have been considered non-compositional
- Modern neural networks have demonstrated successes on complex tasks
- Introduce concept of structural compositionality
- Introduce technique to test for structural compositionality
- Test on models and tasks with odd-one-out tasks
- Non-compositional solution stores prototype that encodes conjunction of two subroutines
- Compositional solution computes subroutines in modular subnetworks
Structural compositionality
- Prior work on compositionality in neural networks has mostly yielded negative results.
- Definitions of compositionality are based on a system’s representations, not its behavior.
- This paper focuses on evaluating the extent to which a model’s representations are structured compositionally.
- There are two ways a network might learn to solve a compositional task.
- If a model exhibits structural compositionality, it should be possible to find a subnetwork that implements one subroutine and not the other.
Experimental design
Preliminaries
- Subroutine: A binary rule
- Compositional Rule: A binary rule that maps input to output using subroutines
- Base Model: A model trained to solve a task defined by a compositional rule
- Subnetwork: A subset of the parameters of a base model, implementing one subroutine
- Ablated Model: A model after ablating a subnetwork
Experimental logic
- Consider a compositional rule, C, with two subroutines
- Train a base model, M C , to solve an odd-one-out task
- Characterize the extent to which M C exhibits structural compositionality
- Learn a binary mask over the weights of M C for each subroutine
- Evaluate the subnetwork on two partitions of the training set
- Ablate the subnetwork from the base model and observe the behavior
- Determine modularity by comparing performance on the two partitions
- Expect a positive difference in performance if model exhibits structural compositionality
Discovering subnetworks
- Use continuous sparsification to discover subnetworks within models
- Optimize a deterministic binary mask over network weights to produce a subnetwork that solves a task
- Employ l0 regularization to encourage sparsity in the binary mask
Vision experiments
- Extended collection of datasets introduced in Zerroug et al. (2022)
- Three basic subroutines: contact, inside, and number
- Three compositional rules: Inside-Contact, Number-Contact, and Inside-Number
- Four types of images, each containing two shapes
- Odd-one-out task in the vision domain defined by a compositional rule
- Base model trained to solve odd-one-out task
- Subnetwork discovered to implement each subroutine
- Base model trained with cross-entropy loss
- Mask training with l0 regularization
- Three backbone architectures: Resnet504, Wide Resnet50, and ViT
- Hyperparameter search over batch size and learning rate
- Continuous sparsification parameters for each subroutine
- Evaluate on Test Target Subroutine and Test Other Subroutine
- M ablatei = M C − Sub i evaluated on Test Target Subroutine and Test Other Subroutine
Language experiments
- We use a subset of data from Marvin & Linzen (2019) to construct odd-one-out tasks for language data.
- We construct rules based on two forms of syntactic agreement: Subject-Verb Agreement and Reflexive Anaphora agreement.
- We partition the Subject-Verb Agreement dataset into two subsets, one that targets singular sentences and one that targets plural sentences.
- We study one architecture, BERT-Small, which is a BERT architecture with 4 hidden layers.
Results
- Structural compositionality is favored for some architecture/task combinations
- Subnetwork performance is higher on the subroutine it was trained to implement
- Ablated model performance is lower on the subroutine it was trained to implement and higher on the other subroutine
Effect of pretraining on structural compositionality
- Finetuning pretrained models improves performance
- Pretrained weights are used for Resnet50 and BERT-Small
- Experiments show that pretraining produces modular subnetworks
- Pretraining makes subnetwork discovery algorithm more stable
Related work
- Most prior work has focused on compositional generalization of standard neural models
- Some prior work has attempted to induce an inductive bias toward compositional generalization from data
- Prior efforts seek to attribute causality to specific components of neural networks’ internal representations
- Present study does not require assumptions about where in the network the subroutine is implemented
- Present study stands adjacent to network pruning research
- Mechanistic interpretability aims to reverse engineer neural networks
- Cammarata et al. (2020) manually inspects individual neurons in InceptionV1
- Present study introduces a method that automatically discovers subnetworks
- Subroutines consist of higher-level features than those in previous work
Discussion
- Neural networks decompose tasks into subtasks and implement solutions in modular subnetworks
- Self-supervised pretraining leads to more consistent structural compositionality in models finetuned on compositional tasks
B. continuous sparsification: extended discussion
- Continuous sparsification attempts to optimize a binary mask that minimizes a loss function.
- The loss function includes a standard loss function and an l0 penalty.
- Optimizing the binary mask is intractable, so a continuous sparsification is used instead.
- During training, a soft mask (σ) and a discrete mask (H) are interpolated.
C. mask hyperparameter search details
- Searched over learning rates, mask parameter initializations, and mask configurations
- Mask configurations based on stages for Resnet models and layers for transformer models
- Best hyperparameter configuration determined based on accuracy and structural compositionality
- Computationally expensive process due to training many separate masks
D. vit hyperparameter search results
- Conducted hyperparameter search on ViT models
- Used batch sizes and learning rates on 6 and 12 layer ViT models
- MLP had a hidden layer of dimensionality 2048 and output dimensionality of 128
- All models failed to solve any of the tasks
- +/-Number subroutine used for training/test examples
- All datasets had training set size of 10000, validation set size of 500, and test set size of 1000
F. language data details
- Generated language data using templates provided by Marvin & Linzen (2019)
- Omitted templates that position noun of interest inside sentential complement or object relative clause
- Nouns of interest always second word of sentence
- Split up datasets into singular and plural partitions
- Dataset statistics identical for singular and plural instances
- Goal is to find subnetworks that implement subroutines
- Avoid potential complication of pretrained model needing to unlearn grammaticality computation
- Experiment to control for randomly-initialized models
- Results indicate that subnetworks are not causally implicated in model behavior
- Ablating subnetworks collapses performance to chance for all tasks
- Ablating subnetworks oftentimes yields high performance on Test Target Subroutine and low performance on Test Other Subroutine
- Results reflect internal mechanisms of trained models, not epiphenomenal artifacts of training binary masks over networks