Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Many tasks can be broken down into subroutines.
Neural networks can achieve impressive performance on vision and language tasks, but it is not known how they do this.
One possibility is that neural networks break down tasks into subroutines and compose them into an overall solution.
Model pruning techniques are used to investigate this question in vision and language tasks.
Results suggest that neural networks can learn to exhibit compositionality.

Paper Content

Introduction

Neural networks have come to dominate AI, but much is unknown about the functions they learn
Debate over role of compositionality
Compositionality is a key property of human cognition
Representation system is compositional if it implements discrete constituent functions
Open question whether neural networks require explicit symbolic mechanisms to implement compositional solutions
Historically, neural networks have been considered non-compositional
Modern neural networks have demonstrated successes on complex tasks
Introduce concept of structural compositionality
Introduce technique to test for structural compositionality
Test on models and tasks with odd-one-out tasks
Non-compositional solution stores prototype that encodes conjunction of two subroutines
Compositional solution computes subroutines in modular subnetworks

Structural compositionality

Prior work on compositionality in neural networks has mostly yielded negative results.
Definitions of compositionality are based on a system’s representations, not its behavior.
This paper focuses on evaluating the extent to which a model’s representations are structured compositionally.
There are two ways a network might learn to solve a compositional task.
If a model exhibits structural compositionality, it should be possible to find a subnetwork that implements one subroutine and not the other.

Experimental design

Preliminaries

Subroutine: A binary rule
Compositional Rule: A binary rule that maps input to output using subroutines
Base Model: A model trained to solve a task defined by a compositional rule
Subnetwork: A subset of the parameters of a base model, implementing one subroutine
Ablated Model: A model after ablating a subnetwork

Experimental logic

Consider a compositional rule, C, with two subroutines
Train a base model, M C , to solve an odd-one-out task
Characterize the extent to which M C exhibits structural compositionality
Learn a binary mask over the weights of M C for each subroutine
Evaluate the subnetwork on two partitions of the training set
Ablate the subnetwork from the base model and observe the behavior
Determine modularity by comparing performance on the two partitions
Expect a positive difference in performance if model exhibits structural compositionality

Discovering subnetworks

Use continuous sparsification to discover subnetworks within models
Optimize a deterministic binary mask over network weights to produce a subnetwork that solves a task
Employ l0 regularization to encourage sparsity in the binary mask

Vision experiments

Extended collection of datasets introduced in Zerroug et al. (2022)
Three basic subroutines: contact, inside, and number
Three compositional rules: Inside-Contact, Number-Contact, and Inside-Number
Four types of images, each containing two shapes
Odd-one-out task in the vision domain defined by a compositional rule
Base model trained to solve odd-one-out task
Subnetwork discovered to implement each subroutine
Base model trained with cross-entropy loss
Mask training with l0 regularization
Three backbone architectures: Resnet504, Wide Resnet50, and ViT
Hyperparameter search over batch size and learning rate
Continuous sparsification parameters for each subroutine
Evaluate on Test Target Subroutine and Test Other Subroutine
M ablatei = M C − Sub i evaluated on Test Target Subroutine and Test Other Subroutine

Language experiments

We use a subset of data from Marvin & Linzen (2019) to construct odd-one-out tasks for language data.
We construct rules based on two forms of syntactic agreement: Subject-Verb Agreement and Reflexive Anaphora agreement.
We partition the Subject-Verb Agreement dataset into two subsets, one that targets singular sentences and one that targets plural sentences.
We study one architecture, BERT-Small, which is a BERT architecture with 4 hidden layers.

Results

Structural compositionality is favored for some architecture/task combinations
Subnetwork performance is higher on the subroutine it was trained to implement
Ablated model performance is lower on the subroutine it was trained to implement and higher on the other subroutine

Effect of pretraining on structural compositionality

Finetuning pretrained models improves performance
Pretrained weights are used for Resnet50 and BERT-Small
Experiments show that pretraining produces modular subnetworks
Pretraining makes subnetwork discovery algorithm more stable

Most prior work has focused on compositional generalization of standard neural models
Some prior work has attempted to induce an inductive bias toward compositional generalization from data
Prior efforts seek to attribute causality to specific components of neural networks’ internal representations
Present study does not require assumptions about where in the network the subroutine is implemented
Present study stands adjacent to network pruning research
Mechanistic interpretability aims to reverse engineer neural networks
Cammarata et al. (2020) manually inspects individual neurons in InceptionV1
Present study introduces a method that automatically discovers subnetworks
Subroutines consist of higher-level features than those in previous work

Discussion

Neural networks decompose tasks into subtasks and implement solutions in modular subnetworks
Self-supervised pretraining leads to more consistent structural compositionality in models finetuned on compositional tasks

B. continuous sparsification: extended discussion

Continuous sparsification attempts to optimize a binary mask that minimizes a loss function.
The loss function includes a standard loss function and an l0 penalty.
Optimizing the binary mask is intractable, so a continuous sparsification is used instead.
During training, a soft mask (σ) and a discrete mask (H) are interpolated.

C. mask hyperparameter search details

Searched over learning rates, mask parameter initializations, and mask configurations
Mask configurations based on stages for Resnet models and layers for transformer models
Best hyperparameter configuration determined based on accuracy and structural compositionality
Computationally expensive process due to training many separate masks

D. vit hyperparameter search results

Conducted hyperparameter search on ViT models
Used batch sizes and learning rates on 6 and 12 layer ViT models
MLP had a hidden layer of dimensionality 2048 and output dimensionality of 128
All models failed to solve any of the tasks
+/-Number subroutine used for training/test examples
All datasets had training set size of 10000, validation set size of 500, and test set size of 1000

F. language data details

Generated language data using templates provided by Marvin & Linzen (2019)
Omitted templates that position noun of interest inside sentential complement or object relative clause
Nouns of interest always second word of sentence
Split up datasets into singular and plural partitions
Dataset statistics identical for singular and plural instances
Goal is to find subnetworks that implement subroutines
Avoid potential complication of pretrained model needing to unlearn grammaticality computation
Experiment to control for randomly-initialized models
Results indicate that subnetworks are not causally implicated in model behavior
Ablating subnetworks collapses performance to chance for all tasks
Ablating subnetworks oftentimes yields high performance on Test Target Subroutine and low performance on Test Other Subroutine
Results reflect internal mechanisms of trained models, not epiphenomenal artifacts of training binary masks over networks

Link to paper#

Abstract#

Paper Content#

Introduction#

Structural compositionality#

Experimental design#

Preliminaries#

Experimental logic#

Discovering subnetworks#

Vision experiments#

Language experiments#

Results#

Effect of pretraining on structural compositionality#

Related work#

Discussion#

B. continuous sparsification: extended discussion#

C. mask hyperparameter search details#

D. vit hyperparameter search results#

F. language data details#

Link to paper

Abstract

Paper Content

Introduction

Structural compositionality

Experimental design

Preliminaries

Experimental logic

Discovering subnetworks

Vision experiments

Language experiments

Results

Effect of pretraining on structural compositionality

Related work

Discussion

B. continuous sparsification: extended discussion

C. mask hyperparameter search details

D. vit hyperparameter search results

F. language data details