Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Presents Recurrent Interface Network (RIN), a neural net architecture that allocates computation adaptively to the input
Hidden units of RINs are partitioned into the interface and latents
RIN block selectively reads from the interface into latents for high-capacity processing
Stacking multiple blocks enables effective routing across local and global levels
Latent self-conditioning technique “warm-starts” the latents at each iteration of the generation process
RINs yield state-of-the-art image and video generation without cascades or guidance
Up to 10$\times$ more efficient compared to specialized 2D and 3D U-Nets

Design of effective neural network architectures is important for deep learning
Convolutional neural networks and Transformers are examples of architectures
Computation is usually allocated in a fixed, uniform manner
It is important to allocate computation in an adaptive manner to improve scalability
Prior work has explored dynamic and input-decoupled computation
Generating images and videos with high-dimensional data requires adaptive computation
Recurrent Interface Networks (RINs) is a new architecture that allocates computation more effectively
RINs outperform U-Net architectures for image and video generation
Latent self-conditioning is proposed to reduce the cost of routing
RINs lead to significant performance and efficiency gains in diffusion models

RINs use tokenization to connect the interface to the input space and learnable embeddings to initialize the latents.
RINs route information between the interface and latents.
The interface grows linearly with input size, while the number of latents is much smaller.
RINs are more efficient than U-Nets, Transformers, and other decoupled architectures.
RINs are especially useful in recurrent settings.

Diffusion models learn a series of state transitions to map noise from a known prior distribution to data
A forward transition from x 0 to x t is defined
A neural net is learned to predict from x t and then estimate x t−∆ from the estimated ˜ and x t
Samples are generated by iteratively applying the denoising function
The network takes as input a noisy image x t , a time step t, and an optional conditioning variable
The interface is initialized from an input x, such as an image or video
Latents are initialized as learned embeddings
The RIN block routes information by reading from X into Z, processing Z, and writing updates back to X
MLP and MHA are used to process information
Readout layer is applied to the corresponding interface tokens to predict local outputs
Local outputs are combined to form the desired output

RINs use routing information to allocate compute to parts of the input
Latents are built by reading interface information
Without context, there is a “cold-start” problem
Humans face a similar “cold-start” problem
RINs can amortize the “warm-up” cost in sequential computation settings
Propose to “warm-start” latents using latents from a previous step
Latent self-conditioning conditions on the latent activations of the neural network

Image generation works well with small datasets such as CIFAR-10
Obtained 1.81 FID without using improved sampling procedure
Model has 31M parameters and trains in 3 hours on 8 TPUv3 chips
Video generation works without using guidance
Latent self-conditioning is important for enhanced routing
Stacking blocks enhances global and local processing
Model can handle a wide range of patch sizes
Sigmoid schedule with appropriate temperature is better during training than cosine schedule
Noise schedule has less impact during sampling
Visualizing read attention reveals which parts of the image are most attended to
RINs bear resemblance to architectures that leverage auxiliary memory
Latent self-conditioning allows RINs to leverage global context
RINs are closely related to recurrent models with input attention
Pixel diffusion models are predominant architecture for image and video generation
Self-conditioning for diffusion models was originally proposed in (Chen et al., 2022c)
RINs outperform U-Nets widely used in recent state-of-the-art image and video diffusion models