Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Humans use natural language to refer to 3D locations Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings into NeRF LERF learns a dense, multi-scale language field inside NeRF LERF can extract 3D relevancy maps for language prompts in real-time LERF enables zero-shot queries on 3D CLIP embeddings without relying on region proposals or masks Paper Content Introduction Neural Radiance Fields (NeRFs) can capture photorealistic digital representations of 3D scenes Natural language is an intuitive interface for interacting with a 3D scene Language Embedded Radiance Fields (LERF) grounds language within NeRF by optimizing embeddings from a vision-language model LERF preserves the integrity of CLIP embeddings at multiple scales, allowing it to handle a broad range of language queries LERF utilizes self-supervised DINO features to regularize the optimized language field LERF can localize both fine-grained and abstract queries across in-the-wild scenes LERF has potential use cases in robotics, analyzing vision-language models, and interacting with 3D scenes Related work Open-Vocabulary Object Detection approaches lie on a spectrum from zero-shot to fully trained on segmentation datasets LSeg trains a 2D image encoder on labeled segmentation datasets CRIS and CLIPSeg train a 2D image decoder to output a relevancy map Common approach for 2D images is a two-stage framework with class-agnostic region or mask proposals OpenSeg and ViLD use CLIP to classify 2D regions from class-agnostic mask proposal networks Detic builds on existing two-stage object detector approaches OWL-ViT attaches lightweight object classification and localization heads after a pre-trained 2D image encoder LERF avoids region proposals by incorporating language embeddings in a dense, 3D, multiscale field Grad-CAM and attention-based methods provide a relevancy mapping between 2D images and text NeRF has an attractive property of averaging information across multiple views Semantic NeRF and Panoptic Lifting embed semantic information from semantic segmentation networks into 3D Distilled Feature Fields and Neural Feature Fusion Fields explore embedding pixel-aligned feature vectors into NeRF LERF embeds feature vectors into NeRF without fine-tuning 3D Language Grounding has been explored in a wide range of contexts VL-Maps and Open-Scene build a 3D volume of language features which can be queried CLIP-Fields and NLMaps-SayCan fuse CLIP embeddings of crops into pointclouds ConceptFusion fuses CLIP features more densely in RGBD pointclouds LERF provides a new dense, volumetric interface for 3D text queries Multi-scale supervision Supervising language field outputs requires querying language embeddings over image patches, not pixels....