Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Face recognition presents challenges to current approaches.
- FaceNet is a system that maps face images to a compact Euclidean space.
- FaceNet uses a deep convolutional network to directly optimize the embedding.
- Triplets of matching/non-matching face patches are used to train the system.
- FaceNet achieves state-of-the-art face recognition performance with 128-bytes per face.
- FaceNet achieves record accuracy of 99.63% on Labeled Faces in the Wild (LFW) dataset.
- FaceNet cuts the error rate by 30% on both LFW and YouTube Faces DB datasets.
- Harmonic embeddings and harmonic triplet loss allow for direct comparison between different networks.
Paper Content
Introduction
- System for face verification, recognition and clustering
- Uses deep convolutional network to learn Euclidean embedding per image
- Squared L2 distances in embedding space correspond to face similarity
- Face verification involves thresholding distance between two embeddings
- Recognition is a k-NN classification problem
- Clustering is achieved using k-means or agglomerative clustering
Related work
- Data driven method that learns representation from pixels of face
- Two different deep network architectures used
- Vast corpus of face verification and recognition works
- Multiple stages combining deep convolutional network with PCA and SVM
- Ensemble of networks used for best performance on LFW
- Triplet loss used to minimize L2 distance between faces of same identity
Method
- FaceNet uses a deep convolutional network
- Two core architectures are discussed: Zeiler&Fergus and Inception
- End-to-end learning of the whole system is employed
- Triplet loss is used to reflect the goal of face verification, recognition and clustering
- Triplet loss encourages faces of one identity to live on a manifold while enforcing distance to other identities
Triplet loss
- Embedding an image into a d-dimensional Euclidean space and constraining it to live on a d-dimensional hypersphere
- Nearest-neighbor classification: an image of a specific person should be closer to other images of the same person than to images of any other person
- Triplet selection: selecting hard triplets that are active and can contribute to improving the model
Triplet selection
- Select triplets that violate the triplet constraint
- Generate triplets online using large mini-batches
- Ensure a minimal number of exemplars of any one identity is present in each mini-batch
- Select semi-hard negatives to avoid bad local minima
Deep convolutional networks
- Trained CNN using Stochastic Gradient Descent (SGD) and AdaGrad
- Learning rate started at 0.05 and decreased to finalize model
- Models initialized from random and trained on CPU cluster for 1,000-2,000 hours
- Margin α set to 0.2
- Two types of architectures explored in experimental section
- Rectified linear units used as non-linear activation function
Datasets and evaluation
- Evaluated method on four datasets
- Evaluated on face verification task
- Used squared L2 distance threshold to determine classification of same and different
- Defined set of true accepts and false accepts
Hold-out test set
- Hold out set of 1 million images with same distribution as training set
- Split into 5 disjoint sets of 200k images each
- FAR and VAL rate computed on 100k x 100k image pairs
- Standard error reported across 5 splits
Personal photos
- Test set has similar distribution to training set
- Test set has been manually verified to have clean labels
- Test set consists of 3 personal photo collections with 12k images
- FAR and VAL rate computed across 12k squared pairs of images
Academic datasets
- LFW is the standard test set for face verification
- Youtube Faces DB is a new dataset used for face recognition
- Both datasets use pairs of images/videos for verification
Experiments
- Training face thumbnails consist of 8M different identities
- Face detector is run on each image to generate a tight bounding box
- Input sizes range from 96x96 pixels to 224x224 pixels
Computation accuracy trade-off
- FLOPS and accuracy have a strong correlation
- Five models (NN1, NN2, NN3, NNS1, NNS2) discussed in experiments
- Performance decreases if number of parameters is reduced further
Effect of cnn model
- Zeiler&Fergus based architecture with 1x1 convolutions and Inception based models both perform comparably
- Inception based models reduce model size and FLOPS
- Image size in pixels affects validation rate
- Embedding dimensionality of model NN1 affects hold-out set
- Largest model achieves dramatic improvement in accuracy
- Tiny NNS2 can be run 30ms/image on a mobile phone and is accurate enough for face clustering
Sensitivity to image quality
- Model is robust across a wide range of image sizes
- Performance remains good even with JPEG compression of quality 20
- Performance remains good even with face thumbnails of size 120x120 and 80x80 pixels
- Training with lower resolution faces could improve performance range
Embedding dimensionality
- 128 dimensional float vector used for training
- 128 dimensional byte vector used for large scale clustering and recognition
- Smaller embeddings possible with minor loss of accuracy
Amount of training data
- Using tens of millions of exemplars results in a 60% reduction in error on a personal photo test set.
- Using hundreds of millions of images gives a small boost, but the improvement tapers off.
Performance on lfw
- Evaluated model on LFW using standard protocol
- Nine training splits used to select L2-distance threshold
- Classification accuracy of 98.87%±0.15 when using fixed center crop
- Record breaking 99.63%±0.09 standard error of the mean when using extra face alignment
- Error reduced by more than a factor of 7 compared to DeepFace in [17] and by 30% compared to DeepId2+ in [15]
Performance on youtube faces db
- Used average similarity of first 100 frames of each video to classify with 95.12% accuracy
- Compared to 91.4% accuracy of 100 frames from [17], error rate reduced by almost half
Face clustering
- Compact embedding can be used to cluster photos of people with the same identity.
- Results of clustering faces are impressive, as shown in Figure 7.
- Clustering is invariant to occlusion, lighting, pose and age.
Summary
- Findings work well
- Future work to explore how far idea can be extended
Appendix: harmonic embedding
- Introduces concept of harmonic embeddings, which are generated by different models but are compatible
- Simplifies upgrade paths, allowing for smooth transition without version incompatibilities
- Figure 8 shows results on 3G dataset, NN2 outperforms NN1, comparison of NN2 to NN1 performs at intermediate level
- To learn harmonic embedding, triplets are generated that mix v1 and v2 embeddings, semihard negatives are selected from both v1 and v2 embeddings
Harmonic triplet loss
- Mix embeddings of v1 and v2 to learn harmonic embedding
- Triplet loss encourages compatibility between different embedding versions
- Visualization of triplet combinations
- Initialize v2 embedding from independently trained NN2
- Retrain last layer of v2 with compatibility encouraging triplet loss
- Perturb incorrectly placed v1 embeddings to improve verification accuracy
- FaceNet output distances between pairs of faces of same and different person in different pose and illumination combinations