Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Face recognition presents challenges to current approaches.
FaceNet is a system that maps face images to a compact Euclidean space.
FaceNet uses a deep convolutional network to directly optimize the embedding.
Triplets of matching/non-matching face patches are used to train the system.
FaceNet achieves state-of-the-art face recognition performance with 128-bytes per face.
FaceNet achieves record accuracy of 99.63% on Labeled Faces in the Wild (LFW) dataset.
FaceNet cuts the error rate by 30% on both LFW and YouTube Faces DB datasets.
Harmonic embeddings and harmonic triplet loss allow for direct comparison between different networks.

Paper Content

Introduction

System for face verification, recognition and clustering
Uses deep convolutional network to learn Euclidean embedding per image
Squared L2 distances in embedding space correspond to face similarity
Face verification involves thresholding distance between two embeddings
Recognition is a k-NN classification problem
Clustering is achieved using k-means or agglomerative clustering

Data driven method that learns representation from pixels of face
Two different deep network architectures used
Vast corpus of face verification and recognition works
Multiple stages combining deep convolutional network with PCA and SVM
Ensemble of networks used for best performance on LFW
Triplet loss used to minimize L2 distance between faces of same identity

Method

FaceNet uses a deep convolutional network
Two core architectures are discussed: Zeiler&Fergus and Inception
End-to-end learning of the whole system is employed
Triplet loss is used to reflect the goal of face verification, recognition and clustering
Triplet loss encourages faces of one identity to live on a manifold while enforcing distance to other identities

Triplet loss

Embedding an image into a d-dimensional Euclidean space and constraining it to live on a d-dimensional hypersphere
Nearest-neighbor classification: an image of a specific person should be closer to other images of the same person than to images of any other person
Triplet selection: selecting hard triplets that are active and can contribute to improving the model

Triplet selection

Select triplets that violate the triplet constraint
Generate triplets online using large mini-batches
Ensure a minimal number of exemplars of any one identity is present in each mini-batch
Select semi-hard negatives to avoid bad local minima

Deep convolutional networks

Trained CNN using Stochastic Gradient Descent (SGD) and AdaGrad
Learning rate started at 0.05 and decreased to finalize model
Models initialized from random and trained on CPU cluster for 1,000-2,000 hours
Margin α set to 0.2
Two types of architectures explored in experimental section
Rectified linear units used as non-linear activation function

Datasets and evaluation

Evaluated method on four datasets
Evaluated on face verification task
Used squared L2 distance threshold to determine classification of same and different
Defined set of true accepts and false accepts

Hold-out test set

Hold out set of 1 million images with same distribution as training set
Split into 5 disjoint sets of 200k images each
FAR and VAL rate computed on 100k x 100k image pairs
Standard error reported across 5 splits

Personal photos

Test set has similar distribution to training set
Test set has been manually verified to have clean labels
Test set consists of 3 personal photo collections with 12k images
FAR and VAL rate computed across 12k squared pairs of images

Academic datasets

LFW is the standard test set for face verification
Youtube Faces DB is a new dataset used for face recognition
Both datasets use pairs of images/videos for verification

Experiments

Training face thumbnails consist of 8M different identities
Face detector is run on each image to generate a tight bounding box
Input sizes range from 96x96 pixels to 224x224 pixels

Computation accuracy trade-off

FLOPS and accuracy have a strong correlation
Five models (NN1, NN2, NN3, NNS1, NNS2) discussed in experiments
Performance decreases if number of parameters is reduced further

Effect of cnn model

Zeiler&Fergus based architecture with 1x1 convolutions and Inception based models both perform comparably
Inception based models reduce model size and FLOPS
Image size in pixels affects validation rate
Embedding dimensionality of model NN1 affects hold-out set
Largest model achieves dramatic improvement in accuracy
Tiny NNS2 can be run 30ms/image on a mobile phone and is accurate enough for face clustering

Sensitivity to image quality

Model is robust across a wide range of image sizes
Performance remains good even with JPEG compression of quality 20
Performance remains good even with face thumbnails of size 120x120 and 80x80 pixels
Training with lower resolution faces could improve performance range

Embedding dimensionality

128 dimensional float vector used for training
128 dimensional byte vector used for large scale clustering and recognition
Smaller embeddings possible with minor loss of accuracy

Amount of training data

Using tens of millions of exemplars results in a 60% reduction in error on a personal photo test set.
Using hundreds of millions of images gives a small boost, but the improvement tapers off.

Performance on lfw

Evaluated model on LFW using standard protocol
Nine training splits used to select L2-distance threshold
Classification accuracy of 98.87%±0.15 when using fixed center crop
Record breaking 99.63%±0.09 standard error of the mean when using extra face alignment
Error reduced by more than a factor of 7 compared to DeepFace in [17] and by 30% compared to DeepId2+ in [15]

Performance on youtube faces db

Used average similarity of first 100 frames of each video to classify with 95.12% accuracy
Compared to 91.4% accuracy of 100 frames from [17], error rate reduced by almost half

Face clustering

Compact embedding can be used to cluster photos of people with the same identity.
Results of clustering faces are impressive, as shown in Figure 7.
Clustering is invariant to occlusion, lighting, pose and age.

Summary

Findings work well
Future work to explore how far idea can be extended

Appendix: harmonic embedding

Introduces concept of harmonic embeddings, which are generated by different models but are compatible
Simplifies upgrade paths, allowing for smooth transition without version incompatibilities
Figure 8 shows results on 3G dataset, NN2 outperforms NN1, comparison of NN2 to NN1 performs at intermediate level
To learn harmonic embedding, triplets are generated that mix v1 and v2 embeddings, semihard negatives are selected from both v1 and v2 embeddings

Harmonic triplet loss

Mix embeddings of v1 and v2 to learn harmonic embedding
Triplet loss encourages compatibility between different embedding versions
Visualization of triplet combinations
Initialize v2 embedding from independently trained NN2
Retrain last layer of v2 with compatibility encouraging triplet loss
Perturb incorrectly placed v1 embeddings to improve verification accuracy
FaceNet output distances between pairs of faces of same and different person in different pose and illumination combinations

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Triplet loss#

Triplet selection#

Deep convolutional networks#

Datasets and evaluation#

Hold-out test set#

Personal photos#

Academic datasets#

Experiments#

Computation accuracy trade-off#

Effect of cnn model#

Sensitivity to image quality#

Embedding dimensionality#

Amount of training data#

Performance on lfw#

Performance on youtube faces db#

Face clustering#

Summary#

Appendix: harmonic embedding#

Harmonic triplet loss#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Triplet loss

Triplet selection

Deep convolutional networks

Datasets and evaluation

Hold-out test set

Personal photos

Academic datasets

Experiments

Computation accuracy trade-off

Effect of cnn model

Sensitivity to image quality

Embedding dimensionality

Amount of training data

Performance on lfw

Performance on youtube faces db

Face clustering

Summary

Appendix: harmonic embedding

Harmonic triplet loss