Face Recognition

Face Recognition was part of the Human-Robot Interaction and my graduation project, which is to build a reception robot. In a home environment, smart home robots must accurately detect and identify all family members to perform tasks specific to individual household members. Therefore, this blog focuses on implementing facial recognition and analyzing related aspects.

Face recognition tasks can be broadly divided into two parts: face detection and face recognition. Face detection refers to locating all faces in an input image and processing them, such as aligning the faces, generating bounding boxes, and resizing or cropping the detected faces. However, due to the potential challenges posed by the robot’s field of vision, such as incomplete or peripheral facial appearances, face detection can become more complex. Additional complications arise from variations in lighting, occlusions, differences in the same individual’s pose or expressions, and inter-individual appearance differences.

To address these challenges, accurate facial recognition in complex environments requires precise localization of faces, followed by the detection of facial feature points, such as facial contours, eye outlines, nose, and mouth contours. These feature points are then used to align the faces uniformly, eliminating errors caused by position or pose variations.

Face recognition can be described as comparing a given photo to a dataset, calculating the facial similarity, and outputting a similarity score or confidence level. Alternatively, it involves identifying which individual in the dataset the photo belongs to and retrieving the relevant information about the individual from the database.

A representative algorithm in this domain is FaceNet, which won the 2014 ILSVRC (ImageNet Large Scale Visual Recognition Challenge) and reduced the Top-5 error rate for facial recognition to 6.67%. FaceNet uses the GoogLeNet network architecture, as outlined in the associated paper. Following this milestone, facial recognition algorithms have continued to evolve. The latest implementation on GitHub recommends using the MTCNN (Multi-task Cascaded Convolutional Networks) algorithm, which improves detection accuracy to 99.65%.

Therefore, the primary focus of this blog is the introduction and implementation of the MTCNN and FaceNet network architectures.

Face Detection MTCNN

Why MTCNN rather than OpenCV?

There are several methods for face detection, such as the Haar feature-based classifier provided by OpenCV and the face detection method in dlib. For OpenCV’s face detection method, its advantages lie in being simple and fast; however, its drawbacks include poor detection performance. It works well for frontal, upright faces in good lighting conditions, but fails to detect faces that are turned sideways, tilted, or poorly lit. Therefore, this method is not suitable for real-world applications. In comparison, dlib’s face detection method provides better performance than OpenCV’s, but its detection capability still does not meet the standards required for real-world applications.

In this study, we use the MTCNN-based deep learning approach for face detection (MTCNN: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Neural Networks). The MTCNN face detection method is more robust to variations in lighting, angles, and facial expressions in natural environments, offering better face detection results. Additionally, it has a relatively low memory consumption, enabling real-time face detection. The MTCNN method used in this blog is implemented in Python and TensorFlow.

How MTCNN works?

Face detection is a critical step in the face recognition process, and the performance of this step significantly impacts the video processing speed and accuracy of face recognition in a robotic system. The method adopted in this paper, MTCNN, integrates both face detection and face alignment to assist in the detection of facial key points. The core of the algorithm involves using three multi-stage deep convolutional neural networks to form a face detection network framework through a cascaded structure.

Figure 1 illustrates the flowchart of the MTCNN face detection process. Initially, candidate boxes are generated by the P-Net (Proposal Network). P-Net performs the first-level analysis of the image and outputs candidate objects. In the next phase, the R-Net (Refinement Network) further analyzes and refines these candidate objects. In the final stage, the O-Net (Output Network) generates the final bounding boxes and the locations of facial key points.

Thus, the MTCNN architecture can be understood as a cascaded structure involving an image pyramid and the networks P-Net, R-Net, and O-Net, which together contribute to the accurate detection and alignment of faces.

The MTCNN model consists of three processing stages, which ultimately produce results for both face detection and face alignment. Before using the three cascaded networks, the input image is first processed through an image pyramid, which scales the image at different resolutions. This approach is necessary because faces can appear at various sizes within an image, and scaling the image at multiple resolutions helps detect faces at different scales. This method allows for a rough initial classification of the image, removing non-face regions and identifying face-containing areas. The advantage of this approach is its ability to quickly detect faces of different sizes, enabling multi-scale target detection and recognition.

Below is an introduction to the three processing stages of MTCNN face detection, illustrated by Figure 1.

First Stage: P-Net

In the first stage, P-Net generates feature maps through forward propagation, where each location is represented by a 32-dimensional feature vector used to determine whether a face is present in that region. As shown in Figure 2(a), P-Net consists of three 3x3 convolutional layers and one 3x3 max-pooling layer. Its input is a processed 12x12x3 RGB image. The network outputs two parts:

A 32-dimensional feature vector that indicates the probability of a face being present in the region. Since a Softmax layer is used, the final output is the probability of the presence of a face in the image.
If a face is detected, a bounding box indicating the face is returned, which constitutes the second part of the output.

The bounding boxes from P-Net may not always align perfectly with the actual face positions. For example, sometimes the face may be only partially within the bounding box or outside the box. In such cases, the region is adjusted to more accurately reflect the desired area. Once the bounding box with the highest probability of containing a face is identified, Non-Maximum Suppression (NMS) is applied to retain the box with the highest score and remove other less probable boxes.

The third part of P-Net’s output is the coordinates of five key facial points: the positions of the left eye, right eye, nose, left corner of the mouth, and right corner of the mouth. Although P-Net provides the key point coordinates, there may still be many false positive face boxes, so the focus at this stage is on selecting the correct face box.

Second Stage: R-Net

As shown in Figure 2(b), R-Net consists of two 3x3 convolutional layers, one 2x2 convolutional layer, two 3x3 max-pooling layers, and a 128-dimensional fully connected layer. Since the candidate boxes generated by P-Net are still rough, R-Net is used to further optimize the results. Before inputting the data, the face bounding box coordinates from P-Net undergo a transformation to map the image size to 24x24x3.

The output of R-Net is similar to that of P-Net, but its main purpose is to refine the results by removing irrelevant face boxes—non-face boxes or partially incorrect face boxes—and passing the refined results to the third stage (O-Net) for final selection.

Third Stage: O-Net

In Figure 2(c), O-Net consists of three 3x3 convolutional layers, one 2x2 convolutional layer, two 3x3 and one 2x2 max-pooling layers, and a 256-dimensional fully connected layer. The input to O-Net is the output from R-Net, scaled to an image size of 48x48x3. The output of O-Net includes:

The coordinates of the final bounding box
The confidence score
The coordinates of the facial key points

This final stage provides the precise face detection and alignment results, refining the bounding boxes and facial key points obtained from the previous stages.

Loss Function

Since MTCNN is a cascaded structure consisting of three sub-networks, each sub-network has its own loss function. In the task of distinguishing between faces and non-faces, which is a binary classification problem, the cross-entropy loss function is used. For bounding box regression and key point localization, the network predicts the offsets between the predicted values and the true values, so the Euclidean distance is used as the loss function for these tasks.

Finally, the total loss function is obtained by summing the individual losses of the three sub-networks, with each loss weighted according to the goals of each stage. The training focuses of the three sub-networks vary, so different weightings are assigned to each loss function. During the training of P-Net and R-Net, the main focus is on the accuracy of the bounding boxes, so the contribution of the key point loss is relatively small. In contrast, during the training of O-Net, which emphasizes key point selection, the weight of the key point loss is larger compared to the other components.

Thus, the total loss function for MTCNN is a weighted sum of three parts:

The cross-entropy loss for face vs. non-face classification.

Since face classification is a binary classification problem, the cross-entropy loss function is used. For each sample $x_i$, the loss function is defined as:

$L_i^{\text{det}} = - \left( y_i^{\text{det}} \log(P_i) + (1 - y_i^{\text{det}})(1 - \log(P_i)) \right)$

Where:

$P_i$ represents the probability output by the network that the sample is a face.
$y_i^{\text{det}} \in {0,1}$ indicates the ground truth label for whether the sample is a face (1) or not (0).

The Euclidean distance loss for bounding box regression.

For each candidate bounding box, the loss function measures the offset between the predicted values and the ground truth. For each sample $x_i$, the Euclidean distance is used as the loss function, defined as:

$L_i^{\text{box}} = | \hat{y}_i^{\text{box}} - y_i^{\text{box}} |_2^2$

Where:

$\hat{y}_i^{\text{box}}$ is the network’s prediction for the bounding box, which includes four values: the top-left corner coordinates, width, and height of the box.
$y_i^{\text{box}}$represents the ground truth for each sample’s bounding box.

The Euclidean distance loss for key point localization.

For key point localization, a loss function similar to bounding box regression is used. The Euclidean distance measures the offset between the predicted and true key point coordinates, defined as:

$L_i^{\text{landmark}} = | \hat{y}_i^{\text{landmark}} - y_i^{\text{landmark}} |_2^2$

Where:

$\hat{y}_i^{\text{landmark}}$is the network’s prediction for the key points, including 10 values corresponding to the coordinates of the left eye, right eye, nose, left mouth corner, and right mouth corner.
$y_i^{\text{landmark}}$ is the ground truth for the facial key points.

Each of these components is weighted appropriately to reflect the emphasis of the corresponding stage in the network.

Total Loss Function

The total loss combines the three individual loss functions with different weights, formulated as:

$L = \min \sum_{i=1}^N \sum_{j \in {\text{det}, \text{box}, \text{landmark}}} \alpha_j \beta_i^j L_i^j$

Where:

N is the total number of samples.
$\alpha_j$ represents the weight for each task. For P-Net and R-Net, the focus is on face classification $(\alpha_{\text{det}} = 1)$ and bounding box regression $(\alpha_{\text{box}} = 0.5)$, with less emphasis on key point localization $(alpha_{\text{landmark}} = 0.5)$. For O-Net, the weights are adjusted to focus more on key point localization $(\alpha_{\text{landmark}} = 1)$.
$\beta_i^j \in {0,1}$ indicates whether the sample participates in the loss computation for a specific task. If $\beta_i^j = 0$, the corresponding loss for that task is not computed.

The loss is minimized using stochastic gradient descent (SGD), and $L_i^j$ represents the computed loss for sample $x_i$ in task $j$. The total loss function reflects the combined optimization of face classification, bounding box regression, and key point localization.

Face Recognition FaceNet

After completing the face detection process, the next step is face recognition. Here, face recognition serves as a general term for face-related tasks, which can be divided into the following four categories:

Face Tracking: Face tracking refers to following an individual within a room after detecting a family member. This is essential for providing targeted services to the individual.
Face Verification: Face verification determines whether two faces appearing sequentially in the field of view belong to the same family member. This functionality is crucial for enabling directed services by confirming identity consistency.
Face Identification: Face identification focuses on determining which member of the family (stored in the robot’s facial database) a detected face belongs to. It facilitates member identification. Additionally, when a new face (such as a visiting relative or friend) is detected, it allows the system to add the new face to the database and carry out subsequent actions accordingly.
Face Clustering: Face clustering enhances the robustness of the robot. By collecting multiple facial images of family members under different angles, lighting conditions, and occlusions in the home environment, the robot can classify and organize these images into distinct clusters for more reliable recognition.

These functions collectively enable the home service robot to handle a variety of face-related tasks, ensuring it operates effectively and adapts to dynamic household scenarios.

Why FaceNet?

Before the introduction of FaceNet, traditional face recognition methods based on convolutional neural networks (CNNs) primarily utilized Siamese networks for feature extraction, followed by classification using methods like SVM (Support Vector Machines).

The advent of FaceNet marked a significant milestone in face recognition. It achieved an impressive accuracy of 99.63% on the widely used LFW (Labeled Faces in the Wild) dataset, setting a new record at the time. Additionally, it reached 95.12% accuracy on the YouTube Faces database. The system reduced the error rate on these two datasets by 30%.

The core idea behind FaceNet is to directly learn the mapping of facial images into a multi-dimensional Euclidean space. The similarity between two facial images is then determined by the distance between their respective feature points in this Euclidean space. As shown in Figure 3, the numerical value between two face images represents the Euclidean distance between their feature points in the multi-dimensional space.

Face recognition models often face challenges due to variations in lighting, pose, and occlusion, which can significantly affect model performance. In Figure 3:

The left and right images show two photos of the same person from different angles.
The top and bottom images represent photos of two different individuals.

From the figure, it is evident that the Euclidean distance between images of the same person is less than 1.1, while the distance between images of different individuals exceeds 1.1. In this case, the intra-class distance is clearly smaller than the inter-class distance, allowing a threshold of approximately 1.1 to be set for determining whether two images belong to the same person.

This threshold-based approach significantly improves the robustness of face recognition, making it possible to handle variations in lighting, pose, and occlusion more effectively.

How FaceNet works?

Previous face recognition methods based on convolutional neural networks (CNNs) trained on facial images typically connected a classification layer at the end. For recognizing untrained faces, these methods relied on the bottleneck layer. However, this approach had significant drawbacks, including low efficiency and poor performance of the bottleneck layer when applied to new, unseen face images.

To improve the accuracy of face recognition, FaceNet introduced three key advantages:

Compact Feature Representation:
FaceNet represents facial images as a 128-dimensional vector. This dramatically reduces the dimensionality of the feature vector after global pooling, thereby significantly reducing the computational cost.
Similarity Measurement via Euclidean Distance:
The similarity between different face images is measured using the Euclidean distance between their feature vectors. A smaller Euclidean distance indicates higher similarity between the two faces, while a larger distance indicates lower similarity.
Triplet Loss for Classification:
FaceNet replaces the traditional Softmax layer with Triplet Loss, which directly classifies face embeddings. Based on a given threshold, the system determines whether two face images belong to the same person.

Figure 4 illustrates the architecture of FaceNet. The pipeline is as follows:

Input Processing:
Facial images detected and cropped to a specified size by MTCNN are used as inputs to the model.
Deep Learning Architecture:
The cropped images are passed into a deep learning architecture, such as GoogleNet or Zeiler&Fergus networks, to extract features.
Feature Normalization (L2 Normalization):
The features output by the deep learning architecture are normalized using L2 normalization, mapping all feature vectors onto a hypersphere.
Feature Embedding:
The normalized feature vectors, called embeddings, represent the input facial image.
Triplet Loss for Classification:
Finally, the model uses Triplet Loss to classify the embeddings and determine whether two facial images represent the same person based on a predefined threshold.

This approach enables FaceNet to perform face recognition with high efficiency and accuracy, making it robust in handling new facial data.

Triplet Loss Function

Similarity via Euclidean Distance

The similarity between two images is determined by the Euclidean distance between their feature vectors, as shown:
$d(x_1, x_2) = | f(x_1) - f(x_2) |_2^2$

Where:

$f(x_1) and f(x_2)$ represent the mappings of images (x_1) and (x_2) onto a hypersphere in a (d)-dimensional Euclidean space.
$d(x_1, x_2)$ denotes the Euclidean distance between the two feature vectors.

Purpose of Triplet Loss

The Triplet Loss function calculates the loss for three input images at once. As illustrated in Figure 5, each calculation involves three samples:

Anchor $x_a$: The target image.
Positive $x_p$: An image of the same class as the Anchor.
Negative M: An image of a different class from the Anchor.

Objectives of Triplet Loss

Minimize the distance between the Anchor and Positive pair:
$d(x_a, x_p)$
Maximize the distance between the Anchor and Negative pair:
$d(x_a, x_n)$

The Triplet Loss function ensures that:

Images of the same class are pushed closer together in the embedding space.
Images of different classes are pulled farther apart.

This design enables the model to effectively distinguish between individuals by producing meaningful embeddings, making Triplet Loss a cornerstone of modern face recognition systems like FaceNet.

Triplet Loss Function Analysis: Intra-Class and Inter-Class Distance

Due to variations in pose, lighting, and occlusion, different photos of the same person may sometimes result in the Euclidean distance between the Anchor and Positive being greater than the distance between the Anchor and Negative. This is an undesirable situation.

As shown in Figure 6, before training, the distance between the Anchor and Negative might be smaller than the distance between the Anchor and Positive. After training, we aim to ensure that the intra-class distance (distance between Anchor and Positive) is significantly smaller than the inter-class distance (distance between Anchor and Negative). This relationship can be expressed mathematically as follows :

$$
| f(x_i^a) - f(x_i^p) |_2^2 + \alpha < | f(x_i^a) - f(x_i^n) |_2^2, \quad \forall (x_i^a, x_i^p, x_i^n) \in T
$$

Where:

$T$ is the set of all triplet combinations.
$x_i^a$, $x_i^p$, and $x_i^n$ represent the Anchor, Positive, and Negative samples, respectively, within the set $T$.
$\alpha$ is the margin between the intra-class and inter-class distances.

Final Triplet Loss Function

To ensure that all triplet combinations satisfy the above constraint, Equation can be transformed into the final Triplet Loss Function, as shown:

$$
L = \sum_{i=1}^N \left[ | f(x_i^a) - f(x_i^p) |_2^2 - | f(x_i^a) - f(x_i^n) |_2^2 \right]+ \alpha
$$

Where:

$N$ is the total number of triplets.

Explanation

The Triplet Loss function:

Penalizes cases where the intra-class distance ($d(x_a, x_p)$) is not sufficiently smaller than the inter-class distance ($d(x_a, x_n)$).
Ensures that the model learns embeddings where samples of the same class are closer together and samples of different classes are farther apart, with a margin of at least $\alpha$.

This formulation helps improve the robustness and accuracy of the model in face recognition tasks.

Efficient Triplet Selection During Training

During the training phase, it is necessary to continuously search for triplet combinations to compute the loss. However, in large datasets, exhaustively enumerating all triplets and calculating the triplet loss requires substantial computational resources. Additionally, many easily distinguishable images do not contribute meaningfully to the model’s convergence.

To address this problem, the authors of Schroff F, Kalenichenko D, Philbin J. Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 815-823 proposed an efficient solution:

For a given image, identify the Hard Positive:
The least similar image belonging to the same person (i.e., the hardest to recognize as the same person).
Identify the Hard Negative:
The most similar image belonging to a different person (i.e., the hardest to distinguish as a different person).

Benefits of Hard Example Mining

By focusing on Hard Positives and Hard Negatives, the training process:

Reduces computational overhead by avoiding redundant comparisons with easily distinguishable images.
Enhances the model’s ability to handle challenging cases, improving its robustness and convergence efficiency.

This method ensures that the triplet loss focuses on the most informative and challenging samples, driving the model to better distinguish between subtle intra-class variations and inter-class similarities.

Result

In Figure 4, the term DEEP ARCHITECTURE refers to the deep learning framework. In the referenced paper, two neural network architectures were primarily used: Zeiler&Fergus and GoogleNet (Inception).

Face Database: LFW and Recognition Results

A well-known face database LFW (Labeled Faces in the Wild) will be tested in this part. The paper provides recognition rates for various network architectures and dimensions, as shown in Table 1:

Table 1. Recognition Accuracy for Different Network Configurations

Architecture	Input Dimension	Recognition Accuracy (VAL)
NN1 (Zeiler&Fergus)	220×220	$87.9% \pm 1.9$
NN2 (Inception)	224×224	$89.4% \pm 1.6$
NN3 (Inception)	160×160	$88.3% \pm 1.7$
NN4 (Inception)	96×96	$82.0% \pm 2.3$
NNS1 (Mini Inception)	—	$82.4% \pm 2.4$
NNS2 (Tiny Inception)	—	$51.9% \pm 2.9$

Preprocessing with MTCNN and Training Configuration

In this study:

Preprocessing:
The images in the LFW dataset were first processed using the MTCNN model. The output consisted of 160×160 cropped facial images, which were used as inputs for training the FaceNet model.
Updates to GoogleNet:
Over time, Google continued improving the GoogleNet architecture. This study primarily used Inception V4 to process the dataset.
Training Parameters:
- Learning rate: $0.1$
- Exponential decay rate: $0.999$
- Epochs: $400$

Key Takeaways

Larger input dimensions (e.g., 224×224 vs. 96×96) and more advanced architectures (e.g., Inception vs. Mini Inception) resulted in higher recognition accuracy.
Processing the dataset with MTCNN ensured high-quality cropped facial images, which improved the performance of the FaceNet model.
Inception V4 and well-tuned hyperparameters significantly enhanced the model’s robustness and accuracy on the LFW dataset.

The following screenshot will be the result or output of the model.

From Figure 7, it can be observed that MTCNN successfully detected faces in all 13,233 images of the LFW dataset. The detected faces were then scaled and cropped to the required dimensions for subsequent processing.

Figure 8 summarizes the performance of FaceNet on the LFW dataset, with the following key metrics:

Accuracy: $98.50%$
Validation Rate: $90.06%$
AUC (Area Under the Curve): $0.998$
EER (Equal Error Rate): $0.016$

Insights:

An AUC value close to $1$ indicates excellent model performance.
A low EER of $0.016$ further confirms that FaceNet effectively meets the requirements of face recognition tasks.

Figure 9 presents a comparison of non-database facial images:

Images 0.jpg and 1.jpg: Two images of the same person taken from different angles.
Image 11.jpg: An image of a different individual.

All the images are of Asian faces. The results demonstrate:

Intra-class distances (same person) are consistently smaller than $0.8$.
Inter-class distances (different people) are significantly larger than $1.0$.

Conclusion:

These results show that FaceNet performs well even on Asian faces, maintaining clear separations between intra-class and inter-class distances. This indicates that the model is robust across diverse facial datasets.

Some other results: