I just had a special Spring Festival, and the beauty is here to give you a happy old. I believe that as the backbone of the technical team of each company, we should also use my colleagues. It is supporting all aspects of work, the same boat, and fight the epidemic. Please pay attention to do your personal and home protection, strengthen your movement, improve immunity. Let us cheer for Wuhan, and the epidemic will end soon!

In the natural scene, the human face testing technology challenges greatly, the US Mission AI platform visual intelligence center has improved two aspects of the underlying algorithm model and system architecture, developed high-precision facial test model


. Currently, the model has been used in the business line of the US group, which is highly satisfied with the performance needs of the business.

I, background

Face detection technology is automatically returned to the face coordinate position and size size in the picture through artificial intelligence analysis. It is the core component of face intelligent analysis applications. It has extensive academic research value and business application value, such as face recognition. , Face attribute analysis (age estimate, gender identification, color value score and expression recognition), face avatar, smart video surveillance, face image filtering, smart image cut, face AR game, etc. Due to the scene of shooting, the natural scene environment is complicated and more varied. The illuminance is not controllable, and the face itself has a large challenge to the detection task (as shown in Figure 1). Over the past 20 years, this task has always been a hot spot in the academic community and the industry.

Natural scene people also have a wide range of applications in the US group business, in order to cope with the technical challenge of the natural scene application itself, while meeting the performance requirements of the business, the US Group Vision Intelligence Center (VIC) from the underlying algorithm model I have improved two aspects of the system architecture, developed high-precision facial detection models

. Moreover, VicFace reached the mainstream level of the industry on the internationally renowned public evaluation set Wider Face.

Figure 1 Natural scene face detection sample example

Second, the status quo of technology development

Unlike deep learning, traditional methods solve natural scene facial detection will design two aspects of feature representations and classifier. The most representative work is Viola-Jones algorithm [2], which utilizes manually designed Haar-Like features and AdaBoost algorithms to complete model training. The traditional method is fast detection speed on the CPU, and the results can be interpretable, and better performance can be achieved in a relative controllable environment. However, when the training data scale is exponentially increasing, the performance of the traditional method is relatively limited, in some complex scenes, and even unable to meet the application needs.

With the improvement of computer power, the growth of training data has achieved breakthrough progress in face detection tasks, which has an overwhelming advantage in detection performance. Face detection algorithm based on depth learning can be roughly divided into three categories:

1) Based on cascaded face detection algorithm.

2) Two-stage face detection algorithm.

3) Single stage face detection algorithm.

Among them, the first class is based on cascading face detection method (such as Cascade CNN [3], MTCNNN [4]) runs faster, moderate detection performance, suitable for limited integration, simple background, less faceful Scenes. The second type of two-stage face detection method is generally based on the FASTER-RCNN [6] framework, generating a candidate area in the first phase, and then classifies and regained the candidate area in the second phase, and the detection accuracy is high, the disadvantage is detection The speed is slow, representing the method with Face R-CNN [9], Scaleface [10], FDNET [11]. The last type of single-stage face detection method is based primarily on Anchor’s classification and regression, and it is usually optimized based on the classic frame (such as SSD [12], RetinaNet [13]), and its detection speed is faster than two stages. Detecting performance is superior to the level of association, is an algorithm for detecting performance and speed balance, and is also the mainstream direction of current face detection algorithm optimization.

Third, optimize ideas and business applications

In the natural scene application, in order to meet the accuracy requirements and achieve the practical goal, the US Group Vision Center (VIC) uses the mainstream Anchor-based single-stage face test scheme, and in the data enhancement and sampling strategy. , Model structural design and loss functions have been optimized, and high-precision facial detection models have been developed.

. The following is an introduction to the details of the relevant technique.

1. Data enhancement and sampling strategy

Single-stage general purpose target detection algorithm is more sensitive to data enhancements, such as classic SSD algorithms on VOC2007 [50] data sets to increase 6.7 through data enhancement performance indicator MAP. The classic single-stage face detection algorithm S3FD [17] also designed sample enhancement strategy, using pictures random cut, picture fixed aspect ratio zoom, image color disturbance and horizontal flip, etc.

Baidu Pyramidbox [18] published by ECCV2018 proposed a Data-Anchor sampling method, and a randomly selected face in the image beats a smaller ANChor’s face, while the size of the training image also performs synchronous transformation . The advantage of this is to generate a smaller face in a smaller face by increasing the smaller face, and increased 0.4 on the Wider Face [1] data set Easy, Medium, and HARD collections. (94.3- > 94.7), 0.4 (93.3-> 93.7), 0.6 (86.1 -> 86.7). ISRN [19] combines SSD sample enhancement mode and Data-Anchor sampling method, and the model detection performance is further improved.

VICFACE is filtered on the Semantic blurred ultra-faced face on the basis of the ISRN sample enhancement, and Mixup [22] has been valid in image classification and target detection, which is now used for face detection and effectively prevents it. Model over fitting problem. Considering that there are many postures, occlusion, and blurred samples in business data, and these samples are small in training, and the detection is difficult, so the dynamics give these difficult samples to give higher weights during model training. Sample recall rate.

2. Model structure design

Face detection model structure design mainly includes four parts of the detection framework, backbone network, prediction module, Anchor settings and positive and negative samples, and is the core of single-stage face detection method optimization.

Detection framework

In recent years, a single-stage face detection framework has achieved important development, representative structures with SSDs used in S3FD [17], used in Retinanet, SRN [23] used in SFDET [23] (referred to as SRN) and Double Structure (After DSFD) used in DSFD [24], as shown in Figure 2 below. Among them, SRN is a single-stage two-step face detection method, using the first step detection result, the negative samples of the smaller person, improve the equalization of the positive and negative samples, people for large scale The face is positioned in a manner, improves the positioning accuracy of large-scale face, and improves the accuracy of face detection. The Evaluation SRN has achieved the best detection effect on Wider Face (measured by AP accuracy by standard protocol), as shown in Table 1.





Figure 2 four detection structures

Table 1 When BackBone is RESNET50, the results of the four detection structures on Wider Face

VicFace inherits the best SRN detection structure of current performance, and for better fusion of the upward and auto-down feature, different weights are given different weights for different characteristics, as an example of P4, which is:

Where the number of elements of WC4 vectors is equal to the channel number of CONV (C4) features, WP4 is equal to the number of channels of UpSample (P5), and WC4 and WP4 are learning, and the elements of elements are greater than 0, and WC4 corresponds to WP4. The sum is 1, the structure is shown in Figure 3.

Figure 3 Vision Intelligent Center VICFACE Network Overall Structure

Backbone network

The backbone of the single-stage face detection model typically uses the classic structure in the classification task (such as VGG [26], ResNet [27], etc.). Among them, the linkage of the backbone of the backbone network is based on the IMAGENET data set, the higher the face detection performance on Wider Face, as shown in Table 2. In order to ensure that the detection network is higher recall, VICFACe’s RESID network uses the RESNet152 network in ImageNet (which is 80.26 on the ImageNet on the ImageNet), and it will be 7×7 on the ImageNet to TOP1 Classification Accounting. The convolution module of Stride is adjusted to 3 3×3 convolution modules, where the first module’s stride is 2, the other is 1; the KERNEL is 1×1, the Stride 2 is replaced with Stride 2 to Stride 2 Avgpool module.

Table 2 Different backbone networks in ImageNet performance comparison and their detection accuracy under the RetinaNet framework


Prediction module


The detection performance of the model can be further improved by using context information. SSH [36] is an early program, Pyramidbox, SRN, DSFD, etc. of the single-stage face detection model, also designed different context modules. As shown in FIG. 4, the SRN context module provides a variety of rectangular perfoster wild, and a plurality of different shapes of experience in detecting extreme postures using 1xK, KX1, and has a human face that detects extreme posture; DSFD uses a plurality of convolutions with holes. Greatly enhanced the range of experience wilderness.

Figure 4 Context module in different network structures


In VicFace, the convolutional module and 1xk, KX1 convolution modules are combined as Context Module, which is also improved in the range of perception and helps to detect the human face of extreme posture, and use the maxout module to improve the recall rate, lowered Misrode. It also utilizes the face position predicted by the CN layer, calibrates the area corresponding to the PN layer characteristics, as shown in FIG. The offset of the face position relative feature position predicted by the CN layer is an offset input, and the PN layer feature acts as a variable convolved DATA input. The region corresponding to the human face area corresponds to the human face area. Ok, relatively more representative, can enhance the performance of the face detection model.

Figure 5 prediction module in self-study detection model structure

Anchor settings and positive and negative samples

In the self-developing scheme, at the C3, P3 layers, the ANChor size is 2S and 4S, the other layer Anchor size is 4s (S represents the stride of the corresponding layer), such anchor setting mode while ensuring the person’s face call return rate, reduced The number of negative samples is mitigated to a certain extent. According to the statistics of the face-like as high ratio, the aspect ratio of Anchor is set to 0.8, and the sample of the CN layer IOU is more than 0.7 is divided into a positive sample, and the division of less than 0.3 is negative sample, the PN layer IOU is greater than 0.5 sample It is divided into a positive sample, less than 0.4 is divided into negative samples.

3. Loss function

The optimization goal of face detection requires not only a distinguishing spray (whether it is a face), but also needs to position the face position and size. Distinguishing between S3FD Use intersecting loss functions, locating face position and size using Smooth L1 Loss, while using difficult negative samples excavation to solve the problem of unbalanced samples. Another way to alleviate the performance loss of positive and negative samples is a more direct way to propose Focal Loss [13]. UnitBox [41] proposes that IOU LOSS can alleviate performance loss in the positioning loss of different scales. Alnnoface [40] uses Focal Loss and IOU LOSS to enhance the performance of face detection models. Introducing other related auxiliary tasks can also enhance the performance of the face detection algorithm, retinaface [42] introduce key positioning tasks, enhance the positioning accuracy of the face detection algorithm; DFS [43] introduces face segmentation tasks, enhances the characteristics of the characteristics .

The advantages of the aforementioned method, VicFace take advantage of the complementary information of face detection and related tasks, and use multi-tasking methods to train face detection models. Use Focal LOSS in face classification to alleviate sample imbalances, while using face-critical positioning and face segmentation to assist in classification targets, thereby improving overall classification accuracy. Using Complete IOU LOSS [47] in face positioning, the difference between the target and the prediction box is the loss function, alleviate the difference between different scale face losses, while taking care of the center point distance of the target and prediction box and The difference in aspect ratio can achieve better overall detection performance.

4. Optimize results and business applications

Under the support of the cluster platform, the natural scene of the US Visual Intelligence Center is compared with the existing mainstream program, and the three verification sets of the International Open Face Detection Evaluation Collection Wider Face Easy, Medium, The lead is reached in Hard (AP is average accuracy, the higher the value, the better), as shown in Figure 6 and Table 3.




Figure 6 Evaluation results of VicFace and current mainstream face detection method on Wider Face

Table 3 VicFace and current mainstream face detection method in Wider Face

Note: SRN is a new method proposed by the Chinese Academy in Aaai2019. DSFD is a new method proposed in CVPR2019. PyramidBox ++ is a new method proposed in 2019. Ainnoface is a new method proposed in 2019. Retinaface is ICCV2019 Wider challenge runner.

In business applications, the natural scene face detection service has been connected to multiple business lines in the US group, which meets the performance requirements of the business in UGC image intelligence filtering and advertising POI image display, the former protects user privacy, preventing user portraits The right, the latter can effectively prevent the phenomenon of local croping in the image, thereby improving the user experience. In addition, VicFace also provides a core basic model for other face intelligent analysis applications, such as automatic detection of kitchen workers’ fitness (wearing hats and masks), adding a guarantee for food safety.

In future work, in order to give users a better experience, it will meet further exploration and optimization in terms of high-concurrent demand. In addition, in an algorithm design, the single-stage target detection method based on Anchor-free has exhibited high potential in the field of general purpose target detection in recent years, and is also an important direction in the future of visual intelligence.


1. YANG S, Luo P, Loy C, et al. Wider face: a face detection benchmark [c] // procedings of the IEEE Conference On Computer Vision and Pattern Recognition. 2016: 5525-5533.

2. Viola P, Jones M J. Robust Real-Time Face Detection [J]. International Journal of Computer Vision, 2004, 57 (2): 137-154.

3. Li H, Lin Z, SHEN X, ET Al. A Convolutional Neural NetWork Cascade for Face Detection [C] // Proceedings of The IEEE Conference On Computer Vision and Pattern Recognition. 2015: 5325-5334.

4. Zhang K, ZHANG Z, LI Z, ET Al. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks [J]. IEEE Signal Processing Letters, 2016, 23 (10): 1499-1503.


5. Hao Z, Liu Y, Qin H, et al. Scale-aware face detection [c] // proceedings of the IEEE Conference On Computer Vision and Pattern Recognition. 2017: 6186-6195.

6. Ren S, HE K, Girshick R, ET Al. Faster R-CNN: Towards Real-Time Object Detection Weiss ProPosal Networks [C] // Advances in Neural Information Processing Systems. 2015: 91-99.

7. Lin T Y, Dollár P, Girshick R, et al. Feature Pyramid Networks for Object Detection [C]. Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition. 2017: 2117-2125.


8. Jiang H, Learned-Miller E. Face Detection with The Faster R-CNN [C] // 2017 12th IEEE International Conference On Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017: 650-657.

9. WANG H, LI ENIF, ET Al. Face R-CNN. Arxiv Preprint Arxiv: 1706.01061, 2017.

10. YANG S, Xiong Y, Loy C, ET Al. Face Detection Through scale-friendly deep convolutional networks [J]. Arxiv Preprint Arxiv: 1706.02863, 2017.

11. ZHANG C, XU X, TU D. Face Detection Using Improved Faster RCNN [J]. ARXIV Preprint Arxiv: 1802.02142, 2018.

12. Liu W, Anguelov D, Erhan D, ET Al. SSD: Single Shot Multibox Detector [C] // European Conference On Computer Vision. Springer, Cham, 2016: 21-37.


13. Lin T y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection [C] // Proceedings of The IEEE International Conference ON Computer Vision. 2017: 2980-2988.

14. Huang L, Yang Y, DENG Y, ET Al. Densebox: Unifying Landmark Localization with end to end object detection [j]. Arxiv preprint Arxiv: 1509.04874, 2015.

15. Liu W, Liao S, Ren W, et al High-level Semantic Feature Detection:. A New Perspective for Pedestrian Detection [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019:. 5187-5196.

16. Zhang Z, HE T, ZHANG H, ET Al. Bag of Freebies for Training Object Detection NEURAL NETWORKS [J]. ARXIV Preprint Arxiv: 1902.04103, 2019.

17. ZHANG S, ZHU X, LEI Z, ET Al. S3FD: SINGLE SCALE-INVARIANT FACE DETECTOR [C] // Proceedings of The IEEE International Conference ON Computer Vision. 2017: 192-201.

18. TANG X, DU D k, He Z, et al. Pyramidbox: a Context-Assisted Single Shot Face Detector [C] // Proceedings of the European Conference On Computer Vision (ECCV). 2018: 797-813.

19. Zhang S, Zhu R, WANG X, ET Al. Improved Selective Refinement Network for Face Detection [J]. Arxiv Preprint Arxiv: 1901.06651, 2019.

20. Li Z, Tang X, Han J, ET Al. Pyramidbox ++: High Performance Detector for Finding Tiny Face [J]. Arxiv Preprint Arxiv: 1904.00386, 2019.


21. ZHANG S, ZHU X, LEI Z, ET Al. Faceboxes: a CPU Real-Time Face Detector With High Accuracy [C] // 2017 IEEE International Joint Conference On Biometrics (IJCB). IEEE, 2017: 1-9.

22. Zhang H, Cisse M, Dauphin Y N, ET Al. Mixup: Beyond Empirical Risk Minimization [J]. Arxiv Preprint Arxiv: 1710.09412, 2017.

23. CHI C, Zhang S, XING J, ET Al. Selective Refinement Network for High Performance Face Detection [C] // Proceedings of the AAAI Conference ON Artificial Intelligence. 2019, 33: 8231-8238.

24. Li J, WANG Y, WANG C, ET Al. DSFD: Dual Shot Face Detector [C] // Proceedings of The IEEE Conference On Computer Vision and Pattern Recognition. 2019: 5060-5069.


25. Zhang S, Wen L, Shi H, ET Al. Single-shot scale-aware network for real-time face detection [J]. International Journal of Computer Vision, 2019, 127 (6-7): 537-559.

26. Simonyan K, ZisSrman A. Very Deep Convolutional Networks for large-scale Image recognition [J]. Arxiv Preprint Arxiv: 1409.1556, 2014.

27. HE K, ZHANG X, REN S, ET Al. Deep Residual Learning for Image Recognition [C] // Proceedings of The IEEE Conference On Computer Vision and Pattern Recognition. 2016: 770-778.


28. XIE S, Girshick R, Dollár P, et al. Aggregated Residual Transformations for Deep Neural NetWorks [C] // Proceedings of The IEEE Conference On Computer Vision and Pattern Recognition. 2017: 1492-1500.

29. Iandola F, Moskewicz M, Karayev S, et al. Densenet: Implementing Efficient Convnet Descriptor Pyramids [J]. Arxiv Preprint Arxiv: 1404.1869, 2014.


30. Howard a g, zhu m, chen b, et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications [J]. Arxiv Preprint Arxiv: 1704.04861, 2017.

31. Sandler M, Howard A, Zhu M, ET Al. MobileNetv2: Inveted Residuals and linear bottlenecks [c] // proceedings of the ieee conference on Computer Vision and Pattern Recognition. 2018: 4510-4520.

32. Bazarevsky V, Kartynnik Y, Vakunov A, et al. Blazeface: Sub-Millisecond Neral Face Detection On Mobile GPUS [J]. Arxiv Preprint Arxiv: 1907.05047, 2019.

33. HE Y, XU D, WU L, ET Al. LFFD: a Light and Fast Face Detector for Edge Devices [J]. Arxiv Preprint Arxiv: 1904.10633, 2019.

34. Zhu R, ZHANG S, WANG X, ET Al. Scratchdet: Exploring to Train Single-shot Object Detectors from scratch [J]. Arxiv Preprint Arxiv: 1810.08425, 2018, 2.

35. Lin T y, Maire M, Belongie S, ET Al. Microsoft Coco: Common Objects in Context [C] // European Conference On Computer Vision. Springer, Cham, 2014: 740-755.

36. Najibi M, SamangouEi P, Chellappa R, et al. Ssh: Single Stage Headless Face Detector [C] // Proceedings of The IEEE International Conference On Computer Vision. 2017: 4875-4884.

37. SA. EARP, P. NOINONGYAO, J. Cairns, A. Ganguly Face Detection with Feature Pyramids and Landmarks. Arxiv Preprint Arxiv: 1912.00596, 2019.

38. Goodfellow I J, Warde-Farley D, Mirza M, ET Al. Maxout NetWorks [J]. Arxiv Preprint Arxiv: 1302.4389, 2013.

39. Zhu C, Tao R, Luu K, et al. Seeing Small Faces from Robust Anchor’s Perspective [C] // Proceedings of The IEEE Conference On Computer Vision and Pattern Recognition. 2018: 5127-5136.

40. F. Zhang, X. Fan, G. Ai, J. Song, Y. QIN, J. Wu Accurate Face Detection for High Performance. Arxiv Preprint Arxiv: 1905.01585, 2019.

41. Yu J, Jiang Y, WANG Z, ET Al. Unitbox: An Advanced Object Detection Network [C] // Proceedings of The 24th ACM International Conference ON MultiMedia. ACM, 2016: 516-520.

42. Deng J, Guo J, Zhou Y, ET Al. Retinaface: Single-Stage Dense Face Localisation in The Wild [J]. Arxiv Preprint Arxiv: 1905.00641, 2019.

43. Tian W, WANG Z, SHEN H H, ET Al. Learning Better Features for Face Detection With Feature Fusion and Segmentation Supervision [J]. Arxiv Preprint Arxiv: 1811.08557, 2018.

44. Y. ZHANG, X. XU, X. Liu Robust and high Performance Face Detector. Arxiv Preprint Arxiv: 1901.02350, 2019.


45. S. Zhang, C. Chi, Z. Lei, Stan Z. LI Refineface: Refinement Neural Network for High Performance Face Detection. Arxiv Preprint Arxiv: 1909.04376, 2019.

46. ​​WANG J, YUAN Y, LI B, ET Al. SFACE: An Effect NetWork for Face Detection in Large Scale Variations [J]. Arxiv Preprint Arxiv: 1804.06559, 2018.

47. ZHENG Z, WANG P, LIU W, ET Al. Distance-IOU Loss: Faster and better Learning for Bounding Box Regression [J]. Arxiv Preprint Arxiv: 1911.08287, 2019.

48. Bay H, Tuytelaars T, Van Gool L. Surf: Speeded Up Robust Features [C] // European Conference On Computer Vision. Springer, Berlin, Heidelberg, 2006: 404-417.


49. YANG B, YAN J, LEI Z, ET Al. Aggregate Channel Features for Multi-View Face Detection [C] // IEEE International Joint Conference On Biometrics. IEEE, 2014: 1-8.

50. Everingham M, Van Gool L, Williams C K i, et al. The Pascal Visual Object Classes Challenge 2007 (VOC2007) Results [J] 2007.

51. Redmon J, Farhadi A. Yolov3: An Incremental Improvement [J]. Arxiv Preprint Arxiv: 1804.02767, 2018.

About the Author

Zhenhua, Huanhuan, Xiaolin,

All are the visual intelligent center engineers.

Job Offers


The main responsibility of the US Mission Visual Intelligent Center Basic Visual Group is to consolidate the core basic technology of visual intelligence underground, providing a platform-level visual solution for the group business. There are basic model optimizations, large-scale distributed training, SERVER efficiency optimization, mobile adaptation optimization and innovative product incubation.

Welcome to the computer visual related field small partners to join us, resume can be sent to Tech@meituan.com (the message title indicates: US Group Vision Intelligent Center Basic Group).