Most state-of-the-art vision models are still trained with supervised learning which requires a large corpus of labeled images to work well. This limits ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of state-of-the-art models.
In this paper, the authors use unlabeled images to improve the state-of-the-art ImageNet accuracy and show that the accuracy gain has an outsized impact on robustness.
The authors conduct experiments on ImageNet 2012 ILSVRC challenge prediction task.
The authors use JFT dataset, which has around 300M images. Although the images in the dataset have labels, they ignored the labels and treated them as unlabeled data.
They selected images that have confidence of the label higher than 0.3. For each class, they selected at most 130K images that have the highest confidence. Finally, for classes that have less than 130K images, they duplicated some images at random so that each class can have 130K images. Consequently, total number of images that we use for training a student model is 130M(80M unique images).
3. Algorithm: Self-training with Noisy Student
The algorithm is basically self-training, a method in semi-supervised learning, but there are several specific methods that the authors propose.
1. Adding noise to the student
We should add more sources of noise to the student while removing the noise in the teacher when the teacher generates the pseudo labels. Without noise, the student will be likely to just mimic teacher to zero the loss of the pseudo samples. Also, noise enforces local smoothness in the decision function on both labeled and unlabeled data.
The authors added 3 sources of noise.
- Dropout: applied to the final classification layer with a dropout rate of 0.5
- Stochastic Depth: Stochastic depth is a technique like dropout but applied to blocks rather than layers. The authors set the survival probability to 0.8 for the final layer and follow the linear decay rule for other layers.
- RandAugment: applied two random operations within the magnitude set to 27
Below experiment shows that adding noise to the student is important when self-training.
2. Big student
The architectures for the student and teacher models can be the same of different, but student model needs to be sufficiently large to fit more data. So to enable the student to learn a more powerful model, the authors made the student model larger than the teacher model.
3. Data balancing
We duplicate images in classes where there are not enough images. For classes where we have too many images, we take the images with the highest confidence.
4. Soft pseudo labels
The authors observed that soft pseudo labels are usually more stable and lead to faster convergence, especially when the teacher model has low accuracy.
5. Larger EfficientNet
The authors used EfficientNets as their baseline models because they provide better capacity for more data.
The authors scaled up EfficientNet-B7 in 3 different ways.
- EfficientNet-L0: wider and deeper but uses a lower resolution
- EfficientNet-L1: increased width from L0
- EfficientNet-L2: applied compound scaling (scaled widht, depth, resolution simultaneously)
6. Fix train-test resolution discrepancy
Following https://arxiv.org/abs/1906.06423 , the authors first trained with a smaller resolution for 359 epochs, fixed the shallow layers, then finetuneed the model with a larger resolution for 1.5 epochs on unaugmented labeled images.
EfficientNet-L2 needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores.
7. Iterative training
The authors ﬁrst improved the accuracy of EfﬁcientNet-B7 using EfﬁcientNet-B7 as both the teacher and the student. Then by using the improved B7 model as the teacher, they trained an EfﬁcientNet-L0 student model. Next, with the EfﬁcientNet-L0 as the teacher, we trained a student model EfﬁcientNet-L1, a wider model than L0. Afterward, we further increased the student model size to EfﬁcientNet-L2, with the EfﬁcientNet-L1 as the teacher. Lastly, they trained another EfﬁcientNet-L2 student by using the EfﬁcientNet-L2 model as the teacher.
4. ImageNet Results
Noisy Student with EfficientNet-L2 improved 2.4% from EfficientNetB7. 0.5% came from making the model larger and 1.9% came from Noisy Student.
Noisy Student outperforms the state-of-the-art accuracy of 86.4% by FixRes ResNeXt-101 WSL that requires 3.5 Billion Instagram images labeled with tags. As a comparison, Noisy Student only requires 300M unlabeled images, which is perhaps more easy to collect. Also, EfficientNet-L2 is approximately twice as small in the number of parameters compared to FixRes ResNeXt-101 WSL.
Noisy Student for EfficientNet B0-B7 without Iterative Training
It leads to a consistent improvement of around 0.8% for all model sizes. The results confirm that vision models can benefit from Noisy Student even without iterative training.
5. Robustness Results
Noisy student significantly improves robustness tested on ImageNet-A/C/P and FGSM adversarial attack.
ImageNet-A contains hard examples.
ImageNet-C contains images under severe corruptions such as snow, motion, blur and fog.
ImageNet-P contains images with different perturbations.
4. Adversarial Robustness against an FGSM attack
FGSM attack performs one gradient descent step on the input image with the update on each pixel set to $\epsilon$.