Understanding The Robustness in Vision Transformers
Abstract
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code will be available at this https URL.
Additional Information
© 2022 by the author(s).Attached Files
Published - zhou22m.pdf
Submitted - 2204.12451.pdf
Files
Name | Size | Download all |
---|---|---|
md5:094b4131866249faf90be08e5f6b5f4d
|
5.1 MB | Preview Download |
md5:374df7200aea6db2276e43ecec7000f1
|
5.2 MB | Preview Download |
Additional details
- Eprint ID
- 115585
- Resolver ID
- CaltechAUTHORS:20220714-212518736
- Created
-
2022-07-15Created from EPrint's datestamp field
- Updated
-
2023-06-02Created from EPrint's last_modified field