Understanding The Robustness in Vision Transformers

Creators: Zhou, Daquan; Yu, Zhiding; Xie, Enze; Xiao, Chaowei; Anandkumar, Anima; Feng, Jiashi; Alvarez, Jose M.

Style

An error occurred while generating the citation.

Abstract

Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code will be available at this https URL.

Additional Information

Attached Files

Published - zhou22m.pdf

Submitted - 2204.12451.pdf

Files

zhou22m.pdf

Files (10.4 MB)

Name	Size	Download all
zhou22m.pdf md5:094b4131866249faf90be08e5f6b5f4d	5.1 MB	Preview Download
2204.12451.pdf md5:374df7200aea6db2276e43ecec7000f1	5.2 MB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes