1. This paper presents a new vision Transformer, called Swin Transformer, that is capable of serving as a general-purpose backbone for computer vision.
2. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
3. Swin Transformer has the flexibility to model at various scales and has linear computational complexity with respect to image size, and its performance surpasses the previous state-of-the-art by a large margin.
The article appears to be reliable and trustworthy in terms of its content and claims made. It provides detailed information about the proposed Swin Transformer, including its hierarchical design, shifted window approach, and its performance on various tasks such as image classification, object detection, and semantic segmentation. The authors also provide evidence for their claims in the form of accuracy scores on ImageNet-1K, COCO test-dev, and ADE20K val datasets. Furthermore, they make their code and models publicly available which further adds to the credibility of their work.
However, there are some potential biases that should be noted in this article. For example, the authors do not discuss any possible risks associated with using this transformer or any unexplored counterarguments that could be raised against it. Additionally, they do not present both sides equally when discussing their proposed transformer; instead they focus mainly on highlighting its advantages over existing methods without providing an equal amount of detail about potential drawbacks or limitations of their approach.