1. This paper presents a simple but strong baseline to efficiently adapt a pre-trained image-based visual-language (I-VL) model for resource-hungry video understanding tasks.
2. The proposed method optimises a few random vectors, termed as continuous prompt vectors, to convert video-related tasks into the same format as the pre-training objectives.
3. Experiments on 10 public benchmarks of action recognition, action localisation, and text-video retrieval show competitive or state-of-the-art performance despite optimising significantly fewer parameters.
The article is generally trustworthy and reliable in its presentation of the research findings. The authors provide detailed descriptions of their proposed method and its components, as well as extensive ablation studies to analyse the critical components. The experiments are conducted on 10 public benchmarks across closed-set, few-shot, and zero-shot scenarios, providing evidence for the claims made in the article. Furthermore, no promotional content or partiality is present in the article; both sides of an argument are presented equally and possible risks are noted where appropriate. Therefore, overall this article can be considered reliable and trustworthy in its presentation of research findings.