Briefing Doc: ByteDance Tarsier – A Large-Scale Video Description Model
Source: Wang, J., Yuan, L., Zhang, Y., & Sun, H. (2023). Tarsier: Recipes for Training and Evaluating Large Video Description Models. arXiv preprint arXiv:2312.00846. Date: 2023-12-01 Summary: This paper introduces Tarsier, a family of large-scale video-language models (LVLMs) designed for fine-grained video description. Tarsier leverages the power of CLIP-ViT for visual encoding and a large … Read more