Source: Wang, J., Yuan, L., Zhang, Y., & Sun, H. (2023). Tarsier: Recipes for Training and Evaluating Large Video Description Models. arXiv preprint arXiv:2312.00846.
Date: 2023-12-01
Summary:
This paper introduces Tarsier, a family of large-scale video-language models (LVLMs) designed for fine-grained video description. Tarsier leverages the power of CLIP-ViT for visual encoding and a large language model (LLM) for temporal relationship modeling and text generation.
Key Findings:
- Superior Performance: Tarsier models demonstrate significantly stronger video description capabilities than existing open-source models and perform comparably to state-of-the-art proprietary models like GPT-4V and Gemini 1.5 Pro.
- Two-Stage Training: A meticulously designed two-stage training procedure – multi-task pre-training followed by instruction tuning – is key to Tarsier’s success.
- High-Quality Data: The quality, diversity, and scale of training data play a crucial role in Tarsier’s performance. The model benefits from:
- High-dynamic videos with accurately matched text
- Multi-task pre-training dataset encompassing diverse tasks like video captioning, VQA, action recognition, etc.
- Instruction tuning data featuring multi-grained descriptions, camera motion descriptions, and creative writing tasks.
- New Benchmark: The paper introduces DREAM-1K, a challenging new benchmark dataset for evaluating video description models. It features videos with diverse complexities, requiring an understanding of multiple frames, subtle motions, and semantic reasoning.
- AutoDQ: A new automatic evaluation method, AutoDQ, is proposed. It assesses description quality through event extraction and entailment analysis, offering a more interpretable evaluation approach.
Important Ideas/Facts:
- Model Architecture: Tarsier utilizes a frozen CLIP-ViT to encode video frames separately. These visual tokens are then fed into an LLM for temporal modeling and autoregressive description generation. This simple yet effective architecture leverages the LLM’s inherent reasoning abilities.
- Multi-task Pre-training: Tarsier is pre-trained on a massive dataset (13.6M video-text pairs) across diverse tasks. This approach allows the model to develop a robust understanding of video content from different perspectives.
- Instruction Tuning: Fine-tuning the model on a curated instruction tuning dataset focusing on multi-grained video descriptions, camera motions, and creative writing significantly enhances its descriptive capabilities.
- DREAM-1K Dataset: DREAM-1K addresses the limitations of existing video captioning datasets, which often feature simplistic videos and short captions. The new dataset includes complex videos with multiple events, subjects, and shots, along with detailed manual annotations.
- Evaluation Metrics: The paper highlights the shortcomings of traditional n-gram based metrics like CIDEr for evaluating long video descriptions. AutoDQ provides a more nuanced and interpretable evaluation by focusing on event extraction and entailment.
Key Quotes:
- “Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a +51.4% advantage in human side-by-side evaluation over the strongest model.”
- “Our second contribution is the introduction of a new benchmark – DREAM-1K for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions.”
- “Our findings indicate that several factors contribute to the model’s strong performance: conducting large-scale multi-task pre-training, scaling up the LLM, tuning all LLM parameters, and fine-tuning the model with carefully annotated multi-grained video descriptions.”
Future Directions:
The authors suggest several avenues for further research, including scaling up the pre-training data, exploring larger visual encoders and LLMs, and refining the model’s ability to handle complex instructions.
Conclusion:
Tarsier represents a significant advancement in open-source video description models. Its strong performance, coupled with the introduction of a challenging new benchmark and evaluation method, paves the way for further research and development in the field of video understanding.