Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Table of Links
- Abstract and 1 Introduction
- 2. Related Works
-
- PG-Video-LLaVA
- 3.1. Overview
- 3.2. Architecture
-
- Experiments
- 4.1. Implementation Details
- 4.2. Stronger Baseline
- 4.3. Spatial Grounding in Videos
- 4.4. Zero-Shot Visual Question Answering
- 5. Conclusion and References
Supplementary Material
- A. Audio Modality Integration
- B. Visual Grounding: Quantitative Evaluation
- C. Qualitative Results for Visual Grounding
- D. Quantitative Evaluations of Video-based Conversation Performance
4.1. Implementation Details
For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.
Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].
This paper is available on arxiv under CC BY 4.0 DEED license.