This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

20 Dec 2024

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.

Supplementary Material

3.1. Overview

In this paper, we introduce PG-Video-LLaVA, a novel Large Multimodal Model (LMM) designed to align video and audio representations with a Large Language Model (LLM). This integration equips PG-Video-LLaVA with the capability to proficiently manage both video and audio data in conversational contexts. Additionally, our method integrates a specialized plug-and-play module for effective video grounding (see Figure 2).

In constructing PG-Video-LLaVA, our approach integrates sophisticated mechanisms for aligning video and audio signals with language processing capabilities, thereby facilitating a comprehensive multimodal analysis. Central to our model is an advanced CLIP-based video encoder, which has been specifically adapted to process both spatial and temporal dimensions of video data. This adaptation enables a deeper understanding of video content, setting PGVideo-LLaVA apart from conventional image-centric models.

For training, PG-Video-LLaVA utilizes the VideoInstruct100K [22] dataset comprising 100K video instructions derived from ActivityNet-200 [11]. This diverse dataset ensures that the model is well-equipped to handle a broad spectrum of video contexts with high accuracy. In addition to visual processing, PG-Video-LLaVA incorporates stateof-the-art audio analysis by leveraging advanced audio transcription techniques, similar to those employed in WhisperX [2] and Whisper-AT[10]. This integration allows the model to process and understand audio inputs effectively, enhancing its overall multimodal interpretation capabilities.

While PG-Video-LLaVA’s foundation is based on the LLaVA-1.5 [18] framework, it is extended for videos to incorporate spatio-temporal representations, audio understanding and visual grounding capabilities. Its unique combination of enhanced video encoding, extensive training dataset, integrated audio processing and grounding capability marks it as a forward step in the field of LMMs.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

New AI Can Talk About Your Artwork Like a Professional Critic

Up Next →

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

Table of Links

3.1. Overview