Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 10 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.
Table of Links
- Abstract and 1 Introduction
- 2. Related Work
- 3. Method
- 4. Data Annotation Pipeline
- 5. Experiments
- 6. Conclusion and References
Supplementary Material (Part 1)
- A. Additional Implementation Details
- B. Additional Downstream Tasks
- C. Additional Qualitative Results
Supplementary Material (Part 2)
D. Dataset Visualization
In this section, we provide additional dataset samples of our GranD and GranDf datasets to better understand the functionalities they offer. Please see Fig. 15 and Fig. 14.
E. Limitations and Future Work
The large-scale automated pipeline provides dense labelings that are important for our pretraining but still contains some noise. A high-quality, clean dataset could help further improve the pretrained representations, although this comes at a significantly higher annotation cost. A potential research direction is to develop a cost-effective annotation pipeline aimed at reducing noise in dense labeling. Additionally, expanding the GLaMM framework to include modalities such as video and 3D is also a future research direction.
F. Ethics and Societal Impact
Our Grounding-anything Dataset (GranD) utilizes SAM images that have de-identified personal information, with all faces and license plates obscured. To the best of our knowledge, the dataset does not portray any strong biases or discrimination. We urge for the responsible use of GranD and GLaMM, promoting research progress while safeguarding privacy.
This paper is available on arxiv under CC BY 4.0 DEED license.