New Dimensions in Text-to-Image Model Evaluation

cover
12 Oct 2024

Authors:

(1) Tony Lee, Stanford with Equal contribution;

(2) Michihiro Yasunaga, Stanford with Equal contribution;

(3) Chenlin Meng, Stanford with Equal contribution;

(4) Yifan Mai, Stanford;

(5) Joon Sung Park, Stanford;

(6) Agrim Gupta, Stanford;

(7) Yunzhi Zhang, Stanford;

(8) Deepak Narayanan, Microsoft;

(9) Hannah Benita Teufel, Aleph Alpha;

(10) Marco Bellagente, Aleph Alpha;

(11) Minguk Kang, POSTECH;

(12) Taesung Park, Adobe;

(13) Jure Leskovec, Stanford;

(14) Jun-Yan Zhu, CMU;

(15) Li Fei-Fei, Stanford;

(16) Jiajun Wu, Stanford;

(17) Stefano Ermon, Stanford;

(18) Percy Liang, Stanford.

Abstract and 1 Introduction

2 Core framework

3 Aspects

4 Scenarios

5 Metrics

6 Models

7 Experiments and results

8 Related work

9 Conclusion

10 Limitations

Author contributions, Acknowledgments and References

A Datasheet

B Scenario details

C Metric details

D Model details

E Human evaluation procedure

Holistic benchmarking. Benchmarks drive the advancements of AI by orienting the directions for the community to improve upon [20, 56, 57, 58]. In particular, in natural language processing (NLP), the adoption of meta-benchmarks [59, 60, 61, 62] and holistic evaluation [1] across multiple scenarios or tasks has allowed for comprehensive assessments of models and accelerated model improvements. However, despite the growing popularity of image generation and the increasing number of models being developed, a holistic evaluation of these models has been lacking. Furthermore, image generation encompasses various technological and societal impacts, including alignment, quality, originality, toxicity, bias, and fairness, which necessitate comprehensive evaluation. Our work fills this gap by conducting a holistic evaluation of image generation models across 12 important aspects.

Benchmarks for image generation. Existing benchmarks primarily focus on assessing image quality and alignment, using automated metrics [23, 36, 24, 63, 64, 65]. Widely used benchmarks such as MS-COCO [21] and ImageNet [20] have been employed to evaluate the quality and alignment of generated images. Metrics like Fréchet Inception Distance (FID) [23, 66], Inception Score [36], and CLIPScore [24] are commonly used for quantitative assessment of image quality and alignment.

To better capture human perception in image evaluation, crowdsourced human evaluation has been explored in recent years [25, 6, 35, 67]. However, these evaluations have been limited to assessing aspects such as alignment and quality. Building upon these crowdsourcing techniques, we extend the evaluation to include additional aspects such as aesthetics, originality, reasoning, and fairness.

As the ethical and societal impacts of image generation models gain prominence [19], researchers have also started evaluating these aspects [33, 29, 8]. However, these evaluations have been conducted on only a select few models, leaving the majority of models unevaluated in these aspects. Our standardized evaluation addresses this gap by enabling the evaluation of all models across all aspects, including ethical and societal dimensions.

Art and design. Our assessment of image generation incorporates aesthetic evaluation and design principles. Aesthetic evaluation considers factors like composition, color harmony, balance, and visual complexity [68, 69]. Design principles, such as clarity, legibility, hierarchy, and consistency in design elements, also influence our evaluation [70]. Combining these insights, we aim to determine whether generated images are visually pleasing, with thoughtful compositions, harmonious colors, balanced elements, and an appropriate level of visual complexity. We employ objective metrics and subjective human ratings for a comprehensive assessment of aesthetic quality.

This paper is available on arxiv under CC BY 4.0 DEED license.