Evaluating AI Models with HEIM Metrics for Fairness, Robustness, and More

To evaluate the 12 aspects (§3), we also curate a diverse and realistic set of metrics. Table 3 presents an overview of all the metrics and their descriptions.

Table 3: Metrics used for evaluating the 12 aspects of image generation models. We use realistic, human metrics as well as automated and commonly-used existing metrics.

Table 4: Models evaluated in the HEIM effort.

Compared to previous metrics, our metrics are more realistic and broader. First, in addition to automated metrics, we use human metrics (top rows in Table 3) to perform realistic evaluation that reflects human judgment [25, 26, 27]. Specifically, we employ human metrics for the overall text-image alignment and photorealism, which are used for many evaluation aspects, including alignment, quality, knowledge, reasoning, fairness, robustness, and multilinguality. We also employ human metrics for overall aesthetics and originality, for which capturing the nuances of human judgment is important. To conduct human evaluation, we employ crowdsourcing following the methodology described in Otani et al.,[35]. Concrete English definitions are provided for each human evaluation question and rating choice, and a minimum of 5 participants evaluate each image. We use at least 100 image samples for each aspect. For more details about the crowdsourcing procedure, please refer to Appendix E.

The second contribution is introducing new metrics for aspects that have received limited attention in existing evaluation efforts, namely fairness, robustness, multilinguality, and efficiency, as discussed in §3. The new metrics aim to close the evaluation gaps.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Curating 62 Practical Scenarios to Test AI Text-to-Image Models

Up Next →

Photorealism, Bias, and Beyond: Results from Evaluating 26 Text-to-Image Models