Authors:
(1) Hoon Kim, Beeble AI, and contributed equally to this work;
(2) Minje Jang, Beeble AI, and contributed equally to this work;
(3) Wonjun Yoon, Beeble AI, and contributed equally to this work;
(4) Jisoo Lee, Beeble AI, and contributed equally to this work;
(5) Donghyun Na, Beeble AI, and contributed equally to this work;
(6) Sanghyun Woo, New York University, and contributed equally to this work.
Editor's Note: This is Part 9 of 14 of a study introducing a method for improving how light and shadows can be applied to human portraits in digital images. Read the rest below.
Table of Links
- Abstract and 1. Introduction
- 2. Related Work
- 3. SwitchLight and 3.1. Preliminaries
- 3.2. Problem Formulation
- 3.3. Architecture
- 3.4. Objectives
- 4. Multi-Masked Autoencoder Pre-training
- 5. Data
- 6. Experiments
- 7. Conclusion
Appendix
- A. Implementation Details
- B. User Study Interface
- C. Video Demonstration
- D. Additional Qualitative Results & References
6. Experiments
This section details our experimental results. We begin with a comparative evaluation of our method against state-ofthe-art approaches using the OLAT dataset. We also employ images from the FFHQ–test [25] for user studies. For qualitative analysis, we utilize copyright-free portrait images from Pexels [1]. Additionally, we conduct ablation studies to validate the efficacy of our pre-training framework and architectural design choice. Subsequently, we detail the additional features and conclude by discussing its limitations. Our evaluation uses the OLAT test set, comprising 35 subjects and 11 lighting environments, ensuring no overlap with the train set.
Evaluation metrics. We employ several key metrics for evaluating the prediction accuracy; Mean Absolute Error (MAE), Mean Squared Error (MSE), Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS). While these metrics offer valuable quantitative insights, they do not fully capture the subtleties of visual quality enhancement. Therefore, we emphasize the importance of qualitative evaluations to gain a comprehensive understanding of model performance.
Baselines. We compared our approach with three state-of-the-art baselines: Single Image Portrait Relighting (SIPR) [45], which uses a single neural network for relighting; Total Relight (TR) [34], employing multiple neural networks that incorporate the Phong reflectance model; and Lumos [52], a TR adaptation for synthetic datasets. Due to the lack of publicly available code or model from these methods, we either integrated their techniques into our framework or requested the original authors to process our inputs with their models and share the results.
Quantitative Comparisons. The results in Table. 1 shows our method outperforming SIPR and TR baselines, demonstrating the significance of incorporating advanced rendering physics and reflectance models. The transition from SIPR to TR emphasizes the value of physics-based design, while the shift from TR to our approach underscores the importance of transitioning from the empirical Phong model to the more accurate Cook-Torrance model. Additionally, pretraining contributes to further enhancements, as evidenced by the improved image details, depicted in Fig 6.
Qualitative Comparisons. Our relighting method exhibits several key advantages over previous approaches, as showcased in Fig. 7. It effectively harmonizes light direction and softness, avoiding harsh highlights and inaccurate lighting that are commonly observed in other methods. A notable strength of our approach lies in its ability to capture highfrequency details like specular highlights and hard shadows. Additionally, as shown in the second row, it preserves facial details and identity, ensuring high fidelity to the subject’s original features and mitigating common distortions seen in previous approaches. Moreover, our approach excels in handling skin tones, producing natural and accurate results under various lighting conditions. This is clearly demonstrated in the fourth row, where our method contrasts sharply with the over-saturated or pale tones from previous methods. Finally, the nuanced treatment of hair is highlighted in the sixth row, where our approach maintains luster and detail, unlike the flattened effect typical in other methods. More qualitative results are available in our supplementary video demonstration.
User Study. We conducted a human subjective test to evaluate the visual quality of relighitng results, summarized in Table. 2. In each test case, workers were presented with an input image and an environment map. They were asked to compare the relighting results from three methods–Ours, Lumos, and TR–based on three criteria: 1) consistency of lighting with the environment map, 2) preservation of facial details, and 3) maintenance of the original identity. To ensure unbiased evaluations, the order of the methods presented was randomized. To aid in understanding the concept of consistent lighting, relit balls were displayed alongside the images. The study included a total of 256 images, consisting of 32 portraits each illuminated with 8 different HDRIs. Each worker was tasked with selecting the best image for each specific criterion, randomly assessing 30 samples. A total of 47 workers participated in the study. The results indicate a strong preference for our results over the baseline methods across all evaluated metrics.
Ablation Studies. We analyze our two major design choices in Table. 3: the MMAE pre-training framework and DiffuseNet. The MMAE, which integrates dynamic masking with generative objectives, outperforms MAE. This superiority is mainly due to the incorporation of challenging masks and global coherence objectives, enabling the model to learn richer features during pre-training. Furthermore, our method of predicting diffuse render demonstrates superiority over direct albedo prediction. Firstly, we see it simplifies the learning process, as predicting diffuse render is more closely related to the original image. Secondly, our approach effectively distinguishes between the influences of illumination (diffuse shading) and surface properties (diffuse render). This distinction is crucial for accurately modeling the intrinsic color of surfaces, as it enables independent and precise evaluation of each element (see Eqn. 9). In contrast, methods that predict albedo directly often struggle to differentiate between these factors, leading to significant inaccuracies in color constancy, as shown in Fig. 8.
Applications. We present two applications using predicted intrinsics in Fig. 9. First, real-time PBR via Cook-Torrance
components in Three.js graphics library. Second, switching the lighting environment between the source and reference images. Further details are in the supplementary video.
Limitations. We identified a few failure cases in Fig. 10. First, our model struggles with neutralizing strong shadows, which leads to inaccurate facial geometry and residual shadow artifacts in both albedo and relit images. Incorporating shadow augmentation techniques [16, 54] during training could mitigate this issue. Second, the model incorrectly interprets reflective surfaces, such as sunglasses, as opaque objects in the normal image. This error prevents the model from properly removing reflective highlights in the albedo and relit images. Lastly, the model inaccurately predicts the albedo for face paint. Implementing a semantic mask [52] to distinguish different semantic regions separately from the skin could help resolving these issues.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.