Authors:
(1) Hoon Kim, Beeble AI, and contributed equally to this work;
(2) Minje Jang, Beeble AI, and contributed equally to this work;
(3) Wonjun Yoon, Beeble AI, and contributed equally to this work;
(4) Jisoo Lee, Beeble AI, and contributed equally to this work;
(5) Donghyun Na, Beeble AI, and contributed equally to this work;
(6) Sanghyun Woo, New York University, and contributed equally to this work.
Editor's Note: This is Part 11 of 14 of a study introducing a method for improving how light and shadows can be applied to human portraits in digital images. Read the rest below.
Table of Links
- Abstract and 1. Introduction
- 2. Related Work
- 3. SwitchLight and 3.1. Preliminaries
- 3.2. Problem Formulation
- 3.3. Architecture
- 3.4. Objectives
- 4. Multi-Masked Autoencoder Pre-training
- 5. Data
- 6. Experiments
- 7. Conclusion
Appendix
- A. Implementation Details
- B. User Study Interface
- C. Video Demonstration
- D. Additional Qualitative Results & References
A. Implementation Details
We pre-train a single U-Net architecture during this process. In the subsequent fine-tuning stage, the weights from this pre-trained model are transferred to multiple U-Nets - NormalNet, DiffuseNet, SpecularNet, and RenderNet. In contrast, IllumNet, which does not follow the U-Net architecture, is initialized with random weights. To ensure compatibility with the varying input channels of each network, we modify the weights as necessary. For example, weights pre-trained for RGB channels are copied and adapted to fit networks with 6 or 9 channels.
Data To generate the relighting training pairs, we randomly select each image from the OLAT dataset. Two randomly chosen HDRI lighting environment maps are then projected onto these images to form a training pair. The images undergo processing in linear space. For managing the dynamic range effectively, we apply logarithmic normalization using the log(1 + x) function.
Architecture SwitchLight employs a UNet-based architecture, consistently applied across its Normal Net, Diffuse Net, Specular Net, and Render Net. This approach is inspired by recent advancements in diffusion-based models [12]. Unlike standard diffusion methods, we omit the temporal embedding layer. The architecture is characterized by several hyperparameters: the number of input channels, a base channel, and channel multipliers that determine the channel count at each stage. Each downsampling stage features two residual blocks, with attention mechanisms integrated at certain resolutions. The key hyperparameters and their corresponding values are summarized in Table. 4.
IllumNet is composed of two projection layers, one for transforming the Phong lobe features and another for image features, with the latter using normal bottleneck features as a compact form of image representation. Following this, a cross-attention layer is employed, wherein the Phong lobe serves as the query and the image features function as both key and value. Finally, an output layer generates the final convolved source HDRI.
The Discriminator network is utilized during both pretraining and fine-tuning stages, maintaining the same architectural design, although the weights are not shared between these stages. This network is composed of a series of residual blocks, each containing two 3×3 convolution layers, interspersed with Leaky ReLU activations. The number of filters progressively increases across these layers: 64, 128, 256, and 512. Correspondingly, as the channel filter count increases, the resolution of the features decreases, and finally, the network compresses its output with a 3x3 convolution into a single channel, yielding a probability value.
Regarding the activation functions across different networks: NormalNet processes its outputs through ℓ2 normalization, ensuring they are unit normal vectors. IllumNet, DiffuseNet, and RenderNet utilize a softplus activation (with β = 20) to generate non-negative pixel values. SpecularNet employs a sigmoid activation fuction, ensuring that both the roughness parameter and Fresnel reflectance values fall within a range of 0 to 1.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.