PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

PSDiffusion: Harmonized Multi-Layer Image Generation via
Layout and Appearance Alignment

WACV 2026

Dingbang Huang¹ Wenbo Li¹ Yifei Zhao¹ Xinyu Pan² Chun Wang³ Yanhong Zeng⁴ Bo Dai⁵

¹Shanghai Jiao Tong University ²The Chinese University of Hong Kong
³Zhejiang University ⁴Shanghai AI Laboratory ⁵The University of Hong Kong

Arixv

Paper

SupMat

Code (Coing Soon)

TL;DR

We present PSDiffusion, an end-to-end diffusion framework for simultaneous multi-layer image generation. Leveraging a global-layer interactive mechanism, our model generates layered images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.

Abstract

Transparent image layer generation plays a significant role in digital art and design workflows. Existing methods typically decompose transparent layers from a single RGB image using a set of tools or generate multiple transparent layers sequentially. Despite some promising results, these methods often limit their ability to model global layout, physically plausible interactions, and visual effects such as shadows and reflections with high alpha quality due to limited shared global context among layers. To address this issue, we propose PSDiffusion, a unified diffusion framework that leverages image composition priors from pre-trained image diffusion model for simultaneous multi-layer text-to-image generation. Specifically, our method introduces a global layer interaction mechanism to generate layered images collaboratively, ensuring both individual layer quality and coherent spatial and visual relationships across layers. We include extensive experiments on benchmark datasets to demonstrate that PSDiffusion is able to outperform existing methods in generating multi-layer images with plausible structure and enhanced visual fidelity.

Method

Overview of our PSDiffusion. Our framework produces multi-layer images through three key components: (1) A transparent VAE encoder preserving alpha channels; (2) Layer cross-attention reweighting for layout plausibility; (3) Partial joint self-attention for inter-layer context modeling.

Overview of our global-layer interactive mechanism, composed of layer cross-attention reweighting module and partial joint self-attention module. Layer cross-attention reweighting module extracts the cross-attention map from the global branch to reweight the cross-attention map of the layer branch, guiding the position of foreground layers. Partial joint self-attention module implements a shared attention across global branch and layer branch to facilitate context-aware feature modeling.

Comparisons

Citation

@article{huang2025psdiffusion,
  title={Psdiffusion: Harmonized multi-layer image generation via layout and appearance alignment},
  author={Huang, Dingbang and Li, Wenbo and Zhao, Yifei and Pan, Xinyu and Wang, Chun and Zeng, Yanhong and Dai, Bo},
  journal={arXiv preprint arXiv:2505.11468},
  year={2025}
}