PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

PSDiffusion: Harmonized Multi-Layer Image Generation via
Layout and Appearance Alignment

WACV 2026


Dingbang Huang1  Wenbo Li1  Yifei Zhao1  Xinyu Pan2  Chun Wang3  Yanhong Zeng4  Bo Dai5

1Shanghai Jiao Tong University   2The Chinese University of Hong Kong   
3Zhejiang University   4Shanghai AI Laboratory   5The University of Hong Kong

TL;DR


We present PSDiffusion, an end-to-end diffusion framework for simultaneous multi-layer image generation. Leveraging a global-layer interactive mechanism, our model generates layered images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.


Abstract


Transparent image layer generation plays a significant role in digital art and design workflows. Existing methods typically decompose transparent layers from a single RGB image using a set of tools or generate multiple transparent layers sequentially. Despite some promising results, these methods often limit their ability to model global layout, physically plausible interactions, and visual effects such as shadows and reflections with high alpha quality due to limited shared global context among layers. To address this issue, we propose PSDiffusion, a unified diffusion framework that leverages image composition priors from pre-trained image diffusion model for simultaneous multi-layer text-to-image generation. Specifically, our method introduces a global layer interaction mechanism to generate layered images collaboratively, ensuring both individual layer quality and coherent spatial and visual relationships across layers. We include extensive experiments on benchmark datasets to demonstrate that PSDiffusion is able to outperform existing methods in generating multi-layer images with plausible structure and enhanced visual fidelity.


Method


Overview of our PSDiffusion. Our framework produces multi-layer images through three key components: (1) A transparent VAE encoder preserving alpha channels; (2) Layer cross-attention reweighting for layout plausibility; (3) Partial joint self-attention for inter-layer context modeling.


Overview of our global-layer interactive mechanism, composed of layer cross-attention reweighting module and partial joint self-attention module. Layer cross-attention reweighting module extracts the cross-attention map from the global branch to reweight the cross-attention map of the layer branch, guiding the position of foreground layers. Partial joint self-attention module implements a shared attention across global branch and layer branch to facilitate context-aware feature modeling.


Comparisons



Citation


@article{huang2025psdiffusion,
  title={Psdiffusion: Harmonized multi-layer image generation via layout and appearance alignment},
  author={Huang, Dingbang and Li, Wenbo and Zhao, Yifei and Pan, Xinyu and Wang, Chun and Zeng, Yanhong and Dai, Bo},
  journal={arXiv preprint arXiv:2505.11468},
  year={2025}
}