AnyDoor: Zero-shot Image Customization with Region-to-region Reference

Chen, Xi; Huang, Lianghua; Liu, Yu; Shen, Yujun; Zhao, Deli; Zhao, Hengshuang

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2025.3562237
Scopus: eid_2-s2.0-105003681546
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: AnyDoor: Zero-shot Image Customization with Region-to-region Reference

Title	AnyDoor: Zero-shot Image Customization with Region-to-region Reference
Authors	Chen, Xi Huang, Lianghua Liu, Yu Shen, Yujun Zhao, Deli Zhao, Hengshuang
Keywords	Diffusion Model Image Composition Image Customization Image Editing Image Generation
Issue Date	25-Apr-2025
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2025.3562237
Abstract	This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zeroshot setting requires an adequate characterization of a certain object. To this end, we leverage the powerful self-supervised image encoder (i.e., DINOv2) to extract the discriminative dentity feature of the target object. Besides, we complement the identity feature with detail features, which are carefully designed to maintain appearance details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Starting from the task of object insertion, we further extend the framework of AnyDoor to a general solution with regionto-region image reference. With the different definitions of the source region and target region, the tasks of object insertion, object removal, and image variation could be integrated into one model without introducing extra parameters. In addition, we investigate incorporating other conditions like the mask, pose skeleton, and depth map as additional guidance to achieve more controllable generation
Persistent Identifier	http://hdl.handle.net/10722/362092
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158

DC Field	Value	Language
dc.contributor.author	Chen, Xi	-
dc.contributor.author	Huang, Lianghua	-
dc.contributor.author	Liu, Yu	-
dc.contributor.author	Shen, Yujun	-
dc.contributor.author	Zhao, Deli	-
dc.contributor.author	Zhao, Hengshuang	-
dc.date.accessioned	2025-09-19T00:31:50Z	-
dc.date.available	2025-09-19T00:31:50Z	-
dc.date.issued	2025-04-25	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/362092	-
dc.description.abstract	<p>This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zeroshot setting requires an adequate characterization of a certain object. To this end, we leverage the powerful self-supervised image encoder (i.e., DINOv2) to extract the discriminative dentity feature of the target object. Besides, we complement the identity feature with detail features, which are carefully designed to maintain appearance details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Starting from the task of object insertion, we further extend the framework of AnyDoor to a general solution with regionto-region image reference. With the different definitions of the source region and target region, the tasks of object insertion, object removal, and image variation could be integrated into one model without introducing extra parameters. In addition, we investigate incorporating other conditions like the mask, pose skeleton, and depth map as additional guidance to achieve more controllable generation</p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	Diffusion Model	-
dc.subject	Image Composition	-
dc.subject	Image Customization	-
dc.subject	Image Editing	-
dc.subject	Image Generation	-
dc.title	AnyDoor: Zero-shot Image Customization with Region-to-region Reference	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPAMI.2025.3562237	-
dc.identifier.scopus	eid_2-s2.0-105003681546	-
dc.identifier.eissn	1939-3539	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: AnyDoor: Zero-shot Image Customization with Region-to-region Reference

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats