Towards object segmentation in the open world

Wu, Jiannan; 吳劍南

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Towards object segmentation in the open world

Title	Towards object segmentation in the open world
Authors	Wu, Jiannan 吳劍南
Advisors	Advisor(s):Luo, P Wang, WP
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wu, J. [吳劍南]. (2024). Towards object segmentation in the open world. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Object segmentation is a fundamental capability in computer vision for pixel-level understanding of digital images and videos. Recent advancements in deep learning algorithms have enabled numerous applications across various domains, including autonomous driving, video surveillance, and medical diagnosis. However, existing object segmentation methods often assume that objects belong to predefined categories, which limits their capacities in open-world scenarios. Therefore, developing robust and accurate object segmentation algorithms for more challenging and complex environments is crucial. This thesis addresses three key vision problems: (i) open-world instance segmentation using transformers, (ii) reference-based object segmentation tasks, and (iii) learning multimodal large language models for object segmentation and beyond. The first part of this thesis addresses the problem of open-world instance segmentation. We explore the architecture of recent advanced DETR-based models and propose a network named SWORD. We analyze the limitations of existing closed-world and open-world models in open-world settings and provide solutions to address these issues. Specifically, we introduce a stop-gradient operation before the classification head, which not only enhances the network’s ability to identify novel objects but also integrates seamlessly with DETR-based models. Additionally, we present a novel contrastive learning framework designed to learn discriminative representations between objects and the background, which is crucial for achieving high precision. The second part of this thesis addresses the problem of referring video object segmentation. The previous methods either fail to consider the object association across frames or require complex, multi-stage pipeline for object tracking. Different from theirs, we propose a simple and end-to-end network for simultaneously segmenting and tracking the referred objects in the videos, termed ReferFormer. Specifically, we leverage the language expressions as the conditions for object queries to find the target object. These conditional queries are shared for all frames in the initial state and could be easily linked for object association. For the third part of this thesis, we study the use of both text and mask prompts for object segmentation and introduce the unified model UniRef++. By treating text and mask prompts as reference information, we design the UniFusion module to incorporate these reference features into the network using a unified attention mechanism. Then a unified transformer architecture is adopted for instancelevel segmentation. Our UniRef++ model is jointly trained on a wide range of image and video benchmarks, achieving competitive performance across four reference-based object segmentation tasks: referring image segmentation, few-shot segmentation, referring video object segmentation, and video object segmentation. The fourth part of this thesis considers using the multimodal large language models (MLLM) for a broad range of vision-language tasks beyond object segmentation. We introduce VisionLLM v2 and present a novel transmission mechanism called “super link" to connect the MLLM with task-specific decoders. This mechanism facilitates efficient information transfer and resolves training conflicts in multitask scenarios. VisionLLM v2 significantly broadens the application scope beyond visual question answering, excelling in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing.
Degree	Doctor of Philosophy
Subject	Computer vision
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/354722

DC Field	Value	Language
dc.contributor.advisor	Luo, P	-
dc.contributor.advisor	Wang, WP	-
dc.contributor.author	Wu, Jiannan	-
dc.contributor.author	吳劍南	-
dc.date.accessioned	2025-03-04T09:30:53Z	-
dc.date.available	2025-03-04T09:30:53Z	-
dc.date.issued	2024	-
dc.identifier.citation	Wu, J. [吳劍南]. (2024). Towards object segmentation in the open world. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/354722	-
dc.description.abstract	Object segmentation is a fundamental capability in computer vision for pixel-level understanding of digital images and videos. Recent advancements in deep learning algorithms have enabled numerous applications across various domains, including autonomous driving, video surveillance, and medical diagnosis. However, existing object segmentation methods often assume that objects belong to predefined categories, which limits their capacities in open-world scenarios. Therefore, developing robust and accurate object segmentation algorithms for more challenging and complex environments is crucial. This thesis addresses three key vision problems: (i) open-world instance segmentation using transformers, (ii) reference-based object segmentation tasks, and (iii) learning multimodal large language models for object segmentation and beyond. The first part of this thesis addresses the problem of open-world instance segmentation. We explore the architecture of recent advanced DETR-based models and propose a network named SWORD. We analyze the limitations of existing closed-world and open-world models in open-world settings and provide solutions to address these issues. Specifically, we introduce a stop-gradient operation before the classification head, which not only enhances the network’s ability to identify novel objects but also integrates seamlessly with DETR-based models. Additionally, we present a novel contrastive learning framework designed to learn discriminative representations between objects and the background, which is crucial for achieving high precision. The second part of this thesis addresses the problem of referring video object segmentation. The previous methods either fail to consider the object association across frames or require complex, multi-stage pipeline for object tracking. Different from theirs, we propose a simple and end-to-end network for simultaneously segmenting and tracking the referred objects in the videos, termed ReferFormer. Specifically, we leverage the language expressions as the conditions for object queries to find the target object. These conditional queries are shared for all frames in the initial state and could be easily linked for object association. For the third part of this thesis, we study the use of both text and mask prompts for object segmentation and introduce the unified model UniRef++. By treating text and mask prompts as reference information, we design the UniFusion module to incorporate these reference features into the network using a unified attention mechanism. Then a unified transformer architecture is adopted for instancelevel segmentation. Our UniRef++ model is jointly trained on a wide range of image and video benchmarks, achieving competitive performance across four reference-based object segmentation tasks: referring image segmentation, few-shot segmentation, referring video object segmentation, and video object segmentation. The fourth part of this thesis considers using the multimodal large language models (MLLM) for a broad range of vision-language tasks beyond object segmentation. We introduce VisionLLM v2 and present a novel transmission mechanism called “super link" to connect the MLLM with task-specific decoders. This mechanism facilitates efficient information transfer and resolves training conflicts in multitask scenarios. VisionLLM v2 significantly broadens the application scope beyond visual question answering, excelling in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.title	Towards object segmentation in the open world	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991044911104303414	-

File Download

Supplementary

postgraduate thesis: Towards object segmentation in the open world

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats