File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Towards object segmentation in the open world
| Title | Towards object segmentation in the open world |
|---|---|
| Authors | |
| Advisors | |
| Issue Date | 2024 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Wu, J. [吳劍南]. (2024). Towards object segmentation in the open world. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | Object segmentation is a fundamental capability in computer vision for pixel-level understanding of digital images and videos. Recent advancements in deep learning algorithms have enabled numerous applications across various domains, including autonomous driving, video surveillance, and medical diagnosis. However, existing object segmentation methods often assume that objects belong to predefined categories, which limits their capacities in open-world scenarios. Therefore, developing robust and accurate object segmentation algorithms for more challenging and complex environments is crucial. This thesis addresses three key vision problems: (i) open-world instance segmentation using transformers, (ii) reference-based object segmentation tasks, and (iii) learning multimodal large language models for object segmentation and beyond. The first part of this thesis addresses the problem of open-world instance segmentation. We explore the architecture of recent advanced DETR-based models and propose a network named SWORD. We analyze the limitations of existing closed-world and open-world models in open-world settings and provide solutions to address these
issues. Specifically, we introduce a stop-gradient operation before the classification head, which not only enhances the network’s ability to identify novel objects but also integrates seamlessly with DETR-based models. Additionally, we present a novel contrastive learning framework designed to learn discriminative representations between objects and the background, which is crucial for achieving high precision. The second part of this thesis addresses the problem of referring video object segmentation. The previous methods either fail to consider the object association across frames or require complex, multi-stage pipeline for object tracking. Different from theirs, we propose a simple and end-to-end network for simultaneously segmenting and tracking the referred objects in the videos, termed ReferFormer. Specifically, we leverage the language expressions as the conditions for object queries to find the target object. These conditional queries are shared for all frames in the initial state and could be easily linked for object association. For the third part of this thesis, we study the use of both text and mask prompts for object segmentation and introduce the unified model UniRef++. By treating text and mask prompts as reference information, we design the UniFusion module to incorporate these reference features into the network using a unified attention mechanism. Then a unified transformer architecture is adopted for instancelevel segmentation. Our UniRef++ model is jointly trained on a wide range of image and video benchmarks, achieving competitive performance across four reference-based object segmentation tasks: referring image segmentation, few-shot segmentation, referring video object segmentation, and video object segmentation. The fourth part of this thesis considers using the multimodal large language models (MLLM) for a broad range of vision-language tasks beyond object segmentation. We introduce VisionLLM v2 and present a novel transmission mechanism called “super link" to connect the MLLM with task-specific decoders. This mechanism facilitates efficient information transfer and resolves training conflicts in multitask scenarios. VisionLLM v2 significantly broadens the application scope beyond visual question answering, excelling in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. |
| Degree | Doctor of Philosophy |
| Subject | Computer vision |
| Dept/Program | Computer Science |
| Persistent Identifier | http://hdl.handle.net/10722/354722 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Luo, P | - |
| dc.contributor.advisor | Wang, WP | - |
| dc.contributor.author | Wu, Jiannan | - |
| dc.contributor.author | 吳劍南 | - |
| dc.date.accessioned | 2025-03-04T09:30:53Z | - |
| dc.date.available | 2025-03-04T09:30:53Z | - |
| dc.date.issued | 2024 | - |
| dc.identifier.citation | Wu, J. [吳劍南]. (2024). Towards object segmentation in the open world. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/354722 | - |
| dc.description.abstract | Object segmentation is a fundamental capability in computer vision for pixel-level understanding of digital images and videos. Recent advancements in deep learning algorithms have enabled numerous applications across various domains, including autonomous driving, video surveillance, and medical diagnosis. However, existing object segmentation methods often assume that objects belong to predefined categories, which limits their capacities in open-world scenarios. Therefore, developing robust and accurate object segmentation algorithms for more challenging and complex environments is crucial. This thesis addresses three key vision problems: (i) open-world instance segmentation using transformers, (ii) reference-based object segmentation tasks, and (iii) learning multimodal large language models for object segmentation and beyond. The first part of this thesis addresses the problem of open-world instance segmentation. We explore the architecture of recent advanced DETR-based models and propose a network named SWORD. We analyze the limitations of existing closed-world and open-world models in open-world settings and provide solutions to address these issues. Specifically, we introduce a stop-gradient operation before the classification head, which not only enhances the network’s ability to identify novel objects but also integrates seamlessly with DETR-based models. Additionally, we present a novel contrastive learning framework designed to learn discriminative representations between objects and the background, which is crucial for achieving high precision. The second part of this thesis addresses the problem of referring video object segmentation. The previous methods either fail to consider the object association across frames or require complex, multi-stage pipeline for object tracking. Different from theirs, we propose a simple and end-to-end network for simultaneously segmenting and tracking the referred objects in the videos, termed ReferFormer. Specifically, we leverage the language expressions as the conditions for object queries to find the target object. These conditional queries are shared for all frames in the initial state and could be easily linked for object association. For the third part of this thesis, we study the use of both text and mask prompts for object segmentation and introduce the unified model UniRef++. By treating text and mask prompts as reference information, we design the UniFusion module to incorporate these reference features into the network using a unified attention mechanism. Then a unified transformer architecture is adopted for instancelevel segmentation. Our UniRef++ model is jointly trained on a wide range of image and video benchmarks, achieving competitive performance across four reference-based object segmentation tasks: referring image segmentation, few-shot segmentation, referring video object segmentation, and video object segmentation. The fourth part of this thesis considers using the multimodal large language models (MLLM) for a broad range of vision-language tasks beyond object segmentation. We introduce VisionLLM v2 and present a novel transmission mechanism called “super link" to connect the MLLM with task-specific decoders. This mechanism facilitates efficient information transfer and resolves training conflicts in multitask scenarios. VisionLLM v2 significantly broadens the application scope beyond visual question answering, excelling in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. | - |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | Computer vision | - |
| dc.title | Towards object segmentation in the open world | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Computer Science | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991044911104303414 | - |
