File Download
Supplementary

postgraduate thesis: Towards object segmentation in the open world

TitleTowards object segmentation in the open world
Authors
Advisors
Advisor(s):Luo, PWang, WP
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Wu, J. [吳劍南]. (2024). Towards object segmentation in the open world. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractObject segmentation is a fundamental capability in computer vision for pixel-level understanding of digital images and videos. Recent advancements in deep learning algorithms have enabled numerous applications across various domains, including autonomous driving, video surveillance, and medical diagnosis. However, existing object segmentation methods often assume that objects belong to predefined categories, which limits their capacities in open-world scenarios. Therefore, developing robust and accurate object segmentation algorithms for more challenging and complex environments is crucial. This thesis addresses three key vision problems: (i) open-world instance segmentation using transformers, (ii) reference-based object segmentation tasks, and (iii) learning multimodal large language models for object segmentation and beyond. The first part of this thesis addresses the problem of open-world instance segmentation. We explore the architecture of recent advanced DETR-based models and propose a network named SWORD. We analyze the limitations of existing closed-world and open-world models in open-world settings and provide solutions to address these issues. Specifically, we introduce a stop-gradient operation before the classification head, which not only enhances the network’s ability to identify novel objects but also integrates seamlessly with DETR-based models. Additionally, we present a novel contrastive learning framework designed to learn discriminative representations between objects and the background, which is crucial for achieving high precision. The second part of this thesis addresses the problem of referring video object segmentation. The previous methods either fail to consider the object association across frames or require complex, multi-stage pipeline for object tracking. Different from theirs, we propose a simple and end-to-end network for simultaneously segmenting and tracking the referred objects in the videos, termed ReferFormer. Specifically, we leverage the language expressions as the conditions for object queries to find the target object. These conditional queries are shared for all frames in the initial state and could be easily linked for object association. For the third part of this thesis, we study the use of both text and mask prompts for object segmentation and introduce the unified model UniRef++. By treating text and mask prompts as reference information, we design the UniFusion module to incorporate these reference features into the network using a unified attention mechanism. Then a unified transformer architecture is adopted for instancelevel segmentation. Our UniRef++ model is jointly trained on a wide range of image and video benchmarks, achieving competitive performance across four reference-based object segmentation tasks: referring image segmentation, few-shot segmentation, referring video object segmentation, and video object segmentation. The fourth part of this thesis considers using the multimodal large language models (MLLM) for a broad range of vision-language tasks beyond object segmentation. We introduce VisionLLM v2 and present a novel transmission mechanism called “super link" to connect the MLLM with task-specific decoders. This mechanism facilitates efficient information transfer and resolves training conflicts in multitask scenarios. VisionLLM v2 significantly broadens the application scope beyond visual question answering, excelling in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing.
DegreeDoctor of Philosophy
SubjectComputer vision
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/354722

 

DC FieldValueLanguage
dc.contributor.advisorLuo, P-
dc.contributor.advisorWang, WP-
dc.contributor.authorWu, Jiannan-
dc.contributor.author吳劍南-
dc.date.accessioned2025-03-04T09:30:53Z-
dc.date.available2025-03-04T09:30:53Z-
dc.date.issued2024-
dc.identifier.citationWu, J. [吳劍南]. (2024). Towards object segmentation in the open world. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/354722-
dc.description.abstractObject segmentation is a fundamental capability in computer vision for pixel-level understanding of digital images and videos. Recent advancements in deep learning algorithms have enabled numerous applications across various domains, including autonomous driving, video surveillance, and medical diagnosis. However, existing object segmentation methods often assume that objects belong to predefined categories, which limits their capacities in open-world scenarios. Therefore, developing robust and accurate object segmentation algorithms for more challenging and complex environments is crucial. This thesis addresses three key vision problems: (i) open-world instance segmentation using transformers, (ii) reference-based object segmentation tasks, and (iii) learning multimodal large language models for object segmentation and beyond. The first part of this thesis addresses the problem of open-world instance segmentation. We explore the architecture of recent advanced DETR-based models and propose a network named SWORD. We analyze the limitations of existing closed-world and open-world models in open-world settings and provide solutions to address these issues. Specifically, we introduce a stop-gradient operation before the classification head, which not only enhances the network’s ability to identify novel objects but also integrates seamlessly with DETR-based models. Additionally, we present a novel contrastive learning framework designed to learn discriminative representations between objects and the background, which is crucial for achieving high precision. The second part of this thesis addresses the problem of referring video object segmentation. The previous methods either fail to consider the object association across frames or require complex, multi-stage pipeline for object tracking. Different from theirs, we propose a simple and end-to-end network for simultaneously segmenting and tracking the referred objects in the videos, termed ReferFormer. Specifically, we leverage the language expressions as the conditions for object queries to find the target object. These conditional queries are shared for all frames in the initial state and could be easily linked for object association. For the third part of this thesis, we study the use of both text and mask prompts for object segmentation and introduce the unified model UniRef++. By treating text and mask prompts as reference information, we design the UniFusion module to incorporate these reference features into the network using a unified attention mechanism. Then a unified transformer architecture is adopted for instancelevel segmentation. Our UniRef++ model is jointly trained on a wide range of image and video benchmarks, achieving competitive performance across four reference-based object segmentation tasks: referring image segmentation, few-shot segmentation, referring video object segmentation, and video object segmentation. The fourth part of this thesis considers using the multimodal large language models (MLLM) for a broad range of vision-language tasks beyond object segmentation. We introduce VisionLLM v2 and present a novel transmission mechanism called “super link" to connect the MLLM with task-specific decoders. This mechanism facilitates efficient information transfer and resolves training conflicts in multitask scenarios. VisionLLM v2 significantly broadens the application scope beyond visual question answering, excelling in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.titleTowards object segmentation in the open world-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991044911104303414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats