File Download
Supplementary

postgraduate thesis: Deep learning for dense visual predictions

TitleDeep learning for dense visual predictions
Authors
Advisors
Advisor(s):Luo, PWang, WP
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Xie, E. [谢恩泽]. (2022). Deep learning for dense visual predictions. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractUnderstanding the content of digital images is very important to human life and has a wide range of applications in the real world. Moreover, deeply understanding each pixel in the images is essential for many applications such as autonomous driving, face verification and smart home robot. Therefore, it is crucial to develop accurate, efficient and robust methods for dense visual prediction besides visual classification, such as semantic segmentation, object detection, and instance segmentation. This thesis tackles three vision problems of dense visual predictions, namely polar representation for instance segmentation, self-supervised object detection and robust semantic segmentation with Transformers. The first part of this thesis addresses the problem of instance segmentation. Existing approaches often solve instance segmentation with ``two stages'', which detects bounding boxes in the first stage and does pixel-level segmentation inside each box in the second stage, resulting in complicated designs and low efficiency. Different from previous methods, we solve instance segmentation problem under a polar coordinate and propose a deep learning based framework, named \emph{PolarMask}, to use polar representation to formulate an instance mask. We define the gravity center of each object as the origin point of polar coordinates and emit a set of rays from the center to the contour with an even angle. During training, PolarMask learns the location of the object center and the length of each ray. During testing, we can get mask prediction by assembling centers and rays. PolarMask is a single-shot anchor-free instance segmentation framework, which is much faster than previous two-stage methods. The second part of this thesis addresses the problem of self-supervised pre-training and representation learning for instance-level detection tasks. Unlike most recent methods that focus on improving image classification accuracy, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo can learn powerful general feature representation from massive unlabeled image data and can largely boost up downstream tasks such as object detection and multi-person pose estimation, and benefits label efficiency. The third part of this thesis considers the problem of efficient, strong, and robust semantic segmentation with the advanced network architecture Transformers, termed SegFormer. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, leading to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The lightweight All-MLP decoder of SegFormer directly fuses these multi-level features and predicts the semantic segmentation mask. As a result, SegFormer sets a new state-of-the-art in terms of efficiency, accuracy and robustness on several benchmarks. We also firstly verify that Transformer has much larger effective receptive field than ConvNets and firstly show excellent zero-shot robustness of Transformer on out-of-distribution data.
DegreeDoctor of Philosophy
SubjectDeep learning (Machine learning)
Digital images
Computer vision
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/322927

 

DC FieldValueLanguage
dc.contributor.advisorLuo, P-
dc.contributor.advisorWang, WP-
dc.contributor.authorXie, Enze-
dc.contributor.author谢恩泽-
dc.date.accessioned2022-11-18T10:41:50Z-
dc.date.available2022-11-18T10:41:50Z-
dc.date.issued2022-
dc.identifier.citationXie, E. [谢恩泽]. (2022). Deep learning for dense visual predictions. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/322927-
dc.description.abstractUnderstanding the content of digital images is very important to human life and has a wide range of applications in the real world. Moreover, deeply understanding each pixel in the images is essential for many applications such as autonomous driving, face verification and smart home robot. Therefore, it is crucial to develop accurate, efficient and robust methods for dense visual prediction besides visual classification, such as semantic segmentation, object detection, and instance segmentation. This thesis tackles three vision problems of dense visual predictions, namely polar representation for instance segmentation, self-supervised object detection and robust semantic segmentation with Transformers. The first part of this thesis addresses the problem of instance segmentation. Existing approaches often solve instance segmentation with ``two stages'', which detects bounding boxes in the first stage and does pixel-level segmentation inside each box in the second stage, resulting in complicated designs and low efficiency. Different from previous methods, we solve instance segmentation problem under a polar coordinate and propose a deep learning based framework, named \emph{PolarMask}, to use polar representation to formulate an instance mask. We define the gravity center of each object as the origin point of polar coordinates and emit a set of rays from the center to the contour with an even angle. During training, PolarMask learns the location of the object center and the length of each ray. During testing, we can get mask prediction by assembling centers and rays. PolarMask is a single-shot anchor-free instance segmentation framework, which is much faster than previous two-stage methods. The second part of this thesis addresses the problem of self-supervised pre-training and representation learning for instance-level detection tasks. Unlike most recent methods that focus on improving image classification accuracy, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo can learn powerful general feature representation from massive unlabeled image data and can largely boost up downstream tasks such as object detection and multi-person pose estimation, and benefits label efficiency. The third part of this thesis considers the problem of efficient, strong, and robust semantic segmentation with the advanced network architecture Transformers, termed SegFormer. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, leading to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The lightweight All-MLP decoder of SegFormer directly fuses these multi-level features and predicts the semantic segmentation mask. As a result, SegFormer sets a new state-of-the-art in terms of efficiency, accuracy and robustness on several benchmarks. We also firstly verify that Transformer has much larger effective receptive field than ConvNets and firstly show excellent zero-shot robustness of Transformer on out-of-distribution data.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshDeep learning (Machine learning)-
dc.subject.lcshDigital images-
dc.subject.lcshComputer vision-
dc.titleDeep learning for dense visual predictions-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044609104903414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats