File Download
Supplementary

postgraduate thesis: Studies on attention modeling for visual understanding

TitleStudies on attention modeling for visual understanding
Authors
Advisors
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Chen, R. [陈润楠]. (2022). Studies on attention modeling for visual understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractAttention mechanism is an essential and complex cognitive ability of human beings, which refers to the capacity that selectively focuses on the critical portion of information while ignoring other irrelevant information. Inspired by the human vision system, integrating the attention mechanism into deep learning is a promising way to promote performance and generalization ability. This thesis studies attention modelling for the visual understanding tasks under different data modalities, including landmark detection in 2D X-ray images and 3D CBCT, personalized highlight detection in videos and zero-shot semantic segmentation on the point cloud. The first part of this thesis addresses the problem of anatomical landmark detection in 2D X-ray images. Since different landmarks require different levels of resolutions and semantics, the attention mechanism stimulates the important features and inhibits the irrelevant ones for individual landmark detection. To this end, we propose a novel attentive feature pyramid fusion module (AFPF) that explicitly shapes high-resolution and semantically enhanced fusion features. Moreover, we combine heat maps and offset maps to perform pixel-wise regression voting. Extensive experiments show that our method achieves significantly higher accuracy than existing methods. Besides, we also demonstrate its generalization capability and stability for unseen data from diverse devices. In the second part, we propose a novel Structure-Aware Long Short-Term Memory framework (SA-LSTM) for efficient and accurate 3D landmark detection, where the attention mechanism is used to capture global-local dependence among the cropping 3D patches. On the one hand, a novel graph attention module implicitly encodes the landmark's global structure to rationalize the predicted position. On the other hand, a novel attention-gated module recursively filters irrelevant local features and maintains high-confident local predictions for aggregating the final result. Extensive experiments show that our method outperforms state-of-the-art methods in terms of accuracy and efficiency. The third part of this thesis explores how attention access referring information for personalized video highlight detection (P-VHD). Current methods regard the user's history as holistic information to predict the user's preference but neglect the inherent diversity of the user's interests, resulting in vague preference representation. In this paper, we propose an attention-guided preference reasoning framework (PR-Net) to explicitly take the diverse interests into account for frame-level highlight prediction. Besides, a new bi-directional contrastive loss is proposed to ensure a compact and differentiable metric space. In this way, our method achieves the best performance. The last part of this thesis formulates attention weights as a new form of feature representation to handle the problem of zero-shot point cloud segmentation. Specifically, we propose a novel approach for learning the geometric primitives shared in objects of seen and unseen classes that transfer knowledge from seen to unseen categories. The point visual representation is formulated as the attention weights of its feature to the learnable geometric primitives. Then, we formulate semantic representation as mixture-distributed embeddings for the fine-grained matching visual representation. In this way, our method recognizes the novel object represented with geometric primitives under the guidance of semantic representation. Our method significantly outperforms other methods on S3DID, ScanNet and SemanticKITTI datasets. (500 words)
DegreeDoctor of Philosophy
SubjectComputer vision
Image processing
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/327627

 

DC FieldValueLanguage
dc.contributor.advisorKomura, T-
dc.contributor.advisorWang, WP-
dc.contributor.authorChen, Runnan-
dc.contributor.author陈润楠-
dc.date.accessioned2023-04-04T03:02:42Z-
dc.date.available2023-04-04T03:02:42Z-
dc.date.issued2022-
dc.identifier.citationChen, R. [陈润楠]. (2022). Studies on attention modeling for visual understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/327627-
dc.description.abstractAttention mechanism is an essential and complex cognitive ability of human beings, which refers to the capacity that selectively focuses on the critical portion of information while ignoring other irrelevant information. Inspired by the human vision system, integrating the attention mechanism into deep learning is a promising way to promote performance and generalization ability. This thesis studies attention modelling for the visual understanding tasks under different data modalities, including landmark detection in 2D X-ray images and 3D CBCT, personalized highlight detection in videos and zero-shot semantic segmentation on the point cloud. The first part of this thesis addresses the problem of anatomical landmark detection in 2D X-ray images. Since different landmarks require different levels of resolutions and semantics, the attention mechanism stimulates the important features and inhibits the irrelevant ones for individual landmark detection. To this end, we propose a novel attentive feature pyramid fusion module (AFPF) that explicitly shapes high-resolution and semantically enhanced fusion features. Moreover, we combine heat maps and offset maps to perform pixel-wise regression voting. Extensive experiments show that our method achieves significantly higher accuracy than existing methods. Besides, we also demonstrate its generalization capability and stability for unseen data from diverse devices. In the second part, we propose a novel Structure-Aware Long Short-Term Memory framework (SA-LSTM) for efficient and accurate 3D landmark detection, where the attention mechanism is used to capture global-local dependence among the cropping 3D patches. On the one hand, a novel graph attention module implicitly encodes the landmark's global structure to rationalize the predicted position. On the other hand, a novel attention-gated module recursively filters irrelevant local features and maintains high-confident local predictions for aggregating the final result. Extensive experiments show that our method outperforms state-of-the-art methods in terms of accuracy and efficiency. The third part of this thesis explores how attention access referring information for personalized video highlight detection (P-VHD). Current methods regard the user's history as holistic information to predict the user's preference but neglect the inherent diversity of the user's interests, resulting in vague preference representation. In this paper, we propose an attention-guided preference reasoning framework (PR-Net) to explicitly take the diverse interests into account for frame-level highlight prediction. Besides, a new bi-directional contrastive loss is proposed to ensure a compact and differentiable metric space. In this way, our method achieves the best performance. The last part of this thesis formulates attention weights as a new form of feature representation to handle the problem of zero-shot point cloud segmentation. Specifically, we propose a novel approach for learning the geometric primitives shared in objects of seen and unseen classes that transfer knowledge from seen to unseen categories. The point visual representation is formulated as the attention weights of its feature to the learnable geometric primitives. Then, we formulate semantic representation as mixture-distributed embeddings for the fine-grained matching visual representation. In this way, our method recognizes the novel object represented with geometric primitives under the guidance of semantic representation. Our method significantly outperforms other methods on S3DID, ScanNet and SemanticKITTI datasets. (500 words)-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshImage processing-
dc.titleStudies on attention modeling for visual understanding-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2023-
dc.identifier.mmsid991044657077703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats