Strengthening cross-interaction learning for vision networks

Fang, Yanwen; 方艷雯

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Statistics & Actuarial Science: Theses

postgraduate thesis: Strengthening cross-interaction learning for vision networks

Title	Strengthening cross-interaction learning for vision networks
Authors	Fang, Yanwen 方艷雯
Advisors	Advisor(s):Li, G
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Fang, Y. [方艷雯]. (2023). Strengthening cross-interaction learning for vision networks. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	In recent years, the field of computer vision has grown astoundingly due to the notable success achieved by various vision networks such as CNNs, vision Transformers and so on. A vision network is generally designed to learn various interactions between objects for different tasks. For example, learning the temporal interaction between different time steps is key to modeling time series data for prediction task. This thesis studies strengthening cross-interaction learning for vision networks in three aspects: cross-layer interaction in backbone models, intraperiod and intratrend temporal interactions in human motion, and person-person interaction in multi-person poses. To achieve these objectives, the thesis proposes three approaches, all of which enhance the representation power of the networks with notable performances. Firstly, a new cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), is proposed to strengthen layerwise interactions by retrieving query-related information from previous layers. To reduce the quadratic computation cost inherited from the vanilla attention, a light-weighted version of MRLA with linear complexity is further proposed to make cross-layer attention feasible to more deep networks. This thesis devises MRLA as a plug-and-play module which is compatible with two types of mainstream vision networks: CNNs and vision Transformers. Remarkable improvements brought by MRLA in image classification, object detection and instance segmentation tasks on benchmark datasets demonstrate its effectiveness, showing that MRLA can enrich the representation power of many state-of-the-art vision networks by linking the fine-grained features to the global ones. Secondly, this thesis explores the intraperiod and intratrend interactions for human motion prediction. A new periodic-trend pose decomposition (PTPDecomp) block is proposed to decompose the hidden pose sequences into period and trend components for separately modeling the temporal dependencies within the period and trend. The PTPDecomp block cooperates with spatial GCNs and temporal GCNs, leading to an encoder-decoder framework called Periodic-Trend Enhanced GCN (PTE-GCN). The encoder or decoder progressively eliminates or refines the long-term trend pattern and focuses on modeling the period pattern, which facilitates learning the intricate temporal relationships entangled in pose sequences. Experiment results on three benchmark datasets demonstrate that PTE-GCN surpasses the state-of-the-art methods in both short-term and long-term predictions, especially for the periodic actions like walking in the long-term forecasting. Lastly, this thesis studies the interactions between the highly interacted persons' motion trajectories in the task of multi-person extreme motion prediction. A novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences. Additionally, a proxy unit is introduced to bridge the involved persons, which cooperates with the XQA module and subtly controls the bidirectional information flows. These designs are then integrated into a Transformer-based architecture and an end-to-end framework called proxy-bridged game Transformer (PGformer) is devised for multi-person motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, and PGformer consistently outperforms the state-of-the-art methods in both short-term and long-term predictions. Besides, PGformer can also be well-compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results.
Degree	Doctor of Philosophy
Subject	Computer vision Neural networks (Computer science)
Dept/Program	Statistics and Actuarial Science
Persistent Identifier	http://hdl.handle.net/10722/335946

DC Field	Value	Language
dc.contributor.advisor	Li, G	-
dc.contributor.author	Fang, Yanwen	-
dc.contributor.author	方艷雯	-
dc.date.accessioned	2023-12-29T04:05:04Z	-
dc.date.available	2023-12-29T04:05:04Z	-
dc.date.issued	2023	-
dc.identifier.citation	Fang, Y. [方艷雯]. (2023). Strengthening cross-interaction learning for vision networks. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/335946	-
dc.description.abstract	In recent years, the field of computer vision has grown astoundingly due to the notable success achieved by various vision networks such as CNNs, vision Transformers and so on. A vision network is generally designed to learn various interactions between objects for different tasks. For example, learning the temporal interaction between different time steps is key to modeling time series data for prediction task. This thesis studies strengthening cross-interaction learning for vision networks in three aspects: cross-layer interaction in backbone models, intraperiod and intratrend temporal interactions in human motion, and person-person interaction in multi-person poses. To achieve these objectives, the thesis proposes three approaches, all of which enhance the representation power of the networks with notable performances. Firstly, a new cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), is proposed to strengthen layerwise interactions by retrieving query-related information from previous layers. To reduce the quadratic computation cost inherited from the vanilla attention, a light-weighted version of MRLA with linear complexity is further proposed to make cross-layer attention feasible to more deep networks. This thesis devises MRLA as a plug-and-play module which is compatible with two types of mainstream vision networks: CNNs and vision Transformers. Remarkable improvements brought by MRLA in image classification, object detection and instance segmentation tasks on benchmark datasets demonstrate its effectiveness, showing that MRLA can enrich the representation power of many state-of-the-art vision networks by linking the fine-grained features to the global ones. Secondly, this thesis explores the intraperiod and intratrend interactions for human motion prediction. A new periodic-trend pose decomposition (PTPDecomp) block is proposed to decompose the hidden pose sequences into period and trend components for separately modeling the temporal dependencies within the period and trend. The PTPDecomp block cooperates with spatial GCNs and temporal GCNs, leading to an encoder-decoder framework called Periodic-Trend Enhanced GCN (PTE-GCN). The encoder or decoder progressively eliminates or refines the long-term trend pattern and focuses on modeling the period pattern, which facilitates learning the intricate temporal relationships entangled in pose sequences. Experiment results on three benchmark datasets demonstrate that PTE-GCN surpasses the state-of-the-art methods in both short-term and long-term predictions, especially for the periodic actions like walking in the long-term forecasting. Lastly, this thesis studies the interactions between the highly interacted persons' motion trajectories in the task of multi-person extreme motion prediction. A novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences. Additionally, a proxy unit is introduced to bridge the involved persons, which cooperates with the XQA module and subtly controls the bidirectional information flows. These designs are then integrated into a Transformer-based architecture and an end-to-end framework called proxy-bridged game Transformer (PGformer) is devised for multi-person motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, and PGformer consistently outperforms the state-of-the-art methods in both short-term and long-term predictions. Besides, PGformer can also be well-compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Neural networks (Computer science)	-
dc.title	Strengthening cross-interaction learning for vision networks	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Statistics and Actuarial Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044751040203414	-

File Download

Supplementary

postgraduate thesis: Strengthening cross-interaction learning for vision networks

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats