File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Strengthening cross-interaction learning for vision networks
Title | Strengthening cross-interaction learning for vision networks |
---|---|
Authors | |
Advisors | Advisor(s):Li, G |
Issue Date | 2023 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Fang, Y. [方艷雯]. (2023). Strengthening cross-interaction learning for vision networks. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | In recent years, the field of computer vision has grown astoundingly due to the notable success achieved by various vision networks such as CNNs, vision Transformers and so on. A vision network is generally designed to learn various interactions between objects for different tasks. For example, learning the temporal interaction between different time steps is key to modeling time series data for prediction task. This thesis studies strengthening cross-interaction learning for vision networks in three aspects: cross-layer interaction in backbone models, intraperiod and intratrend temporal interactions in human motion, and person-person interaction in multi-person poses. To achieve these objectives, the thesis proposes three approaches, all of which enhance the representation power of the networks with notable performances.
Firstly, a new cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), is proposed to strengthen layerwise interactions by retrieving query-related information from previous layers. To reduce the quadratic computation cost inherited from the vanilla attention, a light-weighted version of MRLA with linear complexity is further proposed to make cross-layer attention feasible to more deep networks. This thesis devises MRLA as a plug-and-play module which is compatible with two types of mainstream vision networks: CNNs and vision Transformers. Remarkable improvements brought by MRLA in image classification, object detection and instance segmentation tasks on benchmark datasets demonstrate its effectiveness, showing that MRLA can enrich the representation power of many state-of-the-art vision networks by linking the fine-grained features to the global ones.
Secondly, this thesis explores the intraperiod and intratrend interactions for human motion prediction. A new periodic-trend pose decomposition (PTPDecomp) block is proposed to decompose the hidden pose sequences into period and trend components for separately modeling the temporal dependencies within the period and trend. The PTPDecomp block cooperates with spatial GCNs and temporal GCNs, leading to an encoder-decoder framework called Periodic-Trend Enhanced GCN (PTE-GCN). The encoder or decoder progressively eliminates or refines the long-term trend pattern and focuses on modeling the period pattern, which facilitates learning the intricate temporal relationships entangled in pose sequences. Experiment results on three benchmark datasets demonstrate that PTE-GCN surpasses the state-of-the-art methods in both short-term and long-term predictions, especially for the periodic actions like walking in the long-term forecasting.
Lastly, this thesis studies the interactions between the highly interacted persons' motion trajectories in the task of multi-person extreme motion prediction. A novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences. Additionally, a proxy unit is introduced to bridge the involved persons, which cooperates with the XQA module and subtly controls the bidirectional information flows. These designs are then integrated into a Transformer-based architecture and an end-to-end framework called proxy-bridged game Transformer (PGformer) is devised for multi-person motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, and PGformer consistently outperforms the state-of-the-art methods in both short-term and long-term predictions. Besides, PGformer can also be well-compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results. |
Degree | Doctor of Philosophy |
Subject | Computer vision Neural networks (Computer science) |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/335946 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Li, G | - |
dc.contributor.author | Fang, Yanwen | - |
dc.contributor.author | 方艷雯 | - |
dc.date.accessioned | 2023-12-29T04:05:04Z | - |
dc.date.available | 2023-12-29T04:05:04Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Fang, Y. [方艷雯]. (2023). Strengthening cross-interaction learning for vision networks. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/335946 | - |
dc.description.abstract | In recent years, the field of computer vision has grown astoundingly due to the notable success achieved by various vision networks such as CNNs, vision Transformers and so on. A vision network is generally designed to learn various interactions between objects for different tasks. For example, learning the temporal interaction between different time steps is key to modeling time series data for prediction task. This thesis studies strengthening cross-interaction learning for vision networks in three aspects: cross-layer interaction in backbone models, intraperiod and intratrend temporal interactions in human motion, and person-person interaction in multi-person poses. To achieve these objectives, the thesis proposes three approaches, all of which enhance the representation power of the networks with notable performances. Firstly, a new cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), is proposed to strengthen layerwise interactions by retrieving query-related information from previous layers. To reduce the quadratic computation cost inherited from the vanilla attention, a light-weighted version of MRLA with linear complexity is further proposed to make cross-layer attention feasible to more deep networks. This thesis devises MRLA as a plug-and-play module which is compatible with two types of mainstream vision networks: CNNs and vision Transformers. Remarkable improvements brought by MRLA in image classification, object detection and instance segmentation tasks on benchmark datasets demonstrate its effectiveness, showing that MRLA can enrich the representation power of many state-of-the-art vision networks by linking the fine-grained features to the global ones. Secondly, this thesis explores the intraperiod and intratrend interactions for human motion prediction. A new periodic-trend pose decomposition (PTPDecomp) block is proposed to decompose the hidden pose sequences into period and trend components for separately modeling the temporal dependencies within the period and trend. The PTPDecomp block cooperates with spatial GCNs and temporal GCNs, leading to an encoder-decoder framework called Periodic-Trend Enhanced GCN (PTE-GCN). The encoder or decoder progressively eliminates or refines the long-term trend pattern and focuses on modeling the period pattern, which facilitates learning the intricate temporal relationships entangled in pose sequences. Experiment results on three benchmark datasets demonstrate that PTE-GCN surpasses the state-of-the-art methods in both short-term and long-term predictions, especially for the periodic actions like walking in the long-term forecasting. Lastly, this thesis studies the interactions between the highly interacted persons' motion trajectories in the task of multi-person extreme motion prediction. A novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences. Additionally, a proxy unit is introduced to bridge the involved persons, which cooperates with the XQA module and subtly controls the bidirectional information flows. These designs are then integrated into a Transformer-based architecture and an end-to-end framework called proxy-bridged game Transformer (PGformer) is devised for multi-person motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, and PGformer consistently outperforms the state-of-the-art methods in both short-term and long-term predictions. Besides, PGformer can also be well-compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Computer vision | - |
dc.subject.lcsh | Neural networks (Computer science) | - |
dc.title | Strengthening cross-interaction learning for vision networks | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044751040203414 | - |