File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
Title | Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos |
---|---|
Authors | |
Issue Date | 18-Jun-2023 |
Abstract | Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. |
Persistent Identifier | http://hdl.handle.net/10722/333846 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Wen, Yilin | - |
dc.contributor.author | Pan, Hao | - |
dc.contributor.author | Yang, Lei | - |
dc.contributor.author | Pan, Jia | - |
dc.contributor.author | Komura, Taku | - |
dc.contributor.author | Wang, Wenping | - |
dc.date.accessioned | 2023-10-06T08:39:34Z | - |
dc.date.available | 2023-10-06T08:39:34Z | - |
dc.date.issued | 2023-06-18 | - |
dc.identifier.uri | http://hdl.handle.net/10722/333846 | - |
dc.description.abstract | <p>Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.<br></p> | - |
dc.language | eng | - |
dc.relation.ispartof | The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (18/06/2023-22/06/2023, Vancouver) | - |
dc.title | Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos | - |
dc.type | Conference_Paper | - |
dc.identifier.doi | 10.1109/CVPR52729.2023.02035 | - |