Action recognition using multilevel features and latent structural SVM

Wu, Xinxiao; Xu, Dong; Duan, Lixin; Luo, Jiebo; Jia, Yunde

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TCSVT.2013.2244794
Scopus: eid_2-s2.0-84881455497
WOS: WOS:000322671600013
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Action recognition using multilevel features and latent structural SVM

Title	Action recognition using multilevel features and latent structural SVM
Authors	Wu, Xinxiao Xu, Dong Duan, Lixin Luo, Jiebo Jia, Yunde
Keywords	Action recognition action-scene interaction latent structural SVM multilevel features
Issue Date	2013
Citation	IEEE Transactions on Circuits and Systems for Video Technology, 2013, v. 23, n. 8, p. 1422-1431 How to Cite? DOI: http://dx.doi.org/10.1109/TCSVT.2013.2244794
Abstract	We first propose a new low-level visual feature, called spatio-temporal context distribution feature of interest points, to describe human actions. Each action video is expressed as a set of relative XYT coordinates between pairwise interest points in a local region. We learn a global Gaussian mixture model (GMM) (referred to as a universal background model) using the relative coordinate features from all the training videos, and then we represent each video as the normalized parameters of a video-specific GMM adapted from the global GMM. In order to capture the spatio-temporal relationships at different levels, multiple GMMs are utilized to describe the context distributions of interest points over multiscale local regions. Motivated by the observation that some actions share similar motion patterns, we additionally propose a novel mid-level class correlation feature to capture the semantic correlations between different action classes. Each input action video is represented by a set of decision values obtained from the pre-learned classifiers of all the action classes, with each decision value measuring the likelihood that the input video belongs to the corresponding action class. Moreover, human actions are often associated with some specific natural environments and also exhibit high correlation with particular scene classes. It is therefore beneficial to utilize the contextual scene information for action recognition. In this paper, we build the high-level co-occurrence relationship between action classes and scene classes to discover the mutual contextual constraints between action and scene. By treating the scene class label as a latent variable, we propose to use the latent structural SVM (LSSVM) model to jointly capture the compatibility between multilevel action features (e.g., low-level visual context distribution feature and the corresponding mid-level class correlation feature) and action classes, the compatibility between multilevel scene features (i.e., SIFT feature and the corresponding class correlation feature) and scene classes, and the contextual relationship between action classes and scene classes. Extensive experiments on UCF Sports, YouTube and UCF50 datasets demonstrate the effectiveness of the proposed multilevel features and action-scene interaction based LSSVM model for human action recognition. Moreover, our method generally achieves higher recognition accuracy than other state-of-the-art methods on these datasets. © 1991-2012 IEEE.
Persistent Identifier	http://hdl.handle.net/10722/321522
ISSN	1051-8215 2023 Impact Factor: 8.3 2023 SCImago Journal Rankings: 2.299
ISI Accession Number ID	WOS:000322671600013

DC Field	Value	Language
dc.contributor.author	Wu, Xinxiao	-
dc.contributor.author	Xu, Dong	-
dc.contributor.author	Duan, Lixin	-
dc.contributor.author	Luo, Jiebo	-
dc.contributor.author	Jia, Yunde	-
dc.date.accessioned	2022-11-03T02:19:30Z	-
dc.date.available	2022-11-03T02:19:30Z	-
dc.date.issued	2013	-
dc.identifier.citation	IEEE Transactions on Circuits and Systems for Video Technology, 2013, v. 23, n. 8, p. 1422-1431	-
dc.identifier.issn	1051-8215	-
dc.identifier.uri	http://hdl.handle.net/10722/321522	-
dc.description.abstract	We first propose a new low-level visual feature, called spatio-temporal context distribution feature of interest points, to describe human actions. Each action video is expressed as a set of relative XYT coordinates between pairwise interest points in a local region. We learn a global Gaussian mixture model (GMM) (referred to as a universal background model) using the relative coordinate features from all the training videos, and then we represent each video as the normalized parameters of a video-specific GMM adapted from the global GMM. In order to capture the spatio-temporal relationships at different levels, multiple GMMs are utilized to describe the context distributions of interest points over multiscale local regions. Motivated by the observation that some actions share similar motion patterns, we additionally propose a novel mid-level class correlation feature to capture the semantic correlations between different action classes. Each input action video is represented by a set of decision values obtained from the pre-learned classifiers of all the action classes, with each decision value measuring the likelihood that the input video belongs to the corresponding action class. Moreover, human actions are often associated with some specific natural environments and also exhibit high correlation with particular scene classes. It is therefore beneficial to utilize the contextual scene information for action recognition. In this paper, we build the high-level co-occurrence relationship between action classes and scene classes to discover the mutual contextual constraints between action and scene. By treating the scene class label as a latent variable, we propose to use the latent structural SVM (LSSVM) model to jointly capture the compatibility between multilevel action features (e.g., low-level visual context distribution feature and the corresponding mid-level class correlation feature) and action classes, the compatibility between multilevel scene features (i.e., SIFT feature and the corresponding class correlation feature) and scene classes, and the contextual relationship between action classes and scene classes. Extensive experiments on UCF Sports, YouTube and UCF50 datasets demonstrate the effectiveness of the proposed multilevel features and action-scene interaction based LSSVM model for human action recognition. Moreover, our method generally achieves higher recognition accuracy than other state-of-the-art methods on these datasets. © 1991-2012 IEEE.	-
dc.language	eng	-
dc.relation.ispartof	IEEE Transactions on Circuits and Systems for Video Technology	-
dc.subject	Action recognition	-
dc.subject	action-scene interaction	-
dc.subject	latent structural SVM	-
dc.subject	multilevel features	-
dc.title	Action recognition using multilevel features and latent structural SVM	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/TCSVT.2013.2244794	-
dc.identifier.scopus	eid_2-s2.0-84881455497	-
dc.identifier.volume	23	-
dc.identifier.issue	8	-
dc.identifier.spage	1422	-
dc.identifier.epage	1431	-
dc.identifier.isi	WOS:000322671600013	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Action recognition using multilevel features and latent structural SVM

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats