Visual event recognition in videos by learning from web data

Duan, Lixin; Xu, Dong; Tsang, Ivor W.; Luo, Jiebo

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR.2010.5539870
Scopus: eid_2-s2.0-77956003629
WOS: WOS:000287417502002
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Visual event recognition in videos by learning from web data

Title	Visual event recognition in videos by learning from web data
Authors	Duan, Lixin Xu, Dong Tsang, Ivor W.Luo, Jiebo
Issue Date	2010
Citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, p. 1959-1966 How to Cite? DOI: http://dx.doi.org/10.1109/CVPR.2010.5539870
Abstract	We propose a visual event recognition framework for consumer domain videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). First, we propose a new aligned space-time pyramid matching method to measure the distances between two video clips, where each video clip is divided into space-time volumes over multiple levels. We calculate the pair-wise distances between any two volumes and further integrate the information from different volumes with Integer-flow Earth Mover's Distance (EMD) to explicitly align the volumes. Second, we propose a new cross-domain learning method in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time feature and static SIFT feature) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web domain and consumer domain). For each pyramid level and each type of local features, we train a set of SVM classifiers based on the combined training set from two domains using multiple base kernels of different kernel types and parameters, which are fused with equal weights to obtain an average classifier. Finally, we propose a cross-domain learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), to learn an adapted classifier based on multiple base kernels and the prelearned average classifiers by minimizing both the structural risk functional and the mismatch between data distributions from two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. ©2010 IEEE.
Persistent Identifier	http://hdl.handle.net/10722/321229
ISSN	1063-6919 2023 SCImago Journal Rankings: 10.331
ISI Accession Number ID	WOS:000287417502002

DC Field	Value	Language
dc.contributor.author	Duan, Lixin	-
dc.contributor.author	Xu, Dong	-
dc.contributor.author	Tsang, Ivor W.	-
dc.contributor.author	Luo, Jiebo	-
dc.date.accessioned	2022-11-03T02:17:31Z	-
dc.date.available	2022-11-03T02:17:31Z	-
dc.date.issued	2010	-
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, p. 1959-1966	-
dc.identifier.issn	1063-6919	-
dc.identifier.uri	http://hdl.handle.net/10722/321229	-
dc.description.abstract	We propose a visual event recognition framework for consumer domain videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). First, we propose a new aligned space-time pyramid matching method to measure the distances between two video clips, where each video clip is divided into space-time volumes over multiple levels. We calculate the pair-wise distances between any two volumes and further integrate the information from different volumes with Integer-flow Earth Mover's Distance (EMD) to explicitly align the volumes. Second, we propose a new cross-domain learning method in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time feature and static SIFT feature) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web domain and consumer domain). For each pyramid level and each type of local features, we train a set of SVM classifiers based on the combined training set from two domains using multiple base kernels of different kernel types and parameters, which are fused with equal weights to obtain an average classifier. Finally, we propose a cross-domain learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), to learn an adapted classifier based on multiple base kernels and the prelearned average classifiers by minimizing both the structural risk functional and the mismatch between data distributions from two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. ©2010 IEEE.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	-
dc.title	Visual event recognition in videos by learning from web data	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/CVPR.2010.5539870	-
dc.identifier.scopus	eid_2-s2.0-77956003629	-
dc.identifier.spage	1959	-
dc.identifier.epage	1966	-
dc.identifier.isi	WOS:000287417502002	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Visual event recognition in videos by learning from web data

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats