Object-centric Representation Learning for Video Scene Understanding

Zhou, Yi; Zhang, Hui; Park, Seung In; Yoo, Byung In; Qi, Xiaojuan

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2024.3401409
Scopus: eid_2-s2.0-85193287200
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Electrical & Electronic Engineering: Journal/Magazine Articles

Article: Object-centric Representation Learning for Video Scene Understanding

Title	Object-centric Representation Learning for Video Scene Understanding
Authors	Zhou, Yi Zhang, Hui Park, Seung In Yoo, Byung In Qi, Xiaojuan
Keywords	Depth estimation Estimation Feature extraction Generators IP networks object-centric representation Pipelines scene understanding Semantics Task analysis tracking video panoptic segmentation
Issue Date	15-May-2024
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, p. 1-13 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2024.3401409
Abstract	Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks. Codes will be available at https://github.com/SAITPublic/.
Persistent Identifier	http://hdl.handle.net/10722/350740
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158

DC Field	Value	Language
dc.contributor.author	Zhou, Yi	-
dc.contributor.author	Zhang, Hui	-
dc.contributor.author	Park, Seung In	-
dc.contributor.author	Yoo, Byung In	-
dc.contributor.author	Qi, Xiaojuan	-
dc.date.accessioned	2024-11-02T00:36:48Z	-
dc.date.available	2024-11-02T00:36:48Z	-
dc.date.issued	2024-05-15	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, p. 1-13	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/350740	-
dc.description.abstract	<p>Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks. Codes will be available at https://github.com/SAITPublic/.</p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	Depth estimation	-
dc.subject	Estimation	-
dc.subject	Feature extraction	-
dc.subject	Generators	-
dc.subject	IP networks	-
dc.subject	object-centric representation	-
dc.subject	Pipelines	-
dc.subject	scene understanding	-
dc.subject	Semantics	-
dc.subject	Task analysis	-
dc.subject	tracking	-
dc.subject	video panoptic segmentation	-
dc.title	Object-centric Representation Learning for Video Scene Understanding	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPAMI.2024.3401409	-
dc.identifier.scopus	eid_2-s2.0-85193287200	-
dc.identifier.spage	1	-
dc.identifier.epage	13	-
dc.identifier.eissn	1939-3539	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Object-centric Representation Learning for Video Scene Understanding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats