Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

Chen, Z; Mao, J; Wu, J; Wong, KKY; Tenenbaum, JB; Gan, C

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

Title	Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
Authors	Chen, Z Mao, J Wu, J Wong, KKY Tenenbaum, JB Gan, C
Keywords	Concept Learning Neuro-Symbolic Learning Video Reasoning Visual Reasoning
Issue Date	2021
Citation	The 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria, 3-7 May 2021 How to Cite?
Abstract	We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse question into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.
Description	Poster Presentation
Persistent Identifier	http://hdl.handle.net/10722/301145

DC Field	Value	Language
dc.contributor.author	Chen, Z	-
dc.contributor.author	Mao, J	-
dc.contributor.author	Wu, J	-
dc.contributor.author	Wong, KKY	-
dc.contributor.author	Tenenbaum, JB	-
dc.contributor.author	Gan, C	-
dc.date.accessioned	2021-07-27T08:06:47Z	-
dc.date.available	2021-07-27T08:06:47Z	-
dc.date.issued	2021	-
dc.identifier.citation	The 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria, 3-7 May 2021	-
dc.identifier.uri	http://hdl.handle.net/10722/301145	-
dc.description	Poster Presentation	-
dc.description.abstract	We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse question into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.	-
dc.language	eng	-
dc.relation.ispartof	International Conference on Learning Representations (ICLR) 2021	-
dc.subject	Concept Learning	-
dc.subject	Neuro-Symbolic Learning	-
dc.subject	Video Reasoning	-
dc.subject	Visual Reasoning	-
dc.title	Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning	-
dc.type	Conference_Paper	-
dc.identifier.email	Wong, KKY: kykwong@cs.hku.hk	-
dc.identifier.authority	Wong, KKY=rp01393	-
dc.identifier.hkuros	323466	-
dc.publisher.place	Vienna, Austria	-

File Download

Supplementary

Conference Paper: Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats