File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/s11263-024-02261-x
- Scopus: eid_2-s2.0-105001485512
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: Audio-Visual Segmentation with Semantics
| Title | Audio-Visual Segmentation with Semantics |
|---|---|
| Authors | |
| Keywords | Audio-visual learning Audio-visual segmentation AVSBench Multi-modal segmentation Semantic segmentation Video segmentation |
| Issue Date | 1-Apr-2025 |
| Publisher | Springer |
| Citation | International Journal of Computer Vision, 2025, v. 133, n. 4, p. 1644-1664 How to Cite? |
| Abstract | We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench. |
| Persistent Identifier | http://hdl.handle.net/10722/362445 |
| ISSN | 2023 Impact Factor: 11.6 2023 SCImago Journal Rankings: 6.668 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Zhou, Jinxing | - |
| dc.contributor.author | Shen, Xuyang | - |
| dc.contributor.author | Wang, Jianyuan | - |
| dc.contributor.author | Zhang, Jiayi | - |
| dc.contributor.author | Sun, Weixuan | - |
| dc.contributor.author | Zhang, Jing | - |
| dc.contributor.author | Birchfield, Stan | - |
| dc.contributor.author | Guo, Dan | - |
| dc.contributor.author | Kong, Lingpeng | - |
| dc.contributor.author | Wang, Meng | - |
| dc.contributor.author | Zhong, Yiran | - |
| dc.date.accessioned | 2025-09-24T00:51:36Z | - |
| dc.date.available | 2025-09-24T00:51:36Z | - |
| dc.date.issued | 2025-04-01 | - |
| dc.identifier.citation | International Journal of Computer Vision, 2025, v. 133, n. 4, p. 1644-1664 | - |
| dc.identifier.issn | 0920-5691 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/362445 | - |
| dc.description.abstract | We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench. | - |
| dc.language | eng | - |
| dc.publisher | Springer | - |
| dc.relation.ispartof | International Journal of Computer Vision | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | Audio-visual learning | - |
| dc.subject | Audio-visual segmentation | - |
| dc.subject | AVSBench | - |
| dc.subject | Multi-modal segmentation | - |
| dc.subject | Semantic segmentation | - |
| dc.subject | Video segmentation | - |
| dc.title | Audio-Visual Segmentation with Semantics | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1007/s11263-024-02261-x | - |
| dc.identifier.scopus | eid_2-s2.0-105001485512 | - |
| dc.identifier.volume | 133 | - |
| dc.identifier.issue | 4 | - |
| dc.identifier.spage | 1644 | - |
| dc.identifier.epage | 1664 | - |
| dc.identifier.eissn | 1573-1405 | - |
| dc.identifier.issnl | 0920-5691 | - |
