3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion

Wang, DaDong; Wang, Jie; Sun, MingChen

3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion

DaDong Wang, Jie Wang and MingChen Sun

PLOS ONE, 2024, vol. 19, issue 1, 1-17

Abstract: Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3D Inception-ResUNet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrogram. Multiobjectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1K dataset with NAO robots and synthesized the 10-channel dataset for training the model. The experimental results show that the proposed model trained by multiple objectives reaches an average NSDR of 11.55 dB on the test dataset, which outperforms the comparison model.

Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0289453 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 89453&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0289453

DOI: 10.1371/journal.pone.0289453

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().