Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition

Abstract Details

2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Microphone arrays are widely deployed in modern speech communication systems. With multiple microphones, spatial information is available in addition to spectral cues to improve speech enhancement, speaker separation and robust automatic speech recognition (ASR) in noisy-reverberant environments. Conventionally, multi-microphone beamforming followed by monaural post-filtering is the dominant approach for multi-channel speech enhancement. This approach requires an accurate estimate of target direction, and power spectral density and covariance matrices of speech and noise. Such estimation algorithms usually cannot achieve satisfactory accuracy in noisy and reverberant conditions. Recently, riding on the development of deep neural networks (DNN), time-frequency (T-F) masking and spectral mapping based approaches have been established as the mainstream methodology for monaural (single-channel) speech separation, including speech enhancement and speaker separation. This dissertation investigates deep learning based microphone array processing and its application to speech separation and localization, and robust ASR. We start our work by exploring various ways of integrating speech enhancement and acoustic modeling for single-channel robust ASR. We propose a training framework that jointly trains enhancement frontends, filterbanks and backend acoustic models. We also apply sequence-discriminative training for sequence modeling and run-time unsupervised adaptation to deal with training and testing mismatches. One essential aspect of multi-channel processing is sound localization. We utilize deep learning based T-F masking to identify T-F units dominated by target speaker and only use these T-F units for speaker localization, as they contain much cleaner phases that are informative for localization. This approach dramatically improves the robustness of conventional cross-correlation, beamforming and subspace based approaches for speaker localization in noisy-reverberant environments. Building upon speaker localization, we next tightly integrate complementary spectral and spatial cues for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual speakers and use the localization results to design spatial features that can indicate whether each T-F unit is dominated by the speech arriving from the estimated speaker direction. The spatial features are combined with spectral features in an enhancement network to extract the speaker from an estimated direction and with trained spectral structure. Strong separation performance has been observed on reverberant talker-independent speaker separation tasks. Before addressing multi-channel speech enhancement, we explore various magnitude based phase reconstruction algorithms for monaural speaker separation. We also study complex spectral mapping based phase estimation, where we directly predict the real and imaginary components of target speech. We find that deep learning based magnitude estimates clearly benefit phase reconstruction, and complex spectral mapping leads to better phase estimation. We then apply complex spectral mapping to multi-channel speech dereverberation and enhancement, where phase estimation is used to improve sound localization, time-invariant and time-varying beamforming, and post-filtering. State-of-the-art performance has been obtained on the enhancement and recognition tasks of the REVERB corpus and the CHiME-4 dataset. Finally, for fixed-geometry arrays, we propose multi-microphone complex spectral mapping for speech dereverberation, where DNNs are used for time-varying non-linear beamforming. We find that concatenating multiple microphone signals for complex spectral mapping is a simple and effective way of integrating spectral and spatial information for fixed-geometry arrays.
DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Mikhail Belkin (Committee Member)
Robert Agunga (Other)
211 p.

Recommended Citations

Citations

  • Wang, Z.-Q. (2020). Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587640748354459

    APA Style (7th edition)

  • Wang, Zhong-Qiu. Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1587640748354459.

    MLA Style (8th edition)

  • Wang, Zhong-Qiu. "Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587640748354459

    Chicago Manual of Style (17th edition)