Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Convolutional and recurrent neural networks for real-time speech separation in the complex domain

Abstract Details

2021, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Speech signals are usually distorted by acoustic interference in daily listening environments. Such distortions severely degrade speech intelligibility and quality for human listeners, and make many speech-related tasks, such as automatic speech recognition and speaker identification, very difficult. The use of deep learning has led to tremendous advances in speech enhancement over the last decade. It has been increasingly important to develop deep learning based real-time speech enhancement systems due to the prevalence of many modern smart devices that require real-time processing. The objective of this dissertation is to develop real-time speech enhancement algorithms to improve intelligibility and quality of noisy speech. Our study starts by developing a strong convolutional neural network (CNN) for monaural speech enhancement. The key idea is to systematically aggregate temporal contexts through dilated convolutions, which significantly expand receptive fields. Our experimental results suggest that the proposed model consistently outperforms a feedforward deep neural network (DNN), a unidirectional long short-term memory (LSTM) model and a bidirectional LSTM model in terms of objective speech intelligibility and quality metrics. Although significant progress has been made on deep learning based speech enhancement, most existing studies only exploit magnitude-domain information and enhance the magnitude spectra. We propose to perform complex spectral mapping with a gated convolutional recurrent network (GCRN). Such an approach simultaneously enhances magnitude and phase of speech. Evaluation results show that the proposed GCRN substantially outperforms an existing CNN for complex spectral mapping. Moreover, the proposed approach yields significantly better results than magnitude spectral mapping and complex ratio masking. To achieve strong enhancement performance typically requires a large DNN, making it difficult to deploy such speech enhancement systems on devices with limited hardware resources or in applications with strict latency requirements. We propose two compression pipelines to reduce the model size for DNN-based speech enhancement. We systematically investigate these techniques and evaluate the proposed compression pipelines. Experimental results demonstrate that our approach reduces the sizes of four different models by large margins without significantly sacrificing their enhancement performance. An important application of real-time speech enhancement lies in mobile speech communication. We propose a deep learning based real-time enhancement algorithm for dual-microphone mobile phones. The proposed algorithm employs a new densely-connected convolutional recurrent network to perform dual-channel complex spectral mapping. By compressing the model with a structured pruning technique, we derive an efficient system amenable to real-time processing. Experimental results suggest that the proposed algorithm consistently outperforms an earlier algorithm to dual-channel speech enhancement for mobile phone communication, as well as a deep learning based beamformer. Multi-channel complex spectral mapping (CSM) has proven to be effective in speech separation, assuming a fixed geometry of the microphone array. We comprehensively investigate this approach, and find that multi-channel CSM achieves separation performance better than or comparable to conventional and masking-based beamforming for different array geometries and speech separation tasks. Our investigation demonstrates that this all-neural approach is a general and effective spatial filter for multi-channel speech separation.
DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Eric Healy (Committee Member)
206 p.

Recommended Citations

Citations

  • Tan, K. (2021). Convolutional and recurrent neural networks for real-time speech separation in the complex domain [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1626983471600193

    APA Style (7th edition)

  • Tan, Ke. Convolutional and recurrent neural networks for real-time speech separation in the complex domain. 2021. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1626983471600193.

    MLA Style (8th edition)

  • Tan, Ke. "Convolutional and recurrent neural networks for real-time speech separation in the complex domain." Doctoral dissertation, Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1626983471600193

    Chicago Manual of Style (17th edition)