Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Robust Automatic Speech Recognition By Integrating Speech Separation

Abstract Details

, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Automatic speech recognition (ASR) has been used in many real-world applications such as smart speakers and meeting transcription. It converts speech waveform to text, making it possible for computers to understand and process human speech. When deployed to scenarios with severe noise or multiple speakers, the performance of ASR degrades by large margins. Robust ASR refers to the research field that addresses such performance degradation. Conventionally, the robustness of ASR models to background noise is improved by cascading speech enhancement frontends and ASR backends. This approach introduces distortions to speech signals that can render speech enhancement useless or even harmful for ASR. As for the robustness of ASR models to speech overlaps, traditional frontends cannot use speaker profiles efficiently. In this dissertation, we investigate the integration of ASR backends with speech separation (including speech enhancement and speaker separation) frontends. We start our work by improving the performance of acoustic models in ASR. We propose an utterance-wise recurrent dropout method for a recurrent neural network (RNN) based acoustic model. With utterance-wise context better exploited, the word error rate (WER) reduces substantially. We also propose an iterative speaker adaptation method that can adapt the acoustic model to different speakers using the ASR output from the previous iteration. To obtain a better trade-off between noise reduction and speech distortion for robust monaural (i.e. single-channel) ASR, we train the acoustic model with a large variety of enhanced speech generated by a monaural speech enhancement model. This way, the influence of speech distortion to ASR can be alleviated. We then investigate the use of different types of enhanced features for distortion-independent acoustic modeling. Using distortion-independent acoustic modeling with magnitude features as input, we obtain the state-of-the-art results on the second CHiME speech separation and recognition (CHiME-2) corpus. Multi-channel speech enhancement typically introduces less distortion than monaural speech enhancement. We first substitute the summation operation in beamforming with a learnable complex domain convolutional layer. Operations in complex domain leverage both magnitude and phase information. We then combine this complex domain idea and a two-stage beamforming approach. The first stage extracts spatial features, and the second stage uses both extracted spatial features and the original spectral features as input. This way, the second stage exploits spatial and spectral features explicitly. Using the proposed method, we achieve the state-of-the-art result on the 4th CHiME speech separation and recognition challenge (CHiME-4) corpus. While the enhancement of noisy speech leverages the differences between speech and noise in time-frequency (T-F) patterns, the separation of overlapped speech needs to use speaker-related information. We investigate speaker separation using an inventory of speaker profiles containing speaker identity information. We first select the speaker profiles involved in overlapped speech using an attention-based method. The selected speaker profiles are then used together with the original overlapped speech as input for speaker separation. To alleviate the problem caused by wrong speaker profile selection, we propose to use the output of speaker separation as selected speaker profiles for more iterations of speaker separation. Finally, speech contains sensitive personal data that users may not want to send to cloud-based servers for processing. Next-generation ASR systems should not only be robust to adverse conditions but also lightweight so that they can be deployed on-device. We investigate model compression methods for ASR that do not need model retraining. Our proposed weight sharing based model compression method achieves 9-fold compression with negligible performance degradation.
DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Wei-Lun Chao (Committee Member)

Recommended Citations

Citations

  • Wang, P. (n.d.). Robust Automatic Speech Recognition By Integrating Speech Separation [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668

    APA Style (7th edition)

  • Wang, Peidong. Robust Automatic Speech Recognition By Integrating Speech Separation. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668.

    MLA Style (8th edition)

  • Wang, Peidong. "Robust Automatic Speech Recognition By Integrating Speech Separation." Doctoral dissertation, Ohio State University. Accessed APRIL 16, 2024. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668

    Chicago Manual of Style (17th edition)