Skip to Main Content
 

Global Search Box

 
 
 

ETD Abstract Container

Abstract Header

Robust Speech Enhancement in the Time Domain

Abstract Details

2022, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Speech is the primary mode of human communication and a natural interface for human-machine interaction. However, background noise in the real world creates difficulty for both human and machine listeners. Speech enhancement aims at removing or attenuating background noise from degraded speech. In contrast to the widely accepted time-frequency (T-F) based methods, time-domain speech enhancement aims at estimating the clean speech samples directly from noisy speech samples. Time-domain speech enhancement using deep neural networks (DNNs) is an exciting research direction due to its potential of jointly enhancing the spectral magnitude and phase by utilizing strong modeling capabilities of DNNs. This dissertation presents a systematic effort to develop monaural time-domain speech enhancement systems using DNNs. We start by developing a novel framework for time-domain speech enhancement. It includes a convolutional neural network (CNN) for time-domain enhancement and a spectral magnitude based loss for supervised training. CNNs are more suitable for learning representations from raw waveforms by utilizing local correlations. The loss over spectral magnitude aids supervised learning in recognizing discriminative patterns of speech and noise over different frequency bands. The proposed framework significantly outperforms a strong T-F based gated residual network (GRN) model for spectral magnitude enhancement. Many real-world applications, such as hearing aids and teleconferencing, require real-time speech enhancement. Next, we develop a real-time speech enhancement system called TCNN: Temporal Convolutional Neural Network, a novel utterance-based CNN that utilizes causal and dilated temporal convolutions. Causal convolutions are crucial for low algorithmic latency and dilated convolutions help with long-range context aggregation. TCNN is shown to outperform T-F based baseline models, including a unidirectional long short-term memory (LSTM) model and a convolutional recurrent network (CRN) model. Further, we advance time-domain enhancement by improving both the CNN architecture and the loss function for training. The CNN architecture is improved by adding densely-connected blocks and using self-attention for better context aggregation than dilated convolutions. We propose a new training loss called phase constrained magnitude (PCM) loss, which is measured for not only the enhanced speech but also the removed noise. It helps to improve phase enhancement, and as a result, obtains better SNR improvements and removes an undesired artifact introduced by the spectral magnitude loss. Next, we systematically investigate the cross-corpus generalization of DNN based speech enhancement. We observe that DNNs suffer from a corpus fitting problem, where a DNN trained on one corpus fails to generalize to other corpora. We propose several techniques, such as channel normalization, a smaller frame shift, and a more comprehensive training corpus, to improve cross-corpus generalization. To further elevate cross-corpus generalization, we propose a novel attentive recurrent network (ARN) for time-domain speech enhancement. The key aspects of ARN include RNN, self-attention, a smaller frame shift, and a larger training corpus. ARN exhibits superior speech enhancement in multiple tasks. The causal version of ARN is the first-ever system that is trained in a speaker-, noise-, and corpus-independent way and exhibits substantial intelligibility improvements for both normal-hearing and hearing-impaired listeners in low SNR conditions. Finally, we propose a novel training framework called Attentive Training for supervised speech enhancement that can remove not only noise but also interfering speech. The main idea of attentive training is to attend to the stream of a single speaker in the mixture, i.e. over the speech signals of other talkers and background noise. A DNN model is trained to sequentially attend to (extract) the speech of the first speaker and ignore the rest, where onset time of the first speaker is used as an intrinsic mechanism for speaker selection. Attentive training outperforms widely used permutation invariant training for speaker separation.
DeLiang Wang (Advisor)
Eric Healy (Committee Member)
Eric Fosler-Lussier (Committee Member)

Recommended Citations

Citations

  • Pandey, A. (2022). Robust Speech Enhancement in the Time Domain [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688

    APA Style (7th edition)

  • Pandey, Ashutosh. Robust Speech Enhancement in the Time Domain. 2022. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688.

    MLA Style (8th edition)

  • Pandey, Ashutosh. "Robust Speech Enhancement in the Time Domain." Doctoral dissertation, Ohio State University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688

    Chicago Manual of Style (17th edition)