Robust Speech Enhancement in the Time Domain

Pandey, Ashutosh

Keyword Search

School Logo

Robust_Speech_Enhancement_in_the_Time_Domain_embed.pdf (20.29 MB)

Robust Speech Enhancement in the Time Domain

Author Info

Pandey, Ashutosh

ORCID® Identifier

http://orcid.org/0000-0002-3352-7453

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688

Year and Degree

2022, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

Speech is the primary mode of human communication and a natural interface for human-machine interaction. However, background noise in the real world creates difficulty for both human and machine listeners. Speech enhancement aims at removing or attenuating background noise from degraded speech. In contrast to the widely accepted time-frequency (T-F) based methods, time-domain speech enhancement aims at estimating the clean speech samples directly from noisy speech samples. Time-domain speech enhancement using deep neural networks (DNNs) is an exciting research direction due to its potential of jointly enhancing the spectral magnitude and phase by utilizing strong modeling capabilities of DNNs. This dissertation presents a systematic effort to develop monaural time-domain speech enhancement systems using DNNs. We start by developing a novel framework for time-domain speech enhancement. It includes a convolutional neural network (CNN) for time-domain enhancement and a spectral magnitude based loss for supervised training. CNNs are more suitable for learning representations from raw waveforms by utilizing local correlations. The loss over spectral magnitude aids supervised learning in recognizing discriminative patterns of speech and noise over different frequency bands. The proposed framework significantly outperforms a strong T-F based gated residual network (GRN) model for spectral magnitude enhancement. Many real-world applications, such as hearing aids and teleconferencing, require real-time speech enhancement. Next, we develop a real-time speech enhancement system called TCNN: Temporal Convolutional Neural Network, a novel utterance-based CNN that utilizes causal and dilated temporal convolutions. Causal convolutions are crucial for low algorithmic latency and dilated convolutions help with long-range context aggregation. TCNN is shown to outperform T-F based baseline models, including a unidirectional long short-term memory (LSTM) model and a convolutional recurrent network (CRN) model. Further, we advance time-domain enhancement by improving both the CNN architecture and the loss function for training. The CNN architecture is improved by adding densely-connected blocks and using self-attention for better context aggregation than dilated convolutions. We propose a new training loss called phase constrained magnitude (PCM) loss, which is measured for not only the enhanced speech but also the removed noise. It helps to improve phase enhancement, and as a result, obtains better SNR improvements and removes an undesired artifact introduced by the spectral magnitude loss. Next, we systematically investigate the cross-corpus generalization of DNN based speech enhancement. We observe that DNNs suffer from a corpus fitting problem, where a DNN trained on one corpus fails to generalize to other corpora. We propose several techniques, such as channel normalization, a smaller frame shift, and a more comprehensive training corpus, to improve cross-corpus generalization. To further elevate cross-corpus generalization, we propose a novel attentive recurrent network (ARN) for time-domain speech enhancement. The key aspects of ARN include RNN, self-attention, a smaller frame shift, and a larger training corpus. ARN exhibits superior speech enhancement in multiple tasks. The causal version of ARN is the first-ever system that is trained in a speaker-, noise-, and corpus-independent way and exhibits substantial intelligibility improvements for both normal-hearing and hearing-impaired listeners in low SNR conditions. Finally, we propose a novel training framework called Attentive Training for supervised speech enhancement that can remove not only noise but also interfering speech. The main idea of attentive training is to attend to the stream of a single speaker in the mixture, i.e. over the speech signals of other talkers and background noise. A DNN model is trained to sequentially attend to (extract) the speech of the first speaker and ignore the rest, where onset time of the first speaker is used as an intrinsic mechanism for speaker selection. Attentive training outperforms widely used permutation invariant training for speaker separation.

Committee

DeLiang Wang (Advisor)
Eric Healy (Committee Member)
Eric Fosler-Lussier (Committee Member)

Subject Headings

Computer Engineering; Computer Science

Pandey, A. (2022). Robust Speech Enhancement in the Time Domain [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688
APA Style (7th edition)
Pandey, Ashutosh. Robust Speech Enhancement in the Time Domain. 2022. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688.
MLA Style (8th edition)
Pandey, Ashutosh. "Robust Speech Enhancement in the Time Domain." Doctoral dissertation, Ohio State University, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=osu1658408736909688
Chicago Manual of Style (17th edition)

Document number:

osu1658408736909688

Download Count:

165

Copyright Info

Robust Speech Enhancement in the Time Domain by Ashutosh Pandey is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by The Ohio State University and OhioLINK.

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Robust Speech Enhancement in the Time Domain

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Robust Speech Enhancement in the Time Domain

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations