Deep learning methods for reverberant and noisy speech enhancement

Zhao, Yan

Keyword Search

School Logo

Dissertation_YanZhao.pdf (12.51 MB)

Deep learning methods for reverberant and noisy speech enhancement

Author Info

Zhao, Yan

ORCID® Identifier

http://orcid.org/0000-0001-8595-3297

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348

Year and Degree

2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

In daily listening environments, the speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions can be detrimental to speech intelligibility and quality, and also pose a serious problem for many speech-related applications, including automatic speech and speaker recognition. The objective of this dissertation is to enhance speech signals distorted by reverberation and noise, to benefit both human communications and human-machine interaction. Different from traditional signal processing approaches, we employ deep learning approaches to perform reverberant-noisy speech enhancement. Our study starts with speech dereverberation without background noise. Reverberation consists of sound wave reflections from various surfaces in an enclosed space. This means the reverberant signal at any time step includes the damped and delayed past signals. To explore such relationships at different time steps, we utilize a self-attention mechanism as a pre-processing module to produce dynamic representations. With these enhanced representations, we propose a temporal convolutional network (TCN) based speech dereverberation algorithm. Systematic evaluations demonstrate the effectiveness of the proposed algorithm in a wide range of reverberant conditions. Then we propose a deep learning based time-frequency (T-F) masking algorithm to address both reverberation and noise. Specifically, a deep neural network (DNN) is trained to estimate the ideal ratio mask (IRM), in which the anechoic-clean speech is considered as the desired signal. The enhanced speech is obtained by applying the estimated mask to the reverberant-noisy speech. Listening tests show that the proposed algorithm can improve speech intelligibility for hearing-impaired (HI) listeners substantially, and also benefit normal-hearing (NH) listeners. Considering the different natures of reverberation and noise, we propose to perform speech enhancement using a two-stage strategy, where denoising and dereverberation are conducted sequentially using DNNs. Moreover, we design a new objective function to better estimate the magnitude spectrum of anechoic-clean speech. After pre-training the denoising stage and dereverberation stage separately, the two-stage model is jointly trained with the proposed objective function. Experiments show that two-stage processing outperforms previous one-stage enhancement systems significantly. We also investigate reverberant-noisy speech enhancement in the complex domain. Instead of predicting the complex ideal ratio mask (cIRM) explicitly, our proposed algorithm estimates a complex ratio mask implicitly, and optimizes a loss function defined in terms of the complex spectrum of anechoic-clean speech. Furthermore, to integrate the contextual information among different T-F units more efficiently, we propose a new T-F attention mechanism. Together with an improved DenseUNet architecture, the proposed model improves objective metrics of speech intelligibility and quality substantially. Most existing supervised speech enhancement algorithms, including spectral mapping or T-F masking, assign the same importance to all the T-F units, without considering their different contributions to speech intelligibility or quality. To leverage the insights from speech perception, we propose a new DNN based speech enhancement method that incorporates widely used short-time objective intelligibility measure (STOI) as part of the loss function. Experimental results show that the proposed perceptually guided loss function is able to improve the STOI metric further while maintaining objective speech quality.

Committee

DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Eric Healy (Committee Member)

Pages

168 p.

Subject Headings

Computer Science; Engineering

Keywords

Deep neural networks; Supervised learning; Attention; Speech enhancement; Speech denoising; Speech dereverberation; Time-frequency masking; Speech intelligibility; Speech quality; Computational auditory scene analysis

Zhao, Y. (2020). Deep learning methods for reverberant and noisy speech enhancement [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348
APA Style (7th edition)
Zhao, Yan. Deep learning methods for reverberant and noisy speech enhancement. 2020. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348.
MLA Style (8th edition)
Zhao, Yan. "Deep learning methods for reverberant and noisy speech enhancement." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348
Chicago Manual of Style (17th edition)

Document number:

osu1593462119759348

Download Count:

1,442

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Deep learning methods for reverberant and noisy speech enhancement

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

Deep learning methods for reverberant and noisy speech enhancement

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations