Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Dissertation_YanZhao.pdf (12.51 MB)
ETD Abstract Container
Abstract Header
Deep learning methods for reverberant and noisy speech enhancement
Author Info
Zhao, Yan
ORCID® Identifier
http://orcid.org/0000-0001-8595-3297
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348
Abstract Details
Year and Degree
2020, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Abstract
In daily listening environments, the speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions can be detrimental to speech intelligibility and quality, and also pose a serious problem for many speech-related applications, including automatic speech and speaker recognition. The objective of this dissertation is to enhance speech signals distorted by reverberation and noise, to benefit both human communications and human-machine interaction. Different from traditional signal processing approaches, we employ deep learning approaches to perform reverberant-noisy speech enhancement. Our study starts with speech dereverberation without background noise. Reverberation consists of sound wave reflections from various surfaces in an enclosed space. This means the reverberant signal at any time step includes the damped and delayed past signals. To explore such relationships at different time steps, we utilize a self-attention mechanism as a pre-processing module to produce dynamic representations. With these enhanced representations, we propose a temporal convolutional network (TCN) based speech dereverberation algorithm. Systematic evaluations demonstrate the effectiveness of the proposed algorithm in a wide range of reverberant conditions. Then we propose a deep learning based time-frequency (T-F) masking algorithm to address both reverberation and noise. Specifically, a deep neural network (DNN) is trained to estimate the ideal ratio mask (IRM), in which the anechoic-clean speech is considered as the desired signal. The enhanced speech is obtained by applying the estimated mask to the reverberant-noisy speech. Listening tests show that the proposed algorithm can improve speech intelligibility for hearing-impaired (HI) listeners substantially, and also benefit normal-hearing (NH) listeners. Considering the different natures of reverberation and noise, we propose to perform speech enhancement using a two-stage strategy, where denoising and dereverberation are conducted sequentially using DNNs. Moreover, we design a new objective function to better estimate the magnitude spectrum of anechoic-clean speech. After pre-training the denoising stage and dereverberation stage separately, the two-stage model is jointly trained with the proposed objective function. Experiments show that two-stage processing outperforms previous one-stage enhancement systems significantly. We also investigate reverberant-noisy speech enhancement in the complex domain. Instead of predicting the complex ideal ratio mask (cIRM) explicitly, our proposed algorithm estimates a complex ratio mask implicitly, and optimizes a loss function defined in terms of the complex spectrum of anechoic-clean speech. Furthermore, to integrate the contextual information among different T-F units more efficiently, we propose a new T-F attention mechanism. Together with an improved DenseUNet architecture, the proposed model improves objective metrics of speech intelligibility and quality substantially. Most existing supervised speech enhancement algorithms, including spectral mapping or T-F masking, assign the same importance to all the T-F units, without considering their different contributions to speech intelligibility or quality. To leverage the insights from speech perception, we propose a new DNN based speech enhancement method that incorporates widely used short-time objective intelligibility measure (STOI) as part of the loss function. Experimental results show that the proposed perceptually guided loss function is able to improve the STOI metric further while maintaining objective speech quality.
Committee
DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Eric Healy (Committee Member)
Pages
168 p.
Subject Headings
Computer Science
;
Engineering
Keywords
Deep neural networks
;
Supervised learning
;
Attention
;
Speech enhancement
;
Speech denoising
;
Speech dereverberation
;
Time-frequency masking
;
Speech intelligibility
;
Speech quality
;
Computational auditory scene analysis
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Zhao, Y. (2020).
Deep learning methods for reverberant and noisy speech enhancement
[Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348
APA Style (7th edition)
Zhao, Yan.
Deep learning methods for reverberant and noisy speech enhancement.
2020. Ohio State University, Doctoral dissertation.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348.
MLA Style (8th edition)
Zhao, Yan. "Deep learning methods for reverberant and noisy speech enhancement." Doctoral dissertation, Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu1593462119759348
Download Count:
1,442
Copyright Info
© 2020, all rights reserved.
This open access ETD is published by The Ohio State University and OhioLINK.