Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH

Williamson, Donald S

Abstract Details

2016, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Speech is a vital form of human communication and it is important for many real-world applications. Voice commands are used to interface with electronic devices and hearing-impaired individuals use hearing aids to understand speech better. In realistic environments, background noise and reverberation are present, resulting in performance degradation. For this reason, it is crucial that speech is separated from interference. Many speech separation approaches have been proposed, but there is a considerable need to produce speech estimates that are both intelligible and high quality, especially at low signal-to-noise ratios (SNRs). Time-frequency (T-F) masking and model-based separation are two common ways to extract speech in a noisy observation. T-F masking involves the estimation of an oracle mask, which can be accomplished using supervised learning. Deep neural networks (DNN) are well suited for T-F mask estimation due to their ability to learn mappings from noisy observations to a desired target. Likewise, model-based separation is suitable due to its ability to represent the spectral structure of speech. This dissertation presents work that develops speech separation systems using combinations of T-F masking, DNNs, and model-based reconstruction. The aim of each system is to improve the perceptual quality of the speech estimates. Ideal binary mask (IBM) estimation has shown success in improving the intelligibility of separated speech, but it often results in poor quality due to estimation errors and the removal of speech. On the other hand, model-based separation approaches such as nonnegative matrix factorization (NMF) and sparse reconstruction improve the perceptual quality, but not intelligibility, of separated speech. We start by studying the performance of speech separation by combining IBM estimation with model-based reconstruction. We demonstrate that our system can improve the perceptual quality and intelligibility over performing T-F masking or model-based separation alone. DNNs have successfully estimated a range of targets. We then present a method that uses a DNN to estimate the activations of a speech model. Initially, a DNN is used to estimate the ideal ratio mask (IRM), where the estimated IRM separates the speech from noise with reasonable sound quality. Afterwards, a second DNN learns the mapping from ratio-masked speech to NMF model activations. The estimated activations linearly combine the elements of an NMF speech model to approximate clean speech. Experiments show that the proposed approach produces high quality separated speech. In addition, we conduct a listening study and its results show that our output is preferred over comparison systems. The above and most other speech separation systems operate on the magnitude response of noisy speech and use the noisy phase during signal reconstruction. This occurs because it is believed that the phase spectrum is unimportant for speech enhancement. More recent studies, however, reveal that phase is important for perceptual quality. We present an approach that concurrently enhances the magnitude and phase spectra by operating in the complex domain. We start by introducing the complex ideal ratio mask (cIRM), which has real and imaginary components. A DNN is used to jointly estimate these components of the cIRM. Evaluation results demonstrate that the proposed system substantially improves perceptual quality over recent approaches in noisy environments. Along with background noise, room reverberation is commonly encountered in real environments. The performance of many speech processing applications is severely degraded when both noise and reverberation are present. We propose to simultaneously perform dereverberation and denoising with the cIRM. First, we redefine the cIRM for reverberant and noisy environments. A DNN is then trained to estimate it. The complex mask removes the interference caused by noise and reverberation, and results in better predicted speech quality and intelligibility.
DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Mikhail Belkin (Committee Member)
157 p.

Recommended Citations

Citations

  • Williamson, D. S. (2016). DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277

    APA Style (7th edition)

  • Williamson, Donald. DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH. 2016. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277.

    MLA Style (8th edition)

  • Williamson, Donald. "DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH." Doctoral dissertation, Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277

    Chicago Manual of Style (17th edition)