Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Deep CASA for Robust Pitch Tracking and Speaker Separation

Abstract Details

2019, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.
Speech is the most important means of human communication. In real environments, speech is often corrupted by acoustic inference, including noise, reverberation and competing speakers. Such interference leads to adverse effects on audition, and degrades the performance of speech applications. Inspired by the principles of human auditory scene analysis (ASA), computational auditory scene analysis (CASA) addresses speech separation in two main steps: segmentation and grouping. With noisy speech decomposed into a matrix of time-frequency (T-F) units, segmentation organizes T-F units into segments, each of which corresponds to a contiguous T-F region and is supposed to originate from the same source. Two types of grouping are then performed. Simultaneous grouping aggregates segments overlapping in time to simultaneous streams. In sequential grouping, simultaneous streams are grouped across time into distinct sources. As a traditional speech separation approach, CASA has been successfully applied in various speech-related tasks. In this dissertation, we revisit conventional CASA methods, and perform related tasks from a deep learning perspective. As an intrinsic characteristic of speech, pitch serves as a primary cue in many CASA systems. A reliable estimate of pitch is important not only for extracting harmonic patterns at a frame level, but also for streaming voiced speech in sequential grouping. Based on the types of interference, we can divide pitch tracking in two categories: single pitch tracking in noise and multi-pitch tracking. Pitch tracking in noise is challenging as the harmonic structure of speech can be severely contaminated. To recover the missing harmonic patterns, we propose to use long short-term memory (LSTM) recurrent neural networks (RNNs) to model sequential dynamics. Two architectures are investigated. The first one is conventional LSTM that utilizes recurrent connections to model temporal dynamics. The second one is two-level time-frequency LSTM, with the first level scanning frequency bands and the second level connecting the first level through time. Systematic evaluations show that both proposed models outperform a deep neural network (DNN) based model in various noisy conditions. Multi-pitch tracking aims to extract concurrent pitch contours of different speakers. Accurate pitch estimation and correct speaker assignments need to be achieved at the same time. We use DNNs to model the probabilistic pitch states of two simultaneous speakers. Speaker-dependent (SD) training is adopted for a more accurate assignment of pitch states. A factorial hidden Markov model (FHMM) then integrates pitch probabilities and generates the most likely pitch tracks. Evaluations show that the proposed SD DNN-FHMM framework outperforms other speaker-independent (SI) and SD multi-pitch trackers on two-speaker mixtures. Speaker-independent multi-pitch tracking has been a long-standing difficulty. We extend the DNN-FHMM framework, and use an utterance-level permutation invariant training (uPIT) criterion to train the system with speaker-independent data. A speaker separation front end is further added to improve pitch estimation. The proposed SI approach substantially outperforms all other SI multi-pitch trackers, and largely closes the gap with SD methods. Besides exploring deep learning based pitch tracking as cues for CASA, we directly address talker-independent monaural speaker separation from the perspectives of CASA and deep learning, resulting in what we call a deep CASA approach. Simultaneous grouping is first performed for frame-level separation of the two speakers with a permutation-invariantly trained neural network. Sequential grouping then assigns the frame-level separated spectra to distinct speakers with a clustering network. Compared to a uPIT system which conducts frame-level separation and speaker tracking in one stage, our deep CASA framework achieves better performance for both objectives. Evaluation results on the benchmark WSJ two-speaker mixture database demonstrate that deep CASA significantly outperforms other spectral-domain approaches. In talker-independent speaker separation, generalization to an unknown number of speakers and causal processing are two important considerations for real-world deployment. We propose a multi-speaker extension to deep CASA for C concurrent speakers (C>=2), which works well for speech mixtures with up to C speakers even without the prior knowledge about the speaker number. We also propose extensive revisions to the connections, normalization and clustering algorithm in deep CASA to make a causal system. Experimental results on the WSJ0-2mix and WSJ0-3mix databases show that both extensions achieve the state-of-the-art performance. The development of the deep CASA approach in this dissertation represents a major step towards solving the cocktail party problem.
DeLiang Wang (Advisor)
Eric Fosler-Lussier (Committee Member)
Alan Ritter (Committee Member)
159 p.

Recommended Citations

Citations

  • Liu, Y. (2019). Deep CASA for Robust Pitch Tracking and Speaker Separation [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1566179636974186

    APA Style (7th edition)

  • Liu, Yuzhou. Deep CASA for Robust Pitch Tracking and Speaker Separation. 2019. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1566179636974186.

    MLA Style (8th edition)

  • Liu, Yuzhou. "Deep CASA for Robust Pitch Tracking and Speaker Separation." Doctoral dissertation, Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1566179636974186

    Chicago Manual of Style (17th edition)