Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Integrating computational auditory scene analysis and automatic speech recognition

Srinivasan, Soundararajan

Abstract Details

2006, Doctor of Philosophy, Ohio State University, Biomedical Engineering.
Speech perception studies indicate that robustness of human speech recognition is primarily due to our ability to segregate a target sound source from other interferences. This perceptual process of auditory scene analysis (ASA) is of two types, primitive and schema-driven. This dissertation investigates several aspects of integrating computational ASA (CASA) and automatic speech recognition (ASR). While bottom-up CASA are used as front-end for ASR to improve its robustness, ASR is used to provide top-down information to enhance primitive segregation. Listeners are able to restore masked phonemes by utilizing lexical context. We present a schema-based model for phonemic restoration. The model employs missing-data ASR to decode masked speech and activates word templates via dynamic time warping. A systematic evaluation shows that the model restores both voiced and unvoiced phonemes with a high spectral quality. Missing-data ASR requires a binary mask from bottom-up CASA that identifies speech-dominant time-frequency regions of a noisy mixture. We propose a two-pass system that performs segregation and recognition in tandem. First, an n-best lattice, consistent with bottom-up speech separation, is generated. Second, the lattice is re-scored using a model-based hypothesis test to improve mask estimation and recognition accuracy concurrently. By combining CASA and ASR, we present a model that simulates listeners' ability to attend to a target speaker when degraded by energetic and informational masking. Missing-data ASR is used to account for energetic masking and the output degradation of CASA is used to model informational masking. The model successfully simulates several quantitative aspects of listener performance. The degradation in the output of CASA-based front-ends leads to uncertain ASR inputs. We estimate feature uncertainties in the spectral domain and transform them into the cepstral domain via nonlinear regression. The estimated uncertainty substantially improves recognition accuracy. We also investigate the effect of vocabulary size on conventional and missing-data ASRs. Based on binaural cues, for conventional ASR, we extract the speech signal using a Wiener filter and for missing-data ASR, we estimate a binary mask. We find that while missing-data ASR outperforms conventional ASR on a small vocabulary task, the relative performance reverses on a larger vocabulary task.
DeLiang Wang (Advisor)
212 p.

Recommended Citations

Citations

  • Srinivasan, S. (2006). Integrating computational auditory scene analysis and automatic speech recognition [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1158250036

    APA Style (7th edition)

  • Srinivasan, Soundararajan. Integrating computational auditory scene analysis and automatic speech recognition. 2006. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1158250036.

    MLA Style (8th edition)

  • Srinivasan, Soundararajan. "Integrating computational auditory scene analysis and automatic speech recognition." Doctoral dissertation, Ohio State University, 2006. http://rave.ohiolink.edu/etdc/view?acc_num=osu1158250036

    Chicago Manual of Style (17th edition)