A human listener's ability to organize the time-frequency (T-F) energy of the same sound source into a single stream is termed auditory scene analysis (ASA). Computational auditory scene analysis (CASA) seeks to organize sound based on ASA principles. This dissertation presents a systematic effort on sequential organization in CASA. The organization goal is to group T-F segments from the same speaker that are separated in time into a single stream.
This dissertation proposes a speaker-model-based sequential organization framework and it shows better grouping performance than feature-based methods. Specifically, a computational objective is derived for sequential grouping in the context of speaker recognition for multi-talker mixtures. This formulation leads to a grouping system that searches for the optimal grouping of separated speech segments. A hypothesis pruning method is then proposed that significantly reduces search space and time while achieving performance close to that of exhaustive search. Evaluations show that the proposed system improves both grouping performance and speech recognition accuracy. The proposed system is then extended to handle multi-talker as well as non-speech intrusions using generic models. The system is further extended to deal with noisy inputs from unknown speakers. It employs a speaker quantization method that extracts generic models from a large speaker space. The resulting grouping performance is only moderately lower than that with known speaker models.
In addition, this dissertation presents a systematic effort in robust speaker recognition. A novel usable speech extraction method is proposed that significantly improves recognition performance. A general solution is proposed for speaker recognition under additive-noise conditions. Novel speaker features are derived from auditory filtering, and are used in conjunction with an uncertainty decoder that accounts for mismatch introduced in CASA front-end processing. Evaluations show that the proposed system achieves significant performance improvement over the use of typical speaker features and a state-of-the-art robust front-end processor for noisy speech.