Carol Espy-Wilson

Funding Agency

National Science Foundation




In many real world scenarios, speech recognition and speaker identification systems must deal with simultaneous speech from several talkers, i.e., speech mixtures representing conversations in natural environments. Users of cochlear implants encounter problems in separating speakers in multi-speaker environments, because of the loss of fine temporal structure. Thus, a crucial preprocessing step for such systems is the segregation of speech according to its constituent sources. The project is the first part of this process which involves the recognition of the number of speakers and the separation of their pitch tracks based on the periodic portions of the speech signal (i.e., voiced regions). Since different speakers have characteristic pitch ranges as a consequence of vocal cord physiology, pitch tracks can be used to help separate the combined signal into different speech streams. Current popular multi-pitch tracking approaches are susceptible to artifacts caused by the interaction between the periodic regions of the different speech signals. Consequently, the periodicity of the combined signal can be different from that of the individual components. The major new idea is the extension of an existent periodicity and pitch estimation process to higher dimensions, arriving at a multi-dimensional periodicity function which is not susceptible to the harmonic interaction artifacts. Preliminary results show that the multiple pitch tracks obtained are accurate even when one speaker is considerably more dominant than the other speaker. The approach is easily generalized to non-speech audio and it should be robust in noisy channels. The outcome of this project will be used in a future project where the actual speech streams will be separated from each other based on the multi-pitch information.

Extension of the APP detector for multipitch tracking and speaker separation is a two-year, $125K grant.