This is an archive of audio samples from the groundbreaking 2006 work by Kristjansson et al. that won the PASCAL Speech Separation Challenge that year:
T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath
Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge SystemInterspeech 2006
The SSC task involves two people speaking at the same time. The goal is to identify the letter and number of the speaker that says “white“. Our system first extracts the two component sentences and then sends them to a speech recognition system.
Listen for "white" in the mixed examples and see if you can do it!
Column 1 | Column 2 | Column 3 |
---|---|---|
Input signal | Try it! (hint: listen for “white”) | |
Separated Target | Bin white in D 1 again | |
Separated Masker | Lay green in I 3 soon |
Column 1 | Column 2 | Column 3 |
---|---|---|
Input signal | Try it! (hint: listen for “white”) | |
Separated Target | Set white at H 6 now | |
Separated Masker | Bin blue with R 4 please |
In many cases, word boundaries in the two sentences are closely aligned
making the task impossible. In these cases, the system would
arbitrarily stitch together sentence fragments.
Column 1 | Column 2 | Column 3 |
---|---|---|
Input signal | Try it! (hint: listen for “white”) | |
Separated Target | Lay white at N 4 please (should be: Lay white at I 5 again ) | |
Separated Masker | Place green in I 5 again (should be: Place green in N 4 please) |
The following examples show the importance of temporal constraints. Language Model dynamics force the separated signals to contain whole words. Acoustic dynamics only enforce short time, phoneme level smoothness.
This example shows that when there are no temporal constraints between consecutive frames, the model can switch the source frame by frame. Try listening to the "No dynamics condition". This is known as the 'Permutation Problem'. When language model constraints are included in inference, the permutation problem is gone. Notice the dramatic improvement for "Language Model dynamics" condition.
Dynamics | Mixed Signal | Separated Target | Separated Masker |
---|---|---|---|
No dynamics | |||
Phoneme dynamics | |||
Language Model dynamics |
In this condition, recordings of two different male talkers or two female talkers are mixed together. The task is challenging because the voices characteristics are similar. The frame level model results have quite a bit of choppy leakage. In this case the phone level dynamics help reduce and smooth the leakage of the unwanted speaker's voice. However, phone level dynamics are not nearly as effective as word level dynamics.
Dynamics | Mixed Signal | Separated Target | Separated Masker |
---|---|---|---|
No dynamics | |||
Phoneme dynamics | |||
Language Model dynamics |
In this condition, recordings of male and female talkers are mixed together. Dynamics help to smooth the result, but the short time signature of the voices suffice to separate them.
Dynamics | Mixed Signal | Separated Target | Separated Masker |
---|---|---|---|
No dynamics | |||
Phoneme dynamics | |||
Language Model dynamics |
Our system achieved super human performance in multiple conditions of the 2006 ICSLP Speech Separation Challenge. DNNs have dramatically improved separation results over the GMM/HMM based systems, however, as of October 2023 none of the latest DNN based system have surpassed the the Same Talker results.