Superhuman Source Separation

This is an archive of audio samples from the groundbreaking 2006 work by Kristjansson et al. that won the PASCAL Speech Separation Challenge that year:

T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath

Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System

Interspeech 2006

Audio Examples

The SSC task involves two people speaking at the same time. The goal is to identify the letter and number of the speaker that says “white“. Our system first extracts the two component sentences and then sends them to a speech recognition system.

Listen for "white" in the mixed examples and see if you can do it!

Same female speaker, target speaker at -6 dB:

Column 1 Column 2 Column 3
Input signal Try it! (hint: listen for “white”)
Separated Target Bin white in D 1 again
Separated Masker Lay green in I 3 soon

Same male speaker, target speaker at 0 dB:

Column 1 Column 2 Column 3
Input signal Try it! (hint: listen for “white”)
Separated Target Set white at H 6 now
Separated Masker Bin blue with R 4 please

Same female speaker, target speaker at 0 dB

In many cases, word boundaries in the two sentences are closely aligned
making the task impossible. In these cases, the system would
arbitrarily stitch together sentence fragments.

Column 1 Column 2 Column 3
Input signal Try it! (hint: listen for “white”)
Separated Target Lay white at N 4 please (should be: Lay white at I 5 again )
Separated Masker Place green in I 5 again (should be: Place green in N 4 please)

Solving the Permutation Problem with Language Model Dynamics

The following examples show the importance of temporal constraints. Language Model dynamics force the separated signals to contain whole words. Acoustic dynamics only enforce short time, phoneme level smoothness.

Same Talker Condition (Target at -6dB)

This example shows that when there are no temporal constraints between consecutive frames, the model can switch the source frame by frame. Try listening to the "No dynamics condition". This is known as the 'Permutation Problem'. When language model constraints are included in inference, the permutation problem is gone. Notice the dramatic improvement for "Language Model dynamics" condition.

Dynamics Mixed Signal Separated Target Separated Masker
No dynamics
Phoneme dynamics
Language Model dynamics

Same Gender Condition - Phone Level Dynamics Help

In this condition, recordings of two different male talkers or two female talkers are mixed together. The task is challenging because the voices characteristics are similar. The frame level model results have quite a bit of choppy leakage. In this case the phone level dynamics help reduce and smooth the leakage of the unwanted speaker's voice. However, phone level dynamics are not nearly as effective as word level dynamics.

Dynamics Mixed Signal Separated Target Separated Masker
No dynamics
Phoneme dynamics
Language Model dynamics

Different Gender Condition

In this condition, recordings of male and female talkers are mixed together. Dynamics help to smooth the result, but the short time signature of the voices suffice to separate them.

Dynamics Mixed Signal Separated Target Separated Masker
No dynamics
Phoneme dynamics
Language Model dynamics

Superhuman Results

Our system achieved super human performance in multiple conditions of the 2006 ICSLP Speech Separation Challenge. DNNs have dramatically improved separation results over the GMM/HMM based systems, however, as of October 2023 none of the latest DNN based system have surpassed the the Same Talker results.