Discrimination of speech from non-speech based on multiscale spectro-temporal modulations
Source:
IEEE Transactions on Audio, Speech and Language Processing, Volume 14, Issue 6, p.920–930 (2006)
URL:
http://cobweb.ecn.purdue.edu/~malcolm/yahoo/Mesgarani2006(MultiscaleModulationsTASLP).pdf
Abstract:
We describe a content-based audio classification
algorithm based on novel multiscale spectro-temporal modulation
features inspired by a model of auditory cortical processing. The
task explored is to discriminate speech from nonspeech consisting
of animal vocalizations, music, and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to
automate well, especially in noisy and reverberant environments.
The auditory model captures basic processes occurring from the
early cochlear stages to the central cortical areas. The model
generates a multidimensional spectro-temporal representation of
the sound, which is then analyzed by a multilinear dimensionality
reduction technique and classified by a support vector machine
(SVM). Generalization of the system to signals in high level of
additive noise and reverberation is evaluated and compared to two
existing approaches (Scheirer and Slaney, 2002 and Kingsbury et
al., 2002). The results demonstrate the advantages of the auditory
model over the other two systems, especially at low signal-to-noise
ratios (SNRs) and high reverberation.