Thursday, October 21, 2010

Are segmental durations normally distributed?

What's the best way to normalize duration to account for speaker and speaking rate differences?

If segmental durations are normally distributed, z-score normalization has some nice properties.  It's easy to estimate from a relatively small set of data.  Since the z-score normalization parameters are simply MLE univariate Gaussian parameters fit to the feature (here duration) you can adapt these parameters using maximum a posteriori adaptation to a new speaker, or new utterance -- if there is reason to believe that the speaking rate might have changed.

I've used z-score normalization of word duration in the past, acknowledging that it's poorly motivated.  There's no expectation that word length should be normally distributed -- phone counts of words are not normally distributed, so why should their durations.  In fact, phone counts may be more log-normal than normal. Brief tangent: here is the distribution of phone counts per word.




So I took a look at a speaker (f2) from the Boston University Radio News Corpus, and looked at the distributions of word, syllable, vowel and phone durations to see if any look more or less Gaussian.


The distribution of word durations is decidedly non-Gaussian.  We can see evidence of the bimodality that is likely coming from the abundance of monosyllabic words in the data (~59% of words in this material).  Also, comparing the histogram of phone counts and word durations, the relationship is fairly clear.

Syllable durations don't look too bad.  There's a bit of skew to the right, this model is overestimating longer durations, but this isn't terrible.    There is a strange spikiness to the distribution, but I blame this more on an interaction between the resolution of the duration information (in units of 10ms) and the histogram than an underlying phenomenon.  If someone were to take a look at this data, and decide to go ahead and use z-score normalization on syllable durations, it wouldn't strike me as a terrible way to go.


There has been a good amount of discussion (and some writing, cf. C. Wightman, et al. "Segmental durations in the vicinity of prosodic phrase boundaries."  JASA, 1992) about the inherent duration of phones, and how phone identity is an important conditional variable when examining (and normalizing) phone duration.  I haven't reexamined that here, but suffice to say, the distribution of phone durations isn't well modeled by a Gaussian or likely any other simple parameterized distribution.   In all likelihood phone identity is an important factor to consider here, but are all phone ids equally important or can we just look at vowel vs. consonant or other articulatory clusterings -- frontness, openness, fricatives, etc.? I'll have to come back to that.
But...if we look closer at the distribution of vowel durations, while this isn't all that Gaussian, it looks like a pretty decent fit to a half-normal distribution.  I don't see any obvious multimodality which would suggest a major effect of another conditional variable on this either phone identity or lexical stress, but it's possible that the dip at around the 20th percentile is evidence of this.  Or it could just be noise.
To properly use a half-normal distribution, you would have to discount the probability mass at zero.  This has the advantage that vowel onsets and offsets are much easier to detect than consonant boundaries, syllable boundaries or word boundaries.

 So this might be my new normalization scheme for segmental durations.  Gaussian MLE z-score normalization for syllables if I have them (either from automatic or manual transcription), and Half-normal MLE z-score normalization on vowels that are acoustically approximated (or when ASR output is too poor to be trusted).

Tuesday, October 05, 2010

Interspeech 2010 Recap

Interspeech was in Makuhari, Japan last week.  Makuhari is about 40 minutes from Tokyo, and I'd say totally worth the commute.  The conference center was large and clean, and (after the first day) had functional wireless, but Makuhari offers a lot less than Tokyo does.

Interspeech is probably the speech conference with the broadest scope and largest draw.  This makes it a great place to learn what is going on in the field.

One of the things that was most striking about the work at Interspeech 2010 was the lack of a Hot Topic.  Acoustic modeling for automatic speech recognition is a mainstay of any speech conference, that was there in spades.   There was some nice work on prosody analysis.  Recognition of age, affect and gender were highlighted in the INTERSPEECH 2010 Paralinguistics Challenge, but outside the special session focussing on this, there wasn't an exceptional amount of work on these questions.  Despite the lack of a new major theme to emerge this year, there was some very high quality, interesting work.

Here is some of the work that I found particularly compelling.
  • Married Couples' speech
    Sri Narayanan's group with other collaborators from USC and UCLA have collected a set of married couples' dialog speech during couple's therapy. So this is already compelling data to look at.  You've got naturally occurring emotional speech, which is a rare occurrence, and it's emotion in dialog.  They had (at least) 2 papers on this data at the conference, one looking at prosodic entrainment during these dialogs, and the other classifying qualities like blame, acceptance, and humor in either souse.  Both very compelling first looks at this data.  There are obviously some serious privacy issues with sharing this data, but hopefully it will be possible eventually.

    Automatic Classification of Married Couples’ Behavior using Audio Features Matthew Black, et al.

    Quantification of Prosodic Entrainment in Affective Spontaneous Spoken Interactions of Married Couples
    Chi-Chun Lee, et al.
  • Ferret auditory cortex data for phone recognition
    Hynek Hermansky and colleagues have done a lot of compelling work on phone recognition.  To my eye, a lot of it has been banging away at techniques other than MFCC representations for speech recognition.  Some of them work better than others, obviously, but it's great to see that kind of scientific creativity applied to a core task for speech recognition.  This time the idea was to take "spectro temporal receptive fields" empirically observed from ferrets that have been trained to be accurate phone recognizers, and use these responses to train a phone classifier.  Yeah, that's right.  They used ferret neural activity to try to recognize human speech.  Way out there.  If that weren't compelling enough, the results are good!
  • Prosodic Timing Analysis for Articulatory Re-Synthesis Using a Bank of Resonators with an Adaptive Oscillator Michael C. Brady
    A pet project has been to find a nice way to process rhythm in speech for prosodic analysis.  Most people use some statistic based on the intervocalic intervals, but this is unsatisfying.  While is captures the degree of stability of the speaking rate, it doesn't tell you anything about which syllables are evenly spaced, and which are anomalous.  This paper uses an adaptive oscillator to find the frequency that best describes the speech data.  One of the nicest results (that Michael didn't necessarily highlight) was that he found that deaccented words in his example utterance were not "on the beat".  In the near term I'm planning on replicating this approach for analyzing phrasing, on the idea that in addition to other acoustic resets, the prosodic timing resets at phrase boundaries. A very cool approach.
  • Compressive Sensing
    There was a special session on compressive sensing that was populated mostly by IBM speech team folks.  I hadn't heard of compressive sensing before this conference, and it's always nice to learn a new technique.  At its core compressive sensing is an exemplar based learning algorithm.  Where it gets clever is that where k-means uses a fixed number, k, of exemplars to use with equal weight, and SVM use a fixed set of support vectors to make decisions, in compressive sensing a dynamic set of exemplars are used to classify each data point.  The set of candidate exemplars (possibly the whole training data set) are then weighted with some L1-ish regularization to drive most of the weights to zero -- selecting a subset of all candidates for classification.  Then a weighted k-means is performed using the selected exemplars and weights.  The dynamic selection and weighting of exemplars outperforms vanilla SVMs, but the process is fairly computationally expensive.
Interspeech 2011 is in Florence, and I can't wait -- now I've just got to get banging on some papers for it.