Corpus Phonetics Tutorial

Eleanor Chodroff

Intro      Penn Forced Aligner      AutoVOT      Kaldi      Other Resources

Prerequisites      Familiarization      Training Acoustic Models Conceptually      Training Acoustic Models      Forced Alignment


What is it? Kaldi is a state-of-the-art automatic speech recognition toolkit, containing almost any algorithm related to ASR. It also contains recipes for training your own acoustic models on commonly used speech corpora such as the Wall Street Journal Corpus, TIMIT, and more. These recipes can also serve as a template for training acoustic models on your own speech data.

What are acoustic models? Acoustic models are the statistical representations of a phoneme's acoustic information. A phoneme here represents a member of the set of speech sounds in a language. N.B., this use of the term 'phoneme' only loosely corresponds to the linguistic use of the term 'phoneme'.

The acoustic models are created by training the models on acoustic features from labeled data, such as the Wall Street Journal Corpus, TIMIT, or any other transcribed speech corpus. There are many different ways these can be trained, and the tutorial will try to cover some of the more standard methods. Acoustic models are necessary not only for automatic speech recognition, but also for forced alignment. If you open up the Penn Forced Aligner, you can find those acoustic models in the ‘models’ directory. They were trained on the SCOTUS corpus using the Hidden Markov Model Toolkit (HTK).

Kaldi provides tremendous flexibility and power in training your own acoustic models and forced alignment system. The following tutorial covers a general recipe for training on your own data. This part of the tutorial assumes more familiarity with the terminal; you will also be much better off if you can program basic text manipulations.

Please also refer to the Kaldi website for thorough documentation: