Creating a n-gram language model
At its most basic, a language model is a set of probabilities of the occurrence of particular words in a particular sequence. For the bi- and trigram language models created below, we examine the probability that a particular word occurs given the preceding one (n = 2) or given the two preceding ones (n = 3).
The LMTutorial can be used to create a n-gram language model. The script below will facilitate the creation of the language model using HTK's LM Tutorial. Note that the methods written in the code are based largely on the model creation in the tutorial (hence the retention of the name 'holmes') but this does not affect performance.
Training files transcription, with proper formatting (see lm.sh)
Testing file (optional)
HTK installed with proper paths
Alternatives (not explored)
CMU Statistical Language Modeling
The SRI Language Modeling Toolkit