Editing Speech recognition (section)

===Hidden Markov models===
{{Main|Hidden Markov model}}
Modern general-purpose speech recognition systems are based on hidden Markov models. These are statistical models that output a sequence of symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. In a short time scale (e.g., 10 milliseconds), speech can be approximated as a [[stationary process]]. Speech can be thought of as a [[Markov model]] for many stochastic purposes.

Another reason why HMMs are popular is that they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of ''n''-dimensional real-valued vectors (with ''n'' being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of [[cepstrum|cepstral]] coefficients, which are obtained by taking a [[Fourier transform]] of a short time window of speech and decorrelating the spectrum using a [[cosine transform]], then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each [[phoneme]], will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need [[context dependency]] for the [[phoneme]]s (so that phonemes with different left and right context would have different realizations as HMM states); it would use [[cepstral normalization]] to normalize for a different speaker and recording conditions; for further speaker normalization, it might use vocal tract length normalization (VTLN) for male-female normalization and [[maximum likelihood linear regression]] (MLLR) for more general speaker adaptation. The features would have so-called [[delta coefficient|delta]] and [[delta-delta coefficient]]s to capture speech dynamics and in addition, might use [[heteroscedastic linear discriminant analysis]] (HLDA); or might skip the delta and delta-delta coefficients and use [[splicing (speech recognition)|splicing]] and an [[Linear Discriminant Analysis|LDA]]-based projection followed perhaps by [[heteroscedastic]] linear discriminant analysis or a [[global semi-tied co variance]] transform (also known as [[maximum likelihood linear transform]], or MLLT). Many systems use so-called discriminative training techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum [[mutual information]] (MMI), minimum classification error (MCE), and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the [[Viterbi algorithm]] to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information and combining it statically beforehand (the [[finite state transducer]], or FST, approach).

A possible improvement to decoding is to keep a set of good candidates instead of just keeping the best candidate, and to use a better scoring function ([[re scoring (ASR)|re scoring]]) to rate these good candidates so that we may pick the best one according to this refined score. The set of candidates can be kept either as a list (the [[N-best list]] approach) or as a subset of the models (a [[lattice (order)|lattice]]). Re scoring is usually done by trying to minimize the [[Bayes risk]]<ref>{{Cite journal |last1=Goel |first1=Vaibhava |last2=Byrne |first2=William J. |year=2000 |title=Minimum Bayes-risk automatic speech recognition |url=http://www.clsp.jhu.edu/people/vgoel/publications/CSAL.ps |url-status=live |journal=Computer Speech & Language |volume=14 |issue=2 |pages=115–135 |doi=10.1006/csla.2000.0138 |s2cid=206561058 |archive-url=https://web.archive.org/web/20110725225846/http://www.clsp.jhu.edu/people/vgoel/publications/CSAL.ps |archive-date=25 July 2011 |access-date=28 March 2011 |doi-access=free |df=dmy-all}}</ref> (or an approximation thereof) Instead of taking the source sentence with maximal probability, we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions (i.e., we take the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability). The loss function is usually the [[Levenshtein distance]], though it can be different distances for specific tasks; the set of possible transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been devised to re score [[lattice (order)|lattices]] represented as weighted [[finite state transducers]] with [[edit distance]]s represented themselves as a [[finite state transducer]] verifying certain assumptions.<ref>{{Cite journal |last=Mohri |first=M. |year=2002 |title=Edit-Distance of Weighted Automata: General Definitions and Algorithms |url=http://www.cs.nyu.edu/~mohri/pub/edit.pdf |url-status=live |journal=International Journal of Foundations of Computer Science |volume=14 |issue=6 |pages=957–982 |doi=10.1142/S0129054103002114 |archive-url=https://web.archive.org/web/20120318032640/http://www.cs.nyu.edu/~mohri/pub/edit.pdf |archive-date=18 March 2012 |access-date=28 March 2011 |df=dmy-all}}</ref>