Editing Speech recognition (section)

===Neural networks===
{{Main|Artificial neural network}}
Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification,<ref>{{Cite journal |last1=Waibel |first1=A. |last2=Hanazawa |first2=T. |last3=Hinton |first3=G. |last4=Shikano |first4=K. |last5=Lang |first5=K. J. |year=1989 |title=Phoneme recognition using time-delay neural networks |journal=IEEE Transactions on Acoustics, Speech, and Signal Processing |volume=37 |issue=3 |pages=328–339 |doi=10.1109/29.21701 |s2cid=9563026 |hdl-access=free |hdl=10338.dmlcz/135496}}</ref> phoneme classification through multi-objective evolutionary algorithms,<ref name="Bird Wanner Ekárt Faria 2020 p=113402">{{Cite journal |last1=Bird |first1=Jordan J. |last2=Wanner |first2=Elizabeth |last3=Ekárt |first3=Anikó |last4=Faria |first4=Diego R. |year=2020 |title=Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms |url=https://publications.aston.ac.uk/id/eprint/41416/1/Speech_Recog_ESWA_2_.pdf |journal=Expert Systems with Applications |publisher=Elsevier BV |volume=153 |page=113402 |doi=10.1016/j.eswa.2020.113402 |issn=0957-4174 |s2cid=216472225 |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909053419/https://publications.aston.ac.uk/id/eprint/41416/1/Speech_Recog_ESWA_2_.pdf |url-status=live }}</ref> isolated word recognition,<ref>{{Cite journal |last1=Wu |first1=J. |last2=Chan |first2=C. |year=1993 |title=Isolated Word Recognition by Neural Network Models with Cross-Correlation Coefficients for Speech Dynamics |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=15 |issue=11 |pages=1174–1185 |doi=10.1109/34.244678}}</ref> [[audiovisual speech recognition]], audiovisual speaker recognition and speaker adaptation.

[[Artificial neural network|Neural networks]] make fewer explicit assumptions about feature statistical properties than HMMs and have several qualities making them more attractive recognition models for speech recognition. When used to estimate the probabilities of a speech feature segment, neural networks allow discriminative training in a natural and efficient manner. However, in spite of their effectiveness in classifying short-time units such as individual phonemes and isolated words,<ref>S. A. Zahorian, A. M. Zimmer, and F. Meng, (2002) "[https://www.researchgate.net/profile/Stephen_Zahorian/publication/221480228_Vowel_classification_for_computer-based_visual_feedback_for_speech_training_for_the_hearing_impaired/links/00b7d525d25f51c585000000.pdf Vowel Classification for Computer based Visual Feedback for Speech Training for the Hearing Impaired]," in ICSLP 2002</ref> early neural networks were rarely successful for continuous recognition tasks because of their limited ability to model temporal dependencies.

One approach to this limitation was to use neural networks as a pre-processing, feature transformation or dimensionality reduction,<ref>{{Cite book |last1=Hu |first1=Hongbing |title=ICASSP 2010 |last2=Zahorian |first2=Stephen A. |year=2010 |chapter=Dimensionality Reduction Methods for HMM Phonetic Recognition |chapter-url=http://bingweb.binghamton.edu/~hhu1/paper/Hu2010Dimensionality.pdf |archive-url=http://archive.wikiwix.com/cache/20120706063756/http://bingweb.binghamton.edu/~hhu1/paper/Hu2010Dimensionality.pdf |archive-date=6 July 2012 |url-status=live |df=dmy-all}}</ref> step prior to HMM based recognition. However, more recently, LSTM and related recurrent neural networks (RNNs),<ref name="lstm" /><ref name="sak2015" /><ref name="fernandez2007">{{Cite book |last1=Fernandez |first1=Santiago |title=Proceedings of IJCAI |last2=Graves |first2=Alex |last3=Schmidhuber |first3=Jürgen |author-link3=Jürgen Schmidhuber |year=2007 |chapter=Sequence labelling in structured domains with hierarchical recurrent neural networks |chapter-url=http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-124.pdf |archive-url=https://web.archive.org/web/20170815003130/http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-124.pdf |archive-date=15 August 2017 |url-status=live |df=dmy-all}}</ref><ref>{{Cite arXiv |eprint=1303.5778 |class=cs.NE |first1=Alex |last1=Graves |first2=Abdel-rahman |last2=Mohamed |title=Speech recognition with deep recurrent neural networks |first3=Geoffrey |last3=Hinton |year=2013}} ICASSP 2013.</ref> Time Delay Neural Networks(TDNN's),<ref>{{Cite journal |last=Waibel |first=Alex |year=1989 |title=Modular Construction of Time-Delay Neural Networks for Speech Recognition |url=http://isl.anthropomatik.kit.edu/cmu-kit/Modular_Construction_of_Time-Delay_Neural_Networks_for_Speech_Recognition.pdf |url-status=live |journal=Neural Computation |volume=1 |issue=1 |pages=39–46 |doi=10.1162/neco.1989.1.1.39 |s2cid=236321 |archive-url=https://web.archive.org/web/20160629180846/http://isl.anthropomatik.kit.edu/cmu-kit/Modular_Construction_of_Time-Delay_Neural_Networks_for_Speech_Recognition.pdf |archive-date=29 June 2016 |df=dmy-all}}</ref> and transformers<ref name=":1" /><ref name=":3" /><ref name=":4" /> have demonstrated improved performance in this area.

====Deep feedforward and recurrent neural networks====
{{Main|Deep learning}}

Deep neural networks and denoising [[autoencoder]]s<ref>{{Cite book |last1=Maas |first1=Andrew L. |title=Proceedings of Interspeech 2012 |last2=Le |first2=Quoc V. |last3=O'Neil |first3=Tyler M. |last4=Vinyals |first4=Oriol |last5=Nguyen |first5=Patrick |last6=Ng |first6=Andrew Y. |author-link6=Andrew Ng |year=2012 |chapter=Recurrent Neural Networks for Noise Reduction in Robust ASR}}</ref> are also under investigation. A deep feedforward neural network (DNN) is an [[artificial neural network]] with multiple hidden layers of units between the input and output layers.<ref name=HintonDengYu2012/> Similar to shallow neural networks, DNNs can model complex non-linear relationships. DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning capacity and thus the potential of modeling complex patterns of speech data.<ref name=BOOK2014/>

A success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial researchers, in collaboration with academic researchers, where large output layers of the DNN based on context dependent HMM states constructed by decision trees were adopted.<ref name="Roles2010">{{Cite journal |last1=Yu |first1=D. |last2=Deng |first2=L. |last3=Dahl |first3=G. |date=2010 |title=Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition |url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dbn4asr-nips2010.pdf |journal=NIPS Workshop on Deep Learning and Unsupervised Feature Learning}}</ref><ref name="ref27">{{Cite journal |last1=Dahl |first1=George E. |last2=Yu |first2=Dong |last3=Deng |first3=Li |last4=Acero |first4=Alex |date=2012 |title=Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition |journal=IEEE Transactions on Audio, Speech, and Language Processing |volume=20 |issue=1 |pages=30–42 |doi=10.1109/TASL.2011.2134090 |s2cid=14862572}}</ref>
<ref name="ICASSP2013">Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al. [https://pdfs.semanticscholar.org/6bdc/cfe195bc49d218acc5be750aa49e41f408e4.pdf Recent Advances in Deep Learning for Speech Research at Microsoft] {{Webarchive|url=https://web.archive.org/web/20240909052236/https://pdfs.semanticscholar.org/6bdc/cfe195bc49d218acc5be750aa49e41f408e4.pdf |date=9 September 2024 }}. ICASSP, 2013.</ref> See comprehensive reviews of this development and of the state of the art as of October 2014 in the recent Springer book from Microsoft Research.<ref name="ReferenceA" /> See also the related background of automatic speech recognition and the impact of various machine learning paradigms, notably including [[deep learning]], in
recent overview articles.<ref>{{Cite journal |last1=Deng |first1=L. |last2=Li |first2=Xiao |date=2013 |title=Machine Learning Paradigms for Speech Recognition: An Overview |url=http://cvsp.cs.ntua.gr/courses/patrec/slides_material2018/slides-2018/DengLi_MLParadigms-SpeechRecogn-AnOverview_TALSP13.pdf |journal=IEEE Transactions on Audio, Speech, and Language Processing |volume=21 |issue=5 |pages=1060–1089 |doi=10.1109/TASL.2013.2244083 |s2cid=16585863 |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909052239/http://cvsp.cs.ntua.gr/courses/patrec/slides_material2018/slides-2018/DengLi_MLParadigms-SpeechRecogn-AnOverview_TALSP13.pdf |url-status=live }}</ref><ref name="scholarpedia2015">{{Cite journal |last=Schmidhuber |first=Jürgen |author-link=Jürgen Schmidhuber |year=2015 |title=Deep Learning |journal=Scholarpedia |volume=10 |issue=11 |page=32832 |bibcode=2015SchpJ..1032832S |doi=10.4249/scholarpedia.32832 |doi-access=free}}</ref>

One fundamental principle of [[deep learning]] is to do away with hand-crafted [[feature engineering]] and to use raw features. This principle was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features,<ref name="interspeech2010">L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton (2010) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf Binary Coding of Speech Spectrograms Using a Deep Auto-encoder]. Interspeech.</ref> showing its superiority over the Mel-Cepstral features which contain a few stages of fixed transformation from spectrograms.
The true "raw" features of speech, waveforms, have more recently been shown to produce excellent larger-scale speech recognition results.<ref name="interspeech2014">{{Cite book |last1=Tüske |first1=Zoltán |title=Interspeech 2014 |last2=Golik |first2=Pavel |last3=Schlüter |first3=Ralf |last4=Ney |first4=Hermann |year=2014 |chapter=Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR |chapter-url=https://www-i6.informatik.rwth-aachen.de/publications/download/937/T%7Bu%7DskeZolt%7Ba%7DnGolikPavelSchl%7Bu%7DterRalfNeyHermann--AcousticModelingwithDeepNeuralNetworksUsingRawTimeSignalfor%7BLVCSR%7D--2014.pdf |archive-url=https://web.archive.org/web/20161221174753/https://www-i6.informatik.rwth-aachen.de/publications/download/937/T%7Bu%7DskeZolt%7Ba%7DnGolikPavelSchl%7Bu%7DterRalfNeyHermann--AcousticModelingwithDeepNeuralNetworksUsingRawTimeSignalfor%7BLVCSR%7D--2014.pdf |archive-date=21 December 2016 |url-status=live |df=dmy-all}}</ref>