Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Speech recognition
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Neural networks=== {{Main|Artificial neural network}} Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification,<ref>{{Cite journal |last1=Waibel |first1=A. |last2=Hanazawa |first2=T. |last3=Hinton |first3=G. |last4=Shikano |first4=K. |last5=Lang |first5=K. J. |year=1989 |title=Phoneme recognition using time-delay neural networks |journal=IEEE Transactions on Acoustics, Speech, and Signal Processing |volume=37 |issue=3 |pages=328–339 |doi=10.1109/29.21701 |s2cid=9563026 |hdl-access=free |hdl=10338.dmlcz/135496}}</ref> phoneme classification through multi-objective evolutionary algorithms,<ref name="Bird Wanner Ekárt Faria 2020 p=113402">{{Cite journal |last1=Bird |first1=Jordan J. |last2=Wanner |first2=Elizabeth |last3=Ekárt |first3=Anikó |last4=Faria |first4=Diego R. |year=2020 |title=Optimisation of phonetic aware speech recognition through multi-objective evolutionary algorithms |url=https://publications.aston.ac.uk/id/eprint/41416/1/Speech_Recog_ESWA_2_.pdf |journal=Expert Systems with Applications |publisher=Elsevier BV |volume=153 |page=113402 |doi=10.1016/j.eswa.2020.113402 |issn=0957-4174 |s2cid=216472225 |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909053419/https://publications.aston.ac.uk/id/eprint/41416/1/Speech_Recog_ESWA_2_.pdf |url-status=live }}</ref> isolated word recognition,<ref>{{Cite journal |last1=Wu |first1=J. |last2=Chan |first2=C. |year=1993 |title=Isolated Word Recognition by Neural Network Models with Cross-Correlation Coefficients for Speech Dynamics |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=15 |issue=11 |pages=1174–1185 |doi=10.1109/34.244678}}</ref> [[audiovisual speech recognition]], audiovisual speaker recognition and speaker adaptation. [[Artificial neural network|Neural networks]] make fewer explicit assumptions about feature statistical properties than HMMs and have several qualities making them more attractive recognition models for speech recognition. When used to estimate the probabilities of a speech feature segment, neural networks allow discriminative training in a natural and efficient manner. However, in spite of their effectiveness in classifying short-time units such as individual phonemes and isolated words,<ref>S. A. Zahorian, A. M. Zimmer, and F. Meng, (2002) "[https://www.researchgate.net/profile/Stephen_Zahorian/publication/221480228_Vowel_classification_for_computer-based_visual_feedback_for_speech_training_for_the_hearing_impaired/links/00b7d525d25f51c585000000.pdf Vowel Classification for Computer based Visual Feedback for Speech Training for the Hearing Impaired]," in ICSLP 2002</ref> early neural networks were rarely successful for continuous recognition tasks because of their limited ability to model temporal dependencies. One approach to this limitation was to use neural networks as a pre-processing, feature transformation or dimensionality reduction,<ref>{{Cite book |last1=Hu |first1=Hongbing |title=ICASSP 2010 |last2=Zahorian |first2=Stephen A. |year=2010 |chapter=Dimensionality Reduction Methods for HMM Phonetic Recognition |chapter-url=http://bingweb.binghamton.edu/~hhu1/paper/Hu2010Dimensionality.pdf |archive-url=http://archive.wikiwix.com/cache/20120706063756/http://bingweb.binghamton.edu/~hhu1/paper/Hu2010Dimensionality.pdf |archive-date=6 July 2012 |url-status=live |df=dmy-all}}</ref> step prior to HMM based recognition. However, more recently, LSTM and related recurrent neural networks (RNNs),<ref name="lstm" /><ref name="sak2015" /><ref name="fernandez2007">{{Cite book |last1=Fernandez |first1=Santiago |title=Proceedings of IJCAI |last2=Graves |first2=Alex |last3=Schmidhuber |first3=Jürgen |author-link3=Jürgen Schmidhuber |year=2007 |chapter=Sequence labelling in structured domains with hierarchical recurrent neural networks |chapter-url=http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-124.pdf |archive-url=https://web.archive.org/web/20170815003130/http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-124.pdf |archive-date=15 August 2017 |url-status=live |df=dmy-all}}</ref><ref>{{Cite arXiv |eprint=1303.5778 |class=cs.NE |first1=Alex |last1=Graves |first2=Abdel-rahman |last2=Mohamed |title=Speech recognition with deep recurrent neural networks |first3=Geoffrey |last3=Hinton |year=2013}} ICASSP 2013.</ref> Time Delay Neural Networks(TDNN's),<ref>{{Cite journal |last=Waibel |first=Alex |year=1989 |title=Modular Construction of Time-Delay Neural Networks for Speech Recognition |url=http://isl.anthropomatik.kit.edu/cmu-kit/Modular_Construction_of_Time-Delay_Neural_Networks_for_Speech_Recognition.pdf |url-status=live |journal=Neural Computation |volume=1 |issue=1 |pages=39–46 |doi=10.1162/neco.1989.1.1.39 |s2cid=236321 |archive-url=https://web.archive.org/web/20160629180846/http://isl.anthropomatik.kit.edu/cmu-kit/Modular_Construction_of_Time-Delay_Neural_Networks_for_Speech_Recognition.pdf |archive-date=29 June 2016 |df=dmy-all}}</ref> and transformers<ref name=":1" /><ref name=":3" /><ref name=":4" /> have demonstrated improved performance in this area. ====Deep feedforward and recurrent neural networks==== {{Main|Deep learning}} Deep neural networks and denoising [[autoencoder]]s<ref>{{Cite book |last1=Maas |first1=Andrew L. |title=Proceedings of Interspeech 2012 |last2=Le |first2=Quoc V. |last3=O'Neil |first3=Tyler M. |last4=Vinyals |first4=Oriol |last5=Nguyen |first5=Patrick |last6=Ng |first6=Andrew Y. |author-link6=Andrew Ng |year=2012 |chapter=Recurrent Neural Networks for Noise Reduction in Robust ASR}}</ref> are also under investigation. A deep feedforward neural network (DNN) is an [[artificial neural network]] with multiple hidden layers of units between the input and output layers.<ref name=HintonDengYu2012/> Similar to shallow neural networks, DNNs can model complex non-linear relationships. DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning capacity and thus the potential of modeling complex patterns of speech data.<ref name=BOOK2014/> A success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial researchers, in collaboration with academic researchers, where large output layers of the DNN based on context dependent HMM states constructed by decision trees were adopted.<ref name="Roles2010">{{Cite journal |last1=Yu |first1=D. |last2=Deng |first2=L. |last3=Dahl |first3=G. |date=2010 |title=Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition |url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dbn4asr-nips2010.pdf |journal=NIPS Workshop on Deep Learning and Unsupervised Feature Learning}}</ref><ref name="ref27">{{Cite journal |last1=Dahl |first1=George E. |last2=Yu |first2=Dong |last3=Deng |first3=Li |last4=Acero |first4=Alex |date=2012 |title=Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition |journal=IEEE Transactions on Audio, Speech, and Language Processing |volume=20 |issue=1 |pages=30–42 |doi=10.1109/TASL.2011.2134090 |s2cid=14862572}}</ref> <ref name="ICASSP2013">Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al. [https://pdfs.semanticscholar.org/6bdc/cfe195bc49d218acc5be750aa49e41f408e4.pdf Recent Advances in Deep Learning for Speech Research at Microsoft] {{Webarchive|url=https://web.archive.org/web/20240909052236/https://pdfs.semanticscholar.org/6bdc/cfe195bc49d218acc5be750aa49e41f408e4.pdf |date=9 September 2024 }}. ICASSP, 2013.</ref> See comprehensive reviews of this development and of the state of the art as of October 2014 in the recent Springer book from Microsoft Research.<ref name="ReferenceA" /> See also the related background of automatic speech recognition and the impact of various machine learning paradigms, notably including [[deep learning]], in recent overview articles.<ref>{{Cite journal |last1=Deng |first1=L. |last2=Li |first2=Xiao |date=2013 |title=Machine Learning Paradigms for Speech Recognition: An Overview |url=http://cvsp.cs.ntua.gr/courses/patrec/slides_material2018/slides-2018/DengLi_MLParadigms-SpeechRecogn-AnOverview_TALSP13.pdf |journal=IEEE Transactions on Audio, Speech, and Language Processing |volume=21 |issue=5 |pages=1060–1089 |doi=10.1109/TASL.2013.2244083 |s2cid=16585863 |access-date=9 September 2024 |archive-date=9 September 2024 |archive-url=https://web.archive.org/web/20240909052239/http://cvsp.cs.ntua.gr/courses/patrec/slides_material2018/slides-2018/DengLi_MLParadigms-SpeechRecogn-AnOverview_TALSP13.pdf |url-status=live }}</ref><ref name="scholarpedia2015">{{Cite journal |last=Schmidhuber |first=Jürgen |author-link=Jürgen Schmidhuber |year=2015 |title=Deep Learning |journal=Scholarpedia |volume=10 |issue=11 |page=32832 |bibcode=2015SchpJ..1032832S |doi=10.4249/scholarpedia.32832 |doi-access=free}}</ref> One fundamental principle of [[deep learning]] is to do away with hand-crafted [[feature engineering]] and to use raw features. This principle was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features,<ref name="interspeech2010">L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton (2010) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&rep=rep1&type=pdf Binary Coding of Speech Spectrograms Using a Deep Auto-encoder]. Interspeech.</ref> showing its superiority over the Mel-Cepstral features which contain a few stages of fixed transformation from spectrograms. The true "raw" features of speech, waveforms, have more recently been shown to produce excellent larger-scale speech recognition results.<ref name="interspeech2014">{{Cite book |last1=Tüske |first1=Zoltán |title=Interspeech 2014 |last2=Golik |first2=Pavel |last3=Schlüter |first3=Ralf |last4=Ney |first4=Hermann |year=2014 |chapter=Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR |chapter-url=https://www-i6.informatik.rwth-aachen.de/publications/download/937/T%7Bu%7DskeZolt%7Ba%7DnGolikPavelSchl%7Bu%7DterRalfNeyHermann--AcousticModelingwithDeepNeuralNetworksUsingRawTimeSignalfor%7BLVCSR%7D--2014.pdf |archive-url=https://web.archive.org/web/20161221174753/https://www-i6.informatik.rwth-aachen.de/publications/download/937/T%7Bu%7DskeZolt%7Ba%7DnGolikPavelSchl%7Bu%7DterRalfNeyHermann--AcousticModelingwithDeepNeuralNetworksUsingRawTimeSignalfor%7BLVCSR%7D--2014.pdf |archive-date=21 December 2016 |url-status=live |df=dmy-all}}</ref>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Speech recognition
(section)
Add topic