Editing Neural network (machine learning) (section)

=== Deep learning breakthroughs in the 1960s and 1970s===
Fundamental research was conducted on ANNs in the 1960s and 1970s. The first working deep learning algorithm was the [[Group method of data handling]], a method to train arbitrarily deep neural networks, published by [[Alexey Ivakhnenko]] and Lapa in the [[Soviet Union]] (1965). They regarded it as a form of polynomial regression,<ref name="ivak1965">{{cite book|first1=A. G. |last1=Ivakhnenko |first2=V. G. |last2=Lapa |title=Cybernetics and Forecasting Techniques|url={{google books |plainurl=y |id=rGFgAAAAMAAJ}}|year=1967|publisher=American Elsevier Publishing Co.|isbn=978-0-444-00020-0}}</ref> or a generalization of Rosenblatt's perceptron.<ref>{{Cite journal |last=Ivakhnenko |first=A.G. |date=March 1970 |title=Heuristic self-organization in problems of engineering cybernetics |url=https://linkinghub.elsevier.com/retrieve/pii/0005109870900920 |journal=Automatica |language=en |volume=6 |issue=2 |pages=207–219 |doi=10.1016/0005-1098(70)90092-0 |archive-date=12 August 2024 |access-date=7 August 2024 |archive-url=https://web.archive.org/web/20240812123448/https://linkinghub.elsevier.com/retrieve/pii/0005109870900920 |url-status=live }}</ref> A 1971 paper described a deep network with eight layers trained by this method,<ref name="ivak1971">{{Cite journal|last=Ivakhnenko|first=Alexey|date=1971|title=Polynomial theory of complex systems|url=http://gmdh.net/articles/history/polynomial.pdf|journal=IEEE Transactions on Systems, Man, and Cybernetics|pages=364–378|doi=10.1109/TSMC.1971.4308320|volume=SMC-1|issue=4|access-date=5 November 2019|archive-date=29 August 2017|archive-url=https://web.archive.org/web/20170829230621/http://www.gmdh.net/articles/history/polynomial.pdf|url-status=live}}</ref> which is based on layer by layer training through regression analysis. Superfluous hidden units are pruned using a separate validation set. Since the activation functions of the nodes are Kolmogorov-Gabor polynomials, these were also the first deep networks with multiplicative units or "gates."<ref name="DLhistory">{{cite arXiv |eprint=2212.11279 |class=cs.NE |first=Jürgen |last=Schmidhuber |author-link=Jürgen Schmidhuber |title=Annotated History of Modern AI and Deep Learning |date=2022}}</ref>

The first deep learning [[multilayer perceptron]] trained by [[stochastic gradient descent]]<ref name="robbins1951">{{Cite journal | last1 = Robbins | first1 = H. | author-link = Herbert Robbins| last2 = Monro | first2 = S. | doi = 10.1214/aoms/1177729586 | title = A Stochastic Approximation Method | journal = The Annals of Mathematical Statistics | volume = 22 | issue = 3 | pages = 400 | year = 1951 | doi-access = free }}</ref> was published in 1967 by [[Shun'ichi Amari]].<ref name="Amari1967">{{cite journal |last1=Amari |first1=Shun'ichi |author-link=Shun'ichi Amari|title=A theory of adaptive pattern classifier|journal= IEEE Transactions |date=1967 |volume=EC |issue=16 |pages=279–307}}</ref> In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned [[Knowledge representation|internal representations]] to classify non-linearily separable pattern classes.<ref name="DLhistory"/> Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique.

In 1969, [[Kunihiko Fukushima]] introduced the [[rectifier (neural networks)|ReLU]] (rectified linear unit) activation function.<ref name="DLhistory" /><ref name="Fukushima1969">{{cite journal |last1=Fukushima |first1=K. |date=1969 |title=Visual feature extraction by a multilayered network of analog threshold elements |journal=IEEE Transactions on Systems Science and Cybernetics |volume=5 |issue=4 |pages=322–333 |doi=10.1109/TSSC.1969.300225}}</ref><ref name=sonoda17>{{cite journal | last1 = Sonoda | first1 = Sho | last2=Murata | first2=Noboru | s2cid = 12149203 | year = 2017 | title = Neural network with unbounded activation functions is universal approximator | journal = Applied and Computational Harmonic Analysis | volume = 43 | issue = 2 | pages = 233–268 | doi = 10.1016/j.acha.2015.12.005| arxiv = 1505.03654 }}</ref> The rectifier has become the most popular activation function for deep learning.<ref>{{cite arXiv |eprint=1710.05941 |class=cs.NE |first1=Prajit |last1=Ramachandran |first2=Zoph |last2=Barret |title=Searching for Activation Functions |date=16 October 2017 |last3=Quoc |first3=V. Le}}</ref>

Nevertheless, research stagnated in the United States following the work of [[Marvin Minsky|Minsky]] and [[Seymour Papert|Papert]] (1969),<ref name=":132">{{cite book |last1=Minsky |first1=Marvin |url={{google books |plainurl=y |id=Ow1OAQAAIAAJ}} |title=Perceptrons: An Introduction to Computational Geometry |last2=Papert |first2=Seymour |publisher=MIT Press |year=1969 |isbn=978-0-262-63022-1}}</ref> who emphasized that basic perceptrons were incapable of processing the exclusive-or circuit. This insight was irrelevant for the deep networks of Ivakhnenko (1965) and Amari (1967).

In 1976 transfer learning was introduced in neural networks learning.<ref>Bozinovski S. and Fulgosi A. (1976). "The influence of pattern similarity and transfer learning on the base perceptron training" (original in Croatian) Proceedings of Symposium Informatica 3-121-5, Bled.</ref><ref>Bozinovski S.(2020) "Reminder of the first paper on transfer learning in neural networks, 1976". Informatica 44: 291–302.</ref>

Deep learning architectures for [[convolutional neural network]]s (CNNs) with convolutional layers and downsampling layers and weight replication began with the [[Neocognitron]] introduced by Kunihiko Fukushima in 1979, though not trained by backpropagation.<ref name="FUKU1979">{{cite journal |last1=Fukushima |first1=K. |year=1979 |title=Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron |journal=Trans. IECE (In Japanese)|volume= J62-A |issue=10 |pages=658–665 |doi=10.1007/bf00344251 |pmid=7370364 |s2cid=206775608}}</ref><ref name="FUKU1980">{{cite journal |last1=Fukushima |first1=K. |year=1980 |title=Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position |journal=Biol. Cybern. |volume=36 |issue=4 |pages=193–202 |doi=10.1007/bf00344251 |pmid=7370364 |s2cid=206775608}}</ref><ref name="SCHIDHUB4"/>