Editing Central processing unit (section)

====Task-level parallelism====
{{Main|Multithreading (computer architecture)|l1=Multithreading|Multi-core processor}}
Another strategy of achieving performance is to execute multiple [[Thread (computing)|threads]] or [[Process (computing)|processes]] in parallel. This area of research is known as [[parallel computing]].<ref>{{cite book |last1=Gottlieb |first1=Allan |url=http://dl.acm.org/citation.cfm?id=160438 |title=Highly parallel computing |last2=Almasi |first2=George S. |publisher=Benjamin/Cummings |year=1989 |isbn=978-0-8053-0177-9 |location=Redwood City, California |language=en-us |access-date=2016-04-25 |archive-url=https://web.archive.org/web/20181107043726/https://dl.acm.org/citation.cfm?id=160438 |archive-date=2018-11-07 |url-status=live}}</ref> In [[Flynn's taxonomy]], this strategy is known as [[Multiple instruction, multiple data|multiple instruction stream, multiple data stream]] (MIMD).<ref>{{Cite journal|last1=Flynn|first1=M. J. |s2cid=18573685 |author-link1=Michael J. Flynn|doi=10.1109/TC.1972.5009071|title=Some Computer Organizations and Their Effectiveness|journal=[[IEEE Transactions on Computers]]|volume=C-21|issue=9| pages=948–960| date=September 1972}}</ref>

One technology used for this purpose is [[multiprocessing]] (MP).<ref>{{cite journal |last1=Lu |first1=N.-P. |last2=Chung |first2=C.-P. |year=1998 |title=Parallelism exploitation in superscalar multiprocessing  |journal=IEE Proceedings - Computers and Digital Techniques |volume=145 |issue=4 |pages=255 |doi=10.1049/ip-cdt:19981955|doi-broken-date=7 December 2024 }}</ref> The initial type of this technology is known as [[symmetric multiprocessing]] (SMP), where a small number of CPUs share a coherent view of their memory system. In this scheme, each CPU has additional hardware to maintain a constantly up-to-date view of memory. By avoiding stale views of memory, the CPUs can cooperate on the same program and programs can migrate from one CPU to another. To increase the number of cooperating CPUs beyond a handful, schemes such as [[non-uniform memory access]] (NUMA) and [[directory-based coherence protocols]] were introduced in the 1990s. SMP systems are limited to a small number of CPUs while NUMA systems have been built with thousands of processors. Initially, multiprocessing was built using multiple discrete CPUs and boards to implement the interconnect between the processors. When the processors and their interconnect are all implemented on a single chip, the technology is known as chip-level multiprocessing (CMP) and the single chip as a [[multi-core processor]].

It was later recognized that finer-grain parallelism existed with a single program. A single program might have several threads (or functions) that could be executed separately or in parallel. Some of the earliest examples of this technology implemented [[input/output]] processing such as [[direct memory access]] as a separate thread from the computation thread. A more general approach to this technology was introduced in the 1970s when systems were designed to run multiple computation threads in parallel. This technology is known as [[Multithreading (computer architecture)|multi-threading]] (MT). The approach is considered more cost-effective than multiprocessing, as only a small number of components within a CPU are replicated to support MT as opposed to the entire CPU in the case of MP. In MT, the execution units and the memory system including the caches are shared among multiple threads. The downside of MT is that the hardware support for multithreading is more visible to software than that of MP and thus supervisor software like operating systems have to undergo larger changes to support MT. One type of MT that was implemented is known as [[temporal multithreading]], where one thread is executed until it is stalled waiting for data to return from external memory. In this scheme, the CPU would then quickly context switch to another thread which is ready to run, the switch often done in one CPU clock cycle, such as the [[UltraSPARC T1]]. Another type of MT is [[simultaneous multithreading]], where instructions from multiple threads are executed in parallel within one CPU clock cycle.

For several decades from the 1970s to early 2000s, the focus in designing high performance general purpose CPUs was largely on achieving high ILP through technologies such as pipelining, caches, superscalar execution, out-of-order execution, etc. This trend culminated in large, power-hungry CPUs such as the Intel [[Pentium 4]]. By the early 2000s, CPU designers were thwarted from achieving higher performance from ILP techniques due to the growing disparity between CPU operating frequencies and main memory operating frequencies as well as escalating CPU power dissipation owing to more esoteric ILP techniques.

CPU designers then borrowed ideas from commercial computing markets such as [[transaction processing]], where the aggregate performance of multiple programs, also known as [[throughput]] computing, was more important than the performance of a single thread or process.

This reversal of emphasis is evidenced by the proliferation of dual and more core processor designs and notably, Intel's newer designs resembling its less superscalar [[P6 (microarchitecture)|P6]] architecture. Late designs in several processor families feature chip-level multiprocessing, including the [[x86-64]] [[Opteron]] and [[Athlon 64 X2]], the [[SPARC]] [[UltraSPARC T1]], IBM [[POWER4]] and [[POWER5]], as well as several [[video game console]] CPUs like the [[Xbox 360]]'s triple-core PowerPC design, and the [[PlayStation 3]]'s 7-core [[Cell (microprocessor)|Cell microprocessor]].