Editing Parallel computing (section)

===Instruction-level parallelism===
{{main|Instruction-level parallelism}}
[[File:Nopipeline.png|thumb|300px|A canonical processor without [[Instruction pipelining|pipeline]]. It takes five clock cycles to complete one instruction and thus the processor can issue subscalar performance ({{nobreak|1=IPC = 0.2 < 1}}).]]

A computer program is, in essence, a stream of instructions executed by a processor. Without instruction-level parallelism, a processor can only issue less than one [[Instructions per cycle|instruction per clock cycle]] ({{nobreak|IPC < 1}}). These processors are known as ''subscalar'' processors. These instructions can be [[Out-of-order execution|re-ordered]] and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s.<ref>Culler et al. p.&nbsp;15.</ref>

[[File:Fivestagespipeline.png|thumb|300px|A canonical five-stage [[Instruction pipelining|pipelined]] processor. In the best case scenario, it takes one clock cycle to complete one instruction and thus the processor can issue scalar performance ({{nobreak|1=IPC = 1}}).]]

All modern processors have multi-stage [[Instruction pipelining|instruction pipelines]]. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an ''N''-stage pipeline can have up to ''N'' different instructions at different stages of completion and thus can issue one instruction per clock cycle ({{nobreak|1=IPC = 1}}). These processors are known as ''scalar'' processors. The canonical example of a pipelined processor is a [[RISC]] processor, with five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and register write back (WB). The [[Pentium 4]] processor had a 35-stage pipeline.<ref>[[Yale Patt|Patt, Yale]] (April 2004). "[http://users.ece.utexas.edu/~patt/Videos/talk_videos/cmu_04-29-04.wmv The Microprocessor Ten Years From Now: What Are The Challenges, How Do We Meet Them?] {{webarchive|url=https://web.archive.org/web/20080414141000/http://users.ece.utexas.edu/~patt/Videos/talk_videos/cmu_04-29-04.wmv |date=2008-04-14 }} (wmv). Distinguished Lecturer talk at [[Carnegie Mellon University]]. Retrieved on November 7, 2007.</ref>

[[File:Superscalarpipeline.svg|thumb|300px|A canonical five-stage [[Instruction pipelining|pipelined]] processor with two execution units. In the best case scenario, it takes one clock cycle to complete two instructions and thus the processor can issue superscalar performance ({{nobreak|1=IPC = 2 > 1}}).]]

Most modern processors also have multiple [[execution unit]]s. They usually combine this feature with pipelining and thus can issue more than one instruction per clock cycle ({{nobreak|IPC > 1}}). These processors are known as ''[[superscalar]]'' processors. Superscalar processors differ from [[multi-core processor]]s in that the several execution units are not entire processors (i.e. processing units). Instructions can be grouped together only if there is no [[data dependency]] between them. [[Scoreboarding]] and the [[Tomasulo algorithm]] (which is similar to scoreboarding but makes use of [[register renaming]]) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism.