Editing Parallel computing (section)

===Memory and communication===
Main memory in a parallel computer is either [[Shared memory (interprocess communication)|shared memory]] (shared between all processing elements in a single [[address space]]), or [[distributed memory]] (in which each processing element has its own local address space).<ref name=PH713>Patterson and Hennessy, p.&nbsp;713.</ref> Distributed memory refers to the fact that the memory is logically distributed, but often implies that it is physically distributed as well. [[Distributed shared memory]] and [[memory virtualization]] combine the two approaches, where the processing element has its own local memory and access to the memory on non-local processors. Accesses to local memory are typically faster than accesses to non-local memory. On the [[supercomputers]], distributed shared memory space can be implemented using the programming model such as [[Partitioned global address space|PGAS]].  This model allows processes on one compute node to transparently access the remote memory of another compute node. All compute nodes are also connected to an external shared memory system via high-speed interconnect, such as [[Infiniband]], this external shared memory system is known as [[burst buffer]], which is typically built from arrays of [[non-volatile memory]] physically distributed across multiple I/O nodes.

[[File:Numa.svg|right|thumbnail|400px|A logical view of a [[non-uniform memory access]] (NUMA) architecture. Processors in one directory can access that directory's memory with less latency than they can access memory in the other directory's memory.]]

Computer architectures in which each element of main memory can be accessed with equal [[Memory latency|latency]] and [[Bandwidth (computing)|bandwidth]] are known as [[uniform memory access]] (UMA) systems. Typically, that can be achieved only by a [[Shared memory (interprocess communication)|shared memory]] system, in which the memory is not physically distributed. A system that does not have this property is known as a [[non-uniform memory access]] (NUMA) architecture. Distributed memory systems have non-uniform memory access.

Computer systems make use of [[CPU cache|cache]]s—small and fast memories located close to the processor which store temporary copies of memory values (nearby in both the physical and logical sense). Parallel computer systems have difficulties with caches that may store the same value in more than one location, with the possibility of incorrect program execution. These computers require a [[cache coherency]] system, which keeps track of cached values and strategically purges them, thus ensuring correct program execution. [[Bus sniffing|Bus snooping]] is one of the most common methods for keeping track of which values are being accessed (and thus should be purged). Designing large, high-performance cache coherence systems is a very difficult problem in computer architecture. As a result, shared memory computer architectures do not scale as well as distributed memory systems do.<ref name=PH713/>

Processor–processor and processor–memory communication can be implemented in hardware in several ways, including via shared (either multiported or [[Multiplexing|multiplexed]]) memory, a [[crossbar switch]], a shared [[Bus (computing)|bus]] or an interconnect network of a myriad of [[Network topology|topologies]] including [[Star network|star]], [[Ring network|ring]], [[Tree (graph theory)|tree]], [[Hypercube graph|hypercube]], fat hypercube (a hypercube with more than one processor at a node), or [[Mesh networking|n-dimensional mesh]].

Parallel computers based on interconnected networks need to have some kind of [[routing]] to enable the passing of messages between nodes that are not directly connected. The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines.