Editing Locality of reference (section)

=== Matrix multiplication ===
A common example is [[Matrix multiplication algorithm|matrix multiplication]]:

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for j in 0..m
    for k in 0..p
      C[i][j] = C[i][j] + A[i][k] * B[k][j];
</syntaxhighlight>

By switching the looping order for <code>j</code> and <code>k</code>, the speedup in large matrix multiplications becomes dramatic, at least for languages that put contiguous array elements in the last dimension.  This will not change the mathematical result, but it improves efficiency.  In this case, "large" means, approximately, more than 100,000 elements in each matrix, or enough addressable memory such that the matrices will not fit in L1 and L2 caches.

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for k in 0..p
    for j in 0..m
      C[i][j] = C[i][j] + A[i][k] * B[k][j];
</syntaxhighlight>

The reason for this speedup is that in the first case, the reads of <code>A[i][k]</code> are in cache (since the <code>k</code> index is the contiguous, last dimension), but <code>B[k][j]</code> is not, so there is a cache miss penalty on <code>B[k][j]</code>. <code>C[i][j]</code> is irrelevant, because it can be [[Loop-invariant_code_motion|hoisted]] out of the inner loop -- the loop variable there is <code>k</code>.

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for j in 0..m
    temp = C[i][j]
    for k in 0..p
      temp = temp + A[i][k] * B[k][j];
    C[i][j] = temp
</syntaxhighlight>

In the second case, the reads and writes of <code>C[i][j]</code> are both in cache, the reads of <code>B[k][j]</code> are in cache, and the read of <code>A[i][k]</code> can be hoisted out of the inner loop.

<syntaxhighlight lang="pascal" line="1">
for i in 0..n
  for k in 0..p
    temp = A[i][k]
    for j in 0..m
      C[i][j] = C[i][j] + temp * B[k][j];
</syntaxhighlight>

Thus, the second example has no cache miss penalty in the inner loop while the first example has a cache penalty.

On a year 2014 processor, the second case is approximately five times faster than the first case, when written in [[C (programming language)|C]] and compiled with <code>gcc -O3</code>.  (A careful examination of the disassembled code shows that in the first case, [[GNU Compiler Collection|GCC]] uses [[SIMD]] instructions and in the second case it does not, but the cache penalty is much worse than the SIMD gain.){{Citation needed|date=September 2014}}

Temporal locality can also be improved in the above example by using a technique called [[Loop blocking|blocking]]. The larger matrix can be divided into evenly sized sub-matrices, so that the smaller blocks can be referenced (multiplied) several times while in memory. Note that this example works for square matrices of dimensions SIZE x SIZE, but it can easily be extended for arbitrary matrices by substituting SIZE_I, SIZE_J and SIZE_K where appropriate.

<syntaxhighlight lang="pascal" line="1">
for (ii = 0; ii < SIZE; ii += BLOCK_SIZE)
  for (kk = 0; kk < SIZE; kk += BLOCK_SIZE)
    for (jj = 0; jj < SIZE; jj += BLOCK_SIZE)
      maxi = min(ii + BLOCK_SIZE, SIZE);
      for (i = ii; i < maxi; i++)
        maxk = min(kk + BLOCK_SIZE, SIZE);
        for (k = kk; k < maxk; k++)
          maxj = min(jj + BLOCK_SIZE, SIZE);
          for (j = jj; j < maxj; j++)
            C[i][j] = C[i][j] + A[i][k] * B[k][j];
</syntaxhighlight>

The temporal locality of the above solution is provided because a block can be used several times before moving on, so that it is moved in and out of memory less often.  Spatial locality is improved because elements with consecutive memory addresses tend to be pulled up the memory hierarchy together.