Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Locality of reference
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Matrix multiplication === A common example is [[Matrix multiplication algorithm|matrix multiplication]]: <syntaxhighlight lang="pascal" line="1"> for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; </syntaxhighlight> By switching the looping order for <code>j</code> and <code>k</code>, the speedup in large matrix multiplications becomes dramatic, at least for languages that put contiguous array elements in the last dimension. This will not change the mathematical result, but it improves efficiency. In this case, "large" means, approximately, more than 100,000 elements in each matrix, or enough addressable memory such that the matrices will not fit in L1 and L2 caches. <syntaxhighlight lang="pascal" line="1"> for i in 0..n for k in 0..p for j in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; </syntaxhighlight> The reason for this speedup is that in the first case, the reads of <code>A[i][k]</code> are in cache (since the <code>k</code> index is the contiguous, last dimension), but <code>B[k][j]</code> is not, so there is a cache miss penalty on <code>B[k][j]</code>. <code>C[i][j]</code> is irrelevant, because it can be [[Loop-invariant_code_motion|hoisted]] out of the inner loop -- the loop variable there is <code>k</code>. <syntaxhighlight lang="pascal" line="1"> for i in 0..n for j in 0..m temp = C[i][j] for k in 0..p temp = temp + A[i][k] * B[k][j]; C[i][j] = temp </syntaxhighlight> In the second case, the reads and writes of <code>C[i][j]</code> are both in cache, the reads of <code>B[k][j]</code> are in cache, and the read of <code>A[i][k]</code> can be hoisted out of the inner loop. <syntaxhighlight lang="pascal" line="1"> for i in 0..n for k in 0..p temp = A[i][k] for j in 0..m C[i][j] = C[i][j] + temp * B[k][j]; </syntaxhighlight> Thus, the second example has no cache miss penalty in the inner loop while the first example has a cache penalty. On a year 2014 processor, the second case is approximately five times faster than the first case, when written in [[C (programming language)|C]] and compiled with <code>gcc -O3</code>. (A careful examination of the disassembled code shows that in the first case, [[GNU Compiler Collection|GCC]] uses [[SIMD]] instructions and in the second case it does not, but the cache penalty is much worse than the SIMD gain.){{Citation needed|date=September 2014}} Temporal locality can also be improved in the above example by using a technique called [[Loop blocking|blocking]]. The larger matrix can be divided into evenly sized sub-matrices, so that the smaller blocks can be referenced (multiplied) several times while in memory. Note that this example works for square matrices of dimensions SIZE x SIZE, but it can easily be extended for arbitrary matrices by substituting SIZE_I, SIZE_J and SIZE_K where appropriate. <syntaxhighlight lang="pascal" line="1"> for (ii = 0; ii < SIZE; ii += BLOCK_SIZE) for (kk = 0; kk < SIZE; kk += BLOCK_SIZE) for (jj = 0; jj < SIZE; jj += BLOCK_SIZE) maxi = min(ii + BLOCK_SIZE, SIZE); for (i = ii; i < maxi; i++) maxk = min(kk + BLOCK_SIZE, SIZE); for (k = kk; k < maxk; k++) maxj = min(jj + BLOCK_SIZE, SIZE); for (j = jj; j < maxj; j++) C[i][j] = C[i][j] + A[i][k] * B[k][j]; </syntaxhighlight> The temporal locality of the above solution is provided because a block can be used several times before moving on, so that it is moved in and out of memory less often. Spatial locality is improved because elements with consecutive memory addresses tend to be pulled up the memory hierarchy together.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Locality of reference
(section)
Add topic