Lec 12: Cache Memories
Introduction
- Cache memories are small, fast SRAM-based memories managed automatically in hardware
- Hold frequently accessed blocks of main memory
- Note that data is copied in block-sized transfer units
- CPU looks first for data in cache memory
General Cache Organization
First of all, cache memories are completely managed by hardware. So the hardware must be able to look up for a block in cache, meaning the cache memories have to be strictly yet simply organized.
As shown in the picture, the cache memory contains \(S=2^s\) sets. Each set contains \(E = 2^e\) lines.
There are
- a valid bit
- indicating whether the block is valid or not, since they are random bits when initialized
- a tag
- a block that contains \(B=2^b\) bytes
Finally, we consider the total size of blocks to be the size of the cache, i.e. cache size \(C = S * E * B\).
Cache Operations
Cache Read
We divide the address of a word in three parts: tag, set index and block offset.
Steps:
- Go to the corresponding set
- Check line by line whether there is a line whose
- valid bit is 1
- and tag equals to the tag of the word
- If there is, we use the b bits to determine where the data is in that block
Disadvantages of Direct-Mapped Cache Simulation
Definition of direct-mapped cache: cache where \(e = 0\) and thus \(E = 1\)
Suppose we have a cache memory where
- memory size \(M=16\) bytes(4-bit addresses)
- \(B=2\) bytes/block
- \(S=4\) sets
- \(E=1\) Blocks/set
- i.e.
If we read addresses in the following order, there will be a conflict miss when reading address 0 again. And it's caused by small number of lines per set.
E-way Set Associative Cache
Take \(E=2\) as an example:
If there is no match:
- one line in set is selected for evection and replacement
- replacement policies: random, least recently used (LRU), ...
Advantages of E-way Set Associative Cache
As shown in the picture, if we read addresses in the following order, there will NOT be a conflict miss when reading address 0 again. Since the capacity of one set is not that small.
Cache Write
Write is somewhat more complicated than read, since we have multiple copies of data exist in the system:
- L1, L2, L3, Main Memory, Disk, ...
So, there are two problems we have to consider:
- What to do on a write-hit?
- Write-through (i.e. write immediately to memory as well as cache)
- Write-back (i.e. defer write to memory until replacement of line)
- and this needs a dirty bit
- What to do on a write-miss?
- Write-allocate (i.e. first load into cache, then update line in cache)
- Good if more writes to the location follow
- No-write-allocate (i.e. writes straight to memory, doesn't load into cache)
Typically,
- Write-through + No-write-allocate
- Write-back + Write-allocate
Cache Hierarchy for Intel Core i7
Take Intel Core i7 as example. It's not hard to see that set size for L1, L2 and L3 are
- L1: \(64 = 2^6\)
- L2: \(512 = 2^9\)
- L3: \(8192 = 2^{13}\)
respectively.
Cache Performance Metrics
Miss Rate
- Fraction of memory references not found in cache (misses / accesses) = 1 - hit rate
- Typical numbers (in percentages):
- 3-10% for L1
- Can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
- Time to deliver a line in the cache to the processor
-
Includes time to determine whether the line is in the cache
-
Typical numbers:
- 4 clock cycles for L1
- 10 clock cycles for L2
Miss Penalty
- Additional time required because of a miss
- Typically 50-200 cycles for main memory (Trend: increasing!)
Formula: \(\mathrm{Average \ Access \ Time} = \mathrm{Hit \ Time} + \mathrm{Miss \ Rate} \times \mathrm{Miss \ Penalty}\)
For instance,
- 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
- 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
So, a small difference in hit rate have a huge impact on AAT.
The Memory Mountain
Test function:
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
* array “data” with stride of "stride", using
* using 4x4 loop unrolling.
*/
int test(int elems, int stride) {
long i, sx2 = stride * 2, sx3 = stride * 3, sx4 = stride * 4;
long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
long length = elems;
long limit = length - sx4;
/* Combine 4 elements at a time
* for the sake of parallelization
*/
for (i = 0; i < limit; i += sx4) {
acc0 = acc0 + data[i];
acc1 = acc1 + data[i + stride];
acc2 = acc2 + data[i + sx2];
acc3 = acc3 + data[i + sx3];
}
/* Finish any remaining elements */
for (; i < length; i++) {
acc0 = acc0 + data[i];
}
return ((acc0 + acc1) + (acc2 + acc3));
}
As you can see:
- when stride = 1, the CPU seems to understand this pattern, and applies aggressive prefetching to accelerate.
- when stride > s8, the slope tend to disappear.
- This is because the block size is 64 bytes, and when stride > s8, there will be cache miss every time.
Optimization
Spatial Locality
Temporal Locality
See 矩阵分块操作 for details.
总体而言,
- 不进行分块:\(\frac 9 8 n^3\) 次未命中
- 进行分块(前提是 \(3B^2<C\)):\((\frac n B)^2 2 \frac n B \frac {B^2} 8 = \frac 1 {4B} n^3\) 次未命中
- \(\frac {B^2} 8\) 就是一个块的 cache miss。不难发现,由于块比较小,缓存可以存下所有的数据,所以不论是行优先还是列优先访问,都是一样的。
这和未优化的情况不同。
比如,未优化的情况下,列优先访问时,访问
arr[0][1]
时,arr[0][0-7]
早已从 cache 中被抹去,因此访问是 miss。 而在优化的情况下,访问arr[0][1]
时,由于 cache 还没有满,因此arr[0][0-7]
还在其中,访问不是 miss。 - 因此,差异就在于,对于 \(3B^2 < C\) 的分块,一个块的数据在两个块的矩阵乘法中,不会被 cache 抹去,可以反复使用,因此可以反复使用,提高了 temporal locality 本质上,矩阵乘法只需要用到 \(\mathcal O(n^2)\) 的数据,但是却要进行 \(\mathcal O(n^3)\) 的计算,因此肯定有大量的数据复用,所以我们有 exploit 的余地。