SiSoftware Logo
  Home   FAQ   Press   Download & Buy   Rankings   Contact  
New: SiSoftware Sandra 2015!

Cache and Memory Latency

Benchmarks : Measuring Cache and Memory Latency (access patterns, paging and TLBs)

What is Latency?

In this context, latency is the time (either in clocks or nano-seconds) taken to transfer a block of data either from main memory or GPU caches. We want the data as quickly as possible, thus the lower the time the better. The size of the data block we request is usually the size of a native pointer (4 bytes in 32-bit, 8 in 64-bit).

As a CPU executes instructions, both the instructions themselves and the data they operate on must be brought into the registers; until the instruction/data is available, the CPU cannot proceed to execute it and must wait (out-of-order designs can execute other instructions where the data is available but even they require data).

The latency is thus the time the CPU waits to obtain the data, generally expressed as CPU clocks (1/frequency) for caches (as they usually run at CPU speed) and nano-seconds (10^-9) for the main memory.

Why is it important to measure it?

The latency of the main memory directly influences the efficiency of the CPU, thus its performance: reducing wait time can be more important than increasing execution speed. Unfortunately, memory has huge latency (today, by a factor of 100 or more): A CPU waiting for 100 clocks for data would run at 1/100 efficiency, i.e. 1% of theoretical performance!

Modern CPUs have internal "memory caches" that mirror instructions/data from main memory but at far lower latencies; they allow the CPU to wait far less for instructions/data and are crucial to efficiency. Unfortunately, the faster the smaller, thus modern CPUs contain a cache hierarchy (e.g. L1D, L2, L3) that are larger but slower - thus longer latencies - but still far less than main memory.

Memory is not only differentiated by the speed it runs at (MHz): the timings of various access/command/transfer sequences is configurable based on the memory timings (e.g. tCAS/CL, tRP, tRCD, tRAS, etc.) which are stored in the memory modules' EEPROM (SPD) and set by the BIOS. The lower the latencies supported for various commands, the lower the overall latency when accessing memory - which we can measure.

Another factor we measure is the ratio of memory latency vs. CPU L1D latency, i.e. how much slower memory is compared to the CPU caches. Thus, lower the better, i.e. the memory is not very slow compared to the caches.

Are the Cache / Memory latencies fixed?

No. Modern CPUs also contain "prefetchers" which bring data into the caches speculatively, i.e. they guess which instruction/data will be needed next and fetch it to be ready when needed. Thus the CPU does not need to wait for the data to be brought all the way from main memory but get it from the cache.

Prefetchers work by recognising patterns in the access of data (spatial, temporal, etc.) when executing code. Thus the latency of accessing data depends entirely to whether the prefetchers have "understood" the pattern and have fetched the right data into the caches.

Sandra allows you to test various access patterns and thus observe the latencies of the various cache levels and memory, as well as the effect of the prefetchers.

Are there any other latencies that influence the result?

Yes. Programs (apps) do not access physical memory directly ("real mode") but virtualised memory through "paging". Paging simplifies memory management by mapping (non-contigous) physical memory into contiguous virtual space as well as extending "real memory" with disk space through the page/swap.

Memory is thus allocated and managed in fixed-size blocks ("page size") while the memory manager (or run-time memory manager) manage application memory requests.

The page size is 4kB on both x86/x64 which is very small now that 8-16GB memory computers are common resulting in huge page tables: this is why newer CPUs support "large pages" (2MB) or even "huge pages" (1GB - not supported by Windows yet).

What is the TLB?

The "Page Table" is what maps virtual to physical addresses and thus virtual pages to real memory. The TLB (translation look-aside buffer) is a CPU feature that caches the recent mappings from the page table.

If the TLB does not contain the required map, i.e. "TLB miss", the page table itself must be searched which is very much slower: "Page-Walk Hit". CPUs typically contain multiple TLB levels - just like cache levels - but typically have only 512-entries x 4kB page = 2MB ("TLB range"). This is relatively small compared to 8-16GB memory of today's computers.

How does this relate to lantecy measurement?

As the TLB range is relatively small, an algorithm accessing a large memory block in a random pattern is likely to miss the TLB and thus incur the "TLB miss". Thus the total latency to access a data item not cached in L1D/L2 caches is not just the L3/Memory access latency but this additional latency.

The latency values published by the manufacturers are naturally "best case", and include only L1D/L2/L3/Memory access times and not any additional latencies incurred in practice.

We do not believe it is realistic, due to the small native page size and thus small TLB range that algorithms would not incur the "page-walk hit" when accessing memory outside L1D/L2.

Why not use "Large pages" when working with large memory blocks?

Unfortunately there are major stumbling blocks that applications encounter when attempting to use large pages, many due to the operating system, Windows. This is why only servers (e.g. SQL Server, Exchange, etc.) support large pages and very few applications.

NB: Where applicable All Benchmarks in Sandra support large pages. You can disable by unchecking "use large/huge pages" in module options.

  • Large pages are supported only on Business Windows SKUs (Professional, Enterprise, Ultimate, Server)
  • By default nobody has the privilege to allocate large pages, aka "lock pages in memory". Either Group Policy or Local Security Policy must be edited to give Administrator or all users the right to to allocate large pages.
  • Large pages naturally require bigger (2MB) contiguous physical memory blocks, thus a computer that has been running for a while might have enough free memory but not contiguously to be able to allocate large pages.

Thus the performance using native pages is most relevant today until large pages use becomes widespread - at least under Windows.

Sandra allows you to test various access patterns and thus observe the latencies of the various cache levels and memory, as well as the effect of the prefetchers:

  • Sequential Access Pattern: Memory is accessed sequentially which is an easy pattern for prefetchers - "a show-case for prefetchers"; thus the latencies will be "best case", very much reduced.

  • In-Page Random Access Pattern: Memory is accessed in a random pattern within the page (either native or large): this ensures there are no "TLB miss" latencies, just raw cache/memory latencies. Some prefetchers (e.g. "adjacent line prefetcher") still have an impact.

  • Full Random Access Pattern: Memory is accessed in a random pattern within the whole block. Large blocks may incur a "TLB miss" depending on the "TLB range".

Hardware Specifications

Here are the CPUs and memory systems we are comparing in this article:

Specs. CPU Intel i7-965 (Nehalem) Intel i5-661 (Westmere-A) Intel i5-2500K (Sandy Bridge) Intel i7-3960X (Sandy Bridge E)
Speed - Turbo 3.2 / 3.6GHz 3.3 / 3.6GHz 3.3 / 3.7GHz 3.3 / 3.9GHz
Cores (CU) / Threads (SP) 4C / 8T 2C / 4T 4C / 4T 6C / 12T
Caches (L1 / L2 / L3) 4x 32kB / 4x 256kB / 8MB 2x 32kB / 2x 256kB / 4MB 4x 32kB / 4x 256kB / 8MB 6x 32kB / 6x 256kB / 15MB
Memory (Speed / Latency) 3x DDR3 1333MHz 9-9-9-25 2x DDR3 1333MHz 9-9-9-25 2x DDR3 1600MHz 11-11-11-29 4x DDR3 1600MHz 9-9-9-26

Cache and Memory Latency

Cache and Memory Latencies

In this article we are comparing cache and memory latencies of various processors in Sandra's latency benchmarks with different access patterns and also observing paging/TLB issues:

Note: The latency scale is logarithmic, not linear, as the latencies increase greatly with cache level/main memory by orders of magnitude. Thus what is 10x higher is shown as 2x, making the transition levels of the graph similar in size.

CPU L1D (clk) L2 (clk) L3 (clk) Memory (ns) Comment

Intel i7-965 (Nehalem)
Sequential 4 clk 12 clk 15 clk 7.2 ns While there is no change to L1/L2 latencies, L3 latency doubles due to "TLB miss" while main memory latency triples. The larger the tested memory block, the higher the TLB misses.

Sequential access shows just how much the stride/sequential prefetchers help.

In-Page Random 4 clk 12 clk 22 clk 24.1 ns
Full Random 4 clk 12 clk 40 clk 62.5 ns

Intel i5-661 (Westmere-A)
Sequential 4 clk 10 clk 13 clk 8.5 ns No change in L1/L2 latencies, but L3 latency doubles again while main memory almost triples.

Compared to Nehalem, we see Westmere has similar cache latencies but higher memory latencies due to off-die memory controller: 50% higher latencies (37 vs. 24ns in-page, 108 vs. 62ns out-of-page) running at the same memory bus speed (1333MHz).

In-Page Random 4 clk 10 clk 24 clk 37.1 ns
Full Random 4 clk 10 clk 50 clk 108.4 ns

Intel i5-2500K (Sandy Bridge)
Sequential 4 clk 11 clk 13 clk 7.7 ns While L1D and L2 remain unchanged, L3 latency decreases by about 33% - no mean feat! This is one reason it is faster than the older Nehalem/Lynnfield/Westmere CPUs.

Memory latency does not improve as even though we run faster (1600 vs. 1333MHz), we run higher latencies (11-11-11-29 vs. 9-9-9-25) which shows just how much the latency supported by the memory, not just its speed, matters!

In-Page Random 4 clk 11 clk 14 clk 25.3 ns
Full Random 4 clk 11 clk 28 clk 75.6 ns

Intel i7-3960X (Sandy Bridge E)
Sequential 4 clk 11 clk 14 clk 6 ns Sandy Bridge-E's L1D/L2 caches match its smaller brother, but its L3 cache is slower by about 25% though it is twice the size (15MB vs. 8MB): it needs to support twice as many cores (8 vs. 4 though 2 are disabled on the consumer version).

With memory running at lower latencies (back to 9-9-9-26 vs. 11-11-11-29) and also twice as many memory channels (4 vs. 2) latency decreases by 14%.

In-Page Random 4 clk 11 clk 18 clk 22 ns
Full Random 4 clk 11 clk 38 clk 65.8 ns

Final Thoughts / Conclusions

We have shown that there is no "one latency", but latencies greatly vary with access pattern as well as page size once we move to L3/main memory. The way applications access memory (access pattern) and the way the allocate and manage memory (page size/number of pages) have direct influence on the latencies they will experience.

Access Pattern matters: prefetchers matter. In all cases, the sequential access memory latency is ~1/4 in-page random latency (7/24, 8.5/37, 7.7/25, etc.), thus prefetchers improve efficiency by ~4x (~500%)! This is the reason modern CPUs have complex prefetchers, they have a huge impact on CPU performance: without them the CPU cores would not be able to reach the performance levels we see even if their execution performance were better.

Paging matters to large blocks: TLBs matter. In all cases, "full random" latency is ~2-3x higher than "in-page random" latency (24/63, 37/108, 25/75, etc.). This is the reason CPUs have multiple TLBs, the "TLB miss/page-table walk hit" latency is significant.

As the memory used by applications increases, the TLB misses increase, thus large pages should be used - which Windows does not make it easy; perhaps CPUs should have much larger TLBs to compensate for both operating system/application deficiencies until the situation improves.

Sandra allows you to change both the access pattern and page size (where possible) thus allowing you to measure the different latencies under the different test conditions. As always, there is no single "right answer", it just depends on the conditions of the test.

News | Reviews | Twitter | Facebook | privacy | licence | contact