language=Java, basicstyle=, commentstyle=, stringstyle=, showstringspaces=false, breaklines=true, belowcaptionskip=0.3frameround=ffff, frame=single, rulecolor=, tabsize=1, keywordstyle=, columns=fullflexible, morekeywords=public, class morekeywords=function, keywordstyle=[2], escapeinside=*@@*,

Victima: Drastically Increasing Address Translation Reach
by Leveraging Underutilized Cache Resources

Konstantinos Kanellopoulos¹ Hong Chul Nam¹ F. Nisa Bostanci¹ Rahul Bera¹
Mohammad Sadrosadati¹ Rakesh Kumar² Davide Basilio Bartolini³ Onur Mutlu¹
¹ETH Zürich ²Norwegian University of Science and Technology ³Huawei Zurich Research Center

(2023)

Abstract.

Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs).

We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Leveraging the PTW-CP, Victima uses the valuable cache space only for TLB entries that correspond to costly-to-translate pages, reducing the impact on cached application data. Second, a TLB-aware cache replacement policy prioritizes kee** TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries.

Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima delivers similar performance as a system that employs an optimistic 128K-entry L2 TLB, while avoiding the associated area and power overheads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, (iii) unlike large software-managed TLBs, does not require contiguous physical allocations, (iv) is compatible with modern large page mechanisms and (iv) incurs very small area and power overheads of $0.04\%$ and $0.08\%$ , respectively, on a modern high-end CPU. The source code of Victima is freely available at https://github.com/CMU-SAFARI/Victima.

Virtual Memory, Cache, TLB, Virtualization, Microarchitecture,
Address Translation, Memory Hierarchy, Memory Systems

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: 56th Annual IEEE/ACM International Symposium on Microarchitecture; October 28-November 1, 2023; Toronto, ON, Canada^†^†booktitle: 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’23), October 28-November 1, 2023, Toronto, ON, Canada^†^†price: 15.00^†^†doi: 10.1145/3613424.3614276^†^†isbn: 979-8-4007-0329-4/23/10^†^†ccs: Hardware Memory and dense storage^†^†ccs: Software and its engineering Virtual memory

1. Introduction

Address translation is a significant performance bottleneck in modern data-intensive workloads (vm2, ; karakostas_characterIISWC, ; isca2010-barr-trancache, ; 5-levelpaging, ; contiguitas2023, ; radiantISMM21, ; bhattacharjeePACT2009, ; devirtualizingASPLOS2018, ; hash_dont_cache, ; virtualizationimplication, ; vm29, ). To enable fast address translation, modern processors employ a two-level translation look-aside buffer (TLB) hierarchy that caches recently used virtual-to-physical address translations. However, with the very large data footprints of modern workloads, the last-level TLB (L2 TLB) experiences high miss rate (misses per kilo instructions; MPKI), leading to high-latency page table walks (PTWs) that negatively impact application performance. Virtualized environments exacerbate the PTW latency as they impose two-level address translation (e.g., up to 24 memory accesses can occur during a PTW in a system with nested paging (amdnested, ; google-nested, )), resulting in even higher address translation overheads compared to native execution environments. Therefore, it is crucial to increase the translation reach (i.e., the maximum amount of memory that can be covered by the processor’s TLB hierarchy) to improve the effectiveness of TLBs and thus minimize PTWs. Doing so becomes increasingly important as PTW latency continues to rise with modern processors’ deeper multi-level page table (PT) designs (e.g., 5-level radix PT in the latest Intel processors (5-levelpaging, )).

Previous works have proposed various solutions to reduce the high cost of address translation and increase the translation reach of the TLBs such as employing (i) large hardware TLBs (sharedl3tlbISCA2011, ; distlltlbMICRO2018, ; gpustealing, ) or (ii) backing up the last-level TLB with a large software-managed TLB (pomtlbISCA2017, ; csaltMICRO2017, ; softwareTLBNAS2013, ; softwareTLBISCA2013, ; uhlig94, ; bruceMMU1998, ; softcontrolcachesISCA1986, ; Nagle1993DesignTF, ; Bala1994SoftwarePA, ). Unfortunately, both solutions have significant drawbacks: increased access latency, power, and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs).

Drawback of Large Hardware TLBs. First, a larger TLB has larger access latency (e.g., 1.4x larger latency for every 2x increase in size as reported by CACTI 7.0 (cacti, )), which may partially or entirely offset the performance gains due to fewer TLB misses. Second, a larger TLB leads to larger chip area and higher power consumption, resulting in higher costs and challenges in managing power constraints within the system. Third, a larger TLB may only benefit a specific subset of workloads, making it challenging to justify its applicability in a general-purpose system where some workloads are not sensitive to address translation performance. Section 3.1 provides a detailed quantitative analysis of (i) increasing the size of a conventional last-level TLB and (ii) expanding the TLB hierarchy with a hardware L3 TLB, considering both realistic and optimistic (i.e, ideal) TLB designs.

Drawbacks of Large Software-Managed TLBs. First, to look up a software-managed TLB (STLB), the processor fetches STLB entries from the main memory into the cache hierarchy, resulting in a translation latency comparable to that of a PTW. Hence, an STLB is more effective when PTW latency is higher than the STLB access latency (e.g., as in virtualized environments). Second, storing an STLB in memory requires allocating large contiguous memory blocks during runtime (on the order of 10’s of MB (pomtlbISCA2017, )). Third, an STLB introduces complex hardware/software interactions (e.g., evicting data from a hardware TLB to an STLB) and requires modifications in OS software. Section 3.2 provides a detailed quantitative analysis of STLBs.

Opportunity: Leveraging the Cache Hierarchy. Rather than expanding hardware TLBs or introducing large software-managed TLBs, a cost-effective method to drastically increase translation reach is to store the existing TLB entries within the existing cache hierarchy. For example, a 2MB L2 cache can fit $128\times$ the TLB entries a 2048-entry L2 TLB holds. When a TLB entry resides inside the L2 cache, only one low-latency (e.g., $\approx$ 16 cycles) L2 access is needed to find the virtual-to-physical address translation instead of performing a high-latency (e.g., $\approx 137$ cycles as shown in §3) PTW. One potential pitfall of this approach is the potential reduction of caching capacity for application data, which could ultimately harm end-to-end performance. However, as we show in §3 and as shown in multiple prior works (catch, ; ferdman2012, ; jalili2023harvesting, ; damov2021, ; graspHPCA2020, ; droplet, ; manycoreSC2018, ; Bera2022HermesAL, ; linedistillation, ; spillreceive, ; adaptiveQureshiISCA2007, ), modern data-intensive workloads, tend to (greatly) underutilize the cache hierarchy, especially the large L2/L3/L4 caches. This is because many modern working sets exceed the capacity of the cache hierarchy and many data accesses exhibit low spatial and temporal locality (damov2021, ; graspHPCA2020, ; Tesseract, ; droplet, ; NowatzkiISCA19, ; dynamicgran2012matan, ; manycoreSC2018, ). Therefore, the underutilized cache blocks can likely be repurposed to store TLB entries without replacing useful program data and harming end-to-end application performance.

Our goal in this work is to increase the translation reach of the processor’s TLB hierarchy by leveraging the underutilized resources in the cache hierarchy. We aim to design such a practical technique that: (i) is effective in both native and virtualized execution environments, (ii) does not require or rely on contiguous physical allocations, (iii) is transparent to both application and OS software and (iv) has low area, power, and energy costs.

To this end, we present Victima, a new software-transparent mechanism that drastically increases the translation reach of the TLB by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries. Doing so provides an additional low-latency and high-capacity component to back up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Leveraging the PTW-CP, Victima uses the valuable cache space only for TLB entries that correspond to costly-to-translate pages, reducing the impact on cached application data. Second, a TLB-aware cache replacement policy prioritizes kee** TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., high last-level TLB miss rate) and (ii) the reuse of the TLB entries.

Key Mechanism. Victima gets triggered on last-level TLB misses and evictions. On a last-level TLB miss, if PTW-CP predicts that the page will be costly-to-translate in the future, Victima transforms the data cache block that contains the last-level PT entries (PTEs) (fetched during the PTW) into a cluster of TLB entries to enable direct access to the corresponding PTEs using a virtual address without walking the PT. On a last-level TLB eviction, if PTW-CP makes a positive prediction, Victima issues a PTW in the background to bring the PTEs of the evicted address into the L2 cache, and Victima transforms the fetched PTE entries into a TLB entry. This way, if the evicted TLB entry is accessed again in the future, Victima can directly access the corresponding PTE without walking the PT. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, (iii) unlike large software-managed TLBs, does not require contiguous physical allocations, and (iv) is compatible with modern large page mechanisms (e.g., Transparent Huge Pages in Linux (hugepage, )).

Key Results. We evaluate Victima with an extended version of the Sniper simulator (sniper, ) (which is open-source (victima_ae, )) using 11 data-intensive applications from five diverse benchmark suites (GraphBIG (Lifeng2015, ), GUPS (Plimpton2006, ), XSBench (Tramm2014, ), DLRM (dlmr, ) and GenomicsBench (genomicsbench, )). Our evaluation yields four major results that show Victima’s effectiveness. First, in native execution environments, Victima improves performance by 7.4% on average over the baseline system that uses a four-level radix-tree-based PT, yielding 3.3% and 6.2% higher performance compared to a system with an optimistic 64K-entry L2 TLB and a system with a state-of-the-art software-managed L3 TLB (pomtlbISCA2017, ), respectively. At the same time, Victima delivers similar performance as a system that employs an optimistic 128K-entry L2 TLB, while avoiding the associated area and power overheads. Second, in virtualized environments, Victima improves performance by 28.7% over the baseline nested paging mechanism (amdnested, ), and outperforms an ideal shadow paging mechanism (vm25, ) by 4.9% and a system that employs a state-of-the-art software-managed TLB (pomtlbISCA2017, ) by 20.1%. Third, Victima achieves such performance benefits by reducing L2 TLB miss latency by 22% (60%) on average in native (virtualized) execution environments compared to the baseline system (nested paging (amdnested, )). Fourth, all of Victima’s benefits come at a modest cost of 0.04% area overhead and 0.08% power overhead compared to a modern high-end CPU (raptor_lake, ).

This paper makes the following major contributions:

•

We observe a new opportunity to reuse the existing underutilized cache resources in order to store TLB entries and increase the translation reach of the processor’s TLB hierarchy at low cost and low overheads.
•

We propose Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by carefully and practically leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries for costly-to-translate pages, thereby providing an additional low-latency and high-capacity component to back up the last-level TLB and reducing the number of PTWs.
•

We evaluate Victima using a diverse set of data-intensive applications and demonstrate its effectiveness in both native and virtualized environments. Victima achieves high performance benefits by effectively reducing last-level TLB miss latency compared to both realistic and optimistic baseline systems, with very modest area and power overheads compared to a modern high-end CPU.
•

We open-source Victima and all necessary traces and scripts to completely reproduce results at https://github.com/CMU-SAFARI/Victima.

2. Background

2.1. The Virtual Memory Abstraction

Virtual memory is a cornerstone of most modern computing systems that eases the programming model by providing a convenient abstraction to manage the physical memory (ieemicro2018-Bhattacharjee-tempo, ; hand1999, ; old_vm1, ; old_vm2, ; old_vm3, ; old_vm4, ; old_vm5, ; old_vm6, ; old_vm7, ; denning1970, ; ahearn1973, ; goldberg1974survey, ; bruceMMU1998, ; smith, ; wood1986, ; chen1992simulation, ; koldinger1992, ; lindstrom1995, ; jacob1998, ; avm, ; translationmanagementISCA1993, ; interactionASPLOS1991, ; multics, ). The operating system (OS), transparently to application software, maps each virtual memory address to its corresponding physical memory address. Doing so provides a number of benefits, including: (i) application-transparent memory management, (ii) sharing data between applications, (iii) process isolation, and (iv) page-level memory protection. Conventional virtual memory designs allow any virtual page to map to any free physical page. Such a flexible address map** enables two important key features of virtual memory: (i) efficient memory utilization, and (ii) sharing pages between applications. However, such a flexible address map** mechanism has a critical downside: it creates the need to store a large number of virtual-to-physical map**s, as for every process, the OS needs to store the physical location of every virtual page.

2.2. Page Table (PT)

The PT is a per-process data structure that stores the map**s between virtual and physical pages. In modern x86-64 processors, the PT is organized as a four-level radix-tree (intelx86manual, ). Even though the radix-tree-based PT optimizes for storage efficiency, it requires multiple pointer-chasing operations to discover the virtual-to-physical map**. To search for a virtual-to-physical address map**, the system needs to sequentially access each of the four levels of the page table. This process is called page table walk (PTW).

Figure 1 shows the PTW assuming (i) an x86-64 four-level radix-tree PT whose base address is stored in the CR3 register, and (ii) 4KB pages. As shown in Figure 1, a single PTW requires four sequential memory accesses - to discover the physical page number. The processor uses the first 9-bits of the virtual address as offset (Page Map Level4; PML4) to index the appropriate entry of the PT within the first level of the PT . The processor then reads the pointer stored in the first level of the PT to access the second-level of the PT . It uses the next 9-bit set (Page Directory Page table; PDP) from the virtual address to locate the appropriate entry within the second level. This process continues iteratively for each subsequent level of the PT (Page Directory; PD and Page Table; PT ). Eventually, the processor reaches the leaf level of the PT, where it finds the final entry containing the physical page number corresponding to the given virtual address . ARM processors use a similar approach, with the number of levels varying across different versions of the ISA (arm-manual-tlbmaintenance, ).

Refer to caption — Figure 1. Four-level radix-tree page table walk in x86-64 ISA.

2.3. Virtualized Environments

In virtualized environments, each memory request requires a two-level address translation: (i) from guest-virtual to guest-physical, and (ii) from guest-physical to host-physical. The dominant technique to perform address translation in virtualized environments is Nested Paging (NP) (amdnested, ; google-nested, ). In NP, the system uses two page tables: the guest page table that stores guest-virtual to guest-physical address map**s and the host page table that stores guest-physical to host-physical address map**s. To search for the map** between a guest-virtual page to a host-physical page, NP performs a two-dimensional walk, since a host page table walk is required for each level of the guest page table walk. Therefore, in a virtualized environment with a four-level radix-tree-based PT, NP-based address translation can cause up to 24 sequential memory accesses (a 6 $\times$ increase in memory accesses compared to the native execution environment).

2.4. Memory Management Unit (MMU)

When a user process generates a memory (i.e., instruction or data) request, the processor needs to translate the virtual address to its corresponding physical address. Address translation is a critical operation because it sits on the critical path of the memory access flow: no memory access is possible unless the requested virtual address is first translated into its corresponding physical address. Given that frequent PTWs lead to high address translation overheads, modern cores comprise of a specialized memory management unit (MMU) responsible for accelerating address translation. Figure 2 shows an example structure of the MMU of a modern processor (cascadelake, ), consisting of three key components: (i) a two-level hierarchy of translation lookaside buffers (TLBs), (ii) a hardware page table walker, and (iii) page walk caches (PWCs).

L1 TLBs are highly- or fully-associative caches that directly provide the physical address for recently-accessed virtual pages at very low latency (i.e., typically within 1 cycle). There are two separate L1 TLBs, one for instructions (L1 I-TLB) and one for data (L1 D-TLB). Modern TLBs make use of multiple page sizes beyond 4KB in order to (i) cover large amounts memory with a single entry and (ii) maintain compatibility with modern OSes that transparently allocate large pages (tridentMICRO2021, ; corbet2011, ; reserve, ; panwar2019hawkeye, ). For example, an Intel Cascade Lake core (cascadelake, ) employs 2 L1 D-TLBs, one for 2MB pages and one for 4KB pages. Translation requests that miss in the L1 TLBs are forwarded to a unified L2 TLB, that stores translations for both instructions and data. In case of an L2 TLB miss, the MMU triggers a PTW . PTW is performed by a dedicated hardware page table walker capable of performing multiple concurrent PTWs. In order to reduce PTW latency, page table walkers are equipped with page walk caches (PWC) , which are small dedicated caches for each level of the PT (for the first three levels in x86-64). In case of a PWC miss, the MMU issues the request(s) for the corresponding level of the PT to the conventional memory hierarchy .

To accelerate address translation in virtualized execution environments that use Nested Paging (amdnested, ), as shown in Figure 3, the MMU is additionally equipped with (i) a nested TLB that stores guest-physical-to-host-physical map**s and (ii) an additional hardware page table walker that walks the host PT (while the other one walks the guest PT). Upon an L2 TLB miss, the MMU triggers a guest PTW to retrieve the guest-physical address . On a PWC miss, the guest Page Table Walker must retrieve the guest PT entries from the cache hierarchy. However, to access the cache hierarchy that operates on host-physical addresses, the guest PTW must first translate the host-virtual address to the host-physical address using a host PTW. To avoid the host PTW, the MMU probes the nested TLB to search for the host-virtual-to-host-physical translation . Only in case of a nested TLB miss the MMU triggers the host PTW .

3. Motivation

As shown in multiple prior academic works and industrial studies (vm2, ; karakostas_characterIISWC, ; isca2010-barr-trancache, ; 5-levelpaging, ; contiguitas2023, ; radiantISMM21, ; bhattacharjeePACT2009, ; devirtualizingASPLOS2018, ; hash_dont_cache, ; virtualizationimplication, ; vm29, ), various modern data-intensive workloads experience severe performance bottlenecks due to address translation. For example, a system that (i) employs a 1.5K-entry L2 TLB and (ii) uses both 4KB and 2MB pages, experiences a high MPKI of 39, averaged across all evaluated workloads (see Fig. 5).¹¹1§8 describes our evaluation methodology in detail. At the same time, as we show in Figure 4, the average latency of a PTW is 137 cycles.²²2The x-axis of Figure 4 is cut off (at 190 cycles) since only 0.2% of the PTWs take more than 190 cycles to complete. Maximum observed PTW latency is 608 cycles. Based on our evaluation results, frequent L2 TLB misses in combination with high-latency PTWs lead to an average of 30% of total execution cycles spent on address translation.

Previous works propose various solutions to reduce the high cost of address translation and increase the translation reach of the TLBs such as employing (i) large hardware TLBs (sharedl3tlbISCA2011, ; distlltlbMICRO2018, ; gpustealing, ) or (ii) backing up the last-level TLB with a large software-managed TLB (pomtlbISCA2017, ; csaltMICRO2017, ; softwareTLBNAS2013, ; softwareTLBISCA2013, ; uhlig94, ; bruceMMU1998, ; softcontrolcachesISCA1986, ; Nagle1993DesignTF, ; Bala1994SoftwarePA, ). We examine these solutions and their shortcomings in §3.1 and §3.2.

3.1. Large Hardware TLBs

We evaluate the effectiveness of increasing the size of the TLB. Our methodology and workloads are described in detail in §8. Figure 5 demonstrates the L2 TLB MPKI as we increase the size of the L2 TLB from 1.5K up to 64K entries. We observe that increasing the number of L2 TLB entries from 1.5K to 64K (i.e., by 42 $\times$ ) results in reducing the MPKI from 39 to 24 (i.e., by 44%). To better understand the potential performance of increasing the size of the L2 TLB, Figure 6 shows the execution time speedup of L2 TLB configurations with increasing sizes but equal access latencies (i.e., 12 cycles) compared to the baseline system (1.5K-entry L2 TLB). We evaluate an optimistic setting where the access latency is set to 12 cycles regardless of the TLB size. We observe that the optimistic 64K-entry configuration (that reduces MPKI by 44%) leads to a 4.0% higher performance on average compared to the baseline 1.5K-entry L2 TLB configuration.

Unfortunately, increasing the TLB size does not come for free: it leads to larger access latency (as well as area and power), which counteracts the potential performance benefits due to fewer PTWs. For instance, according to CACTI 7.0 (cacti, ), the latency of accessing a 64K-entry large TLB is as high as 39 cycles. Figure 7 shows the execution time speedup of realistic L2 TLB configurations with increasing sizes, while the access latency is adjusted based on the size of the TLB (based on CACTI 7.0 modeling (cacti, )), compared to the baseline system (1.5K-entry L2 TLB with 12-cycle access latency). We observe that in this realistic setting, the performance benefits of increasing the L2 TLB size are significantly lower compared to the optimistic setting (Fig. 6). The realistic 64K-entry configuration (that reduces MPKI by 44%, but comes with a 39-cycle access latency) leads to only 0.8% higher average performance over the baseline configuration. We conclude that although increasing the L2 TLB size reduces PTWs, it comes with increased access latency (as well as power and area), which leads to small performance benefits realistically.

Increasing the size of the L2 TLB has a negative impact on the translation latency of requests that hit in the L2 TLB. Therefore, to keep the access latency of the L2 TLB small, we also explore a scenario where the TLB hierarchy is extended with a large hardware L3 TLB. Figure 8 shows the execution time speedup achieved by a system with a 64K-entry L3 TLB with increasing access latencies, ranging from 15 cycles up to 39 cycles (which is the latency suggested by CACTI 7.0 (cacti, )), compared to the baseline system that employs a two-level TLB hierarchy (with a 1.5K-entry 12-cycle L2 TLB). We observe that a large 64K-entry L3 TLB with a very aggressive 15-cycle access latency leads to a 2.9% performance increase compared to the baseline system. The performance gains are lower compared to employing a 64K-entry L2 TLB (4.0%). This is because, for applications that experience low L2 TLB hit rates, employing an L3 TLB results in a higher L3 TLB hit latency (L2 TLB miss latency $+$ L3 TLB hit latency) compared to using a large L2 TLB. We conclude that employing a large L3 TLB is not universally beneficial, and the performance gains heavily depend on the L2 TLB hit rates and L3 TLB access latencies.

3.2. Large Software-Managed TLBs

Previous works (pomtlbISCA2017, ; csaltMICRO2017, ; softwareTLBNAS2013, ; softwareTLBISCA2013, ; uhlig94, ; bruceMMU1998, ; softcontrolcachesISCA1986, ; Nagle1993DesignTF, ; Bala1994SoftwarePA, ) propose using large software-managed TLBs to reduce PTWs. However, software-managed TLBs suffer from four key disadvantages. First, to look up a software-managed TLB (STLB), the processor fetches STLB entries from the main memory into the cache hierarchy. At the same time, the hit rate of STLBs likely does not justify the cost of fetching STLB entries from the main memory. Hence, the total latency of accessing STLB entries and performing PTWs is comparable to the latency of performing PTWs in the baseline system. To validate our claim, Figure 9 shows the average L2 TLB miss latency in (i) the baseline system in native execution, (ii) a system with a state-of-the-art L3 STLB (pomtlbISCA2017, ) in native execution, (iii) the baseline system that employs nested paging (NP) (amdnested, ) in virtualized execution and (iv) a system with a state-of-the-art L3 STLB (pomtlbISCA2017, ) and NP (amdnested, ) in virtualized execution. We observe that the average L2 TLB miss latency in a system with an STLB is 122 cycles, which is comparable to the baseline system (128 cycles). However, the average L2 TLB miss latency in the system with NP in virtualized execution is 275 cycles, which is higher than the average L2 TLB miss latency in a system with an L3 STLB (220 cycles) in virtualized execution, making the STLB a more attractive solution in virtualized execution environments.

Second, allocating an STLB in software requires contiguous physical address space (on the order of 10’s of MB), which is difficult to find in environments where memory is heavily fragmented, such as data centers (contiguitas2023, ; translationranger2019, ; chloe2020, ) and in cases where memory capacity pressure is high (tmoASPLOS2022, ; memtrade2023sigmetrics, ; softwaredefinedASPLOS2019, ). Third, resizing an STLB throughout the execution of the program to match the program’s needs is challenging due to the large data movement cost of migrating the TLB entries betweeen different software data structures (elastic-cuckoo-asplos20, ; mehtJovanHPCA2023, ; flataAsplos2022, ; nestedecht, ). Fourth, integrating a software-managed TLB in the address translation pipeline requires OS and hardware changes to support (i) flushing and updating software STLB entries during a TLB shootdown (csaltMICRO2017, ; pomtlbISCA2017, ), (ii) handling evictions from the hardware TLB to the STLB (csaltMICRO2017, ; pomtlbISCA2017, ).

3.3. Opportunity: Storing TLB Entries Inside the Cache Hierarchy

Instead of expanding hardware TLBs or introducing large software-managed TLBs, we posit that a cost-effective method to drastically increase the translation reach of the TLB hierarchy is to store the existing TLB entries within the existing cache hierarchy. For example, a 2MB L2 cache can fit $128\times$ the TLB entries a 2048-entry L2 TLB holds. When a TLB entry resides inside the L2 cache, only one low-latency (i.e., $\approx$ 16 cycles) L2 access is needed to find the virtual-to-physical address translation instead of performing a high-latency (i.e., $\approx 137$ cycles on average) PTW.

To better understand the potential of caching TLB entries in the cache hierarchy, we conduct a study where for every L2 TLB miss, the translation request is always served from the L1 cache (TLB-hit-L1), L2 cache (TLB-hit-L2) or the LLC (TLB-hit-LLC). Figure 10 shows the reduction in address translation latency provided by TLB-hit-{L1, L2, LLC} compared to the baseline system. We observe that, even when servicing every L2 TLB miss from the LLC (which takes $\approx$ 35 cycles to access), L2 TLB miss latency is reduced by 71.9% on average across 11 workloads. We conclude that caching TLB entries inside the cache hierarchy can potentially greatly reduce the address translation latency.

3.4. Cache Underutilization

One potential pitfall of storing TLB entries inside the cache hierarchy is the potential reduction of caching capacity for application data, which could ultimately harm end-to-end performance. However, as shown in prior works (catch, ; ferdman2012, ; jalili2023harvesting, ; damov2021, ; graspHPCA2020, ; droplet, ; manycoreSC2018, ; Bera2022HermesAL, ; linedistillation, ; spillreceive, ; adaptiveQureshiISCA2007, ), many modern data-intensive workloads, tend to (greatly) underutilize the cache hierarchy, especially the large L2/L3/L4 caches. This is because modern working sets exceed the capacity of the cache hierarchy and data accesses exhibit low spatial and temporal locality (damov2021, ; graspHPCA2020, ; Tesseract, ; droplet, ; NowatzkiISCA19, ; dynamicgran2012matan, ; manycoreSC2018, ).

Figure 11 shows the reuse-level distribution of blocks in the L2 cache across our evaluated data-intensive workloads (note that y-axis starts from 75%). We observe that on average 92% of the cache blocks experience no reuse (i.e., 0 reuse) after being brought to the L2 cache (i.e., these blocks are not accessed while they reside inside the L2 cache). In contrast, only 8% of blocks experience reuse higher than 1 (i.e., they are accessed more than once while they reside inside the L2 cache). We conclude that a large fraction of the underutilized cache blocks can be repurposed to store TLB entries without replacing useful program data and harming end-to-end application performance.

3.5. Our Goal

Our goal is to increase the translation reach of the processor’s TLB hierarchy by leveraging the underutilized resources in the cache hierarchy. We aim to design such a practical technique that: (i) is effective in both native and virtualized execution environments, (ii) does not require or rely on contiguous physical allocations, (iii) is transparent to both application and OS software and (iv) has low area, power, and energy costs. To this end, our key idea is to store TLB entries in the cache hierarchy.

4. Victima: Design Overview

We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the TLB by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries. Doing so provides an additional low-latency and high-capacity component to back up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Leveraging the PTW-CP, Victima uses the valuable cache space only for TLB entries that correspond to costly-to-translate pages, reducing the impact on cached application data. Second, a TLB-aware cache replacement policy prioritizes kee** TLB entries in the cache hierarchy by taking into account (i) the translation pressure (e.g., high last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries.

Figure 12 shows the translation flow in Victima compared to the one in a conventional baseline processor (raptor_lake, ). In the baseline system (Fig. 12 top) , (i) whenever an entry is evicted from the L2 TLB , the evicted TLB entry is not cached anywhere. Hence, (i) the TLB entry is dropped and (ii) a high-latency PTW is required to fetch it when it is requested again . In contrast, Victima (Fig. 12 bottom) stores into the L2 cache (i) entries that get evicted from the L2 TLB and (ii) the TLB entries of memory accesses that cause L2 TLB misses. Victima gets triggered on last-level TLB misses and evictions . On a last-level TLB miss, if PTW-CP predicts that the page will be costly-to-translate in the future , Victima transforms the data cache block that contains the last-level PT entries (PTEs) (fetched during the PTW) into a cluster of TLB entries to enable direct access to the corresponding cluster of PTEs using a virtual address without walking the PT. Storing a cluster of TLB entries for contiguous virtual pages inside the L2 cache can be highly beneficial for applications whose memory accesses exhibit high spatial locality. On a last-level TLB eviction , if PTW-CP makes a positive prediction, Victima issues a PTW in the background to bring the PTEs of the evicted address into the L2 cache, and Victima transforms the fetched PTEs into a TLB entry. This way, if the evicted TLB entry is accessed again in the future , Victima can directly access the corresponding PTE from the L2 cache without walking the PT .

Storing evicted TLB entries in the L2 cache can be highly beneficial for applications that experience a high number of capacity misses in the TLB hierarchy. Victima’s functionality seamlessly applies to virtualized environments as well. In virtualized execution, where Victima stores into the L2 cache both (i) conventional TLB entries that store direct guest-virtual-to-host-physical map**s as well as (ii) nested TLB entries that store guest-physical-to-host-physical map**s.

5. Victima: Detailed Design

We describe in detail (i) how the L2 cache is modified to store TLB entries, (ii) how Victima inserts TLB entries into the L2 cache, (iii) how address translation flow changes in the presence of Victima, (iv) how Victima operates in virtualized environments and (v) how Victima maintains TLB entries coherent. We use as the reference design point a modern x86-64 system that employs 48-bit virtual addresses (VA) and 52-bit physical adddresses (PA) (intelx86manual, ).

5.1. Modifications to the L2 Cache

We minimally modify the L2 cache to (i) support storing TLB entries and (ii) enable a TLB-aware replacement policy that favors kee** TLB entries inside the L2 cache taking into account address translation pressure (e.g., L2 TLB MPKI) and the reuse characteristics of TLB entries.

TLB Blocks. We introduce a new cache block type to store TLB entries in the data store of the L2 cache, called the TLB block. Figure 13 shows how the same address maps to (i) a conventional L2 data cache block and (ii) an L2 cache block that contains TLB entries for 4KB or 2MB pages. Each cache entry can potentially store a data block or a TLB block. A conventional data block is (typically) accessed using the PA while a TLB block is accessed using the VA. Victima modifies the cache block metadata layout to enable storing TLB entries. First, an additional bit is needed to distinguish between a data block versus a TLB block. Second, in a conventional data block, the size of the tag of a 1MB, 16-way associative L2 cache consists of $52-log_{2}(1024)-log_{2}(64)=36$ bits. However, in a TLB block, the tag consists of only 23 bits and is computed as $48-log_{2}(4KB)-log_{2}(1024)-log_{2}(8)=23$ bits which is smaller than the tag needed for a data block.³³3Each 64-byte TLB block can store up to 8 8-byte PTEs. Victima uses the 3 least significant bits of the virtual page number to identify and access a specific PTE. We leverage the unused space in TLB blocks to (i) avoid aliasing and (ii) store page size information.

To prevent aliasing between the virtual addresses (VAs) of different processes, 11 unused bits of the tag are reserved for storing the address-space identifier (ASID) or the virtual-machine identifier (VMID) of each process. The rest of the bits are used to store page size information. Given a 48-bit VA and 52-bit PA, we can spare 11 bits for the ASID/VMID. As the VA size becomes larger, e.g., 57 bits, fewer bits can be spared for the ASID/VMID (4 bits in case of 57-bit VA and 52-bit PA). However, modern operating systems do not use more than 12 ASIDs/core (linuxasid, ) in order to avoid expensive lookups in the ASID table. Hence, when using 57-bit VAs and 52-bit PAs, even with only 4 bits left for the ASID, there is no risk of aliasing.

For a cache with 64-byte cache lines, it is possible to uniquely tag and avoid aliasing between TLB entries (without increasing the size of the cache’s hardware tag entries) only if $(PA_{length}>VA_{length}-9)$ .⁴⁴4If $PA_{length}\leq(VA_{length}-9)$ , a single VA can map to different TLB blocks. This is because the tag of the TLB block does not fit inside the hardware tag entry of the L2 cache. In cases where this condition is not met, an alternative approach is to reduce the number of TLB entries in the TLB Block (e.g., by storing 7 PTEs instead of 8 PTEs) and use the remaining bits for the tag/ASID/VMID. Previous works (e.g., (graphfire, )) propose such solutions to enable efficient sub-block tagging in data caches.

TLB-aware Cache Replacement Policy. We extend the conventional state-of-the-art SRRIP cache replacement policy (srrip, ) to prioritize storing TLB entries of an application for longer time periods if the application experiences high address translation overheads (i.e., L2 TLB MPKI greater than 5). Listing 13 shows the pseudocode of the block insertion function, replacement candidate function, and cache hit function for SRRIP in the baseline system and Victima (changes compared to baseline SRRIP are marked in blue). Upon insertion of a TLB entry inside the L2 cache (insertBlockInL2(block) Line 1), the re-reference interval (analogous to reuse distance) is set to 0 (Line 6), marking the TLB entry as a block with a small reuse distance. This way, TLB entries are unlikely to be evicted soon after their insertion. Upon selection of a replacement candidate (chooseReplacementCandidate() Line 10), if the selected replacement candidate is a TLB block (Line 23) and translation pressure is high (Line 23), SRRIP makes one more attempt to find a replacement candidate that is not a TLB block (Line 23). If no such candidate is found, the TLB block is evicted from the L2 cache and is dropped (i.e., not written anywhere else). Upon a cache hit to a TLB entry (updateOnL2CacheHit(index) Line 28), the re-reference interval is reduced by three instead of one (Line 32) to provide higher priority to the TLB entry compared to other data blocks (Line 34).

{listing}

⬇

1 function insertBlockInL2(block):

2 // If inserting a TLB block and TLB pressure is high

3 // set the reference-interval to 0 to provide high priority

4 // assuming that reuse in near future will be high

5 if (block == TLB and TLB_MPKI > 5)

6 rrip_counter[block] = 0

7 else

8 rrip_counter[block] = RRIP_MAX

10 function chooseReplacementCandidate():

11 // Replace an invalid block if possible

12 for i from 0 to m_associativity - 1:

13 if (block[i] == invalid ) return i

14 // Check the re-reference interval of each block in the set

15 for j from 0 to RRIP_MAX:

16 // Search for a block with RRIP_MAX

17 for i from 1 to #ways:

18 chosen_block = chooseBlockWithHighRRIP()

19 // If a TLB block is chosen for replacement

20 // and translation pressure is high,

21 // make one more attempt and try

22 // to evict another block if possible

23 if (chosen_block == TLB && TLB_MPKI > 5) skip

24 // Increment all RRIP counters

25 for i from 1 to #ways :

26 incrementRRIPcounters()

28 function updateOnL2CacheHit(index):

29 // Assume reuse will be high for TLB block

30 // and reduce the re-reference interval by 3 instead of 1

31 if (block[index] == TLB && TLB_MPKI > 5)

32 rrip_counter[index] -= 3;

33 else

34 rrip_counter[index]--;

TLB-Block-Aware SRRIP (srrip, ) L2 Cache Replacement Policy

5.2. Inserting TLB Blocks into the L2 Cache

Victima allocates a block of 8 TLB entries (64 bytes) that correspond to 8 contiguous virtual pages inside the L2 cache upon an L2 TLB miss or an L2 TLB eviction, if the corresponding page is deemed to be costly-to-translate in the future. To predict whether a page will be costly-to-translate, Victima employs a Page Table Walk cost predictor (PTW-CP). Figure 14 depicts Victima’s operations on an L2 TLB miss or eviction.

Inserting a TLB Block into the L2 Cache upon an L2 TLB Miss. When an L2 TLB miss occurs, the MMU consults the PTW-CP to find out if the page is predicted to be costly-to-translate in the future ( in Fig. 14). If the prediction is positive, the MMU checks if the corresponding TLB block already resides inside the L2 cache . If it does, no further action is needed. If not, the MMU first waits until the PTW is completed . When the last level of the PT is fetched, the MMU transforms the cache block that contains the PTEs to a TLB block by updating the metadata of the block . The MMU (i) replaces the existing tag with the tag of the virtual page region, (ii) sets the TLB bit to mark the cache block as TLB block, and (iii) updates the ASID and the page size information associated with the TLB block. This way, the TLB block containing the consecutive PTE entries is directly accessible using the corresponding virtual page numbers and the ASID of the application without walking the PT. Storing several (e.g., 8 in our implementation) TLB entries for consecutive virtual pages inside the same L2 cache TLB block can be highly beneficial for applications whose memory acccesses exhibit high spatial locality and frequently access neighboring pages.

Inserting a TLB Block into the L2 Cache upon an L2 TLB Eviction. When an L2 TLB eviction occurs, the MMU consults the PTW-CP to find out if the page is predicted to be costly-to-translate in the future ( in Fig. 14). If the outcome of the prediction is positive, the MMU checks if the corresponding TLB block already resides in the L2 cache . If it does, no further action is needed. If it does not, the MMU issues in the background a PTW for the corresponding TLB entry . When the last level of the page table is fetched , the MMU follows the same procedure as the L2 TLB miss-based insertion (i.e., transforms the cache block that contains the PTEs to a TLB block) . This way, if the evicted TLB entry (or any other TLB entry in the block) is accessed again in the future, Victima can directly access the corresponding PTE without walking the PT.

Page Table Walk Cost Predictor: Functionality. The PTW cost predictor (PTW-CP) is a small comparator-based circuit that estimates whether the page is among the top 30% most costly-to-translate pages. Using it, Victima predicts if a page will cause costly PTWs in the future and decides whether the MMU should store the corresponding TLB block inside the L2 cache. To make this decision, PTW-CP uses two metrics associated with a page: (i) PTW frequency and (ii) PTW cost, both of which are embedded inside the PTE of the corresponding page. Figure 15 shows the structure and the functionality of PTW-CP. PTW frequency is stored as a 3-bit counter in the unused bits of the PTE and is incremented after every PTW that fetches the corresponding PTE. PTW cost is also stored as a 4-bit counter in the unused bits of the PTE and is incremented every time the PTW leads to at least one DRAM access. Both counters are updated by the MMU after every PTW that fetches the corresponding PTE. If any of the two counters overflows, its value remains at the maximum value throughout the rest of the program’s execution.

On an L2 TLB miss or eviction , the PTW-CP waits until the corresponding PTE is fetched inside the L2 TLB. PTW-CP fetches the two counters from the TLB entry that contains the PTE, passes them through a tree of comparators, and calculates the result . If the L2 cache experiences high MPKI (i.e., data exhibits low locality, meaning that caching data is not that beneficial), the PTW-CP is bypassed and the TLB entry is inserted inside the L2 cache without consulting the PTW-CP .

Page Table Walk Cost Predictor: Feature Selection. Our development of PTW-CP’s architecture involves a systematic and empirical approach to (i) identify the most critical features for making high-accuracy predictions and (ii) create an effective predictor while minimizing hardware overhead and inference latency. Initially, we collect a set of 10 per-page features related to address translation, as shown in Table 1. From these 10 features, we methodically identify a small subset that would maximize accuracy while minimizing prediction time and storage overhead.

Table 1. Per-Page Feature Set

Feature (per PTE)	Bits	Description
Page Size	1	The size of the page (4KB or 2MB)
Page Table Walk Frequency	3	# of PTWs for the page
Page Table Walk Cost	4	# of DRAM accesses during all PTWs
PWC Hits	5	# of times the PTW led to PWC hit
L1 TLB Misses	5	# of times the page experienced L1 TLB miss
L2 TLB Misses	5	# of times the page experienced L2 TLB miss
L2 Cache Hits	5	# of times the page experienced L2 cache hits
L1 TLB Evictions	5	# of times the TLB entry got evicted from L1 TLB
L2 TLB Evictions	6	# of times the TLB entry got evicted from L2 TLB
Accesses	6	# of accesses to the page

Table 2 shows the architectural characteristics and the performance of three different multi-layer perceptron-based neural networks (NN) (haykin1994neural, ) and of our final comparator-based model. First, we evaluate three different NN architectures with different feature sets to gain insights about the most critical features (for accuracy). The first NN (NN-10) uses all 10 features, the second NN (NN-5) uses a set of 5 features (PTW cost, PTW frequency, PWC hits, L2 TLB evictions, and accesses to the page), and the third (NN-2) uses only 2 features, the PTW frequency and the PTW cost. We use four metrics to evaluate the performance of each model: accuracy, precision, recall, and F1-score. Accuracy is the fraction of correct predictions, precision is the fraction of correct positive (i.e., costly-to-translate) predictions, and recall is the fraction of correct negative predictions. F1-score is the harmonic mean of precision and recall. In the context of PTW-CP, making negative predictions when the page is actually costly-to-translate leads to performance degradation, while making positive predictions when the page is actually not costly-to-translate leads to L2 cache pollution. From Table 2, we observe that NN-10 achieves the highest performance, with an F1-score of 90.42%. By reducing the number of features to 5, NN-5 still achieves high performance reaching 89.89% F1-score while NN-2 leads to an F1-score of 80.66%. At the same time, NN-2 is 7.75x smaller than NN-10 and 90.5x smaller than NN-5 which makes it an attractive solution for PTW-CP as it achieves reasonable accuracy with small hardware overhead.

Table 2. Comparison of Different Types of PTW-CP

Model Parameters	NN-10	NN-5	NN-2	Comparator
# of Features	10	5	2	2
Number of Layers	4	4	6	N/A
Size of Hidden Layers	16	64	4	N/A
Size (B)	6024	70152	776	24
Recall	93.34%	92.44%	89.62%	89.61%
Accuracy	92.13%	91.72%	82.90%	82.90%
Precision	87.68%	87.47%	73.33%	73.34%
F1-score	90.42%	89.89%	80.66%	80.66%

To gain a better understanding of the prediction pattern of NN-2, Fig. 16 shows the predictions of the network for all possible PTW frequency and PTW cost value pairs. We observe that the network exhibits a clear prediction pattern that separates costly-to-translate pages from non-costly-to-translate pages: PTW frequency-cost value pairs that fall inside the boundaries of the bounding box (rectangle spanning from the bottom-left corner (1,1) to the top-right corner (12,7) as drawn on Fig. 16) are classified as costly-to-translate by NN-2, while PTW frequency-cost value pairs that fall outside the bounding box are classified as non-costly-to-translate. Many of the PTW frequency-cost value pairs never occur during the execution of the applications we evaluate and are not classified by NN-2. Table 2 demonstrates that a simple comparator approach that mimics the functionality the bounding box shown in Fig. 16, achieves an F1-score of 80.66% without any performance loss compared to NN-2. The comparator-based model requires only 24 bytes of storage, 251x less than NN-10, 2923x less than NN-5 and 32x less than NN-2. The comparator-based model requires only (i) four comparators to compare the two counters with the edges of the bounding box, i.e., (1,1) and (12,7) and (ii) can make a prediction in a single cycle. The comparator-based model is the PTW-CP architecture that we use in Victima.

5.3. Address Translation Flow with Victima

Figure 17 demonstrates the address translation flow in a system that employs Victima. When an L2 TLB miss occurs, the MMU in parallel (i) initiates the PTW and (ii) looks up the corresponding TLB block in the L2 cache . In contrast to regular L2 data block lookups, which are performed using the physical address, a TLB block lookup is performed using the virtual page number (VPN) and the address-space identifier (ASID) of the translation request. The size of the VPN is not known a priori, so Victima probes the L2 cache twice in parallel, once assuming a 4KB VPN and once assuming a 2MB VPN. If the tag (either the tag of the 4KB VPN or the tag of the 2MB VPN) and the ASID matches with a block that has the TLB-entry bit set, the translation request is served by the L2 cache , the PTW is aborted, and the TLB entry is inserted into the L2 TLB. If the TLB entry is not found in the L2 cache, the PT walker runs to completion and resolves the translation .

5.4. Victima in Virtualized Environments

We demonstrate how Victima improves address translation in virtualized environments. The key idea is to insert both (i) TLB entries and (ii) nested TLB entries into the L2 cache to increase the translation reach of the processor’s TLB hierarchy for both guest-virtual-to-guest-physical and guest-physical-to-host-physical address translations and avoid both (i) guest-PTWs and (ii) host-PTWs. A nested TLB block is a block of 8 nested TLB entries that correspond to 8 contiguous host-virtual pages. To distinguish between conventional TLB blocks and nested TLB blocks, Victima extends the cache block metadata with an additional bit to mark a block as a nested TLB block. Figure 18 shows how Nested TLB blocks are inserted into the L2 cache in a system that employs Victima and nested paging (amdnested, ) in virtualized execution. (conventional TLB blocks are allocated as described in §5.2).

Inserting a Nested TLB Block into the L2 Cache upon a Nested TLB Miss. When a Nested TLB miss occurs, the MMU consults the PTW-CP to find out if the host-virtual page will be costly-to-translate in the future . If the prediction is positive, the MMU checks if the corresponding nested TLB block already resides inside the L2 cache . If it does, no further action is needed. If not, the MMU first waits until the host-PTW is completed. When the last level of the host-PT is fetched , the MMU transforms the cache block that contains the host-PTEs to a nested TLB block by updating the metadata of the block . The MMU (i) replaces the existing tag with the tag of the host-virtual page region, (ii) sets the nested TLB bit to mark the cache block as a nested TLB block, and (iii) updates the ASID (or VMID) and the page size information.

Inserting a Nested TLB Block into the L2 Cache upon a Nested TLB Eviction. When a Nested TLB eviction occurs, the MMU consults the PTW-CP to find out if the host-virtual page will be costly-to-translate in the future . If the outcome of the prediction is positive, the MMU checks if the corresponding nested TLB block already resides in the L2 cache . If it does, no further action is needed. If it does not, the MMU issues in the background a host-PTW for the corresponding TLB entry . When the last level of the host-PT is fetched , the MMU transforms the cache block that contains the host-PTEs to a nested TLB block. .

Address Translation Flow. Figure 18 shows the address translation flow of a system that employs Victima and nested paging (amdnested, ) in virtualized execution. If a nested TLB miss occurs , the MMU probes the L2 cache to search for the nested TLB entry . If the nested TLB entry is found inside the L2 cache, the host-PTW gets skipped . If it is not found, the host-PTW is performs the guest-physical-to-host-physical address translation .

6. TLB Maintainance Operations

Modern ISAs provide specific instructions used by the OS to invalidate TLB entries and maintain correctness in the presence of (i) context switches and (ii) modifications of virtual-to-physical address map**s (called TLB shootdowns) that occur due to physical page migration, memory de-allocation etc. Different ISAs provide different instructions for TLB invalidations. For example, the ARM v8 architecture (arm-manual-tlbmaintenance, , D5.10.2) defines multiple special instructions to invalidate TLB entries with each instruction handling a distinct case (e.g., invalidating a single TLB entry vs invalidating all TLB entries with a specific ASID). x86-64 provides a single instruction, INVLPG, which corresponds to invalidating one single TLB entry (intelx86manual-vol2, ). In Victima, whenever a TLB invalidation is required, the corresponding TLB entries in the L2 cache need to be invalidated. In this section, following the example of the ARM specification, which is a superset of other specifications we know of, we discuss in detail how Victima supports TLB invalidations due to context-switches and TLB shootdowns.

6.1. Context Switches

TLB flushing occurs when the OS switches the hardware context and schedules another process (or thread) to the core. In this case, the OS makes a decision on whether or not the TLB entries across the TLB hierarchy should be invalidated, which depends on the ASIDs of the current and to-be-executed processes (in practice Linux uses only 12 different ASIDs per core even though the processor can support up to 4096 ASIDs). In Victima, if the OS flushes the entire TLB hierarchy, all the TLB blocks in L2 cache need to be invalidated as well. If the OS performs a partial flush based on the ASID, all the TLB blocks in L2 cache with the corresponding ASID need to be evicted. In the corner case that Victima uses fewer bits for the ASID, i.e., when L2 cache tag is not large enough to store enough ASID bits to cover the ASID of the process, all the TLB blocks inside the L2 cache get invalidated during a context switch (the L1 and L2 TLB entries can still be invalidated using the ASID). Based on our evaluation setup, for a 2MB L2 cache which is occupied by 50% by TLB blocks, the total time to complete the invalidation procedure is on the order of 100 ns. The invalidation procedure happens in parallel with the L2 TLB invalidation and is negligible compared to context switch completion times (order of $\mu$ s (kaffesHotOS2021, ; XPC, )).

(i) Invalidating all TLB entries. To invalidate all the TLB blocks inside the L2 cache, the L2 TLB first sends an invalidation command to the L2 cache controller. The cache controller probes in parallel all cache banks to invalidate all the TLB blocks of every L2 cache set. For each way, if the TLB entry bit is set, the TLB block is invalidated.

(ii) Invalidating all TLB entries with a specific ASID. To invalidate all TLB blocks with a specific ASID, the L2 TLB first sends an invalidation command to the L2 cache controller with the corresponding ASID. For every cache block, if the TLB entry bit is set and the ASID matches the ASID of the invalidation request, the TLB block is invalidated. If the size of the ASID of the invalidation command is larger (e.g., 4 bits) than the supported ASID (e.g., 3 bits), then all the TLB blocks inside L2 cache are flushed. However, we believe this is an uncommon case, because, e.g., Linux uses only 12 ASIDs/core (linuxasid, ).

6.2. TLB Shootdowns

A TLB shootdown occurs when the CPU needs to invalidate stale TLB entries on local and remote cores. It is caused by various memory management operations that modify page table entries, such as de-allocating pages (unmap()), migrating pages, page permission changes, deduplication, and memory compaction. As shown in previous works (latr, ), TLB shootdowns take order of $\mu$ s time to complete due to expensive inter-processor interrupts (IPIs). In Victima, if the system performs a TLB shootdown, the corresponding TLB blocks need to be invalidated in the L2 cache. We explain how for two different TLB shootdown-based invalidations:

(i) Invalidating a single TLB entry given VA and ASID. Invalidating a specific TLB entry by VA and ASID only requires sending an invalidation command with the VA and the ASID to the L2 cache controller. Since each TLB block contains eight contiguous TLB entries, invalidating one TLB entry of the TLB block leads to invalidating all eight corresponding TLB entries.

(ii) Invalidating all TLB entries given a range of VAs. Invalidating a range of VAs requires sending multiple invalidation commands with different VAs to the L2 cache controller. The L2 cache controller accordingly invalidates all the corresponding TLB blocks.

7. Area & Power Overhead

Victima requires three additions to an existing high-performance core design: (i) two new TLB Entry bits in every L2 cache block (one of TLB entries and one for nested TLB entries) (§5.1), (ii) the PTW cost estimator (§5.2) and (iii) the necessary logic to perform tag matching and invalidation of TLB blocks using the TLB Entry bit, the VPN, and the ASID (§6). Extending each L2 cache block with two TLB Entry bits results in a $0.4\%$ storage overhead for caches with $64$ B blocks (e.g., in total $8$ KB for a $2$ MB L2 cache). PTW-CP requires only (i) 4 comparators to compare the PTE counters with the corresponding thresholds and (ii) 4 registers to store the thresholds. To support tag matching/invalidation operations for TLB blocks, we extend the tag comparators of the L2 cache with a bitmask to distinguish between tag matching/invalidation for TLB blocks and tag matching/invalidation for conventional data blocks. Based on our evaluation with McPAT (mcpat, ), all additional logic requires $0.04\%$ area overhead and $0.08\%$ power overhead on top of the high-end Intel Raptor Lake processor (raptor_lake, ).

8. Evaluation Methodology

We evaluate Victima using an extended version of the Sniper Multicore Simulator (sniper, ). This simulator and its documentation are freely available at https://github.com/CMU-SAFARI/Victima. We extend Sniper to accurately model: (i) TLBs that support multiple page sizes, (ii) the conventional radix page table walk, (iii) page walk caches, (iv) nested TLBs and nested paging (amdnested, ) and, (vi) the functionality and timing of all the evaluated systems. Table 8 shows the simulation configuration of (i) the baseline system and (ii) all evaluated systems.

Table 3. Simulation Configuration and Simulated Systems

Baseline System
Core	4-way OoO x86-64 2.6GHz
MMU	L1 I-TLB: 128-entry, 8-way assoc, 1-cycle latency
	L1 D-TLB (4KB): 64-entry, 4-way assoc, 1-cycle latency
L1 D-TLB (2MB): 32-entry, 4-way assoc, 1-cycle latency
	L2 TLB: 1536-entry, 12-way assoc, 12-cycle latency
	3 Split Page Walk Caches: 32-entry, 4-way assoc, 2-cycle latency
L1 Cache	L1 I-Cache: 32 KB, 8-way assoc, 4-cycle access latency
L1 D-Cache: 32 KB, 8-way assoc, 4-cycle access latency
	LRU replacement policy; IP-stride prefetcher (stride, )
L2 Cache	2 MB, 16-way assoc, 16-cycle latency
	SRRIP replacement policy (srrip, ); Stream prefetcher (streamer, )
L3 Cache	2 MB/core, 16-way assoc, 35-cycle latency
Transparent Huge Pages (arcangeli2010, ; corbet2011, )	Debian 9 4.14.2. 10-node cluster Memory per node: 256GB-1TB
Evaluated Systems
POM-TLB (pomtlbISCA2017, )	64K-entry L3 software-managed TLB, 16-way assoc
	TLB-aware SRRIP replacement policy (§5.1)
Opt. L3 TLB-64K	1.5K-entry L2 TLB, 12-cycle latency
	64K-entry L3 TLB, optimistic 15-cycle latency
Opt. L2 TLB-64K	64K-entry L2 TLB, 16-way assoc, optimistic 12-cycle latency
Opt. L2 TLB-128K	128K-entry L2 TLB, 16-way assoc, optimistic 12-cycle latency
Nested Paging (amdnested, )	2D PTW; Guest PT: Four-level Radix, Host PT: Four-level Radix
	64-entry Nested TLB, 1-cycle latency
Ideal Shadow Paging (I-SP) (vm25, )	1D Shadow PTW instead of 2D PTW
	Updates to shadow page table cause no performance overheads
Victima	MMU consults PTW-CP only if L2 cache MPKI $<$ 5 (§5.2)
	TLB-aware SRRIP replacement policy (§5.1)

Workloads. Table 4 shows all the benchmarks we use to evaluate Victima and the systems we compare Victima to. We select applications with high L2 TLB MPKI ( $>5$ ), which are also used in previous works (elastic-cuckoo-asplos20, ; compendiaISMM2021, ; midgard, ; flataAsplos2022, ). We evaluate our design using seven workloads from the GraphBig (Lifeng2015, ) suite, XSBench (Tramm2014, ), the Random access workload from the GUPS suite (Plimpton2006, ), Sparse Length Sum from DLRM (dlmr, ) and kmer-count from GenomicsBench (genomicsbench, ). We extract the page size information for each workload from a real system that uses Transparent Huge Pages (corbet2011, ; arcangeli2010, ) with both 4KB and 2MB pages. Each benchmark is executed for 500M instructions.

Table 4. Evaluated Workloads

Suite	Workload	Dataset size
GraphBIG (Lifeng2015, )	Betweeness Centrality (BC), Bread-first search (BFS), Connected components (CC), Graph coloring (GC), PageRank (PR), Triangle counting (TC), Shortest-path (SP)	8 GB
XSBench (Tramm2014, )	Particle Simulation (XS)	9 GB
GUPS (Plimpton2006, )	Random-access (RND)	10 GB
DLRM (dlmr, )	Sparse-length sum (DLRM)	10.3 GB
GenomicsBench (genomicsbench, )	k-mer counting (GEN)	33 GB

Evaluated Systems in Native Execution. We evaluate six different systems in native execution environments: (i) Radix: Baseline x86-64 system that uses the conventional (1) two-level TLB hierarchy and (2) four-level radix-based page table, (ii) POM-TLB: a system equipped with a large 64K-entry software-managed L3 TLB (pomtlbISCA2017, ) and the TLB-aware SRRIP policy (§5.1) at the L2 cache, (iii) Opt. L3TLB-64K: a system equipped with a 64K-entry L3 TLB with an optimistic 15-cycle access latency, (iv) Opt. L2TLB-64K: a system equipped with a 64K-entry L2 TLB with an optimistic 12-cycle access latency, (v) Opt. L2TLB-128K: a system equipped with a 128K-entry L2 TLB with an optimistic 12-cycle access latency, and (vi) Victima: a system that employs Victima and the TLB-aware SRRIP policy (§5.1) at the L2 cache.

Evaluated Systems in Virtualized Execution. We evaluate four different systems in virtualized execution environments: (i) Nested Paging (NP): Baseline x86-64 system that uses (1) a two-level TLB hierarchy and (2) a 64-entry Nested TLB and employs Nested Paging (amdnested, ), (ii) POM-TLB: a system equipped with a large 64K-entry software-managed L3 TLB (pomtlbISCA2017, ) and the TLB-aware SRRIP policy (§5.1) at the L2 cache, (iii) I-SP: a system that employs an ideal version of shadow paging (amdnested, ; vm25, ) where (1) only a four-level radix shadow page table walk is needed to discover the virtual-to-physical translation and (2) the updates to the shadow page table are performed without incurring performance overhead, and (iv) Victima: a system that employs Victima and caches both TLB and nested TLB entries in the L2 cache which is equipped with the TLB-aware SRRIP policy at the L2 cache.

Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources