HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mathastext

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.06938v1 [cs.AR] 11 Mar 2024

TCAM-SSD: A Framework for Search-Based Computing in Solid-State Drives

Ryan Wong University of Illinois Urbana-ChampaignUSA Nikita Kim Carnegie Mellon UniversityUSA Kevin Higgs University of Illinois Urbana-ChampaignUSA Engin Ipek Samsung ElectronicsUSA Sapan Agarwal Sandia National LaboratoriesUSA Saugata Ghose University of Illinois Urbana-ChampaignUSA  and  Ben Feinberg Sandia National LaboratoriesUSA
(2024)
Abstract.

As the amount of data produced in society continues to grow at an exponential rate, modern applications are incurring significant performance and energy penalties due to high data movement between the CPU and memory/storage. While processing in main memory can alleviate these penalties, it is becoming increasingly difficult to keep large datasets entirely in main memory. This has led to a recent push for in-storage computation, where processing is performed inside the storage device.

We propose TCAM-SSD, a new framework for search-based computation inside the NAND flash memory arrays of a conventional solid-state drive (SSD), which requires lightweight modifications to only the array periphery and firmware. TCAM-SSD introduces a search manager and link table, which can logically partition the NAND flash memory’s contents into search-enabled regions and standard storage regions. Together, these light firmware changes enable TCAM-SSD to seamlessly handle block I/O operations, in addition to new search operations, thereby reducing end-to-end execution time and total data movement. We provide an NVMe-compatible interface that provides programmers with the ability to dynamically allocate data on and make use of TCAM-SSD, allowing the system to be leveraged by a wide variety of applications. We evaluate three example use cases of TCAM-SSD to demonstrate its benefits. For transactional databases, TCAM-SSD can mitigate the performance penalties for applications with large datasets, achieving a 60.9% speedup over a conventional system that retrieves data from the SSD and computes using the CPU. For database analytics, TCAM-SSD provides an average speedup of 17.7×\times× over a conventional system for a collection of analytical queries. For graph analytics, we combine TCAM-SSD’s associative search with a sparse data structure, speeding up graph computing for larger-than-memory datasets by 14.5%.

copyright: None

1. Introduction

Over the past decade, the amount of data generated and consumed by modern applications has grown exponentially (Reinsel et al., 2018). As one example, the number of photos shared per minute on the Instagram social media application has increased from 3,500 in 2013 to approximately 66,000 in 2022 (Xu et al., 2016; Domo, Inc., 2022). This growth has been observed in a wide variety of application domains, including machine learning (Chowdhery et al., 2023), social media (B. Marr, 2018; Domo, Inc., 2014), and business transactions (Im et al., 2018), with the average person producing 102 MB/times102dividemegabyteabsent102\text{\,}\mathrm{MB}\text{/}start_ARG 102 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_MB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of data per minute (Domo, Inc., 2023). As the quantity of data grows, increased pressure is placed on existing memory and storage subsystems, as frequent data movement is needed between these subsystems, the on-chip caches, and the CPU. Unfortunately, this data movement has become a major bottleneck in modern systems, as it consumes large amounts of energy and results in significant performance penalties (Oliveira et al., 2021; Nair et al., 2015; Ghose et al., 2019; Dally, 2015).

To alleviate the overheads of data movement, architects and system designers have been exploring processing-in-memory (PIM), a broad field of techniques that can perform computation close to or inside memory and storage devices. While early proposals for PIM date back to the 1970s, a recent resurgence of PIM has led to significant innovation in the last decade, with solutions proposed across a diverse range of memory technologies (e.g., SRAM (Aga et al., 2017; Eckert et al., 2018; Fujiki et al., 2019; Jeloka et al., 2016), DRAM (Gao et al., 2021; Ha**azar et al., 2021; Seshadri et al., 2017; Li et al., 2017), NAND flash memory (Gao et al., 2021; Park et al., 2022; Shim and Yu, 2022), emerging memory technologies (Angizi et al., 2018a, b, 2019; Ankit et al., 2019; Chou et al., 2019; Gaillardon et al., 2016; Gupta et al., 2018; Hamdioui et al., 2017; Imani et al., 2018; Shafiee et al., 2016)). DRAM-based processing-in-memory has formed the basis of several commercial products available today (He et al., 2020; Lee et al., 2021; Devaux, 2019; UPMEM SAS, [n. d.]; Samsung Electronics Co., Ltd., [n. d.]). However, efforts to increase DRAM capacity continue to suffer from scaling issues that have plagued DRAM manufacturers over the last 20 years (Mandelman et al., 2002; Kang et al., 2014; Mutlu, 2013), and recent reports speculate that these scaling issues will worsen in the coming decade (Im et al., 2020; Jones, 2020). These challenges will make it increasingly difficult for main memory capacity to keep up with ever-increasing dataset sizes.

NAND-flash-based solid-state drives (SSDs) provide an opportunity to perform PIM while overcoming main memory scalability issues. Compared to DRAM, NAND flash memory offers significantly higher densities that translate into orders-of-magnitude larger capacity at an approximately 40×\times× cheaper cost-per-bit (Neumann and Freitag, 2020). However, NAND flash access latencies are significantly higher than their equivalents in DRAM. This makes it more challenging to take near-memory logic originally proposed to sit close to DRAM and implement this logic close to storage. Instead, to overcome the higher latencies, NAND-flash-based PIM can take advantage of processing-using-memory (PUM; a.k.a. in situ computing), where we perform logic directly using memory cells to unlock the potential of million-way parallelism (where, generally, a pair of cells can form a compute unit).

Recent works have explored how to implement PUM in NAND flash memory. Parabit (Gao et al., 2021) proposes modifications to the latching circuitry used in NAND flash memory to execute bitwise operations between subsequent page-level operations. Flash-Cosmos (Park et al., 2022) extends Parabit by proposing intra- and inter- block bitwise operations, a common operation for a variety of applications, including databases and web search. GP3D (Shim and Yu, 2022) accelerates graph analytics, specifically by incorporating a PageRank (Brin and Page, 1998) accelerator that leverages the NAND flash array structure to perform analog matrix–vector multiplication. However, despite significant potential for performance and energy improvements, these existing approaches (1) have notable limitations that prevent their widespread use (e.g., the need to perform the same operation across thousands of data elements, modifications within the NAND flash array, domain-specific solutions) and (2) do not examine how to integrate their low-level in-flash primitives with the larger system and application (e.g., data organization across arrays and across chips, retrieving operand results from the array).

To tackle these issues, we propose TCAM-SSD, a new framework for efficient in-SSD computing. TCAM-SSD builds upon in-memory search (Tseng et al., 2020), a previously-proposed primitive that treats a NAND flash array as a bulk ternary content-addressable memory (TCAM). Conceptually, a TCAM searches across multiple data entries in parallel, by driving the search string value (i.e., the search key) on wires connected across all of the entries. While the IMS work shows how to perform parallel TCAM lookups using a NAND flash array, it does not discuss how to integrate this primitive to perform useful computation for applications, or how the SSD should manage the data and computation alongside I/O requests. TCAM-SSD provides mechanisms that enable the application to (1) effectively retrieve data through searchable data elements, and (2) execute compute operations in the forms of queries. The proposed framework can be further extended to support in-SSD associative computing, enabling a wide range of applications such as databases (Caminal et al., 2022), data mining (Agrawal and Srikant, 1994), network routing (Pei and Zukowski, 1991), and image processing (Meribout et al., 2000).

Our goal with TCAM-SSD is to build a full in-SSD computing framework that allows applications to manage data and efficiently make use of IMS. IMS implements the TCAM-based search primitive by modifying the voltages applied to each wordline (i.e., row) in the array, and by storing data along the bitlines of the NAND flash string. TCAM-SSD makes four key changes to a conventional SSD to enable applications to effectively utilize IMS. First, TCAM-SSD seamlessly enables searchable fields within a data region to be stored in the SSD using the column-oriented format required by IMS, while maintaining a complete and coherent copy of the data region in the SSD using the conventional row-oriented format. This allows TCAM-SSD to enable efficient bulk searches across the data region without requiring the programmer to explicitly manage data transposition. Second, TCAM-SSD incorporates an efficient firmware metadata structure for searchable data regions and their results. Third, TCAM-SSD exposes an NVMe 2.0 (NVM Express, Inc., 2021a) compliant command interface that applications and/or system drivers can use to perform computing tasks parallelized across many NAND flash arrays. Together, these changes enable the ability to coherently modify searchable data while maintaining the ability to perform standard SSD operations (e.g., read/write). Fourth, TCAM-SSD introduces multiple optimizations that reduce unnecessary data movement between the NAND flash arrays and the SSD firmware processor. TCAM-SSD requires no modifications to the NAND flash arrays, and requires only lightweight changes to the array peripheral circuitry and the SSD firmware.

While TCAM-SSD can provide performance and data movement improvements across many application domains, we focus on three examples to demonstrate data structures and semantics that can make use of associative search. We evaluate the benefits of TCAM-SSD over a system with a conventional SSD for these use cases. For transactional databases, TCAM-SSD’s associative search can mitigate the performance penalties of accessing the disk, with a 60.9% speedup. For database analytics, TCAM-SSD provides an average speedup of 17.7×\times× for a collection of analytical queries. For graph analytics, TCAM-SSD combines associative search with a sparse data structure, providing speedups for larger-than-memory datasets of 14.5%.

We make the following contributions in this work:

  • We introduce the first full framework for in-SSD computing using IMS.

  • We modify the SSD firmware to perform associative search operations alongside conventional I/O requests.

  • We provide an NVMe-compatible interface for applications to perform bulk parallel associative search.

  • We explore three use cases to show data structures and application semantics for in-SSD computing.

2. Background

2.1. NAND Flash SSD Organization

Modern NAND-flash-memory-based solid-state drives (SSDs) are typically divided into a front end and a back end. The front end handles (1) interfacing to the host; (2) managing the flash translation layer (FTL), the firmware layer responsible for map** the host’s logical block address for a piece of data to an SSD-internal physical address; and (3) dispatching I/O requests to the back end memory subsystem, which contains the NAND flash memory chips. Due to the complexity of managing these requests and the FTL, the front end typically executes on a dedicated microcontroller and includes dedicated DRAM.

The back end is divided into multiple channels, which can execute independent I/O operations in parallel. Additional parallelism is attained by connecting multiple NAND flash memory chip packages to each channel (known as way pipelining or package interleaving). Furthermore, within each chip package, different dies can support interleaved commands (die interleaving). Although dies are the smallest superstructure the front end can issue an independent command to, each die may contain multiple planes, which may operate in parallel when the die is issued specific multi-plane operations. Finally, each plane is composed of multiple blocks.

A NAND flash block consists of multiple rows of cells, each of which are formed by connecting multiple cells together via the wordlines as seen in Figure 1. The wordlines (WLj𝑊subscript𝐿𝑗WL_{j}italic_W italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) of a block are connected together via shared bitlines (BLk𝐵subscript𝐿𝑘BL_{k}italic_B italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT), which connect a vertical column of NAND flash memory cells to a shared peripheral circuit that can perform I/O. Each row contains one or more pages of data, and only a single page per block can be accessed at a time. Read and write operations are performed at a page granularity, while erase operations are performed on an entire block at once. To take advantage of the internal parallelism, pages from different channels that share the same address offsets can be connected into a larger virtual structure called a superpage that spans multiple blocks, with the blocks collectively referred to as a superblock.

Refer to caption
Figure 1. Left) NAND flash organization. Right) NAND flash blocks composed of ground select line (GSL), horizontal wordlines (WL), vertical bitlines (BL), string select line (SSL), and NAND flash cells.

A NAND flash cell consists of a transistor with a floating gate that can hold charge. The absence or presence of charge on the floating gate corresponds with the cell state (i.e., its stored data value) and sets the threshold voltage (V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT) for the cell. In a single-level cell (SLC), the cell can store two states (0 or 1) that correspond to two discrete windows of V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT levels. To increase density and reduce the cost per bit, NAND flash manufacturers often store multiple bits in a single cell. For example, a multi-level cell (MLC) can store four states (00, 01, 10, 11) to represent two bits, while triple-level cells (TLC) and quad-level cells (QLC) can store three and four bits, respectively.

During a read, a voltage is applied to the top of the bitline, both ground select and string select lines, and the state of each cell in a row is read based on whether or not the transistor allows current to flow through it. To read stored data from an SLC NAND flash memory block, a read reference voltage V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT, which is between the two V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT levels, is applied to the wordline being read. If the cell is set to high V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT, V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT is insufficient to turn on the transistor, resulting in a logical 0 being read out. However, if the cell is set to low V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT, V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT turns on the transistor, resulting in a logical 1 being read out. To ensure that current flows through the other cells in the same column, all other wordlines are driven to V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT, a voltage sufficient to turn on transistors regardless of the stored state (i.e., V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT >>> high V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT). For an n𝑛nitalic_n-bit cell, we need to apply up to n𝑛nitalic_n read reference voltages for each read, which significantly increases the read latency. To mitigate this increase, many systems use an SLC cache (i.e., a region of cells that hold only a single bit) to improve performance (Yang et al., 2016).

To perform a write, V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT is applied to all the wordlines except for the page to be written, which is driven to a programming voltage V𝑝𝑟𝑜𝑔𝑟𝑎𝑚subscript𝑉𝑝𝑟𝑜𝑔𝑟𝑎𝑚V_{\text{program}}italic_V start_POSTSUBSCRIPT program end_POSTSUBSCRIPT. V𝑝𝑟𝑜𝑔𝑟𝑎𝑚subscript𝑉𝑝𝑟𝑜𝑔𝑟𝑎𝑚V_{\text{program}}italic_V start_POSTSUBSCRIPT program end_POSTSUBSCRIPT is much higher than V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT and V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT, which results in electrons being injected into the floating gate. Once electrons have been injected, it is not possible to remove them until an erase operation is performed. Therefore, prior to storing data, the cell must be erased by applying a large negative voltage (V𝑒𝑟𝑎𝑠𝑒subscript𝑉𝑒𝑟𝑎𝑠𝑒V_{\text{erase}}italic_V start_POSTSUBSCRIPT erase end_POSTSUBSCRIPT), which causes V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT of the cell to be set low. Multi-bit cells use more complex (and slower) methods to perform programming, such as two-step programming (Park et al., 2008; Cai et al., 2014) or foggy–fine programming (Cai et al., 2017a, 2015a).

2.2. Associative Memories

In many systems, searching for specific data elements can quickly become an expensive operation even with additional metadata (e.g., indexes, hash functions). Although tree structures provide significant improvement over linearly scanning, the structure still incurs a logarithmic search time. Hashing, with constant search time, can lead to collisions that hinder performance. To improve the performance of data retrieval, systems can employ dedicated hardware to provide constant-time searching without aliasing. A content-addressable memory (CAM) (Batcher, 1974; Kohonen, 1980; Wade and Sodini, 1987) contains an array of data elements that can be indexed directly by data value (as opposed to the address-based indexing employed by caches). In general, a CAM executes a lookup (i.e., a query) in the form of a parallel search across all data elements in an array by broadcasting the desired content value (i.e., a key). The CAM lookup returns a list (often in the form of a bitvector) that indicates which of its entries, if any, contain the key. CAMs are well suited for applications in which the location and presence of the data is unknown, e.g., translation lookaside buffers (TLBs), associative caches, database operations (Sun et al., 2017).

A ternary content-addressable memory (TCAM) extends the concept of a CAM by introducing a wild card bit (i.e., a don’t care bit; represented as X). In a TCAM, a bitline whose query value is set to X will match either a 0 or a 1 for that bit. For example, a single TCAM lookup for 01X0 retrieves a list of matches corresponding to 0100, 0110, and 01X0, while a non-ternary CAM would have to perform two lookups (a search for 0100, followed by a search for 0110). Due to their additional capabilities, TCAMs have been used in packet classification (Ravikumar and Mahapatra, 2004), network intrusion detection (Graves et al., 2019), and information retrieval (Li et al., 2014).

A TCAM is typically implemented using SRAM-like structures (Pagiamtzis and Sheikholeslami, 2006; Jeloka et al., 2016). Unfortunately, this comes with significant density and power costs (Goel and Gupta, 2010; Arsovski et al., 2003). For every additional data element (e.g., word) in a TCAM, both the static power and the dynamic power per match increase (because a CAM must look up all of its contained data elements for every search). Due to the overheads associated with SRAM-based TCAM, alternative TCAM devices have been proposed using emerging non-volatile memories (NVMs) (Matsunaga et al., 2009; Derharcobian and Murphy, 2010; Eshraghian et al., 2010; Rajendran et al., 2011; Xu et al., 2010; Alibart et al., 2011; Matsunaga et al., 2011; Yin et al., 2019; Zha and Li, 2020; Narla et al., 2023). NVM-based TCAMs are significantly more area- and power-efficient (Graves et al., 2020). In this work, we focus on implementing TCAM using NAND flash memory, where we can attain a high storage density using more mature technology than emerging NVMs, with scalability and static power benefits over SRAM-based TCAM.

3. The TCAM-SSD Framework

We introduce a framework for performing associative search inside NAND-flash-memory-based solid-state drives (SSDs), which we call TCAM-SSD. By enabling in-SSD associative search, TCAM-SSD can quickly identify relevant pieces of data without having to send the entire dataset to the CPU, significantly reducing the I/O traffic required for modern applications to manage and process very large datasets. Unlike prior works, TCAM-SSD enables in-NAND-flash computation without making any modifications inside the NAND flash arrays within a NAND flash block, and exposes an interface that is compatible with popular SSD protocols (and uses existing protocol hardware). In hardware, TCAM-SSD requires only minimal modifications to the array periphery to support the in-array search. We modify the embedded firmware to logically partition the SSD into data regions, which store file blocks in a conventional manner, and search regions, where we use an efficient transposed data layout to enable rapid, highly-parallel associative search across data elements.

3.1. High-Level Overview

Figure 2 shows the front-end interface for TCAM-SSD. TCAM-SSD aims to eliminate two types of data movement required by conventional drive reads: CPU–FE (front end) and FE–BE (back end). Applications interact with TCAM-SSD through drive-level commands that we introduce as extensions to the standard NVMe protocol. An application uses one of these commands to allocate a new search region. Our modified FTL performs block-level allocation for the search region, and data elements are written to the NAND flash chips in a vertical manner (i.e., the bits of a word are written to the same bitline of a NAND flash block, with the bits distributed across different wordlines). TCAM-SSD provides additional commands that can update a data element, and can append new data to a search region. These regions, and the keys that they contain, are efficiently tracked in the firmware.

Refer to caption
Figure 2. TCAM-SSD front end (modules introduced for TCAM-SSD are shown in orange).

To perform a ternary search (i.e., a search that either looks for a matching value for each bit or treats a bit as a don’t care) over one or more search regions, the application issues a search command over NVMe to the firmware ( 1 in Figure 2). The firmware schedules search commands for the NAND flash chips in the back end ( 2). The command uses per-wordline read reference voltages to represent each bit of the search key (or whether a bit is a don’t care), and passes these voltages to the NAND flash blocks that contain the search regions requested by the command ( 3). The NAND flash blocks, using modified peripheral circuitry to support per-wordline voltages, then issues a chip-level SRCH command that can concurrently search through thousands of data elements in the block at once. The output of each bitline indicates whether the word stored along the bitline is a match. Combined with block-level parallelism, a single SRCH command can search over tens of thousands of data elements simultaneously. The list of matches is returned to a search manager that we add to the firmware ( 4), which uses a metadata table to decode where the specific data lies ( 5). The firmware then schedules and issues read requests for only the matching data ( 6), and the matching data is returned to the host ( 7).

3.2. Implementing the Search Primitive

We implement a primitive based on IMS (Tseng et al., 2020) for ternary search (i.e., for each bit, match a 0 or 1, or ignore that bit) within a conventional NAND flash block. We do this by modifying the layout and programming pattern of the cells in a NAND flash block, and by changing the V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT and V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT voltages that are applied to the block’s wordlines. Note that our approach requires zero changes to the layout or design of the cell array within the block. Even with our changes, a block can still perform conventional read, program, and erase operations, allowing it to continue functioning as storage.

Storing a Bit

To enable ternary search, we logically (but not physically) combine two NAND flash cells, which share the same bitline and sit in adjacent rows to one another, to represent a single bit of data. We can use these two cells to store a bit value 0 (Figure 3a), a bit value 1 (Figure 3b), or a bit value X (representing a bit that will match either a 0 or a 1; i.e., a bit that can be ignored). For each NAND flash cell, we use the same V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT states as conventional SLC cells.111While TCAM-SSD can make use of multi-level cells, we use SLC, without loss of generality, to simplify our descriptions.

Refer to caption
Figure 3. Associative search in a NAND flash array.
Map** Data in a Block

Unlike conventional SSDs, which store a bit of data in a single cell and store all bits of a data element across cells that share a wordline within a NAND flash block, TCAM-SSD transposes how data is stored for blocks within a search region.222Blocks in data regions still use the conventional data map**. All of the bits of a data element are stored along the same bitline, with each bit stored across two adjacent cells. For a typical NAND flash block where the block spans 128–512 rows (Cai et al., 2017a), this allows us to store data elements as large as 63–255 bits, with the last bit used to represent whether the element is valid.333Note that we can store and search for shorter data elements, by setting the search key bits to X for the bits that do not belong to the data elements.

To reduce the complexity of programming and interacting with our transposed representation, and to avoid introducing cell-to-cell program interference, we allocate an entire block of NAND flash memory at once for those blocks that we will perform search on. Despite performing block-level allocation, TCAM-SSD uses existing page-level programming operations to write data to the block, handling transposition transparently to the programmer (see Section 3.3).

Checking a NAND Flash Block for Matches

Our transposed organization allows us to search an entire block at once, by using read reference voltages to represent the bits of the search key. Figure 3c shows an example for a 6×3636\times 36 × 3 NAND flash block. Each column represents a different data element, with pairs of rows representing one bit of the element. Across Bit 0 (i.e., the first two rows) of each data element, we search for a bit value 1 by applying V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT to the first row and V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT to the second row. Bit 1 shows how to ignore a bit (i.e., match either a 0 or a 1), and Bit 2 shows how to search for a bit value 0. For a given search key, TCAM-SSD chooses and applies the correct wordline voltages for each bit in the key, using a new chip command that we introduce called SRCH. SRCH contains one bit per wordline to indicate the search key (i.e., whether to apply V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT or V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT), and we replace the block’s voltage selection decoder with per-wordline 2:1 muxes, similar to proposals from prior work (Cai et al., 2012, 2013, 2015b, 2017b).

When SRCH is performed on a block, a bit value 1 is applied at the top of each bitline, identical to how reads are performed. If all bits of a data element match the search key, all transistors along the bitline turn on, and the 1 propagates to the bottom of the bitline. If any bit does not match, one of that bit’s transistors will be off (e.g., orange transistors in Figure 3c), and the 1 stops propagating. SRCH searches all data elements in a block simultaneously, and returns a match vector (i.e., the bitline outputs) as a single value. Note that the last cell in each bitline stores a valid bit (0 if valid, 1 if invalid; not shown in the figure), and V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT is applied to the this wordline, preventing invalid elements from matching.

3.3. Managing Searchable Data in Firmware

To maintain the ability to perform baseline SSD functionality and as well as our new associative search functionality, TCAM-SSD splits the SSD into two types of regions, data and search. Data regions behave exactly the same as data currently does in a conventional SSD, relying on conventional reads, programming, and erase. Search regions can be allocated only through a new command that we introduce (Section 3.4), where the size of the region is based on the number of data elements and the element length.

Note that we do not limit the length of a data element to the number of NAND flash cells along a bitline in a block (wordlines per block2wordlines per block2\frac{\text{wordlines per block}}{2}divide start_ARG wordlines per block end_ARG start_ARG 2 end_ARG, which we define as the native element size). In cases where we need longer data elements, a search region can be extended across multiple blocks, with each data element split across the blocks. While this may require additional search operations, the resulting match vectors from each block can be ANDed together to form a final match vector prior to decode and accessing the data region. Alternatively, for large tables, the number of keys in a search region may exceed the number of columns in a single block (page size) and the match vectors must be concatenated together.

Linking Search Regions to Data Regions

Each search region is connected to a data region, and the map** between the regions is maintained in a software-controlled link table. For each data element in the search region, the data region contains a corresponding data entry. Both data elements and data entries have a fixed length, allowing the link table to store only a single base physical address per block in the data region (along with a pointer to a firmware buffer for updated values; see Section 3.4). The TCAM-SSD firmware can add an offset to the base address to look up specific data entries corresponding to matching data elements.

The contents of the linked data entry are application dependent. For example, if an application directly wants the value of the matching data element, its corresponding data entry contains a non-transposed replica of the element value.444By replicating the data in a conventional wordline-oriented data region, we can read an n𝑛nitalic_n-bit data element’s value out in a single cycle, instead of performing n𝑛nitalic_n reads (one for each bit) from the transposed search region. If we want to implement a key–value store (KVS), which stores tuples of keys and values, each data element in the search region would correspond to the key, and its corresponding entry in the data region would hold the value (and, if desired, a non-transposed copy of the key).

Support for Block-Level Allocation

TCAM-SSD requires modifications to the flash translation layer (FTL) to implement search. Most importantly, rather than using page-level allocation, search regions (but not data regions) use block-level allocation since pages within a search block must be allocated contiguously. Notably, this block-level allocation requirement only applies to search regions: data regions can continue to use the underlying FTL implementation. In a conventional SSD, superblocks are formed from a collection of blocks usually from different chips with the same block offset. Accordingly, prior work has proposed superblock FTL designs (Jung et al., 2010). TCAM-SSD is amenable to this type of allocation scheme, as it enables the system to search over an entire superblock in parallel.

3.4. TCAM-SSD Command Interface

To interface with TCAM-SSD, we propose a TCAM-specific NVMe 2.0 compliant command set specification (NVM Express, Inc., 2021b). The proposed commands are similar to the KVS command set ratified as part of NVMe 2.0 (NVM Express, Inc., 2022), and take advantage of the ability to add vendor-specific functionality to the interface. The commands described below are sufficient to implement a basic set of associative computing functionality; however, more advanced commands may be needed to support a wider range of functionality in the future (e.g., updating data elements in place).

Allocate / Deallocate / Append

Since search regions are managed separately from data regions, they must be specially allocated and managed. The Allocate command creates a search region based on the data element size, and a linked data region based on the data entry size, as discussed in Section 3.3. The command can optionally provide data for the search region by providing a pointer to the host memory that stores elements and entries. The Deallocate command frees a search region by marking all blocks for erase.

The Append command is used to add elements of the same size to an existing search region, along with their corresponding data entries to the data region. The firmware stores the new data element (along with its corresponding data entry) in a software buffer. Once there are enough elements in the software buffer to fill an entire block, one new block each is appended to the search region and the data region, and the elements and their data entries are transferred out of the buffer and written to the drive. The firmware appends the link table with the new map**.

Importantly, allocate and append require that the order of data elements is the same as the order of data entries in the linked data region. This requirement must be managed by the host application.

Simple Search / Search / Search Continue

Once data elements and corresponding data entries are written, the host can issue search commands to the SSD. For simplicity, we describe the Simple Search command, which contains a fixed-length search key (up to 127 bits), the address of the search region, and a pointer to a host buffer where return data can be stored. Alternatively, if the native element size exceeds 127 bits, the host may issue a Search command, which uses a data pointer to communicate the search key to the SSD. Both the Search and Simple Search commands can also communicate additional operations to complete with the resultant search data, such as logical AND and OR reductions between shorter keys. The firmware issues one or more chip-level SRCH commands to the selected chips to perform the bulk parallel search, where each command invocation returns a match vector. The firmware uses each match vector, along with a base address from the link table, to calculate the addresses for data entries in the data region that correspond to matches. The firmware then issues read commands to these addresses, and writes the returned data to the host buffer.

Note that the Search command does not know how many tuples will match, and therefore may not allocate sufficient host buffer space to store all of the returned data entries. To address this, a flag is added to the completion queue entry, notifying the host that the host buffer was inadequately sized. Upon receiving this signal, the host may issue a Search Continue to the same search region address with a new host buffer to complete the data transfer issued by a prior search.

Delete

When the Delete command is invoked to remove a data element and its corresponding data entry, the firmware first searches for the data element in the search region. The firmware then invalidates all matching data elements, by using normal chip-level page commands to read and update the element valid bits for each block containing a match. This invalidation is written in place, since it involves only raising V𝑡ℎsubscript𝑉𝑡ℎV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT of one cell per match from a 0 to a 1.

Updating an existing data element (or its associated data entry) involves first calling Delete to remove the old value, and then calling Append. While such updates are costly, we find that they are infrequent (or non-existent) for many target applications, which tend to use relatively stable datasets.

3.5. Software Support

We wrap the TCAM-SSD NVMe commands from Section 3.4 into a programmer-friendly API. This API can be used in two modes: (1) NVMe Mode, where data is returned to the host CPU for computing; and (2) Associative Update Mode, where data is not transferred over NVMe, and bulk updates are performed directly in the SSD on matching data.

NVMe Mode

Listing 1 shows an example application where we search a dataset for all salary records of employees named “Bob”. Once the data entry is brought to the host, the programmer is able to modify the matching records (e.g., give all Bobs a raise), and send the updates back to the SSD.

Listing 1: NVMe Mode
Array employees, firstName;
// create Search Region
void *sr =
  AllocSearchable(firstName, employees);
// retrieve data for any Bob from SSD
void *dataEntry =
  SearchSearchable(sr, "Bob");
// update salaries for all Bobs in host
UpdateSalary(dataEntry, newSalary);
// send modified salaries back to SSD
UpdateSearchVal(sr, "Bob", dataEntry);
Listing 2: Associative Computing Mode
// variables are pre-allocated
// matches for Bob are moved to SSD DRAM
void *bA =
  SearchSearchable(sr, "Bob", capp=true);
// sets desired operation for each match to add 1
void op1 = +1;
// perform modification to each match in SSD DRAM
UpdateSearchVal(sr, bA, op1, capp=true);
DeallocSearchable(sr, capp=true);
Associative Update Mode

Listing 2 shows the same example using associative update mode. In this mode, the search operation brings the matching entries for “Bob” into a DRAM buffer inside the SSD. Using the associative update command, the programmer can send an operation (e.g., addition) and immediate to the SSD, which will be bulk update all matching records. Notably, this mode does not require moving the records between SSD and host.

3.6. Hardware Optimizations

We propose four hardware optimizations to improve the efficiency of TCAM-SSD.

3.6.1. Enhancing Reliability

There are several sources of errors that can induce bit flips in data read out from NAND flash chips (Cai et al., 2017a; Shin et al., 2012; Matsui et al., 2017; Huang et al., 2014). While our in-SSD operations intentionally avoid sending data to the firmware microcontroller to minimize FE–BE data movement, this prevents the operations from using the firmware-based error correction techniques to mitigate these read errors, which can introduce false positives and false negatives. TCAM-SSD reduces the impact of these read errors in two ways.

First, TCAM-SSD operations are expected to induce significantly fewer read disturb errors (and, thus, false negatives) than read operations. A key factor on the magnitude of read disturb errors is the value applied to the wordline. During a read operation, the firmware applies V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT to one row in an n𝑛nitalic_n-row NAND flash block and V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT to the other (i.e., n1𝑛1n-1italic_n - 1) rows. During our search operations, the firmware applies V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT to only n/2𝑛2n/2italic_n / 2 rows, with V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT applied to the other n/2𝑛2n/2italic_n / 2 rows.555A wild card search (i.e., setting one bit to don’t care, or X) replaces one V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT with an additional V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT. V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT is typically 2–3×\times× smaller than V𝑝𝑎𝑠𝑠subscript𝑉𝑝𝑎𝑠𝑠V_{\text{pass}}italic_V start_POSTSUBSCRIPT pass end_POSTSUBSCRIPT (Cai et al., 2017a), and prior work has shown that this lower wordline voltage leads to an exponential drop in read disturb induced errors (Cai et al., 2015a) (e.g., a 1300×\times× drop in errors from just a 6% drop in voltage).

Second, we employ the enhanced SLC-mode programming (ESP) technique proposed in Flash-Cosmos (Park et al., 2022) with the goal of eliminating retention errors (and, thus, false positives), for the search region. ESP treats NAND flash cells as SLC, and increases the Vthsubscript𝑉𝑡V_{th}italic_V start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT margin between bit value 0 and bit value 1, which reduces the probability that the voltage of a cell programmed to bit value 0 will drop below V𝑟𝑒𝑎𝑑subscript𝑉𝑟𝑒𝑎𝑑V_{\text{read}}italic_V start_POSTSUBSCRIPT read end_POSTSUBSCRIPT and be incorrectly read as bit value 1. Notably, prior work has reported that the error rate increases with the number of bits per cell; specifically, Huang et al (Huang et al., 2014) observe that the MLC error rate is 2 orders of magnitude greater than SLC. Therefore, ESP can substantially improve reliability. While this decreases the bit density of the SSD, we do this only on the search region cells, while data regions (non-TCAM-SSD cells) continue to use multiple levels per cell.666Non-search region MLC/TLC cells continue to use conventional error-correcting codes (ECC).

SSDs usually optimized for high-capacity, relying on multi-level cells to increase the bit-density of the drive. TCAM-SSD’s reliance on SLC cells can therefore adversely affect total drive capacity; however, only search regions require SLC cells and the linked data region can continue to use MLC/TLC. Thus, the impact on capacity comes solely from the new search regions, which we quantify in our evaluation. Additionally, prior work (Jimenez et al., 2013) has explored using some arrays are used as an SLC cache—reducing the effective density penalty to only 2×\times× within the search regions. We generalize to all SSDs and report the search region overheads in terms of blocks of total SSD.

3.6.2. Supporting Early (Conditional) Termination

Once a search has been executed on the NAND flash arrays, the data must be read back by the microcontroller. However, this data is a match vector, which must be decoded prior to executing the secondary read. Because we are searching for a particular piece of data among a large dataset, we expect the majority of the match vectors to return no matches. Therefore, storing a match vector of 0s and subsequently decoding 0s to their corresponding value would waste the limited in-SSD DRAM capacity. Accordingly, we propose a mechanism to support early (conditional) termination.

At each flash channel controller, we add a small circuit to quickly decode the match vector as it is read from the back end. In particular, if the data burst is all zeroes, we increment a counter and discard the data burst. If there is a match, then we tag the burst with the value stored in the counter, which is used to later decode the corresponding value.

3.6.3. Write Inversion

Modern NAND flash supports inverse reads, in which the data read from the NAND flash device is the inverse value that is stored in the cell (Park et al., 2022). While initially designed for reads, this functionality can be used as part of the program–verify operation to accelerate writes (Lee et al., 2002). We can make use of write inversion to reduce the amount of command data transmitted to the NAND flash chips. If we restrict our primitive from Section 3.2 to store only bit values 0 and 1 (i.e., we no longer store X values in the SSD,777In our evaluated use cases, we do not find a need to store X values. but can still use X as a don’t care in a search value), the two wordlines sharing a single data value in TCAM-SSD are the inverse of each other. Once the program operation for a wordline has been completed using a program–verify operation, a subsequent row address can be supplied without additional write data. If the program operation is then executed, the inverse data will then be written to the new wordline. Note, that by using write inversion, we can reduce data movement between the firmware processor and the NAND flash arrays by approximately 2×\times× during programming.

3.6.4. Data Result Compaction

Associative search operations may return multiple matches. However, the corresponding matches may not be contiguous in the data region. Therefore, a search operation with N matches would return N logical blocks (i.e., pages) of data to the host, resulting in wasteful data movement. Thus, if the maximum data entry size is less than the size of a host logical block, we compact multiple data entries into a few host blocks, which are returned to the host. To enable this optimization, it is necessary to know the size of the corresponding data entries in the data region. This information is provided to TCAM-SSD as part of the Allocate command, and is stored in the link table structure.

3.7. Comparing In-Storage Compute Techniques

As discussed in Section 3.1, TCAM-SSD aims to eliminate unnecessary CPU–FE and FE–BE data movement. Prior works have proposed two families of in-storage compute techniques that also target these types of data movement.

Computational SSDs (e.g., (Advanced Micro Devices, Inc., 2021; Samsung Electronics Co., Ltd., 2022)) reduce CPU–FE data movement by introducing dedicated compute logic (e.g., FPGAs, embedded IP cores) in either the front end of the SSD or just outside of the SSD’s host interface. Because the introduced logic does not have direct access to the back end NAND flash chips, computational SSDs cannot eliminate FE–BE movement.

In-flash bitwise processing (IFBP) techniques target both CPU–FE and FE–BE data movement by performing processing using NAND flash memory cells. Flash-Cosmos (Park et al., 2022) is a state-of-the-art IFBP technique that performs bulk Boolean operations. By focusing on Boolean operations, Flash-Cosmos restricts its opportunities to reduce FE–BE movement compared to TCAM-SSD. For example, for a table containing 64-bit user entries and a list of accessed websites, Flash-Cosmos can determine the active user count for a given website using PUM, but cannot use PUM to answer which users are active (instead requiring data to be sent back to the CPU); TCAM-SSD can perform both using memory.

While there is some overlap between the functionality of computational SSDs, Flash-Cosmos, and TCAM-SSD, they offer some complementary benefits, and TCAM-SSD can operate in conjunction with both. For example, an application can use TCAM-SSD to execute a portion of the computation within the NAND flash arrays, reducing CPU–FE and FE–BE data movement compared to a CPU-based search. Computational SSD logic can then post-process the search results returned by TCAM-SSD, further reducing CPU–FE movement. Similarly, the Flash-Cosmos inter-block ‘OR’ operation can be leveraged by TCAM-SSD to enable search of data elements larger than the native element size.

3.8. Targeting Use Cases for TCAM-SSD

TCAM-SSD can be broadly applied to a variety of application domains that benefit from associative search. Examples include association rule mining (Agrawal and Srikant, 1994), image processing (Meribout et al., 2000), database analytics (Caminal et al., 2022), hardware reconfiguration (Zha and Li, 2018, 2020), and text processing (Imani et al., 2016). Associative memories have also found uses in packet classification (Spitznagel et al., 2003; Lakshminarayanan et al., 2005), IP routing (Zane et al., 2003), and network intrusion detection systems (Chang et al., 2008). In this work, we demonstrate how to apply and optimize TCAM-SSD for three example use cases: (1) mitigating the disk access overheads of online transaction processing in databases (Section 5.1), (2) reducing the data transferred between storage and the CPU for database analytics (Section 5.2), and (3) improving the processing efficiency of large graphs during graph analytics (Section 6).

4. Experimental Methdology

To evaluate TCAM-SSD, we develop a set of detailed analytical models to capture the low-level hardware and software modifications to the SSD, as well as all critical aspects of the system, including, NVMe initialization overheads, DRAM access times, and NAND flash access time. For example, the latency of the SRCH operation (Section 3.1) includes, NVMe ( 1 in Figure 2), translation ( 2), block-level search to the search region ( 3), match-vector retrieval and decode ( 4 & 5), physical access(es) to the data region ( 6), FE–BE data movement, and CPU–FE data movement ( 7). Additionally, we include support channel- and die-way parallelism.

Table 1 lists the parameters for the evaluated 3D NAND flash SSD, based on the configuration in Flash-Cosmos (Park et al., 2022) including the write latency for ESP. The NVMe initialization overhead is set to 4 µs/times4dividemicrosecondabsent4\text{\,}\mathrm{\SIUnitSymbolMicro s}\text{/}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_µ roman_s end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG, based on prior work (Tavakkol et al., 2018; Lawley, 2014; Liu et al., 2004). All other parameters are matched to Flash-Cosmos for ease of comparison.

Table 1. 3D NAND flash configuration (Park et al., 2022).
Parallelism

channel ×\times× die

Back End

channels = 8, packages/channel = 1, dies/package = 8, planes/die = 2, blocks/plane = 2,048, pages/block = 196, page size = 16 kB/times16dividekilobyteabsent16\text{\,}\mathrm{kB}\text{/}start_ARG 16 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_kB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG

DRAM Access Time

15 ns/times15dividenanosecondabsent15\text{\,}\mathrm{ns}\text{/}start_ARG 15 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_ns end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG to retrieve 64 B/times64dividebyteabsent64\text{\,}\mathrm{B}\text{/}start_ARG 64 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_B end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG

Flash Access Times

read = 22.5 µs/times22.5dividemicrosecondabsent22.5\text{\,}\mathrm{\SIUnitSymbolMicro s}\text{/}start_ARG 22.5 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_µ roman_s end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG, search = 25 µs/times25dividemicrosecondabsent25\text{\,}\mathrm{\SIUnitSymbolMicro s}\text{/}start_ARG 25 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_µ roman_s end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG, write (SLC/MLC/TLC) = 200 µs/times200dividemicrosecondabsent200\text{\,}\mathrm{\SIUnitSymbolMicro s}\text{/}start_ARG 200 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_µ roman_s end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG/500 µs/times500dividemicrosecondabsent500\text{\,}\mathrm{\SIUnitSymbolMicro s}\text{/}start_ARG 500 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_µ roman_s end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG/700 µs/times700dividemicrosecondabsent700\text{\,}\mathrm{\SIUnitSymbolMicro s}\text{/}start_ARG 700 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_µ roman_s end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG

Max Search Size

128 k/times128dividekilonothingabsent128\text{\,}\mathrm{k}\text{/}start_ARG 128 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_k end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG keys ×\times× channels ×\times× dies, native element size = 97 bit/times97dividebitabsent97\text{\,}\mathrm{bit}\text{/}start_ARG 97 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_bit end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG

We conservatively assume that the data in the data region for both the baseline and TCAM-SSD resides in SLC portions of the SSD. This makes the search and read latencies comparable; in contrast, if the data resides in an MLC cell, the read latency would be significantly larger than that of the search operation, as TCAM-SSD uses SLC to enable search. Our model captures the effect of channel- and die-level parallelism, allowing multiple in-flight operations. across different channels. The approximate latency for standard NVMe operations (e.g., read, program, erase) are computing by using the latency of NVMe, translating block addresses to physical addresses, physical page access, FE-BE data movement, and CPU-FE data movement. New TCAM-specific commands incur the overheads of the the respective operations. Our analytical model makes two conservative assumptions that adversely affect TCAM-SSD. First, we assume that the NAND flash access time for search is similar-to\sim10% higher than a read operation888The IMS (Tseng et al., 2020) work notes that for a block with 96 wordlines and 128 k/times128dividekilonothingabsent128\text{\,}\mathrm{k}\text{/}start_ARG 128 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_k end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG bitlines, a reliable search operation can be executed within the read latency and reflects the increased read latency observed in Flash-Cosmos (Park et al., 2022) (3.3% for intra-block MWS and 3.3% for four-block inter-block MWS). The baseline read operations are unaffected. Second, multi-block SRCH operations block all internal parallelism needed by the SRCH operation, even when the result is a single data entry. Search regions may span multiple blocks (Section 3.3), requiring a SRCH operation to check all blocks in the region; however, retrieving the associated data entry may only require a single read operation. Conservatively, we block all occupied channels/dies required by the search operation even when a single match is required. Baseline reads are unaffected.

We evaluate two different classes of applications, databases and graphs. For databases, we use DBx1000 (Yu et al., 2014) to generate traces of TPC-C (Transaction Processing Council, 2010) behavior. To evaluate TCAM-SSD for database analytics, we examine queries evaluated in prior work (Gu et al., 2016; Woods et al., 2014). Similarly, we use dbgen (Transaction Processing Council, 2011) to populate a database with data specified by TPC-H (Transaction Processing Council, 2021). Finally, to evaluate graph workloads, we extract a vertex traversal of a single-source shortest path (SSSP) algorithm (Malicevic et al., 2017), and use a collection of synthetic and real-world graphs ranging from geographical data to social networks (see Section 6).

5. Evaluating TCAM-SSD for Databases

Many applications make use of relational databases, which store a series of records into one or more tables. Each table consists of one or more columns, where each column corresponds to a particular attribute, and each record (i.e., row) in a table contains a value per column. For example, a record for a college ID system could contain a student’s ID number, first name, and last name, and the table would hold records for every student (or, depending on the database design, have per-department tables of students). The goal of a database is to allow users to quickly retrieve information about the stored records. There are two common approaches to retrieval. Transaction processing searches particular columns in one or more tables, and returns or updates individual records whose attributes match the search query. Analytics processing scans entire tables and extracts aggregated information about one or more columns across all records.

Relational databases are designed to handle large amounts of data. Scanning through each record in the table to identify matches for a transaction can be time consuming and require a significant amount of I/O. To reduce these costs, database managers generate indexes, which use data structures (e.g., hashes, trees) designed for sub-linear search/lookup times across all of the records, to catalog the contents of one or more table columns (Lehman and Carey, 1986; Zhang et al., 2018). A table can have one primary index, which contains unique pointers for each record in the table, and multiple secondary indexes, whose pointers may correspond to multiple records (and thus require additional processing to retrieve matching records). Note that the exact choice and structure of indexes depend heavily on the workload, and on the specific design of the database.

Ideally, all of the contents of a database (i.e., all indexes and all tables) would fit in main memory, to reduce the latencies of index traversal and data lookup (Stonebraker and Weisberg, 2013; Kallman et al., 2008). However, as datasets continue to grow, it is becoming increasingly difficult to keep the tables resident in memory, and for large enough databases, main memory will not even be able to hold all of the indexes. In a conventional system, this would result in a significant latency penalty for both transactions and analytical queries, as multiple SSD read operations would be required. We propose to use TCAM-SSD to significantly reduce the latencies involved with large-footprint databases.

5.1. Online Transaction Processing Workloads

An online transaction processing (OLTP) workload provides data in response to an end user executing a series of transactions (Kemper and Neumann, 2011; Yu et al., 2014), where each transaction consists of a series of small, simple queries. OLTP workloads are often generated by user interaction, and thus require high throughput and low latency, so database designers attempt to fit the entire database in memory (Stonebraker and Weisberg, 2013; Kallman et al., 2008). Unfortunately, this is difficult to do for many larger databases, and while several optimizations attempt to maximize main memory usage in such scenarios (e.g., (Diaconu et al., 2013; Eldawy et al., 2014; DeBrabant et al., 2013)), metadata pressure and growing dataset sizes undermine these optimizations and often induce a high amount of SSD I/O operations (Leis et al., 2018; Stoica and Ailamaki, 2013; Zhang et al., 2016).

When the database is too large to fit in memory, TCAM-SSD can accelerate a transaction by performing a bulk associative search across an entire column of the database. Figure 4 shows an example of two tables (adapted from Aho et al. (Aho and Ullman, 1992)): one containing records of faculty members, and another containing records of students. Instead of using dedicated hash or tree data structures to index a column, TCAM-SSD creates a search region for each indexable column, where the region consists of the data stored for each record under that column. For example, in the Faculty table, if the database management system (DBMS) wants to index the Group column, the host reads and transposes the column’s contents, and then calls the TCAM-SSD Allocate/Append command to write the transposed values into a newly-allocated search region.999Note that with computational SSD support, we could perform the transposition in the SSD instead of at the host, eliminating CPU–FE movement to generate the index. Now, whenever a transaction wants to search for records using the Group column, the DBMS can simply call the TCAM-SSD Search command, which performs a single-cycle parallel lookup across the entire index, and copies any matching records from the data region (which in this case is the already-stored database table) to the host buffer.

Refer to caption
Figure 4. Search/data region map** of database tables.

With TCAM-SSD, the DBMS can easily create multiple indexes, with each getting its own search region. As shown in Figure 4, our example Student table has indexes for both StudentID and Last. Both of these search regions’ link table entries point to the same data region, avoiding the need to replicate the database tables. TCAM-SSD can also support meta-index generation (e.g., it can store an index with the length of each last name).

Methodology

We use the TPC-C database (Transaction Processing Council, 2010) to evaluate how TCAM-SSD optimizes an OLTP system. We generate a trace of 1 M/times1dividemeganothingabsent1\text{\,}\mathrm{M}\text{/}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_M end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG transactions for an OLTP workload using the DBx1000 DBMS (Yu et al., 2014), where the queries within the transactions use a mixture of indexes. We scale the number of records in the table by 100, resulting in 3 M/times3dividemeganothingabsent3\text{\,}\mathrm{M}\text{/}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_M end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG entries. In our setup, the database is stored in the SSD, mimicking a setup where the database is too large to fit entirely in memory. For the baseline database running on a conventional SSD, all indexes are stored in main memory.

Our workload uses one secondary index (LastName), which the baseline system organizes as a hash index. This leads to a number of collisions, where each collision needs to retrieve multiple pages of data from the SSD to check for exact matches. TPC-C records include a warehouse ID, but because the first query in each transaction limits the search to only a single warehouse ID, we choose TCAM-SSD’s search regions for the database to correspond to a single warehouse.

Results

We observe that over the course of the workload, TCAM-SSD achieves a speedup of 60.9% over DBx1000. To understand the source of this speedup, we examine which queries run faster on TCAM-SSD, and which queries run faster on DBx1000. We determine that TCAM-SSD is faster than DBx1000 whenever a query needs to retrieve more than 3 pages from the SSD. Figure 5a shows the cumulative distribution function (CDF) of the queries in the workload, as a function of the fetched page count for each query. We observe from the figure that 73.5% of queries exceed the 3-page threshold (i.e., run faster on TCAM-SSD than on DBx1000). However, the greater the number of pages is per query, the greater the benefit that TCAM-SSD provides. Therefore, we plot the CDF of latency in Figure 5b, and observe that TCAM-SSD improves the latency for queries that take up 95.8% of the total workload latency.

Refer to caption
(a) CDF of queries
Refer to caption
(b) CDF of latency
Figure 5. Cumulative distribution function (CDF) showing which queries are accelerated by TCAM-SSD.]

TCAM-SSD reduces both CPU–FE and FE–BE data movement, by 92.3% and 77.0%, respectively, compared to DBx1000. This reduction, due to TCAM-SSD’s ability to execute the search directly inside the NAND flash array, contributes to its latency reduction while reducing the energy spent on disk I/O. TCAM-SSD’s overheads are small, requiring only 23 flash blocks (<0.01%absentpercent0.01<0.01\%< 0.01 % of the SSD capacity) to store the search regions, and only 2.5 kB/times2.5dividekilobyteabsent2.5\text{\,}\mathrm{kB}\text{/}start_ARG 2.5 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_kB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of firmware DRAM to store the link tables. We conclude that TCAM-SSD efficiently enables significant savings for our OLTP workload.

5.2. Database Analytics

An online analytics processing (OLAP) workload consists of complex queries that scan entire columns of data from a table and aggregate the scanned information. Unlike an OLTP workload, an OLAP workload does not make efficient use of an index, as the index is designed to avoid a column-wide scan to look up a specific column attribute, while the OLAP workload wants to scan all attributes in the column. As a result, OLAP workloads often incur significant SSD I/O whenever a database table does not fit in memory.

TCAM-SSD can substantially mitigate the overheads of SSD I/O incurred during OLAP scan operations. Recall from Section 5.1 that TCAM-SSD implements an indexed column as a search region, and performs highly-parallelized search across the entire column to locate matching records. This is effectively a fast parallel scan operation, and we can reuse this same format and operation for OLAP scans. Using our example from Figure 4, a simple OLAP query may want to aggregate a list of all last names beginning with the letter H. TCAM-SSD can quickly scan the column and return a list of all records matching the query, generating CPU–FE data movement only for the relevant records (as opposed to having to send all records to the host for a conventional SSD).

Note that in a conventional system, we often use different data organizations for OLTP (row-major table storage) and OLAP (column-major table storage) to reduce I/O traffic. However, with TCAM-SSD, we can use a single data organization for both while reducing I/O traffic even further. Additionally, TCAM-SSD can generate a single search region for fused keys (i.e., a concatenation of two or more columns), further reducing scan latency.

Methodology

We use TPC-H (Transaction Processing Council, 2021), a business analytics workload, and populate the database using dbgen (Transaction Processing Council, 2011). With a scale factor of 100, the resulting database has a size of 115 GB/times115dividegigabyteabsent115\text{\,}\mathrm{GB}\text{/}start_ARG 115 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG. We evaluate two analytic queries examined in prior work (Gu et al., 2016; Woods et al., 2014), which are modified versions of TPC-H queries that scan one 74 GB/times74dividegigabyteabsent74\text{\,}\mathrm{GB}\text{/}start_ARG 74 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG table from the database. Query 2 performs additional filter operations compared to Query 1, and we use the fused key optimization in TCAM-SSD to efficiently support these filters.

Results

For the database that we generate, TCAM-SSD on average speeds up the scan operation for Query 1 by 18.3×\times×, and Query 2 by 17.1×\times×, compared to a baseline database scan operation. TCAM-SSD’s improvements are a result of reducing both CPU–FE and FE–BE data movement (see Section 3.7). For Query 1, the baseline must send the entire table’s data to the host CPU, requiring 4.9 M/times4.9dividemeganothingabsent4.9\text{\,}\mathrm{M}\text{/}start_ARG 4.9 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_M end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG read operations that generate 74 GB/times74dividegigabyteabsent74\text{\,}\mathrm{GB}\text{/}start_ARG 74 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of CPU–FE and 74 GB/times74dividegigabyteabsent74\text{\,}\mathrm{GB}\text{/}start_ARG 74 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of FE–BE movement. In comparison, TCAM-SSD requires only 4.6 k/times4.6dividekilonothingabsent4.6\text{\,}\mathrm{k}\text{/}start_ARG 4.6 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_k end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG SRCH chip commands (which generate no CPU–FE movement and 71.5 MB/times71.5dividemegabyteabsent71.5\text{\,}\mathrm{MB}\text{/}start_ARG 71.5 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_MB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of FE–BE movement), and only 240.0 k/times240.0dividekilonothingabsent240.0\text{\,}\mathrm{k}\text{/}start_ARG 240.0 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_k end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG read operations for the matching data (3.7 GB/times3.7dividegigabyteabsent3.7\text{\,}\mathrm{GB}\text{/}start_ARG 3.7 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of CPU–FE and 3.7 GB/times3.7dividegigabyteabsent3.7\text{\,}\mathrm{GB}\text{/}start_ARG 3.7 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of FE–BE movement).101010While computational SSDs can achieve similar CPU–FE savings for analytics (Lagrange et al., 2020; Samynathan et al., 2019), they would still generate 74 GB/times74dividegigabyteabsent74\text{\,}\mathrm{GB}\text{/}start_ARG 74 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of FE–BE movement, saturating internal bandwidth and increasing energy. Query 2’s additional filter operations increase SRCH command count to 18.3 k/times18.3dividekilonothingabsent18.3\text{\,}\mathrm{k}\text{/}start_ARG 18.3 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_k end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG. This increases FE–BE data movement to 286.1 MB/times286.1dividemegabyteabsent286.1\text{\,}\mathrm{MB}\text{/}start_ARG 286.1 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_MB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG (as more match vectors need to be transmitted to the firmware), but keeps CPU–FE data movement at only 3.7 GB/times3.7dividegigabyteabsent3.7\text{\,}\mathrm{GB}\text{/}start_ARG 3.7 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_GB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG. To enable these benefits, TCAM-SSD requires only 4578 NAND flash blocks (1.7% of the SSD capacity) and 0.2 MB/times0.2dividemegabyteabsent0.2\text{\,}\mathrm{MB}\text{/}start_ARG 0.2 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_MB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of DRAM storage for the link table.

Because an OLAP query typically returns many records, the location of each matching record has a significant impact on performance. For example, if a query matches 10 records, these records could all be in the same NAND flash page, or they could each reside in different pages. To explore the impacts of different database layouts, we analytically sweep two parameters for both of our queries: selectivity is the fraction of records in the database that match a query, and locality captures how likely the records are to share a page. For example, a locality of 0% means that we need an SSD read operation for every matching record, while a locality of 100% means that we assume all matching records are stored back-to-back and require the minimal number of SSD reads. By evaluating combinations of locality and selectivity, we aim to remain DBMS-agnostic, and demonstrate how TCAM-SSD can affect the performance of different data layouts

Figure 6 shows the impact of selectivity and locality on performance. We make four observations from the figure. First, TCAM-SSD’s speedups over the baseline range from 0.74×\times× (1% selectivity, 0% locality) to 1637.0×\times× (0.01% selectivity, 100% locality), with an average speedup of 113.5×\times× across the sweep. Second, TCAM-SSD’s benefits increase as the selectivity decreases. Third, TCAM-SSD’s benefits are smaller for Query 2 than for Query 1. This is because Query 2 contains additional conditions for the scan. However, TCAM-SSD still preserves much of the speedup by taking advantage of its fused key optimization to reduce the impact of these additional conditions. Note that for our synthesized database, the selectivity and locality for both queries is 0.04% and 0.0%, respectively. We conclude that TCAM-SSD can substantially improve the performance of analytics, and that its benefits increase with match sparsity and with worsening locality (two common effects as the dataset size increases).

Refer to caption
(a) Query 1 (y-axis is log scale)
Refer to caption
(b) Query 2 (y-axis is log scale)
Figure 6. Speedup for analytical queries with TCAM-SSD, normalized to scan using a conventional SSD.

6. Graph Processing Using TCAM-SSD

Graph processing is employed today across a wide variety of domains, ranging from social media networks (Yang and Leskovec, 2011), to roads and geographical data (Leskovec et al., 2009), to the connectivity of the Internet (Web Data Commons, 2014). A graph processing framework (Pearce et al., 2010; Zheng et al., 2015) typically initiates analytics by preprocessing the graph from the dataset’s generic input format (e.g., an edge array (Roy et al., 2013; Gonzalez et al., 2012)) into a framework-specific format (along with applying other optimizations, e.g., (Maass et al., 2017; Zhu et al., 2015; Vora et al., 2016; Zhang et al., 2015)). For large networks, graph frameworks use metadata structures such as in-memory indexes that allow the system to quickly locate the edges of interest (which are stored in the SSD).

While many different metadata structures exist for the index (Zhu et al., 2015), we focus on adjacency lists, which form the basis of many graph analytics data structures. For each vertex, the adjacency list stores a count of the number of outgoing edges, and a pointer to the first edge belonging to the vertex in the edge list. For the sake of simplicity, we describe how the index can be created for out edges; however, the process can be repeated for in edges. Figure 7a shows an example adjacency list for eight vertices.

Refer to caption
(a) Conventional adjacency list
Refer to caption
(b) TCAM-SSD index
Figure 7. Comparison of graph index structures.

Notably, as graph datasets grow in size, the size of the index grows, requiring additional memory to store both the index and edges. Once the index exceeds the capacity of memory, systems that process large graphs experience severe performance degradation (Jun et al., 2018), requiring the system to migrate both index and graph data between memory and SSD. TCAM-SSD can optimize SSD I/O for large graphs by eliminating the conventional adjacency list and reducing the multi-step edge retrieval process into a single Search command.

A naive TCAM-SSD graph format could simply forgo an in-memory index, and instead perform bulk parallel search directly on the edge array to find corresponding edges based on either source or destination vertices. This, however, is highly inefficient, because while TCAM-SSD’s Search operation can search millions of edge entries simultaneously (accounting for maximum parallelism), large graphs can contain billions or even trillions of edges. This would require TCAM-SSD to perform thousands of back-to-back Search operations, and could stall unrelated I/O operations by tying up all of the SSD’s channels. Instead, TCAM-SSD uses a compressed host-side in-memory structure to reduce metadata overheads.

Compressing the Index

Graphs often follow a power law distribution (Page et al., 1999; Brin and Page, 1998; Zheng et al., 2015; Chen et al., 2010), where few vertices have a large degree (i.e., count of connected edges), while many vertices have a single-digit degree. If we were to allocate a dedicated search region to each vertex, most of the regions would be highly underutilized. For the common case where the out-degree of the vertex is less then the number of bitlines in the block, a Search operation will still check all of the empty bitlines in the block, wasting both energy and area. We propose two optimizations to reduce the underutilization, resulting in the compressed index shown in Figure 7b.

First, multiple vertices with a small out-degree, and with consecutive IDs, can be compressed into a single search region. Our index stores a single entry for the search region, and stores the highest vertex ID in the Max ID column, along with a search region pointer. To access the correct search region for a vertex, the graph framework performs a binary search over the sorted Max ID field. For example, in Figure 7b, vertices 0–3 are compressed into a single search region, and vertices 5–7 are compressed into a second search region. If the framework wants to locate the region for vertex 6, it uses the entry with a Max ID of 7, since 6 falls between 4 and 7.

Second, certain vertices may contain substantially more edges than the number of bitlines per block. A TCAM-based search for such a vertex could tie up multiple levels of internal SSD parallelism, and may return matches to most of its edges, resulting in little to no reduction in read count. For such vertices, we do not store their edges in a search region. Instead, we store an edge list in the data region, and keep a direct pointer to the edge list in the index (along with an edge count). We set the edge count to 0 for search region entries, to distinguish it from an edge list entry.111111This structure is inspired by prior work on sparse data structures (Vuduc, 2003), including those of graph analytics, and may be applicable for other applications (e.g., sparse linear algebra). Figure 7b shows how vertex 4 has a direct pointer to its edge list in our index.

TCAM-SSD constructs the compressed index during preprocessing, by performing a count sort (similar to the initial step of many graph preprocessing algorithms (Malicevic et al., 2017; Roy et al., 2015)). Figure 8 shows the reduction in memory footprint across a variety of synthetic and real-world graphs (Table 2), for two TCAM-SSD configurations: (1) TCAM (NP), a basic version of our index without the large vertex optimization; and (2) TCAM-256, which uses a data region pointer for vertices with more than 256 edges. On average, TCAM-SSD reduces in-memory overheads by 47.5%, compared to a baseline index with a 4 B/times4dividebyteabsent4\text{\,}\mathrm{B}\text{/}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_B end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG pointer and 4 B/times4dividebyteabsent4\text{\,}\mathrm{B}\text{/}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_B end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG of metadata (e.g., vertex weight) per entry.

Refer to caption
Figure 8. Normalized graph index overhead.
Table 2. Evaluated graphs.
Graph Nodes Edges Graph Nodes Edges
Patents (Leskovec et al., 2005) 3.7M 16.5M Orkut (Yang and Leskovec, 2012) 3M 117M
Road-CA (Leskovec et al., 2009) 1.9M 2.7M Youtube (Yang and Leskovec, 2012) 1.1M 3M
Road-PA (Leskovec et al., 2009) 1.1M 1.5M LiveJournal (Backstrom et al., 2006) 4.8M 69M
Road-TX (Leskovec et al., 2009) 1.3M 1.9M Kron25 33.5M 1B
Twitter (Yang and Leskovec, 2011) 17M 1.5B Mag240 (Wang et al., 2020) 121.7M 1.3B
Formulating Graph Analytics as a Search Problem

To retrieve edge data for a vertex from TCAM-SSD, the application starts by searching for the corresponding in-memory index entry. For example, to retrieve data for vertex 2 in Figure 7b, TCAM-SSD performs a binary search of the index structure in the following order: Max ID = 4, Max ID = 3 (which is the correct entry). Once the correct entry is found, the system then issues either a search or read based on the entry’s edge list count. If the count is non-zero, TCAM-SSD reads the edge list. If the count is 0, TCAM-SSD searches a search region. Within a search region, the bitline holds a (src, dst) tuple. When vertex 2 is used as src, TCAM-SSD finds all matches in the search region, and accesses the corresponding data region to read out and return the out-edges.

Methodology

To evaluate TCAM-SSD, we use a vertex access traversal trace for the SSSP algorithm. We examine four different cases: (1) in-memory index (IM), where the LBA for each edge list is stored in memory; (2) out of memory (OOM), where both the edge list and index are on disk; (3) TCAM (NP); and (4) TCAM-256. For TCAM (NP) and TCAM-256, we assume that every index access is a DRAM row miss. We conservatively assume 0% locality, and explore the interplay between search and data pointers in the index.

Results

Figure 9 shows the execution time for SSSP vertex traversals on our four configurations, normalized to IM. We make four observations from the figure. First, OOM incurs a 99% overhead over IM, averaged across all networks, indicating the storage access cost of large graphs in conventional systems. Second, TCAM (NP) performs 10.2% better than OOM on average, as it avoids data movement costs between the CPU and the SSD. Third, while TCAM (NP) improves performance for most datasets, we do see performance degradation for Kron25. We determine that this is due to the overheads incurred from moving and decoding the match vector for vertices with a high degree. Fourth, because TCAM-256 uses our optimized data structure, it can further improve the performance for large graphs with vertices with high degrees with an average improvement of 14.5% and 4.3% over OOM and TCAM (NP), respectively. For Kron25, TCAM-256 results in a 24.2% speedup compared to TCAM (NP), showing that our optimization overcomes the performance overheads of TCAM-SSD for high-degree vertices.

Refer to caption
Figure 9. Execution time for SSSP.

For the synthetic Kron25 graph, the search region uses  8200 blocks (3.1% of SSD capacity) and the link table overhead is 66 MB/times66dividemegabyteabsent66\text{\,}\mathrm{MB}\text{/}start_ARG 66 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_MB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG. For the real-world Twitter graph, the search region overhead is 3.8% with a link table overhead of 50.9 MB/times50.9dividemegabyteabsent50.9\text{\,}\mathrm{MB}\text{/}start_ARG 50.9 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_MB end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG. Although Twitter contains more edges than kron25 (Table 2), kron25 has more high-degree vertices, necessitating more link table entries.

7. Related Work

To our knowledge, TCAM-SSD is the first end-to-end framework for in-NAND-flash associative searching. We discuss closely-related works below.

In-Storage Computation.

Parabit (Gao et al., 2021), Flash-Cosmos (Park et al., 2022), and GP3D (Shim and Yu, 2022) perform processing using NAND flash memory. TCAM-SSD avoids shortcomings of these works by enabling general-purpose associative computing instead of bulk bitwise or domain-specific solutions. Computational SSDs (i.e., smart SSDs) exploit the internal microcontroller (R. Cheerla, 2019; Kwak et al., 2020; Do et al., 2013; Kang et al., 2013; Koo et al., 2017; Samsung Electronics Co., Ltd., 2022) or nearby FPGAs (CRZ Technology, 2019; Salamat et al., 2021; Advanced Micro Devices, Inc., 2021) to perform processing-near-memory. Other approaches (Mailthody et al., 2019; Bae et al., 2013; Cho et al., 2013; Jun et al., 2015; Kim et al., 2011; Samsung Electronics Co., Ltd., 2022) place additional hardware throughout the SSD hierarchy to enable in-storage computation. Some in-storage computing platforms target query processing (Park et al., 2021; Seshadri et al., 2014; Xu et al., 2020; Hu et al., 2022) or key–value interfaces (** et al., 2017; Xu et al., 2016; Im et al., 2020). TCAM-SSD is orthogonal to these computational SSD solutions, and can benefit from offloading certain operations to embedded compute elements.

SmartSSDs are an emerging technology that attempt to exploit the increasingly powerful microcontroller on SSDs (R. Cheerla, 2019; Kwak et al., 2020; Do et al., 2013; Kang et al., 2013; Koo et al., 2017). These systems are sometimes augmented with field-programmable gate arrays (FPGAs) to further enhance their compute capabilities (CRZ Technology, 2019; Salamat et al., 2021). Similarly, other techniques have been proposed to execute query processing on the SSD (Park et al., 2021; Seshadri et al., 2014; Xu et al., 2020). Other approaches (Mailthody et al., 2019; Bae et al., 2013; Cho et al., 2013; Jun et al., 2015; Kim et al., 2011) have explored placing additional hardware throughout the SSD hierarchy to enable in-storage computation. Another class of systems (** et al., 2017; Xu et al., 2016; Im et al., 2020) integrate key-value interfaces onto the SSD to further accelerate data retrieval. TCAM-SSD is compatible with many of the above systems; notably, TCAM-SSD can benefit from offloading portions of the computation onto the SSD microcontroller.

Key–Value Stores.

Key–value stores (KVS) (DeCandia et al., 2007; Debnath et al., 2010; Atikoglu et al., 2012) are a software-defined construct that map a predefined key to a cluster of data. Using an input key, a KVS can leverage the associative search operations of TCAMs to retrieve corresponding clusters of data in O(1) time. Although associative memories can be used to implement a key–value store, a KVS does not require explicit use of TCAMs (** et al., 2017; Kang et al., 2019; Kaiyrakhmet et al., 2019).

Associative Computing and Content-Addressable Parallel Processors.

Associative computing (i.e., associative processing) (Potter, 2012) broadly refers to a computing paradigm in which associative operations are used to locate and compute over data in parallel. Associative computing has been explored in a variety of contexts (Foster, 1976; Potter et al., 1994; Sayre, 1976; Pagiamtzis and Sheikholeslami, 2006; Yavits et al., 2021), and often (but not always) relies on associative memory structures (Foster, 1976; Guo et al., 2013; Caminal et al., 2021, 2022). Two such examples are CAPE (Caminal et al., 2021), which accelerates associative computing in SRAM, and Castle (Caminal et al., 2022), which extends upon CAPE to accelerate databases. Other associative computing frameworks (Zha and Li, 2020; Guo et al., 2011, 2013) employ emerging NVM technologies, limiting their near-term applicability. In contrast, TCAM-SSD implements associative computing using conventional NAND flash memory, which provides high storage density in a mature technology.

There are multiple approaches and definitions of associative computing. One example framework, ASC (Potter et al., 1994), provides a mechanism to convert conventional RAM-based algorithms into associative computing algorithms by organizing data into two-dimensional tables. To perform computation, bulk associative search operations are then used to select the corresponding entries and retrieve the matching data element for addition computation. A second example is the content-addressable parallel processor (CAPP) (Foster, 1976), which consists of three formal requirements: (1) it stores data in a vector format, (2) it can compare key against all vector elements in parallel, (3) it can update all matching elements in bulk with a new value. Although the TCAM-SSD framework broadly fits into the associative computing paradigm, as it relies on associative search operations to find and retrieve data, it is not a CAPP, as it does not natively support updating all matching elements in parallel. However, we describe how TCAM-SSD can be extended to support this using its associative update mode (Section 3.5).

Database Acceleration.

Recent works propose to use emerging NVM technologies for analog in situ SQL-style operations. Sun et al. propose a storage scheme for in-memory databases to map tuples onto crossbars (Sun et al., 2017; Wang et al., 2018), while others (Li et al., 2020) utilize ReRAM-based content addressable memories (ReCAMs) to provide in situ support for fundamental SQL database operations, including sort, join, and selection. Other recent work has explored the use of Optane (Hady et al., 2017; Wu et al., 2019; Shanbhag et al., 2020) and CXL (Ahn et al., 2022) to extend the memory capacity available to the DBMS, while others seek to mitigate the overheads of pointer chasing during data retrieval (Kocberber et al., 2013). FPGAs have also shown promise in accelerating database workloads (Kim et al., 2019).

Graph Workload Acceleration.

Graph analytics has been widely explored through a variety of hardware and software frameworks (Maass et al., 2017; Liu and Huang, 2017). PREGEL (Malewicz et al., 2010) introduces the “think like a vertex” framework, enabling a simple programming model for distributed graph processing. GraphChi (Kyrola et al., 2012) and X-Stream (Roy et al., 2013) demonstrate the potential for graph processing without assumptions of memory size, by effectively utilizing secondary storage. Ligra (Shun and Blelloch, 2013) proposes a high-performance shared memory machine framework that assumes the graph fits in a the memory system. Recent works propose SSD optimizations for graph analytics (Zheng et al., 2015; Jun et al., 2018; Suzuki et al., 2021; Matam et al., 2019) or dedicated accelerators for graph processing (Ham et al., 2016; Dadu et al., 2021; Ahn et al., 2015).

8. Conclusion

We present TCAM-SSD, a framework for in-SSD associative search using NAND flash memory. With modest modifications to the NAND flash chips inside commodity solid-state drives (and with no modifications to the NAND flash array), we can enable highly-parallel ternary search operations. Our framework includes a hardware primitive, firmware modifications, and a user interface, along with hardware optimizations and application-specific optimizations. We show that for three use cases, TCAM-SSD can provide notable performance and data movement improvements for large dataset processing.

Acknowledgements.
This article has been authored by an employee of National Technology & Engineering Solutions of Sandia, LLC under Contract No. DE-NA0003525 with the U.S. Department of Energy (DOE). The employee owns all right, title and interest in and to the article and is solely responsible for its contents. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan https://www.energy.gov/downloads/doe-public-access-plan. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

References

  • (1)
  • Advanced Micro Devices, Inc. (2021) Advanced Micro Devices, Inc. 2021. Samsung SmartSSD. https://www.xilinx.com/applications/data-center/computational-storage/smartssd.html.
  • Aga et al. (2017) S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das. 2017. Compute Caches. In HPCA.
  • Agrawal and Srikant (1994) R. Agrawal and R. Srikant. 1994. Fast Algorithms for Mining Association Rules. In VLDB.
  • Ahn et al. (2015) J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2015. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In ISCA.
  • Ahn et al. (2022) M. Ahn, A. Chang, D. Lee, J. Gim, J. Kim, J. Jung, O. Rebholz, V. Pham, K. Malladi, and Y. S. Ki. 2022. Enabling CXL Memory Expansion for In-Memory Database Management Systems. In DaMoN.
  • Aho and Ullman (1992) A. V. Aho and J. D. Ullman. 1992. Foundations of Computer Science. Computer Science Press, Inc.
  • Alibart et al. (2011) F. Alibart, T. Sherwood, and D. B. Strukov. 2011. Hybrid CMOS/Nanodevice Circuits for High Throughput Pattern Matching Applications. In AHS.
  • Angizi et al. (2018a) S. Angizi, Z. He, and D. Fan. 2018a. PIMA-Logic: A Novel Processing-in-Memory Architecture for Highly Flexible and Energy-Efficient Logic Computation. In DAC.
  • Angizi et al. (2018b) S. Angizi, Z. He, A. S. Rakin, and D. Fan. 2018b. CMP-PIM: An Energy-Efficient Comparator-Based Processing-in-Memory Neural Network Accelerator. In DAC.
  • Angizi et al. (2019) S. Angizi, J. Sun, W. Zhang, and D. Fan. 2019. AlignS: A Processing-in-Memory Accelerator for DNA Short Read Alignment Leveraging SOT-MRAM. In DAC.
  • Ankit et al. (2019) A. Ankit, I. El Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W. m. Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic. 2019. PUMA: A Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference. In ASPLOS.
  • Arsovski et al. (2003) I. Arsovski, T. Chandler, and A. Sheikholeslami. 2003. A Ternary Content-Addressable Memory (TCAM) Based on 4T Static Storage and Including a Current-Race Sensing Scheme. JSSC (Jan. 2003).
  • Atikoglu et al. (2012) B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. 2012. Workload Analysis of a Large-Scale Key-Value Store. (2012).
  • B. Marr (2018) B. Marr. 2018. How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/?sh=667f02260ba9.
  • Backstrom et al. (2006) L Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. 2006. Group Formation in Large Social Networks: Membership, Growth, and Evolution. In KDD.
  • Bae et al. (2013) D.-H. Bae, J.-H. Kim, S.-W. Kim, H. Oh, and C. Park. 2013. Intelligent SSD: A Turbo for Big Data Mining. In CIKM.
  • Batcher (1974) K. E. Batcher. 1974. STARAN Parallel Processor System Hardware. In AFIPS.
  • Brin and Page (1998) S. Brin and L. Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems (Apr. 1998).
  • Cai et al. (2017a) Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu. 2017a. Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives. Proc. IEEE (Sep. 2017).
  • Cai et al. (2017b) Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, and E. F. Haratsch. 2017b. Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques. In HPCA.
  • Cai et al. (2012) Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai. 2012. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In DATE.
  • Cai et al. (2013) Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai. 2013. Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling. In DATE.
  • Cai et al. (2015a) Y. Cai, Y. Luo, S. Ghose, and O. Mutlu. 2015a. Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery. In DSN.
  • Cai et al. (2015b) Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, and O. Mutlu. 2015b. Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery. In HPCA.
  • Cai et al. (2014) Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, and K. Mai. 2014. Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories. In SIGMETRICS.
  • Caminal et al. (2022) H. Caminal, Y. Chronis, T. Wu, J. M. Patel, and J. F. Martínez. 2022. Accelerating Database Analytic Query Workloads Using an Associative Processor. In ISCA.
  • Caminal et al. (2021) H. Caminal, K. Yang, S. Srinivasa, A. K. Ramanathan, K. Al-Hawaj, T. Wu, V. Narayanan, C. Batten, and J. F. Martínez. 2021. CAPE: A Content-Addressable Processing Engine. In HPCA.
  • Chang et al. (2008) Y.-K. Chang, M.-L. Tsai, and C.-C. Su. 2008. Improved TCAM-Based Pre-Filtering for Network Intrusion Detection Systems. In AINA.
  • Chen et al. (2010) R. Chen, X. Weng, B. He, M. Yang, B. Choi, and X. Li. 2010. On the Efficiency and Programmability of Large Graph Processing in the Cloud. Technical Report MSR-TR-2010-44. Microsoft Research.
  • Cho et al. (2013) B. Y Cho, W. S. Jeong, D. Oh, and W. W. Ro. 2013. XSD: Accelerating MapReduce by Harnessing the GPU Inside an SSD. In DIMES.
  • Chou et al. (2019) T. Chou, W. Tang, J. Botimer, and Z. Zhang. 2019. CASCADE: Connecting RRAMs to Extend Analog Dataflow in an End-to-End In-Memory Processing Paradigm. In MICRO.
  • Chowdhery et al. (2023) A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, Kathy M.-H., D. Eck, J. Dean, S. Petrov, and N. Fiedel. 2023. PaLM: Scaling Language Modeling With Pathways. JMLR (Aug. 2023).
  • CRZ Technology (2019) CRZ Technology. 2019. Daisy OpenSSD. http://www.mangoboard.com/main/view.asp?idx=1061&pageNo=1&cate1=9&cate2=150&cate3=181
  • Dadu et al. (2021) V. Dadu, S. Liu, and T. Nowatzki. 2021. PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators. In ISCA.
  • Dally (2015) B. Dally. 2015. Challenges for Future Computing Systems. Keynote talk at HiPEAC.
  • Debnath et al. (2010) B. Debnath, S. Sengupta, and J. Li. 2010. FlashStore: High Throughput Persistent Key-Value Store. Proc. VLDB Endow. (Sep. 2010).
  • DeBrabant et al. (2013) J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. 2013. Anti-Caching: A New Approach to Database Management System Architecture. Proc. VLDB Endow. (Sep. 2013).
  • DeCandia et al. (2007) G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s Highly Available Key-Value Store. OSR (Oct. 2007).
  • Derharcobian and Murphy (2010) N. Derharcobian and C. N. Murphy. 2010. Phase-Change Memory (PCM) Based Universal Content-Addressable Memory (CAM) Configured as Binary/Ternary CAM. U.S. Patent No. 7,675,765 B2.
  • Devaux (2019) F. Devaux. 2019. The True Processing in Memory Accelerator. In Hot Chips.
  • Diaconu et al. (2013) C. Diaconu, C. Freedman, E. Ismert, P.-Å. Larson, P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. 2013. Hekaton: SQL Server’s Memory-Optimized OLTP Engine. In SIGMOD.
  • Do et al. (2013) J. Do, Y.-S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt. 2013. Query Processing on Smart SSDs: Opportunities and Challenges. In SIGMOD.
  • Domo, Inc. (2014) Domo, Inc. 2014. Data Never Sleeps 2.0. https://www.domo.com/learn/infographic/data-never-sleeps-2.
  • Domo, Inc. (2022) Domo, Inc. 2022. Data Never Sleeps 10.0. https://www.domo.com/learn/infographic/data-never-sleeps-10.
  • Domo, Inc. (2023) Domo, Inc. 2023. Data Never Sleeps 11.0. https://www.domo.com/learn/infographic/data-never-sleeps-11.
  • Eckert et al. (2018) C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. M . Sylvester, D. T. Blaauw, and R. Das. 2018. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. In ISCA.
  • Eldawy et al. (2014) A. Eldawy, J. Levandoski, and P.-Å. Larson. 2014. Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database. Proc. VLDB Endow. (Jul. 2014).
  • Eshraghian et al. (2010) K. Eshraghian, K.-R. Cho, O. Kavehei, S.-K. Kang, D. Abbott, and S.-M. S. Kang. 2010. Memristor MOS Content Addressable Memory (MCAM): Hybrid Architecture for Future High Performance Search Engines. TVLSI (Aug. 2010).
  • Foster (1976) C. C. Foster. 1976. Content Addressable Parallel Processors. John Wiley & Sons, Inc.
  • Fujiki et al. (2019) D. Fujiki, S. Mahlke, and R. Das. 2019. Duality Cache for Data Parallel Acceleration. In ISCA.
  • Gaillardon et al. (2016) P. Gaillardon, L. Amaru, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay, and G. D. Micheli. 2016. The Programmable Logic-in-Memory (PLiM) Computer. In DATE.
  • Gao et al. (2021) C. Gao, X. Xin, Y. Lu, Y. Zhang, J. Yang, and J. Shu. 2021. ParaBit: Processing Parallel Bitwise Operations in NAND Flash Memory Based SSDs. In MICRO.
  • Ghose et al. (2019) S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. 2019. Processing-in-Memory: A Workload-Driven Perspective. IBM JRD (Nov.–Dec. 2019).
  • Goel and Gupta (2010) Ashish Goel and Pankaj Gupta. 2010. Small Subset Queries and Bloom Filters Using Ternary Associative Memories, With Applications. In SIGMETRICS.
  • Gonzalez et al. (2012) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI.
  • Graves et al. (2019) C. E. Graves, C. Li, X. Sheng, W. Ma, S. R. Chalamalasetti, D. Miller, J. S. Ignowski, B. Buchanan, L. Zheng, S.-T. Lam, X. Li, L. Kiyama, M. Foltin, M. P. Hardy, and J. P. Strachan. 2019. Memristor TCAMs Accelerate Regular Expression Matching for Network Intrusion Detection. TNANO (Aug. 2019).
  • Graves et al. (2020) C. E. Graves, C. Li, X. Sheng, D. Miller, J. Ignowski, L. Kiyama, and J. P. Strachan. 2020. In-Memory Computing With Memristor Content Addressable Memories for Pattern Matching. Adv. Mat. (Aug. 2020).
  • Gu et al. (2016) B. Gu, A. S. Yoon, D.-H. Bae, I. Jo, J. Lee, J. Yoon, J.-U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang. 2016. Biscuit: A Framework for Near-Data Processing of Big Data Workloads. In ISCA.
  • Guo et al. (2011) Q. Guo, X. Guo, Y. Bai, and E. İpek. 2011. A Resistive TCAM Accelerator for Data-Intensive Computing. In MICRO.
  • Guo et al. (2013) Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G Friedman. 2013. AC-DIMM: Associative Computing With STT-MRAM. In ISCA.
  • Gupta et al. (2018) S. Gupta, M. Imani, and T. Rosing. 2018. FELIX: Fast and Energy-Efficient Logic in Memory. In ICCAD.
  • Hady et al. (2017) F. T. Hady, A. Foong, B. Veal, and D. Williams. 2017. Platform Storage Performance With 3D XPoint Technology. Proc. IEEE (Sep. 2017).
  • Ha**azar et al. (2021) N. Ha**azar, G. F. Oliveira, S. Gregorio, J. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gomez-Luna, and O. Mutlu. 2021. SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM. In ASPLOS.
  • Ham et al. (2016) T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. In MICRO.
  • Hamdioui et al. (2017) S. Hamdioui, S. Kvatinsky, G. Cauwenberghs, L. Xie, N. Wald, S. Joshi, H. M. Elsayed, H. Corporaal, and K. Bertels. 2017. Memristor for Computing: Myth or Reality?. In DATE.
  • He et al. (2020) M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar. 2020. Newton: A DRAM-Maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning. In ISCA.
  • Hu et al. (2022) H.-W. Hu, W.-C. Wang, Y.-H. Chang, Y.-C. Lee, B.-R. Lin, H.-M. Wang, Y.-P. Lin, Y.-M. Huang, C.-Y. Lee, T.-H. Su, C.-C. Hsieh, C.-M. Hu, Y.-T. Lai, C.-K. Chen, H.-S. Chen, H.-P. Li, T.-W. Kuo, M.-F. Chang, K.-C. Wang, C.-H. Hung, and C.-Y. Lu. 2022. ICE: An Intelligent Cognition Engine With 3D NAND-Based In-Memory Computing for Vector Similarity Search Acceleration. In MICRO.
  • Huang et al. (2014) P. Huang, P. Subedi, X. He, S. He, and K. Zhou. 2014. FlexECC: Partially Relaxing ECC of MLC SSD for Better Cache Performance. In USENIX ATC.
  • Im et al. (2020) J. Im, J. Bae, C. Chung, Arvind, and S. Lee. 2020. PinK: High-Speed In-Storage Key-Value Store With Bounded Tails. In USENIX ATC.
  • Im et al. (2018) J.-F. Im, K. Gopalakrishna, S. Subramaniam, M. Shrivastava, A. Tumbde, X. Jiang, J. Dai, S. Lee, N. Pawar, J. Li, and R. Aringunram. 2018. Pinot: Realtime OLAP for 530 Million Users. In SIGMOD.
  • Imani et al. (2018) M. Imani, S. Gupta, Y. Kim, and T. Rosing. 2018. FloatPIM: In-Memory Acceleration of Deep Neural Network Training With High Precision. In ISCA.
  • Imani et al. (2016) M. Imani, S. Patil, and T. S. Rosing. 2016. MASC: Ultra-Low Energy Multiple-Access Single-Charge TCAM for Approximate Computing. In DATE.
  • Jeloka et al. (2016) S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw. 2016. A 28 nm/times28dividenanometerabsent28\text{\,}\mathrm{nm}\text{/}start_ARG 28 end_ARG start_ARG times end_ARG start_ARG start_ARG roman_nm end_ARG start_ARG divide end_ARG start_ARG end_ARG end_ARG Configurable Memory (TCAM/BCAM/SRAM) Using Push-Rule 6T Bit Cell Enabling Logic-in-Memory. In JSSC.
  • Jimenez et al. (2013) X. Jimenez, D. Novo, and P. Ienne. 2013. Phoenix: Reviving MLC Blocks as SLC to Extend NAND Flash Devices Lifetime. In DATE.
  • ** et al. (2017) Y. **, H.-W. Tseng, Y. Papakonstantinou, and S. Swanson. 2017. KAML: A Flexible, High-Performance Key-Value SSD. In HPCA.
  • Jones (2020) S. W. Jones. 2020. Economics in the 3D Era. Talk at LithoVision.
  • Jun et al. (2015) S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and Arvind. 2015. BlueDBM: An Appliance for Big Data Analytics. In ISCA.
  • Jun et al. (2018) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash Storage for External Graph Analytics. In ISCA.
  • Jung et al. (2010) D. Jung, J.-U. Kang, H. Jo, J.-S. Kim, and J. Lee. 2010. Superblock FTL: A Superblock-Based Flash Translation Layer With a Hybrid Address Translation Scheme. ACM Trans. Embed. Comput. Syst. (Apr. 2010).
  • Kaiyrakhmet et al. (2019) O. Kaiyrakhmet, S. Lee, B. Nam, S. H. Noh, and Y.-R. Choi. 2019. SLM-DB: Single-Level Key-Value Store With Persistent Memory. In FAST.
  • Kallman et al. (2008) R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. 2008. H-Store: A High-Performance, Distributed Main Memory Transaction Processing System. Proc. VLDB Endow. (Aug. 2008).
  • Kang et al. (2014) U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. Choi. 2014. Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling. In The Memory Forum.
  • Kang et al. (2013) Y. Kang, Y.-S. Kee, E. L. Miller, and C. Park. 2013. Enabling Cost-Effective Data Processing With Smart SSD. In MSST.
  • Kang et al. (2019) Y. Kang, R. Pitchumani, P. Mishra, Y.-S. Kee, F. Londono, S. Oh, J. Lee, and D. D. G. Lee. 2019. Towards Building a High-Performance, Scale-In Key-Value Storage System. In SYSTOR.
  • Kemper and Neumann (2011) A. Kemper and T. Neumann. 2011. HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots. In ICDE.
  • Kim et al. (2019) K. Kim, R. Johnson, and I. Pandis. 2019. BionicDB: Fast and Power-Efficient OLTP on FPGA. In EDBT.
  • Kim et al. (2011) S. Kim, H. Oh, C. Park, S. Cho, and S.-W. Lee. 2011. Fast, Energy Efficient Scan Inside Flash Memory SSDs. In ADMS.
  • Kocberber et al. (2013) O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. 2013. Meet the Walkers Accelerating Index Traversals for In-Memory Databases. In MICRO.
  • Kohonen (1980) T. Kohonen. 1980. Content-Addressable Memories. Springer-Verlag.
  • Koo et al. (2017) G. Koo, K. K. Matam, T. I, H. V. K. G. Narra, J. Li, H.-W. Tseng, S. Swanson, and M. Annavaram. 2017. Summarizer: Trading Communication With Computing Near Storage. In MICRO.
  • Kwak et al. (2020) J. Kwak, S. Lee, K. Park, J. Jeong, and Y. H. Song. 2020. Cosmos+ OpenSSD: Rapid Prototype for Flash Storage Systems. TOS (Jul. 2020).
  • Kyrola et al. (2012) A. Kyrola, G. Blelloch, and C. Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI.
  • Lagrange et al. (2020) V. Lagrange, H. Li, and A. Shayesteh. 2020. Modeling Analytics for Computational Storage. In ICPE.
  • Lakshminarayanan et al. (2005) K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary. 2005. Algorithms for Advanced Packet Classification With Ternary CAMs. In SIGCOMM.
  • Lawley (2014) J. Lawley. 2014. Understanding Performance of PCI Express Systems. White Paper WP350. Xilinx, Inc.
  • Lee et al. (2002) J. Lee, H.-S. Im, D.-S. Byeon, K.-H. Lee, D. H. Chae, K.-H. Lee, Y.-H. Lim, J.-D. Choi, Y.-I. Seo, J.-S. Lee, and K.-D. Suh. 2002. A 1.8V 1Gb NAND Flash Memory With 0.12µm STI Process Technology. In ISSCC.
  • Lee et al. (2021) S. Lee, S.-H. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology. In ISCA.
  • Lehman and Carey (1986) T. J. Lehman and M. J. Carey. 1986. A Study of Index Structures for Main Memory Database Management Systems. In VLDB.
  • Leis et al. (2018) V. Leis, M. Haubenschild, A. Kemper, and T. Neumann. 2018. LeanStore: In-Memory Data Management Beyond Main Memory. In ICDE.
  • Leskovec et al. (2005) J. Leskovec, J. Kleinberg, and C. Faloutsos. 2005. Graphs Over Time: Densification Laws, Shrinking Diameters and Possible Explanations. In KDD.
  • Leskovec et al. (2009) J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Math. (Jan. 2009).
  • Li et al. (2020) H. Li, H. **, L. Zheng, and X. Liao. 2020. ReSQM: Accelerating Database Operations Using ReRAM-Based Content Addressable Memory. IEEE TCAD (Nov. 2020).
  • Li et al. (2014) J. Li, R. K. Montoye, M. Ishii, and L. Chang. 2014. 1 Mb 0.41 µm22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing. JSSC (Apr. 2014).
  • Li et al. (2017) S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie. 2017. DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator. In MICRO.
  • Liu and Huang (2017) H. Liu and H. H. Huang. 2017. Graphene: Fine-Grained IO Management for Graph Computing. In FAST.
  • Liu et al. (2004) J. Liu, A. Mamidala, A. Vishnu, and D. K. Panda. 2004. Performance Evaluation of InfiniBand With PCI Express. In HOTI.
  • Maass et al. (2017) S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim. 2017. Mosaic: Processing a Trillion-Edge Graph on a Single Machine. In EuroSys.
  • Mailthody et al. (2019) V. S. Mailthody, Z. Qureshi, W. Liang, Z. Feng, S. G. de Gonzalo, Y. Li, H. Franke, J. Xiong, J. Huang, and W.-m. Hwu. 2019. DeepStore: In-Storage Acceleration for Intelligent Queries. In MICRO.
  • Malewicz et al. (2010) G. Malewicz, M. H. Austern, A.J.C Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. In SIGMOD.
  • Malicevic et al. (2017) J. Malicevic, B. Lepers, and W. Zwaenepoel. 2017. Everything You Always Wanted to Know About Multicore Graph Processing but Were Afraid to Ask. In USENIX ATC.
  • Mandelman et al. (2002) J. A. Mandelman, R. H. Dennard, G. B. Bronner, J. K. DeBrosse, R. Divakaruni, Y. Li, and C. J. Radens. 2002. Challenges and Future Directions for the Scaling of Dynamic Random-Access Memory (DRAM). IBM JRD (Mar. 2002).
  • Matam et al. (2019) K. K. Matam, G. Koo, H. Zha, H.-W. Tseng, and M. Annavaram. 2019. GraphSSD: Graph Semantics Aware SSD. In ISCA.
  • Matsui et al. (2017) C. Matsui, C. Sun, and K. Takeuchi. 2017. Design of Hybrid SSDs With Storage Class Memory and NAND Flash Memory. Proc. IEEE (2017).
  • Matsunaga et al. (2009) S. Matsunaga, K. Hiyama, A. Matsumoto, S. Ikeda, H. Hasegawa, K. Miura, J. Hayakawa, T. Endoh, H. Ohno, and T. Hanyu. 2009. Standby-Power-Free Compact Ternary Content-Addressable Memory Cell Chip Using Magnetic Tunnel Junction Devices. APEX (Feb. 2009).
  • Matsunaga et al. (2011) S. Matsunaga, A. Katsumata, M. Natsui, S. Fukami, T. Endoh, H. Ohno, and T. Hanyu. 2011. Fully Parallel 6T-2MTJ Nonvolatile TCAM With Single-Transistor-Based Self Match-Line Discharge Control. In VLSIC.
  • Meribout et al. (2000) M. Meribout, T. Ogura, and M. Nakanishi. 2000. On Using the CAM Concept for Parametric Curve Extraction. TIP (2000).
  • Mutlu (2013) O. Mutlu. 2013. Memory Scaling: A Systems Architecture Perspective. In IMW.
  • Nair et al. (2015) R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura. 2015. Active Memory Cube: A Processing-in-Memory Architecture for Exascale Systems. IBM JRD (Mar.–May 2015).
  • Narla et al. (2023) S. Narla, P. Kumar, A. F. Laguna, D. Reis, X. S. Hu, M. Niemier, and A. Naeemi. 2023. Design of a Compact Spin-Orbit-Torque-Based Ternary Content Addressable Memory. TED (Feb. 2023).
  • Neumann and Freitag (2020) T. Neumann and M. J. Freitag. 2020. Umbra: A Disk-Based System With In-Memory Performance.. In CIDR.
  • NVM Express, Inc. (2021a) NVM Express, Inc. 2021a. NVM Express® Base Specification, Revision 2.0a.
  • NVM Express, Inc. (2021b) NVM Express, Inc. 2021b. NVM Express® NVM Command Set Specification, Revision 1.0a.
  • NVM Express, Inc. (2022) NVM Express, Inc. 2022. NVM Express® Key Value Command Set Specification, Revision 1.0b.
  • Oliveira et al. (2021) G. F. Oliveira, J. Gómez-Luna, L. Orosa, S. Ghose, N. Vijaykumar, I. Fernandez, M. Sadrosadati, and O. Mutlu. 2021. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks. IEEE Access (2021).
  • Page et al. (1999) L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66. Stanford Univ. InfoLab.
  • Pagiamtzis and Sheikholeslami (2006) K. Pagiamtzis and A. Sheikholeslami. 2006. Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey. JSSC (Feb. 2006).
  • Park et al. (2022) J. Park, R. Azizi, G. F. Oliveira, M. Sadrosadati, R. Nadig, D. Novo, J. Gómez-Luna, M. Kim, and O. Mutlu. 2022. Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory. In MICRO.
  • Park et al. (2021) J.-H. Park, S. Choi, G. Oh, and S.-W. Lee. 2021. SaS: SSD as SQL Database System. Proc. VLDB Endow. (May 2021).
  • Park et al. (2008) K.-T. Park, M. Kang, D. Kim, S.-W. Hwang, B. T. Choi, Y.-T. Lee, C. Kim, and K. Kim. 2008. A Zeroing Cell-to-Cell Interference Page Architecture With Temporary LSB Storing and Parallel MSB Program Scheme for MLC NAND Flash Memories. JSSC (2008).
  • Pearce et al. (2010) R. Pearce, M. Gokhale, and N. M. Amato. 2010. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SC.
  • Pei and Zukowski (1991) T.-B. Pei and C. Zukowski. 1991. VLSI Implementation of Routing Tables: Tries and CAMs. In INFCOM.
  • Potter et al. (1994) J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, and C. Asthagiri. 1994. ASC: An Associative-Computing Paradigm. Computer (Nov. 1994).
  • Potter (2012) J. L. Potter. 2012. Associative Computing: A Programming Paradigm for Massively Parallel Computers. Springer Science & Business Media.
  • R. Cheerla (2019) R. Cheerla. 2019. Computational SSDs. https://www.snia.org/sites/default/files/SDCEMEA/2019/Presentations/Computational_SSDs_Final.pdf. Talk at SDC EMEA.
  • Rajendran et al. (2011) B. Rajendran, R. W. Cheek, L. A. Lastras, M. M. Franceschini, M. J. Breitwisch, A. Schrott, J. Li, R. Montoye, L. Chang, and C. Lam. 2011. Demonstration of CAM and TCAM Using Phase Change Devices. In IMW.
  • Ravikumar and Mahapatra (2004) V. C. Ravikumar and R. N. Mahapatra. 2004. TCAM Architecture for IP Lookup Using Prefix Properties. IEEE Micro (Mar.–Apr. 2004).
  • Reinsel et al. (2018) D. Reinsel, J. Gantz, and J. Rydning. 2018. Data Age 2025: The Digitization of the World From Edge to Core. Technical Report. IDC.
  • Roy et al. (2015) A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. 2015. Chaos: Scale-Out Graph Processing From Secondary Storage. In SOSP.
  • Roy et al. (2013) A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-Centric Graph Processing Using Streaming Partitions. In SOSP.
  • Salamat et al. (2021) S. Salamat, A. Haj Aboutalebi, B. Khaleghi, J. H. Lee, Y. S. Ki, and T. Rosing. 2021. NASCENT: Near-Storage Acceleration of Database Sort on SmartSSD. In FPGA.
  • Samsung Electronics Co., Ltd. ([n. d.]) Samsung Electronics Co., Ltd. [n. d.]. HBM Processing in Memory. https://www.samsung.com/semiconductor/solutions/technology/hbm-processing-in-memory/.
  • Samsung Electronics Co., Ltd. (2022) Samsung Electronics Co., Ltd. 2022. Samsung Electronics Develops Second-Generation SmartSSD Computational Storage Drive With Upgraded Processing Functionality. https://news.samsung.com/global/samsung-electronics-develops-second-generation-smartssd-computational-storage-drive-with-upgraded-processing-functionality.
  • Samynathan et al. (2019) B. Samynathan, K. Chapman, M. Nik, B. Robatmili, S. Mirkhani, and M. Lavasani. 2019. Computational Storage for Big Data Analytics. In ADMS.
  • Sayre (1976) G. E Sayre. 1976. STARAN: An Associative Approach to Multiprocessor Architecture. In Comp. Arch.: Wkshp. of Gesellschaft für Informatik Erlangen.
  • Seshadri et al. (2014) S. Seshadri, M. Gahagan, S. Bhaskaran, T. Bunker, A. De, Y. **, Y. Liu, and S. Swanson. 2014. Willow: A User-Programmable SSD. In OSDI.
  • Seshadri et al. (2017) V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. 2017. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology. In MICRO.
  • Shafiee et al. (2016) A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator With In-Situ Analog Arithmetic in Crossbars. In ISCA.
  • Shanbhag et al. (2020) A. Shanbhag, N. Tatbul, D. Cohen, and S. Madden. 2020. Large-Scale In-Memory Analytics on Intel® Optane™ DC Persistent Memory. In DaMoN.
  • Shim and Yu (2022) W. Shim and S. Yu. 2022. GP3D: 3D NAND Based In-Memory Graph Processing Accelerator. JETCAS (Jun. 2022).
  • Shin et al. (2012) S.-H. Shin, D.-K. Shim, J.-Y. Jeong, O.-S. Kwon, S.-Y. Yoon, M.-H. Choi, T.-Y. Kim, H.-W. Park, H.-J. Yoon, Y.-S. Song, Y.-H. Choi, S.-W. Shim, Y.-L. Ahn, K.-T. Park, J.-M. Han, K.-H. Kyung, and Y.-H. Jun. 2012. A New 3-Bit Programming Algorithm Using SLC-to-TLC Migration for 8MB/S High Performance TLC NAND Flash Memory. In VLSIC.
  • Shun and Blelloch (2013) J. Shun and G. E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In PPoPP.
  • Spitznagel et al. (2003) E. Spitznagel, D. Taylor, and J. Turner. 2003. Packet Classification Using Extended TCAMs. In ICNP.
  • Stoica and Ailamaki (2013) R. Stoica and A. Ailamaki. 2013. Enabling Efficient OS Paging for Main-Memory OLTP Databases. In DaMoN.
  • Stonebraker and Weisberg (2013) M. Stonebraker and A. Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. (Jun. 2013).
  • Sun et al. (2017) Y. Sun, Y. Wang, and H. Yang. 2017. Energy-Efficient SQL Query Exploiting RRAM-Based Process-in-Memory Structure. In NVMSA.
  • Suzuki et al. (2021) T. Suzuki, K. Hiwada, H. Kajihara, S. Sano, S. Nomura, and T. Shiozawa. 2021. Approaching DRAM Performance by Using Microsecond-Latency Flash Memory for Small-Sized Random Read Accesses: A New Access Method and Its Graph Applications. Proc. VLDB Endow. (Apr. 2021).
  • Tavakkol et al. (2018) A. Tavakkol, J. Gómez-Luna, M. Sadrosadati, S. Ghose, and O. Mutlu. 2018. MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices. In FAST.
  • Transaction Processing Council (2010) Transaction Processing Council. 2010. TPC-C Benchmark. http://www.tpc.org/tpcc/spec/tpcc_current.pdf
  • Transaction Processing Council (2011) Transaction Processing Council. 2011. TPC-H DBGEN. https://github.com/electrum/tpch-dbgen.
  • Transaction Processing Council (2021) Transaction Processing Council. 2021. TPC-H Benchmark. http://www.tpc.org/tpch
  • Tseng et al. (2020) P.-H. Tseng, F.-M. Lee, Y.-H. Lin, L.-Y. Chen, Y.-C. Li, H.-W. Hu, Y.-Y. Wang, C.-C. Hsieh, M.-H. Lee, H.-L. Lung, K.-Y. Hsieh, K.-C. Wang, and C.-Y. Lu. 2020. In-Memory-Searching Architecture Based on 3D-NAND Technology With Ultra-High Parallelism. In IEDM.
  • UPMEM SAS ([n. d.]) UPMEM SAS. [n. d.]. Technology. https://www.upmem.com/technology/.
  • Vora et al. (2016) K. Vora, G. Xu, and R. Gupta. 2016. Load the Edges You Need: A Generic I/O Optimization for Disk-Based Graph Processing. In USENIX ATC.
  • Vuduc (2003) R. Vuduc. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Ph. D. Dissertation. Univ. of California, Berkeley.
  • Wade and Sodini (1987) J. P. Wade and C. G. Sodini. 1987. Dynamic Cross-Coupled Bit-Line Content Addressable Memory Cell for High-Density Arrays. JSSC (Feb. 1987).
  • Wang et al. (2020) K. Wang, Z. Shen, C. Huang, C.-H. Wu, Y. Dong, and A. Kanakia. 2020. Microsoft Academic Graph: When Experts Are Not Enough. QSS (Feb. 2020).
  • Wang et al. (2018) P. Wang, S. Li, G. Sun, X. Wang, Y. Chen, H. Li, J. Cong, N. Xiao, and T. Zhang. 2018. RC-NVM: Enabling Symmetric Row and Column Memory Accesses for In-Memory Databases. In HPCA.
  • Web Data Commons (2014) Web Data Commons. 2014. Hyperlink Graphs. http://webdatacommons.org/hyperlinkgraph/.
  • Woods et al. (2014) L Woods, Z. István, and G. Alonso. 2014. Ibex—An Intelligent Storage Engine With Support for Advanced SQL Offloading. Proc. VLDB Endow. (Jul. 2014).
  • Wu et al. (2019) K. Wu, A. Arpaci-Dusseau, R. Arpaci-Dusseau, R. Sen, and K. Park. 2019. Exploiting Intel Optane SSD for Microsoft SQL Server. In DaMoN.
  • Xu et al. (2020) S. Xu, T. Bourgeat, T. Huang, H. Kim, S. Lee, and Arvind. 2020. AQUOMAN: An Analytic-Query Offloading Machine. In MICRO.
  • Xu et al. (2016) S. Xu, S. Lee, S.-W. Jun, M. Liu, J. Hicks, and Arvind. 2016. BlueCache: A Scalable Distributed Flash-Based Key-Value Store. Proc. VLDB Endow. (Nov. 2016).
  • Xu et al. (2010) W. Xu, T. Zhang, and Y. Chen. 2010. Design of Spin-Torque Transfer Magnetoresistive RAM and CAM/TCAM With High Sensing and Search Speed. TVLSI (Jan. 2010).
  • Yang and Leskovec (2011) J. Yang and J. Leskovec. 2011. Patterns of Temporal Variation in Online Media. In WSDM.
  • Yang and Leskovec (2012) J. Yang and J. Leskovec. 2012. Defining and Evaluating Network Communities Based on Ground-Truth. In KDD.
  • Yang et al. (2016) M.-C. Yang, Y.-H. Chang, C.-W. Tsao, and C.-Y. Liu. 2016. Utilization-Aware Self-Tuning Design for TLC Flash Storage Devices. TVLSI (Oct. 2016).
  • Yavits et al. (2021) L. Yavits, R. Kaplan, and R. Ginosar. 2021. GIRAF: General Purpose In-Storage Resistive Associative Framework. TPDS (2021).
  • Yin et al. (2019) X. Yin, K. Ni, D. Reis, S. Datta, M. Niemier, and X. S. Hu. 2019. An Ultra-Dense 2FeFET TCAM Design Based on a Multi-Domain FeFET Model. TCAS-II (Dec. 2019).
  • Yu et al. (2014) X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and M. Stonebraker. 2014. Staring Into the Abyss: An Evaluation of Concurrency Control With One Thousand Cores. Proc. VLDB Endow. (Nov. 2014).
  • Zane et al. (2003) F. Zane, G. Narlikar, and A. Basu. 2003. CoolCAMs: Power-Efficient TCAMs for Forwarding Engines. In INFOCOM.
  • Zha and Li (2018) Y. Zha and J. Li. 2018. Liquid Silicon: A Data-Centric Reconfigurable Architecture Enabled by RRAM Technology. In FPGA.
  • Zha and Li (2020) Y. Zha and J. Li. 2020. Hyper-AP: Enhancing Associative Processing Through a Full-Stack Optimization. In ISCA.
  • Zhang et al. (2016) H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, and R. Shen. 2016. Reducing the Storage Overhead of Main-Memory OLTP Databases With Hybrid Indexes. In SIGMOD.
  • Zhang et al. (2015) K. Zhang, R. Chen, and H. Chen. 2015. NUMA-Aware Graph-Structured Analytics. In PPoPP.
  • Zhang et al. (2018) P. Zhang, L. Xing, N. Yang, G. Tan, Q. Liu, and C. Zhang. 2018. Redis++: A High Performance In-Memory Database Based on Segmented Memory Management and Two-Level Hash Index. In BDCloud.
  • Zheng et al. (2015) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. 2015. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In FAST.
  • Zhu et al. (2015) X. Zhu, W. Han, and W. Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In USENIX ATC.