## NVMe + CPU + GPU = Memory Efficient Analytics

HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics

Hamish Nicholson, Aunn Raza, Periklis Chrysogelos, Anastasia Ailamaki





# The Broken Pillars of Fast Analytics



### Memory is cheap

Memory is (relatively) expensive



### Cache hits are key to performance

NVMe array bandwidth competitive with memory bandwidth



NUMA is insignificant compared to persistent storage access

Increasing accelerator heterogeneity

# Storage must be workload & hardware aware <sub>2</sub>





### **Heterogeneous Hierarchies have Multiple Transfer Paths**



- 1. DRAM to GPU (32 GB/s)
  - Eagerly transfer pages to GPU-memory
  - Byte-addressable access by GPU
- 2. NVMe to DRAM (86 GB/s, block)
- 3. NVMe to GPU-memory to GPU (32GB/s, block)

### Data routing requires optimizing for path BW & granularity,





### **NVMe BW Saturates CPU Throughput**



### **GPU needs caching to mitigate interconnect bottleneck**





# **Block Storage Wastes Interconnect BW**

NVMe Storage

SELECT T.c FROM T WHERE T.a < 50 AND T.b > 42





Processed Data

T = Ø







T = Ø

# **Block Storage Wastes Interconnect BW**

NVMe Storage

С

В

SELECT T.C FROM T WHERE T.a < 50 AND T.b > 42



DRAM

### Byte addressability minimizes overfetching







\*Chrysogelos et. al [VLDB 2019]





1. Logical scan emits page IDs



\*Chrysogelos et. al [VLDB 2019]

#### $\Delta iAS$



## **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents



\*Chrysogelos et. al [VLDB 2019]



- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]



- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]



- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]



- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]



- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

### **Transfer Path Depends on Workload and Hardware**

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]

- 1. Logical scan emits page IDs
- 2. Route based on cache contents
- 3. mem-move consults HetCache on:
  - Locations of in-memory copies
  - Preferred location to cache, if any



\*Chrysogelos et. al [VLDB 2019]



# **Experimental Setup**

- Hardware
  - 24-core AMD EPYC 7413
  - NVIDIA A40, PCIe 4.0 x16
  - 12x PCIe 4.0 x4 NVMe, 7GB/s each
- Software
  - Proteus: Hybrid CPU-GPU analytical engine
- Benchmark
  - Star Schema Benchmark. (SF 1000)
  - ~96GB(Q1.x 3.x) working set per query







# **Combining Transfer Paths for GPU**



Q3.1 (3.4% selectivity)

Q3.4 (0.000076% selectivity)

### Enabling sub-page accesses via DRAM staging => up to 45% faster

# **Memory Efficient CPU-GPU Execution**







# **Storage BW is approaching DRAM BW**

- (Near) in-memory performance on larger-than-DRAM datasets
  - Granularity and processing throughput-aware data placement
- Efficient interconnect use for NVMe-GPU transfers
  - Stage selectively accessed data in DRAM for GPUs
- Storage systems must be hardware & workload aware

