GoblinCore-64: A scalable, open architecture for data intensive high performance computing

Date

2017-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The current class of mainstream microprocessor architectures rely upon multi-level data caches and relatively low degrees of concurrency to solve a wide range of applications and algorithmic constructs. These mainstream architectures are well suited to efficiently executing applications that are generally considered to be cache friendly. These may include applications that operate on dense, linear data structures or applications that make heavy reuse of data in cache.

However, applications that are generally considered to be data intensive in nature may access memory with irregular memory request patterns or access such large data structures that they cannot reside entirely in an on-chip data cache. In this work, we introduce the GoblinCore-64 (GC64) architecture and instruction set. The goal of GC64 is to provide a scalable, flexible and open architecture for efficiently executing data intensive computing applications and algorithms.

The GC64 infrastructure is built upon a hierarchical set of hardware modules designed to support scalable concurrency with explicit support for latency hiding. The infrastructure’s memory hierarchy is constructed using software-managed scratchpad memories for local, application-managed memory requests and Hybrid Memory Cube devices for main memory storage. The instruction set is based upon the RISC-V instruction set specification with additional extensions to support scatter/gather memory requests, task concurrency and task management. Further, the system architecture is bolstered by a high performance main memory subsystem based upon three-dimensional stacked memories, or Hybrid Memory Cubes, coupled to an intelligent memory request coalescing methodology. The result is a simple, effective, and scalable architecture that can be easily adapted to efficiently execute data intensive applications using commodity programming models such as OpenMP, MPI and MapReduce.

In order to validate the GC64 infrastructure, we construct a simulation infrastructure based upon the standard RISC-V simulation platform, Spike, that mimics a production GC64 device. We demonstrate the aforementioned system architecture using a wide array of benchmark vehicles that include the GAP Benchmark Suite, NAS Parallel Benchmarks, High Performance Conjugate Gradient Benchmark and the Barcelona OpenMP Task Suite.

Description

Keywords

Data intensive computing, High performance computing, Parallel computing, System architecture, Micro architecture, Hybrid memory cube, Latency tolerant architecture, Supercomputer, Graph theory, Combinatorics, Sparse solver

Citation