15618 Project
Summary

We implemented Dijkstra's algorithm in parallel on a GPU and a PIM system to compare their performance on graph workloads. This report discusses the implementations, experiments, and insights gained from comparing these two computing paradigms.

Background

Dijkstra algorithm is crucial for finding the shortest path in graphs, applicable in various domains such as networking and geographic information systems.

The PIM architecture, characterized by its integration of processing units near memory, promises reduced data movement and latency, potentially benefiting data-intensive applications like graph processing.

PIM Architecture

In our project, we implement the Dijkstra's algorithm on a special architecture named Process-in-Memory (PIM) architecture. Which the PIM system embedded multiple Process Units (PE) within the DRAM module of computer architecture. Below figure shows the architecture of the UPMEM PIM system, which is one of the leading PIM architecture. t it composed by host CPU to provide management to the PIM modules, and the DRAM main memory, as well as PIM-enabled Memory. There are multiple PIM chips on the DIMM with each chip includes 8 DRAM Processing Unit (DPU). Each DPU can access 64 MB DRAM bank, and each PIM chip has its own stack and heap for storing data and fast memory access.

CPU Architecture

The CPU architecture aims to maximum the performance for general workloads, and the performance of CPU is bottlenecked by the memory access latency. Modern CPU architecture implements a lot of techniques to hide the memory access latency, by optimizations of prefetching and out-of-order execution to hide DRAM latency, as well as the exploitation of memory-level and thread-level parallelism. However, due to the random access pattern of graph topology, the prefetching does not work effectively and thus suffering from the memory access latency.

GPU Architecture

GPU is designed specifically for the dense data computation, with a high memory access bandwidth and parallel memory access capability to reduce the memory access latency. It exploit the SIMT execution model, which capability to highly parallel, independent workload. CPU is designed for maximize parallelism, memory bandwidth and throughput for wide range of parallel computing workload.

Algorithm Description

Dijkstra's algorithm works by iteratively updating the shortest paths to all vertices from a given source. The algorithm has wide application in the real-world, for instance the network routing protocols, which Dijkstra's algorithm is foundational for many types of network analysis. The pseudo-code below demonstrates the naive implementation for the Dijkstra algorithm.

The Dijkstra's algorithm take a graph topology that having the path cost between two vertexes, represented as a 2D array, and iteratively compute the shortest path length between the starting vertex and all other vertexes in the graph. The output is the shortest path length between all vertexes and the start vertex.

The key operations of the Dijkstra's algorithm include:

  1. Updating distance estimates for vertices adjacent to the current vertex.
  2. Selecting the next vertex with the shortest tentative distance.

The updating distance is computational heavy, since all the neighboring vertex of the current vertex has to update its shortest path, and therefore all the vertexes has to be access sequentially in randomly access pattern, which will be bottlenecked by the memory hierarchy in the CPU.

Also, finding the vertex with the shortest tentative distance require traversing all the vertex to compare their path, this performance has to be O(N) or the best performance O(lgN) if using priority queue.

The two operations are inherently sequentially and therefore, we keep the sequentially execution of the two step, while implementing each step using the special characteristic of GPU and PIM to improve the performance.

Graph Characteristics and Parallelization Opportunities

Graph algorithms like Dijkstra’s are challenging for parallel systems due to the inherently sequential way to find the shortest path vertex and the update of the tentative distance of each adjacent vertex, as well as the poor data locality, irregular memory access pattern and workload imbalance:

  1. Poor data locality due to the irregular memory access patterns because of graph topology.
  2. Workload imbalance caused by non-uniform graph structures.

Poor data locality.

The PIM architecture is specifically designed for the irregular memory access, which aims to break the Von Neumann bottleneck by integrating processing capabilities directly within or near memory chips to minimize the data movement between CPU and memory. The irregular memory access of the graph topology makes the prefetch hard and less effective in the current CPU architecture. The GPU architecture have high bandwidth to the main memory, and therefore the latency in memory access can be reduced by the design of GPU architecture.

Workload imbalance.

The GPU using SIMT execution model, which is sensitive to the workload imbalance. However, when the density of the graph is high, the workload imbalance can be mitigated and thus we can see that GPU outperform the CPU when the graph is fully connected or having a high density. The PIM architecture has limited threading and a simple process unit architecture, and therefore is less suffered by the workload imbalance.

Implementation
  1. Languages/APIs: C++ with CUDA for GPU and C with API from the UPMEM SDK for PIM architecture.
  2. Hardware: Tests conducted on GHC/PSC clusters with GPUs and PIM tests on UPMEM platforms.

GPU Implementation

TThe GPU vertion of the Dijkstra's algorithm composes of two kernel, which the first kernel is using parallel reduction to find the vertex with the shortest path in the current iteration.

Each thread stores its locally found minimum distance in the shared memory array shared\_dist at the position corresponding to its thread ID within the block.

_syncthreads() is a barrier synchronization that ensures all threads in the block have completed their execution up to this point before proceeding. This is crucial because the next step involves reading values written to shared memory by all threads in the block.

The above code block is the first half of the shortest vertex finding kernel, which to maximize the performance, each thread load a portion of the current distance collaboratively, then parallelly find the vertex with the shortest path.

Finding the vertex with minimum path length is done by parallel reduction shown in the above code block. The parallel reduction phase in shared memory efficiently finds the vertex with the minimum path length. In the loop, each active thread compares distances stored in shared memory: if the distance at position thread_id + s is less than the distance at thread_id, it updates the minimum value and the corresponding node index. The loop iteratively halves the number of active threads, reducing the potential minimum distances until only the smallest remains. This minimum distance and its corresponding vertex index are then written to global memory by the first thread in the block, marking the identified vertex as visited. This approach minimizes global memory accesses, leveraging the fast shared memory to improve performance significantly.

Once the current vertex with shortest path is found, the second kernel is launched, to update of each vertex in parallel. Each thread update the shortest path for one vertex to fully utilize the computational power of GPU.

Since the two kernels has to run in sequential, and therefore we tried to improve the performance for each kernel by mapping each vertex to thread of the GPU to perform parallel reduction and update of shortest path.

PIM Implementation

The PIM module and the host is highly seperated in the term of memory. Therefore, the host need to first load the graph into the DPU's heap and the result will be stored on the DPU's stack and transfer back to the host once the DPUs finishes its execution.

TWe define a data structure to store the shortest path length on the stack of the DPU, and the heap address is stored as a uint32_t value, which denote the pointer pointing to the start of the graph.

The DPU directly access the heap and therefore reduce the latency for the randomly access of the graph topology. There is spetial API for access the heap of the DPU.

However, there is limitation for the DPU when we try to parallel the work among the DPU. Specifically, the heap size and the stack size is too small to implement the parallel version of the Dijkstra's algorithm. The thread in DPU cannot share the heap, which thus all the intermediate data has to be stored on the stack, which is merely 64 KB. The stack size is only able to process the graph with 16 vertexes, which is not particularly useful. And due to the highly separated characteristic of threading in DPU and the limited stack size, we are not able to implement the multi-threading in DPU.

Therefore, the DPU version of the Dijkstra's algorithm follows the sequential execution version, with the close memory processing to reduce memory access latency.

Results

The statistic we used for performance analysis is the execution time for the CPU, GPU and PIM, and the number of cycle that can be gathered from the CPU and PIM to analysis the benefit of near-data computing.

CPU baseline result

For the CPU baseline, we measure the computation time with the number of cycle altogether to identify the bottleneck of the sequential version of the Dijkstra's algorithm. The cycle for the 1024 vertex suddenly increased by a large amount compared to the 64 vertex which fit into the cache. Therefore, this result clearly indicate that the cache miss due to the random access pattern and the large graph cause the CPU to wait for memory access and thus drastically increase the total cycle for execution.

Therefore, we hypothesis that the GPU architecture that are designed for large data set computing, with high bandwidth to the main memory can mitigate the memory access latency. The PIM system, reduce the memory access latency by placing the Process Unit close to the memory, which can also mitigate the memory access latency.

GPU result

Due to the SIMT execution model that is specifically designed for dense data computing, the GPU outperform the CPU when the graph is dense. Since the CPU only access each vertex sequentially while the GPU is able to update the vertexes in parallel, the benefit of parallelism outweighed the overhead of launching the kernel in GPU when performing Dijkstra's algorithm on a densely connected graph and the vertex number is large that does not fit into cache.

As Figure 2 shows, when the graph size is small and is sparsely connected, the GPU is slower than the CPU version of the code, since the overhead of the GPU code is too large compare to the speedup of parallelism. However, when the graph is large and fully connected, the GPU can leverage its computation power and the high bandwidth for memory access, and eventually, the speedup of GPU version reach 2x times compared to the CPU.

PIM result

As the table demonstrated, the DPU performance does not benefit by the near-data computing. Although the DPU is located near the DRAM and having the short interconnection between DPU and the DRAM, the less powerful DPU processing unit architecture compared to the highly optimized CPU result in the under-performance of the DPU. We expect less cycle needed for the DPU due to the easiness of DPU to access to the main memory, however, both execution time and the cycle of DPU show a large figure.

As the plot shows, the PIM execution time increase drastically with the graph size, which is due to the large number of memory access operation and arithmetic due to the large graph size, yet the simple processing unit architecture does not benefit from the any optimization such as out-of-order execution and memory level parallelism implemented in modern CPU. In addition, due to the limited stack space, the DPU is unable to execute our parallel version of Dijkstra's algorithm and thus result in the bad performance of the DPU.

Sensitivity analysis

The following plot is generated using different density of graph, which using a graph with total of 1024 vertexes.

The density of the graph has little impact on the execution time of the PIM, which therefore indicating that the PIM execution time is not affected by the irregular data access, instead, it is bottlenecked by the arithmetic and the number of memory access.

Analysis

Challenges

The challenges below are inherently the bottleneck for Dijkstra's algorithm to have a good performance and leveraging the maximum benefit of parallelism. Our implementation in the GPU and PIM can only mitigate the effect of workload imbalance and the low data locality to achieve speedup but have not eliminated them.

  1. Divergent Execution: Both architectures exhibited issues with path length variability affecting thread execution efficiency.
  2. Workload Imbalance: Non-uniform graph structures led to significant load imbalance, impacting overall performance.
  3. Low data locality: The random access pattern of the graph topology exhibit low data locality, and therefore the cache miss and memory access latency is the bottleneck of the performance.

Comparative Analysis

GPU vs. PIM:

While GPU excelled in highly parallelizable tasks, its performance was bottlenecked by memory-intensive operations, a domain where PIM is expected to perform better due to its architecture. However, in the reality, since the DPU module is primitive and does not have the sophisticate optimization architecture that CPU generally have. The DPU is also not as powerful as the GPU, since it is only a simple accelerator compared to the GPU. The DPU have much worse performance than the CPU and the GPU, but it is still an active research area with some promising experimental result. In addition, due to the small stack and heap size, and the limited parallelism of the DPU, the parallel version of the Dikstra's algorithm cannot be implemented on DPU, which further worsen the DPU performance. The GPU however, is able to outperform the CPU due to this large bandwidth for memory access to DRAM, and the highly parallel execution model to cancel the overhead for launching CUDA kernel when processing a dense graph. Thus, we are able to outperform the CPU baseline using GPU eventually.