15618 Project

Summary

We are going to implement a parallel version of Dijkstra algorithm on both PIM system and GPU, and compare between two kind of architecture on the Graph workload.

GitHub link for the project

link for the progress report

Background

The Dijkstra algorithm finds the shortest path from one vertex to all other vertices, which is widely used in network routing protocols and is foundational for many types of network analysis.

The PIM system we used for comparison with GPU, is a process-in-memory system that having multiple data process unit (DPU) located near the memory. This type of architecture aim for improving the performance by reducing the data movement within the cache hierarchy. Since the DPU is located close to the memory, the network latency for the memory access can be also reduced.

Here are several aspects of the algorithm that can benefit greatly from parallelism:

1. Handling each destination vertex: each thread compute the shortest distance between a vertex and the source.

2. Construct priority queue: compute the shortest distance between vertexes in parallel.

Challenge

Despite being a well-studied shortest path graph search algorithm, parallel implementation of Dijkstra algorithm faces challenges:

Memory Access Characteristics: Graph search often exhibit irregular memory access patterns, as the pointer of a vertex is arbitrary distributed in the memory, showing limited temporal and spatial data locality.

Divergent Execution: The path in graph have various length, some threads might exit early as the path involve less vertex.

Workload imbalance: The graph search algorithm is inherently workload imbalanced, as the number of edge connecting different vertex varies, dependent on the graph topology.

Communication Overhead: there are constraints on the communication bandwidth in GPU and DPU if the data to transfer is too large. The synchronization between threads for updating the shortest path length also require inherent communication.

Goal and Deliverable

We plan to implement the two versions of Dijkstra Algorithm on two kinds of computer architecture.

The deliverables are the CUDA implementation and the PIM implementation of the algorithms, and analysis of our parallel implementation of the two architecture with comparison. We will show speedup graphs and profiling results, and discuss the results.

Resource

The resource we need for this project is the GHC/PSC machine cluster, with openMP and MPI installed, and GPU core enabled. Also the access to the PIM system.

Schedule

Week of March 31:

Develop the CUDA version of the Dijkstra Algorithm.

Week of April 7:

Finish develop the CUDA version of the Dijkstra Algorithm, and start the PIM version of the Dijkstra Algorithm.

Week of April 14:

Finish the PIM version of the Dijkstra Algorithm.

Week of April 21:

Collect experimental data and compare the performance. Finish writing the final report.

May 5:

Final Project Report Due.