Technical Details | Designing conventional graph apps and HLS kernels for a graph processing HLS benchmark suite

These are the background knowledges that need to be known in order to comprehend our results

NetworkX

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks including many standard graph algorithms in serial programming model that can be taken as reference.

Compressed Sparse Row(CSR) Format

Compressed Sparse Row (CSR) format is a data structure used two arrays store all the graph information including the number of vertices and edges, all the edges with direction or not, and all the vertices. The reason we use CSR format is because it only stores edges which means the spaces complexity is very low. More importantly, our selected programming model requires us to constantly read and write the neighbor vertices of the source vertex. CSR format have very low time complexity to do that.

GAS Programming Model

The GAS programming model provides a high-level abstraction for various graph processing algorithms and is widely adopted in the graph processing framework. As shown in figure, “G” represents the phase of gathering when a source node gathers data from its neighbor. The sum is just an arithmetic operation if we need to do an accumulation for our further updates. “A” here represents an update to the source node. It may vary depending on what we actually want from its neighbors. “S” represents the phase of scattering when we filter out vertices that may not be needed in the future iteration.

Power Law

The figure below shows the natural graphs that we selected from SNAP with specific properties, alpha and sigma. A python package called Power-law to calculate the alpha value for each graph. Large alpha implies that the graph has a lower density (ratio of edges to vertices) and the vast majority of vertices are low degrees(number of in and out edges) . In this case, we will have a smaller bottleneck in parallel computing. We consider our graph to have large alpha if alpha is larger than 3. Since Power-law essentially relies on the modeling, sigma defines how reliable the alpha is. We consider our alpha to be reliable if our sigma is around 1. In the end, for directed graphs, there will be in-degree and out-degree instead of just degrees. Therefore, we also calculate the alpha for in-degree and out-degree. As a result, for each scale, we will have a graph with a large alpha and a graph with a small alpha for comparison purpose.

Instruction Mix

Instruction mix, as its name, refers to the blend of instruction types in a program. We break down the program into computation instruction, memory, and branch instructions and calculate the percentage of each instruction in the whole operation. The goal of our result is to have a high percentage of computation instruction and a low percentage of branch instruction. A high percentage of computation instruction means that the application can be easily optimized by using the FPGA’s fine-grained parallelism. A low percentage of branch instruction results in high predictability of the control flow, which means the control could be easily transformed to fixed data paths in hardware, eliminating overhead and bloat.

Memory Footprint

Memory footprint is a measurement of the total number of unique bytes a benchmark suite address. In order to make it directly comparable, we use the same input and force calculations on different applications. With the same input size, we can measure how many arrays or matrices the application uses in the whole operation and what are the sizes of them. For example, if a program uses three 500 times 500 size matrices to store integer values in the whole operation, we can calculate the total memory footprint as 3000000 bytes. This measurement can be simply calculated by hand if the input size is settled.

Xilinx Vivado HLS Result

With specific input, the Vivado HLS co-simulation will measure the total execution time of our application. By dividing the period of the clock, we will know how many clock cycles that have been operated. Other than execution time, the Vivado HLS also builds Verilog code based on our C code, which is why they are called high-level synthesis(HLS). By synthesizing the Verilog code from HLS, Vivado will tell us how many FPGA resources like Flip-Flops, Look-Up-Tables, and DSPs the program used. All of these data could be measured by using Vivado HLS and Vivado itself.

OpenMP

OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, which can be used to accelerate our HLS kernel on the CPU platform. We also use it to check if our HLS kernels have the same result compared to what is generated by the conventional graph apps.

Vivado HLS Pragmas

a. pragma HLS unroll

There are three types of situations when we unroll the for loops in Vivado HLS. The left side of figure below shows when operating rolled loops. In this situation, all the iteration with a for loop will run sequentially. The middle one presents Partially Unrolled Loops. In this situation, we specify a factor in orders of 2. Vivado HLS will create two pieces of hardware to let two iterations of loops run concurrently. The one on the right is the unrolled Loop. In this situation, Vivado HLS will make all the iteration run concurrently, which means the computation only needs one iteration. The number of hardware will be the number of iterations.

b. pragma HLS pipeline

If we describe the operation within an iteration as read, compute, and write, pipelining a for loop allows the second iteration’s read operation and the first iteration’s compute operation to run concurrently. In the ideal situation, the second iteration’s read operation can enter the pipeline right after the first iteration’s read operation.

c. Pipeline Rewind

As shown in figure below, Vivado HLS offers a rewind option. This command enables the overlap of the iterations of the outer loop in a nested loop .