Results and Discussion | Designing conventional graph apps and HLS kernels for a graph processing HLS benchmark suite

The results of our quantitative evaluations and information which we can read from are shown in this section.

Instruction Mix

Due to the NetworkX’s extremely slow process speed, we are not able to get the Python data for betweenness centrality using Brandes algorithm with medium and large size graphs. Therefore, we are not able to record the instruction mix data for these graphs. Based on the result in figure above, we can clearly see that each algorithm shows different execution behavior which fulfills our diversified objective.

Since the graph algorithm is an input-dependent application, we first check the consistency of our result across all 6 graphs. For the WCC kernel, there are two implementations SV and LP. The SV’s percentage of the branch instruction varies by 3.71% with 6 different graphs. Its percentage of memory instruction has a relatively large variance of 5.86% and its computation instruction has an even larger variance of 9.57% in 6 graphs. The LP’s percentage of the branch instruction varies by 6.88%, its memory instruction is very consistent with only 0.7% variance, and its computation instruction varies by 6.18%. For all 6 graphs, the LP algorithm requires more branch and memory instruction while the SV algorithm requires more computation Instruction. The reason why the LP algorithm has more memory traffic is that it has two component arrays while the SV algorithm requires only one. Plus the LP algorithm updates the previous array with the new one in every iteration. The difference in control and computation may be caused by the inherent properties of each algorithm and can not be directly answered by simply looking at the instruction mix graph.

The rest of the three kernels ArticleRank(AR), betweenness centrality(BC), and graph coloring(GC) focus on other problems which can not be directly compared to the WCC kernels. But our primary objective for the instruction mix is to find out whether our HLS kernels have different execution behavior, which is clearly shown by figure 15 that our HLS kernels indeed have different execution behaviors. The AR algorithm has a comparatively small variance of 2.33% on branch instruction, a small variance of 3.59% on memory instruction, and a significantly consistent computation instruction with only 1.26% percent variance in 6 graphs. Within the only two graph results, Brandes algorithm shows a 0.8% variance on branch instruction, 1% on memory instruction, and 1.7% on computation instruction. The greedy algorithm shows a large variance of 10.79 percent on the branch instruction, a small variance of 3.93% on the memory instruction, and large variance of 6.86% on the computation instruction. Since both of these results are based on small graphs, these results are still questionable.

Momory Footprint

In order to understand figure above, we first need to differentiate our 6 graphs into two different kinds of graphs that are directed graphs and undirected graphs as shown in table 2. Since WCC only works on the undirected graphs, it needs to transfer a directed graph to an undirected graph by doubling the number of edges, which means adding an edge in or out from the current node to the node where the edge is from or points to, and vice versa. When the scale of the graph is small, this effect is comparatively small. When the graph becomes larger and has more edges, this effect will make the SV and LP algorithm dramatically increase in memory footprint. In figure 16, p2p-Gnutella09, web-Stanford, and amazon0302 are directed graphs and p2p-Gnutella09 is a small graph. In p2p-Gnutella09, Brandes algorithm consumes the largest amount of memory. The LP algorithm is in second place, the AR algorithm is in third place, the SV algorithm is in fourth place, and the greedy algorithm consumes the least amount of memory. When it comes to medium size graphs, the LP algorithm consumes the most memory, the SV algorithm is second place, the Brandes algorithm is the third place, the AR algorithm is the fourth place, and the greedy algorithm still consumes the least amount of memory. Even though both large graphs are undirected graphs, these medium-size graphs’ rankings will still be the same for the large graphs. For an undirected graph, we do not have this issue anymore. The Brandes algorithm consumes the largest amount of memory as we expected, the AR algorithm is the second place, the LP and the greedy algorithm are the third place, the SV algorithm consumes the least amount of memory, which are similar to the ranking running the small graphs. The ranks are the same for all three undirected graphs. For the WCC kernel, the LP algorithm always consumes more memory than the SV algorithm. The reason is that the LP algorithm requires two component arrays while the SV algorithm only requires one.

OpenMP Speed Up Result

In order to find how parallelism works for each algorithm, we calculate the ratio of speeding up by dividing the OpenMP 1 core execution time with OpenMP 4 core execution time. During the testing, we found that the Brandes algorithm and the greedy algorithm cannot apply any parallelism, which means the ratio is always close to 1. For conciseness, we do not show the result in figure above, but it clearly shows that different algorithms have different speed up behavior on the CPU platform.

For the WCC kernel, the scale of the graph seems to not impact the Speed Up result for the SV algorithm except maybe the large scale graph. It may have a negative impact on the SV algorithm’s Speed Up result which we can not explain. The power law of the graph impacts the Speed Up of the SV algorithm no matter what the scale is. For the LP algorithm, the scale of the graph starts to improve the Speed Up of the algorithm when it reaches to large scale. At the same time, we can also start to see the impact of power-law on it. Despite having a quicker execution time than the LP algorithm has for all 6 graphs, the SV algorithm’s Speed Up result is always less than the LP’s result. The reason is that, even though the SV algorithm has a better theoretical performance in terms of time complexity and parallelism, the SV algorithm has a lot of indirect addressing operations that may cause some memory overhead during the parallel computation. For the AR algorithm, the scale of the graph improves the Speed Up of the algorithm starting from the medium size graph. At the same time, we can see the power-law of the graph also impacts the algorithm’s Speed Up too.

HLS execution time result with different HLS pragmas

In this part of the result, we HLS our code to the RTL code and get the execution result from the Vivado XSIM. Due to the DSP warning messages, the simulation process for AR and Brandes algorithm is extremely slow. Therefore, we are only able to show the result of the other three algorithms LP, SV, and the greedy algorithm. Moreover, due to the extremely slow simulation speed, our simulation result is only based on the small graphs that are p2p-Gnutella09 and ca-GrQc. Since both graphs show the same result, we only show the result for p2p-Gnutella09 in the main body and the ca-GrQc result is in the Appendix. Unlike our other results, all three algorithms have similar behavior with the same pragma. This is due to the reason that we use the GAS programming model for all of our algorithms, thus implies the relationship to programming styles.

The “nothing” in the figure above means that no pragmas are added in the program and the program runs sequentially. Gather unroll 256 means that we partially unroll the loop gather with a factor equal to 256. In other words, we allow the 256 source nodes to read and compute information from their neighbors at the same time. For LP and SV algorithms, both of these algorithms show that partially unrolling the loop gather makes the result even worse. The reason for this result is due to the memory read and write bottleneck. By using the CSR data structure, we need to constantly read two arrays to complete a traversal inside the graph, and we need to constantly write our results to the other arrays. The memory that stores these arrays only contains two ports. For each clock cycle, there can be only two programs that read the memory or one program that writes the memory. Therefore, even though we have 256 source nodes operating at the same time, only one source node can read or write the memory. Due to the extra synchronization time, partially unrolling loop gather makes the execution time result even worse. The greedy algorithm can not partially unroll the loop gather because of the inherent property of the algorithm.

Partially unrolling loop apply means that we allow one source node to find and compute information from its neighbors. And we allow the source node to find 256 neighbors at the same time. In this way, we actually see the improvement in the result. The reason is that we do not need to read one of the graph arrays anymore and the bottleneck from the memory will have less impact on the performance. The best result comes from the method that partially unrolls the loop gather and pipelines the loop apply at the same time. Pipelining loop apply means that we do not need to find 256 neighbors at the same time. Instead, we gather each neighbor one by one, but it can still save time by allowing us to begin some of the operations from later neighbors earlier. Therefore, the bottleneck of memory access has the least impact on this approach.

Synthesis result

All the results that are shown in figure above contain the best optimization strategy we have found. Due to the same reason as instruction mix, we can only show the synthesis result for SV, LP, and greedy algorithms. Since both small graphs show a similar result, we put the result from p2p-Gnutella09 to be shown above. Based on the result above, we can clearly see that all of the three algorithms have different execution time and resource usage.

For the WCC kernel, both LP and SV have a similar resource usage while SV consumes slightly more resources. For the execution time, SV is faster than LP which is also shown in our CPU test result. As a result, we believe that it is caused by the inherent property of each algorithm. By comparing the WCC kernel and GC kernel, we can clearly see that the greedy algorithm consumes much fewer resources because we can not partially unroll the loop-gather in the greedy algorithm. The optimization strategy we came up with for the greedy algorithm is that we only partially unroll the loop apply with factor 2 while both LP and SV partially unroll the loop gather with factor 256. This optimization difference over exaggerates the difference in resource usage between WCC and GC kernel.