Our current goal is to decrease latency and minimize area usage, which is a challenge because there is a trade-off between the two. Our task is to find a balance between latency and area that best meet the latency and area target of the current FPGA (Xilinx Kintex-7 XC7K325T-2FBG676I). According to the datasheet [9], the current FPGA has 407,600 FFs, 203,800 LUTs and 445 BRAMs. To evaluate area statistics, we will calculate the percentage utilization for each resource, and pick the highest utilization to estimate the number of ASICs supported on one FPGA.
We want to get the performance result and area statistics for each design. To begin, we run the C simulation to check for syntax errors. This is mainly to check if variable declaration and/or function calls are consistent throughout the code. We can check for output under solution1/csim/build/output.txt, this gives us a rough estimate of the output.
After C simulation, we can run C synthesis on the FPGA. If there are no errors, we can see a Synthesis Summary Report of the design. This gives us a rough estimate of the performance of the design, and we can know quickly whether our changes have made the performance better or worse.
If we want to get a more accurate performance result of our design, we could proceed to run C/RTL cosimulation, this generally takes longer to run so we only run it when we think our design is better than before. The performance report from C /RTL cosimulation tends to generate longer latency than the synthesis report, but they are more accurate. After that, we can check output under solution1/sim/wrapc/output.txt and make sure it matches with the output from C synthesis before.
Below shows a complete run of a design which we did not vectorize the data. The synthesis report in Figure 5 shows that latency is 16417 cycles, along with the correct output.txt file in Figure 6. After we proceed to run C/RTL cosimulation, we get a smaller latency of 9167 cycles as shown in Figure 7. However, it turns out with incorrect output.txt as shown in Figure 8. This shows that there is something wrong with our algorithm logic, which is because we did not initialize the integral to be zeros and resulted in duplicate reads. The performance at this stage is shown in Table 3 and Table 4.
Cycles | Time | Time slower than goal |
9167 | 30.5 µs | 11.9 |
FFs | FFs % Util | LUTs | LUTs % Util | BRAM | BRAM % Util | ASICs |
5201 | 1.28 | 8907 | 4.37 | 18 | 4.04 | 22 |