The parallel method of accelerating computations involve processing multiple computations at once. Typically, this method is the slowest of the three acceleration methods.
As each set of four samples comes in, we will simply add each of them in the adder tree depicted above. Each minimum clock period (assuming we implement Ripple-Carry-Adders (RCA)) must be (8-bit RCA) + (9-bit RCA) = 17-bit RCA = 17 gate delays. For the squared samples, this would be a (16-bit RCA) + (17-bit RCA) = 33-bit RCA = 33 gate delays. This is likely sufficient, as a typical gate delay is less than 200 ps, which would mean our minimum clock speed is 6.6 ns to process all the samples (less than 8 ns).
As each set of four samples comes in, we add them in pairs and store the pairs in temporary registers (as shown). This allows us to speed up our clock so that it is limited by only the slowest stage of our pipeline. In this case, the minimum clock period (assuming we implement RCAs) is 9-bit RCA = 9 gate delays. For the squared samples, this would be a 17-bit RCA = 17 gate delays. By similar reasoning as the parallelized schema, we would expect our minimum clock speed to be 3.4ns to process all the samples. This approach, however, requires more power and area because we have to add the temporary registers. If the parallelized schema is fast enough, then it would be better to use it instead because we want to optimize power and area rather than clock speed.
This schema can be parameterized. Essentially, our approach is the same as in the parallelized-pipelined, but in addition to pipelining our adder tree, we can also pipeline our RCAs. Depending on how much faster we want our clock to go, we can pipeline every n-th bit (for some 1 ≤ n ≤ k, where k is the number of bits in the adder) so that we can have a minimum clock of n gate delays = 200n ps. This is consistent for both samples and squared samples.
This approach requires the most power and area out of the three because we have to add the temporary registers not just in the adder tree, but also inside the RCAs. If either of the other two schemas are fast enough, then we should use them, because, as we mentioned before, we want to optimize power and area rather than clock speed. Now that we have explained our computation schemas, we will need to verify that they work as we intended. To do so, we wrote verification code in MATLAB.