The datapath for our design consists of several pipelined stages, meaning each stage of the design has registers to allow for new inputs to the system before previous inputs have been completely processed. This optimization allows for a faster clock frequency and throughput at the expense of latency. An overview schematic for our design is presented below followed by sections describing each segment of the design.

#### Line Buffers:

*A simplified line buffer to demonstrate structure*

The first stage of our design is a structure called a line buffer. In a line buffer, sequential pixel data is fed in and is temporarily stored to allow for parallel output. To create such a structure, we generated 13 FIFOs with a depth of 256 (the length of one line). Each FIFO has its input tied to the output of the previous fifo to avoid needing to input the same pixel data multiple times. This structure allows us to access an entire column of a window (see algorithm page) at a time. The line buffer provides access to 1 pixel on either side of the pixel of interest for calculating derivatives in the next module. The line buffer structures we generated are large enough to hold 1 window for a current frame as well as the next frame since the next module requires access to two frames at once.

#### Derivative Module:

The derivative module takes the right, middle, left, and next pixels for an entire column of our 11×11 window in parallel. From these inputs it generates the partial derivatives with respect to time, x, and y in a single clock cycle using 3 adders for each pixel of the column.

#### Multiply Module:

The multiply module receives partial derivatives as input and calculates the following products in parallel as required by our algorithm to compute the A transpose times A matrix and A transpose times B matrix. (see algorithm page).

A11 = Ix * Ix

A12 = Ix * Iy

A22 = Iy * Iy

B1 = Ix * It

B2 = Iy * It

As before, this calculation occurs in a single clock cycle.

#### Sum Modules:

After the multiply module each of the values calculated above is summed over an entire window. To maximize data reuse and optimize for efficiency, we make use of a sliding window optimization. The first sum module computes the sum of an entire column of the window in a single clock cycle. This result is simultaneously passed to the next sum module and to a FIFO. The second sum module contains an internal register which accumulates the sum over a window. When a new column of data comes in it is added to the accumulated value in the register. The FIFO mentioned before outputs to the second sum module the column from 11 cycles ago. In this way the newest column is added to the accumulated value while the oldest column is subtracted away. At the end of each line, this entire structure needs to be reset. The diagram below visualizes this process. As you can see, the entire shaded region is reused.

*Sliding window visualization*

#### Matrix Inversion:

The next modules calculate the inverse of the A transpose times A matrix. The values from the sum modules are wired to both the Calculate Denominator module and the Shift Invert module. The calculate denominator module calculates the denominator of the determinant of the A matrix and shifts it left by 10 in a single clock cycle. Shifting left by 10 divides the value by 1024 and helps to avoid overflow. At the same time, the Shift Invert module negates each term of the matrix and shifts the results to the right by 10 to cancel the effect of the division in the denominator module. B matrix values are unused until the result module and are fed into a delay module which delays them by the 19 clock cycles it takes for the A values to make it through the denominator, shift invert, and divide modules.

#### Division:

Next, each inverted value from the previous modules needs to be divided by the denominator we calculated. To perform this division, we instantiated 3 Xilinx IP pipelined dividers, one for each distinct value of the A transposed times A matrix (A12 and A21 are identical). These dividers take 18 clock cycles and are pipelined to allow multiple divisions to occur at once. The result of the division is fed into the result module along with the delayed B matrix values.

#### Result:

The result module performs the final calculation on our two matrices to finish solving the following equation from the algorithm page.

It does this in two clock cycles in order to keep the time each pipeline stage takes balanced. In the first clock cycle, it multiplies A matrix values with B matrix values. In the second, it computes the required sums. At the end of the second cycle, our final u and v values are output from the result module and stored in small output FIFOs which act as a temporary buffer between the result module and the attached processing system.