This thesis investigated a hardware interpreter for sparse matrix LU factorization. LU factorization is one of the most commonly used methods for solving a system of linear equations representing an electrical network. In this method, a system of equations Ax=B is solved by factoring the matrix A into its Lower (L) and Upper (U) triangular matrices. L and U are then used to obtain the unknown vector x by forward and backward substitution. Factorization of A is O(n3) time-complex, requiring techniques to speed it up. We explored a hardware-based interpreter for the unrolled and compressed LU factorization code. The inputs to the hardware interpreter were a stream of instructions and a list of non-zeros. The instructions were decompressed on-the-fly and executed on the non-zero list using a special-purpose floating-point unit. Three cases of a Harvard-type hardware architecture were investigated; the architecture cases were modeled at behavioral and structural levels of abstraction and verified for functional and performance correctness. The well-known linear-system-solver software, SuperLU, was used for performance comparison.
We found that all the architecture cases studied showed a significant memory interface throttle; an architecture case which used a 4-port interleaved memory for storing data, performed the fastest. Another case explored was a faster to implement all FPGA solution, which used FPGA block-RAMs as a true dual-port memory; this case did not perform as fast as the earlier case with 4-port memory. The third case with standard one-port memory was the slowest. We showed that with an efficient implementation of the floating-point unit resulting in higher frequency operations, all three cases of the proposed architecture out-perform the software based LU decomposition.