With the rapid growth of semiconductor technology, chip density has increased significantly. As the power exponent is setting hard limits to frequency increases, multi-core and chip level multi-processors have become prevalent in recent years to take advantage of the increasing chip density. In the new generation of processors, multi-core architecture design is becoming the major trend: IBM/SONY/Toshiba's Cell Broadband Engine processors contain nine cores; NVIDIA graphics processors contain more than 30 cores.
One of the biggest challenges is to efficiently utilize the computational power provided by multi-core systems. The second challenge to achieving high performance in a computer system is the growing disparity between processor and memory speeds. This thesis examines the problems of sorting, matrix multiplication, and ordinary differential equation initial value problems on two target architectures, the Cell Broadband Engine, and the Nvidia CUDA enabled graphics processor. This thesis first studies how to exploit various levels of parallelism for these application programs. At the same time, the author also tries to explore the use of memory hierarchies and other architecture features to further improve the performance.