With the limitations given by the power consumption (power wall), memory wall and the instruction level parallelism, the computing industry has turned its direction to multi-core architectures. Nowadays, the multi-core and many-core architectures are becoming the trend of the processor design. But how to exploit these architectures is the primary challenge for the research community. To take advantage of the multi-core architectures, the software design has undergone fundamental changes.
Sorting is a fundamental, important problem in computer science. It is utilized in many applications such as databases and search engines. In this thesis, we will investigate and auto-tune two parallel sorting algorithms, i.e., radix sort and sample sort on two parallel architectures, the many-core nVIDIA CUDA enabled graphics processors, and the multi-core Cell Broadband Engine. We redesign and manually tune these two parallel sorting algorithms to take advantage of multiple-level parallelism simultaneously, i.e., thread level parallelism, loop level parallelism, data level parallelism (SIMD instructions). At the same time, we try to take advantage of the high-speed shared memory. The experimental results showed that the parallel implementation of these two sorting algorithms on these two multi-core architectures achieved significant performance improvement compared to the corresponding sequential version.