Many cancers consist of several different subtypes. For example, the two most common subtypes of renal cell carcinoma (RCC) are the clear cell RCC and the papillary RCC. It is expected that the gene expression profiles of the subtypes of common cancers are also distinctive and the subtypes can be identified based on the expressions of a panel of genes.
The goal of this thesis is to identify the panel of discriminator genes using a genetic algorithm and the k-nearest neighbor method. The genetic algorithm implemented uses integer-stream coding scheme. The fitness of each chromosome is evaluated by its ability to correctly classify the known samples using a k-nearest neighbor method. To test the robustness of the algorithm, a bootstrapping analysis is performed, which removes one sample from the data set at a time and uses the remaining samples for gene selection. The effects of different distance metrics on the classification results, the stability of the algorithm with respect to different initial populations, and the sensitivity of the algorithm with respect to different samples are also studied.
The algorithm has been tested using two microarray data sets: a set of nine RCC samples and a set of 73 human acute leukemia samples. The computation results indicate the combined genetic algorithm and k-nearest neighbor method can serve as an effective tool for classifying cancer subtypes.