Por favor, use este identificador para citar o enlazar este ítem: http://cimat.repositorioinstitucional.mx/jspui/handle/1008/727
PROBLEMS IN STATISTICAL GENETICS: CLASSIFICATION AND TESTING FOR NETWORK CHANGES
ADOLPHUS WAGALA
Acceso Abierto
Atribución-NoComercial
INTEGRACIÓN ESTADÍSTICA DE DATOS MOLECULARES
his thesis addresses the problems of classification of microarray data and the statistical integration of molecular data to test for network changes. For the classification problem, we consider the unpreprocessed and preprocessed microarray data sets. We implement an extension of the partial least squares generalized linear regression (PLSGLR) Bastien et al. (2005) achieved by combining it with the logistic regression to get partial least squares generalized linear regression-logistic regression model (PLSGLR-log) and also with the linear discriminant analysis to get the partial least squares generalized linear regression-linear discriminant analysis denoted by (PLSGLRDA). These two classification methodologies are then compared with the classical methodologies namely the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), the support vector machine (SVM). Furthermore, we implement a recent algorithm by Dalmau et al. (2015) known as kernel multilogit algorithm (KMA). The results indicate that for the noisy unpreprocessed data, the KMA emerged as the clear “winner” based on based on their low misclassification error rates. For the preprocessed normalized data, there was no clear “winner” since there was no single method that performed outstandingly better than the rest. The KNN emerged as a clear “loser” since it consistently had a relatively higher rate of misclassification both when applied to the un-preprocessed and preprocessed data sets. The statistical integration of molecular data to test for network changes considers an experiment involving two main groups namely the healthy (H) and acute rheumatic fever (ARF) subjects. For each group, each specimen is divided in two portions so that one portion is group A streptococcus (GAS) stimulated while the other is unstimulated so that we end up with four sub groups: Healthy GAS stimulated, Healthy unstimulated, ARF-GAS stimulated and ARF unstimulated. As a result, we have dependence within the groups and independence between the groups. For all the groups, p genes are measured for expression. We identify a prior network from the curated literature and online sources. The genes considered in the experiment are then matched with the ones in the prior network so that we reduce the prior network to only the genes that are found in the experimental data. We then construct two networks, one for the healthy and the
07-03-2018
Trabajo de grado, doctorado
OTRAS
Versión aceptada
acceptedVersion - Versión aceptada
Aparece en las colecciones: Tesis del CIMAT

Cargar archivos:


Fichero Descripción Tamaño Formato  
TE 658.pdf2.45 MBAdobe PDFVisualizar/Abrir