Microarray data routinely contain gene expression levels of thousands of genes. In the context of medical diagnostics, an important problem is to find the genes that are correlated with given phenotypes. These genes may reveal insights to biological processes and may be used to predict the phenotypes of new samples. In most cases, while the gene expression levels are available for a large number of genes, only a small fraction of these genes may be informative in classification with statistical significance. We introduce a nonparametric scoring algorithm that assigns a score to each gene based on samples with known classes. Based on these scores, we can find a small set of genes which are informative of their class, and subsequent analysis can be carried out with this set. This procedure is robust to outliers and different normalization schemes, and immediately reduces the size of the data with little loss of information. We study the properties of this algorithm and apply it to the data set from cancer patients. We quantify the information in a given set of genes by comparing its distribution of the score statistics to a set of distributions generated by permutations that preserve the correlation structure among the genes.
A nonparametric scoring algorithm for identifying informative genes from microarray data
BONETTI, MARCO
2001
Abstract
Microarray data routinely contain gene expression levels of thousands of genes. In the context of medical diagnostics, an important problem is to find the genes that are correlated with given phenotypes. These genes may reveal insights to biological processes and may be used to predict the phenotypes of new samples. In most cases, while the gene expression levels are available for a large number of genes, only a small fraction of these genes may be informative in classification with statistical significance. We introduce a nonparametric scoring algorithm that assigns a score to each gene based on samples with known classes. Based on these scores, we can find a small set of genes which are informative of their class, and subsequent analysis can be carried out with this set. This procedure is robust to outliers and different normalization schemes, and immediately reduces the size of the data with little loss of information. We study the properties of this algorithm and apply it to the data set from cancer patients. We quantify the information in a given set of genes by comparing its distribution of the score statistics to a set of distributions generated by permutations that preserve the correlation structure among the genes.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.