DATA MINING APPROACH IGES 2008 Principe What is Machine Learning? Algorithms that allow computers to learn. Unsupervised learning E.g. cluster analysis Supervised learning E.g. decision trees, logistic regression Data Mining Machine learning in large databases. Pattern Recognition An application of machine learning to

Speech recognition Face recognition Etc. Concept : classification Example of a play golf dataset humidity Lazy Classification temperature Decision tree Classification : performances measurement

Confusion Matrix : a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another). Actual class case Predicted Class Case control

control True positive False positive False Negative True negative Classification : performances measurement Classification Measures:

Accuracy = (TP + TN) / (TP + TN + FP + FN) What proportion of cases and controls are correctly classified? Sensitivity = TP / (TP + FN) What proportion of cases are correctly classified? Specificity = TN / (TN + FP) What proportion of controls are correctly classified? Balanced Accuracy = (Sensitivity + Specificity) / 2 TP = True Positive TN = Trus Negative Classification : performances measurement Receiver Operating Characteristic (ROC) :

Classification : validation Cross-validation : Leave One Out Cross Validation (LOOCV) Better for small datasets Unbiased estimate of prediction error n-fold CV (e.g. 5-fold or 10-fold CV) Better for larger datasets Biased estimate of prediction error Lower variance May need to repeat several times and average results Classification : filters Statistic Filters

Minor allele frequency (e.g. > 0.1) LD (e.g. r2< 0.9) Chi-square Information gain Interaction information Gain ratio Principle components ReliefF Classification : filters Knowledge Filters Prior statistical results Prior experimental results Biochemical pathway Gene Ontology Protein-protein interactions

1 Classification : wrapper Many search methods Exhaustive Hill climbing Neural networks (NN) Simulated annealing (SA) Beam Evolutionary algorithms (EA) Genetic algorithms (GA) Genetic programming (GP) Estimation of Distribution Algorithms (EDA)

Classification : wrapper Example of GENN : Grammatical Evolution to optimize Neural Network Etape 1 : Gnration alatoire dune gnration de rseaux de neurones. Etape 2 : Les donnes sont dcoupes en 10 parties. Les 9/10 sont utiliss pour lapprentissage Etape 3 : Slection des n 1 Paramtrage de la GE (taux de mutation,

nombre de gnration maximale,) 6 Erreur de classification et Erreur de prdiction NN GE NN NN

NN NN 5 NNNN NN NN Recombinaison Mutation Duplication 2

3 NN NN NN NN NN NN NN NN 4 NN

NN NN 1 3 2 meilleurs NN Etape 4 (GE part1): valuation des performances des RN sur les donnes dapprentissage. Classement des mthodes en fonction de leur taux derreur et slection des meilleurs RN. Etapes 5 (GE part2): Parmi les rseaux slectionns ltape 4, des phnomnes de

recombinaison, de duplication et de mutation sont simuls, et crer ainsi une nouvelle gnration de rseaux. Etape 6 : Evaluation du RN final sur le 1/10 des donnes et mesure de la capacit classer les atteints et non-atteints. Available data mining package http://www.ailab.si/orange/ http://www.cs.waikato.ac.nz/ml/ Ensemble Learning & Random Forest Machine Learning Methods

Decision trees Linear regression Neural networks k-nearest neighbour Nave Bayesian classifiers Support Vector Machines Ensemble Learning Methods Bagging, Boosting, Random Forests& RuleFit and many more ...

14 Ensemble Learning Methods Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions T Learning phase (x, ?) Application phase T1

T2 TS h1 h2 hS h* = F(h1, h2, , hS)

(x, y*) Ensemble Learning Methods Accuracy : a more reliable mapping can be obtained by combining the output of multiple "experts Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach). Mixture of experts, ensemble feature selection. There is no single model that works for all pattern recognition problems! (no free lunch theorem) "To solve really hard problems, well have to use several different representations. It is time to stop arguing over which type of patternclassification technique is best. Instead we should work at a higher level of organization and discover how to build managerial systems to exploit the different virtues and evade the different limitations of each of these ways of

comparing things. " --Minsky, 1991. 16 Ensemble Learning Methods How to generate base classifiers? Generation strategy

Decision tree learning:ID3, C4.5 & CART Instance-based learning: k-nearest neighbor Bayesian classification: Nave Bayes Neural networks Regression analysis Clustering et.al. How to integrate them? Integration strategy: BAGGing = Bootstrap AGGregation (Breiman, 1996) Boosting (Schapire and Singer, 1998)

Random Forests (Breiman, 2001) Ensemble Learning Methods Bootstrap aggregating (bagging) : meta-algorithm to improve machine learning of classification and regression models in terms of stability and classification accuracy. Bagging generates sub-sample from a standard training set by sampling examples from D uniformly and with replacement (bootstrap sample). The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification). Ensemble Learning Methods Boosting is a machine learning meta-algorithm for performing supervised learning.

Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight Random Forest Tree 1 Tree 2 Tree 3 Tree i

Final Classification is based on votes from all N trees Yes No Yes No Final Decision Random Forest

Each tree is constructed using the following algorithm: Number of cases = N, Number of variables in the classifier = M. 1. Choose the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. 2. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes. 3. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. 4. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier). Random Forest

Its accuracy is as good as Adaboost and sometimes better It is relatively robust to outliers and noise It is faster than bagging or boosting It gives useful internal estimates of error, strength, correlation and variable importance It is simple and can be easily parallelized Random Forest Heidema 2006 BMC Genetics Pubmed Publications 100 0 2003

2004 2005 2006 2007 Use of Entropy based Methods Information Theory Shannon Entropy Probabilit Entropie

= incertitude sur un tat = incertitude sur le systme N H pi log pi i 1 B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics Information Theory Surprisal or Self-Entropy of event i : si ( pi ) log pi si ( pi 0)

si ( pi 1) 0 si pas dincertitude sur un systme : entropie = 0 Exemple pour un systme X ayant deux tats possible tel que : p1 = 10-5 s1 = 5 p2 1 s2 = 0 H(X) 0 Information Theory Least biased probability distribution is the one that maximizes the information entropy subject to prior information

1 pi N Least biased (maximum entropy) probability is uniform: ! Entropy is not disorder Sequence 1: 1111100000 p1=0.5, p0=0.5 H(sequence1) = log(2)

Sequence 2: 1100010110 p1=0.5, p0=0.5 H(sequence2) = log(2) Maximum Entropy Principle (Jaynes, Physical Review 106, 620 (1957)) Information Theory Mutual Information (correlation) Information = Removal of uncertainty Usually called Information Gain P ( a, b) I ( A; B) p(a, b) log 2 P(a) P(b)

aA bB H ( A) H ( B) H ( A, B) Uncertainty from SNP A Uncertainty removed because of correlation between A and B Uncertainty from SNP B B.A. McKinney et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics Information Theory Information Gain Consider two attributes, A and B (two SNPs), and a class label C (disease status).

Information Theory Information Gain A = SNP1 B = SNP2 C = disease status If IG(ABC) > 0 Evidence for an attribute interaction If IG(ABC) < 0 The information between A and B is redundant If IG(ABC) = 0 Evidence of conditional independence or a mixture of synergy and redundancy

Information Theory Attribute Selection based on Entropy Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects). Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration. Information Theory Exemple dapplication pour mesurer la qualit du squenage Entropy Kullback-Leibler

N pi D ( P, Q) pi log i 1 qi pi = probabilit observ qi = probabilit priori > > > > > > >

> > CCAGAGGCCCAACUGGUAAACGGGC CCG-AAGCUCAACGGGAUAAUGAGC CCG-AAGCCGAACGGGAAAACCGGC CC-CAAGCGC-AGGGGAGAA-GCGC CCG-ACGCCA-ACGGGAGAA-UGGC CCGUUUUCAG-UCGGGAAAAACUGA CCGUUACUCC-UCGGGAUAAAGGAG CCGUAAGAGG-ACGGGAUAAACCUC CCG-UAGGAG-GCGGGAUAU-CUCC (sous H0 est = 0.25 pour chaque nuclotide)

Relative Uncertainty on base U for one specific position Gorodkin, et. al. Comput. Appl. Biosci., Vol. 13, no. 6 pp 583-586, 1997. Tuerk et. al. PNAS 89, pp 6988-6992, 1992 Ahola et al. A statistical score for assessing the quality of multiple sequence alignments. Bioinformatics D'haeseleer. What are DNA sequence motifs? Nature Biotechnology MDR : Multifactor Dimensionality Reduction MDR MDR Step 1 & 2 Step 1: partition the data into some number of equal

parts for cross-validation Step 2: a set of N genetic and/or discrete environmental factors is selected from the list of all factors MDR MDR Step 3 The N factors and their multifactor classes or cells are represented in N-dimensional space The ratio of the number of cases to the number of controls is evaluated within each multifactor cell MDR MDR Step 4 Each multifactor cell in N-dimensional space is labeled as

high-risk if the ratio meets or exceeds some threshold T (e.g. T = 1.0) and low-risk if otherwise Those cells labeled high-risk are in one group and those lowrisk are in another group, which reduces the N-dimensional model to one dimension MDR MDR Step 5 & 6 Step 5: all possible combinations of N factors are evaluated for their ability to classify affected and unaffected individuals in the training data, and the best N-factor model is selected. Step 6: the independent test data from the cross-validation is used to estimate the prediction error of the best model selected. MDR

overview MDR MDR final Steps 1 through 6 are repeated for each possible cross validation interval The final step: determine which multifactor levels (e.g. genotypes) are high risk and which are low risk using the entire dataset. MDR : New features Interpretation Interaction Graphs Comprised of a node for each attribute with pairwise connections between them.

Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute. Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes. IG > 0 Evidence for an attribute interaction IG< 0 The information between SNP1 and SNP2 is redundant

IG = 0 Evidence of conditional independence or a mixture of synergy and redundancy MDR : New features Interpretation Interaction Graphs Attribute Interaction I(SNP1;SNP2;Class) Class Attribute effect

SNP1 Attribute effect SNP2 Attribute correlation MDR : New features Interpretation Dendrograms Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree. MDR : New features

MDR : New features