Classification Techniques: Decision Tree Learning Bamshad BamshadMobasher Mobasher DePaul DePaulUniversity University Classification: 3 Step Process i 1. Model construction (Learning): 4 Each record (instance, example) is assumed to belong to a predefined class, as determined by one of the attributes h This attribute is called the target attribute h The values of the target attribute are the class labels 4 The set of all instances used for learning the model is called training set i 2. Model Evaluation (Accuracy): 4 Estimate accuracy rate of the model based on a test set 4 The known labels of test instances are compared with the predicts class from model 4 Test set is independent of training set otherwise over-fitting will occur i 3. Model Use (Classification): 4 The model is used to classify unseen instances (i.e., to predict the class labels for new unclassified instances) 4 Predict the value of an actual attribute 2 Classification Methods ii ii

ii ii ii ii ii ii Decision Decision Tree Tree Induction Induction Bayesian Bayesian Classification Classification K-Nearest K-Nearest Neighbor Neighbor Neural Neural Networks Networks Support Support Vector Vector Machines Machines Association-Based Association-Based Classification Classification Genetic Genetic Algorithms Algorithms Many Many More More . . ii Also Also Ensemble Ensemble Methods Methods 3

Decision Trees i A decision tree is a flow-chart-like tree structure 4 Internal node denotes a test on an attribute (feature) 4 Branch represents an outcome of the test 4 All records in a branch have the same value for the tested attribute 4 Leaf node represents class label or class label distribution outlook sunny humidity high N overcast rain windy P normal P true N false P 4 Instance Language for Classification i Example: is it a good day to play golf?

4 a set of attributes and their possible values: outlook sunny, overcast, rain temperature cool, mild, hot humidity high, normal windy true, false A particular instance in the training set might be: : play In this case, the target class is a binary attribute, so each instance represents a positive or a negative example. 5 Using Decision Trees for Classification i Examples can be classified as follows 4 4 4 4 i 1. look at the example's value for the feature specified 2. move along the edge labeled with this value 3. if you reach a leaf, return the label of the leaf 4. otherwise, repeat from step 1 Example (a decision tree to decide whether to go on a picnic): outlook sunny humidity

high N overcast rain : ? will be classified as noplay windy P normal P So a new instance: true N false P 6 Decision Trees and Decision Rules outlook If attributes are continuous, internal nodes may test against a threshold. sunny overcast rain

humidity windy P > 75%<= 75% N > 20 P N <= 20 P Each path in the tree represents a decision rule: Rule1: If (outlook=sunny) AND (humidity<=0.75) Then (play=yes) Rule3: If (outlook=overcast) Then (play=yes) Rule2: If (outlook=rainy) AND (wind>20) Then (play=no) ... 7 Top-Down Decision Tree Generation i The basic approach usually consists of two

phases: 4 Tree construction h At the start, all the training instances are at the root h Partition instances recursively based on selected attributes 4 Tree pruning (to improve accuracy) h remove tree branches that may reflect noise in the training data and lead to errors when classifying test data i Basic Steps in Decision Tree Construction 4 Tree starts a single node representing all data 4 If instances are all same class then node becomes a leaf labeled with class label 4 Otherwise, select feature that best distinguishes among instances h Partition the data based the values of the selected feature (with each branch representing one partitions) 4 Recursion stops when: h instances in node belong to the same class (or if too few instances remain) 8 Trees Construction Algorithm (ID3) i Decision Tree Learning Method (ID3) 4 Input: a set of training instances S, a set of features F 4 1. If every element of S has a class value yes, return yes; if every element of S has class value no, return no 4 2. Otherwise, choose the best feature f from F (if there are no features remaining, then return failure); 4 3. Extend tree from f by adding a new branch for each attribute value of f h 3.1. Set F = F {f}, 4 4. Distribute training instances to leaf nodes (so each leaf node n represents the subset of examples Sn of S with the corresponding attribute value 4 5. Repeat steps 1-5 for each leaf node n with Sn as the new set of training instances and F as the new set of attributes

i Main Question: 4 how do we choose the best feature at each step? Note: Note:ID3 ID3algorithm algorithmonly onlydeals dealswith withcategorical categoricalattributes, attributes,but butcan canbe beextended extended (as in C4.5) to handle continuous attributes (as in C4.5) to handle continuous attributes 9 Choosing the Best Feature i i Use Information Gain to find the best (most discriminating) feature Assume there are two classes, P and N (e.g, P = yes and N = no) 4 Let the set of instances S (training data) contains p elements of class P and n elements of class N 4 The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): I ( p, n) Pr( P) log 2 Pr( P) Pr( N ) log 2 Pr( N ) 4 Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n)

i In other words, entropy of a set on instances S is a function of the probability distribution of classes among the instances in S. 10 Entropy i Entropy for a two class variable 11 Entropy in Multi-Class Problems i More generally, if we have m classes, c1, c2, , cm , with s1, s2, , sm as the numbers of instances from S in each class, then the entropy is: I i where pi is the probability that an arbitrary instance belongs to the class ci. 12 Information Gain i Now, assume that using attribute A a set S of instances will be partitioned into sets S1, S2 , , Sv each corresponding to distinct values of attribute A. 4 If Si contains pi cases of P and ni cases of N, the entropy, or the expected information needed to classify objects in all subtrees Si is Si p n E ( A) Pr( Si ) I ( pi , ni ) where, Pr( Si ) i i S p n i 1

The Theprobability probabilitythat that an arbitrary an arbitrary instance instanceininSS belongs belongstotothe the partition S partition Si i i i Gainthat ( A)would I ( pbe, ngained ) E (byAbranching ) The encoding information on A: At any point we want to branch using an attribute that provides the highest information gain. 13 Attribute Selection - Example i Day D1 D2 D3

D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 The Golf example: what attribute should we choose as the root? outlook temp humidity wind sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain hot hot hot mild cool

cool cool mild cool mild mild mild hot mild high high high high normal normal normal high normal normal normal high normal high weak strong weak weak weak strong strong weak weak weak strong strong weak strong Gain(outlook) = .94 - (4/14)*0

- (5/14)*.97 - (5/14)*.97 = .24 play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No S: [9+,5-] Outlook? overcast [4+,0-] sunny [2+,3-] rainy [3+,2-] I(9,5) = -(9/14).log(9/14) - (5/14).log(5/14) = 0.94 I(4,0) = -(4/4).log(4/4) - (0/4).log(0/4) =0 I(2,3) = -(2/5).log(2/5) - (3/5).log(3/5) = 0.97 I(3,2) = -(3/5).log(3/5) - (2/5).log(2/5)

= 0.97 14 Attribute Selection - Example (Cont.) Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 outlook temp humidity wind sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast

rain hot hot hot mild cool cool cool mild cool mild mild mild hot mild high high high high normal normal normal high normal normal normal high normal high weak strong weak weak weak strong strong weak weak

weak strong strong weak strong play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No So, So,classifying classifyingexamples examplesby byhumidity humidityprovides provides more information gain than by wind. Similarly, more information gain than by wind. Similarly, we wemust mustfind findthe

theinformation informationgain gainfor fortemp. temp. In Inthis thiscase, case,however, however,you youcan canverify verifythat that outlook has largest information gain, outlook has largest information gain,so soitll itllbe be selected as root selected as root S: [9+,5-] (I = 0.940) humidity? high [3+,4-] (I = 0.985) normal [6+,1-] (I = 0.592) Gain(humidity) = .940 - (7/14)*.985 - (7/14)*.592 = .151

S: [9+,5-] (I = 0.940) wind? weak [6+,2-] (I = 0.811) strong [3+,3-] (I = 1.00) Gain(wind) = .940 - (8/14)*.811 - (8/14)*1.0 = .048 15 Attribute Selection - Example (Cont.) i Partially learned decision tree S: [9+,5-] Outlook sunny {D1, D2, , D14} overcast rainy ? yes ? [2+,3-] [4+,0-] [3+,2-] {D3, D7, D12, D13}

{D4, D5, D6, D10, D14} {D1, D2, D8, D9, D11} i which attribute should be tested here? Ssunny = {D1, D2, D8, D9, D11} Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970 Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570 Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019 16 Other Attribute Selection Measures i Gain ratio: 4 Information Gain measure tends to be biased in favor attributes with a large number of values 4 Gain Ratio normalizes the Information Gain with respect to the total entropy of all splits based on values of an attribute 4 Used by C4.5 (the successor of ID3) 4 But, tends to prefer unbalanced splits (one partition much smaller than others) i Gini index: 4 A measure of impurity (based on relative frequencies of classes in a set of instances) h The attribute that provides the smallest Gini index (or the largest reduction in impurity due to the split) is chosen to split the node 4 Possible Problems: h Biased towards multivalued attributes; similar to Info. Gain. h Has difficulty when # of classes is large 17 Overfitting and Tree Pruning i Overfitting: An induced tree may overfit the training data 4 Too many branches, some may reflect anomalies due to noise or outliers 4 Some splits or leaf nodes may be the result of decision based on very few instances, resulting in poor accuracy for unseen instances i Two approaches to avoid overfitting 4 Prepruning: Halt tree construction early do not split a node if this would result in the error rate going above a pre-specified threshold h Difficult to choose an appropriate threshold

4 Postpruning: Remove branches from a fully grown tree h Get a sequence of progressively pruned trees h Use a test data different from the training data to measure error rates h Select the best pruned tree 18 Enhancements to Basic Decision Tree Learning Approach i Allow for continuous-valued attributes 4 Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals i Handle missing attribute values 4 Assign the most common value of the attribute 4 Assign probability to each of the possible values i Attribute construction 4 Create new attributes based on existing ones that are sparsely represented 4 This reduces fragmentation, repetition, and replication 19 Classification Techniques: Decision Tree Learning Bamshad BamshadMobasher Mobasher DePaul DePaulUniversity University