A DT asks a question and classifies based on the answer
Note: A classification can be categories or numeric
In the 2nd case we are using mouse wtto predict mouse size
More complex DT:
It combines numeric data:
With Yes/No data:
Notice that cut-off for Resting heart rate need not be same on both sides
Also order of questions need not be same on both sides
The final classifications may be repeated
U start at top and go down till u get to a pt where u cant go further
Thats how a sample is classified
We want to create a tree that uses chest pain, good blood circulation, blocked artery status to predict heart disease(y/n)
We want to decide which of chest pain, good blood circulation, blocked artery status should be root node
We start off by exploring how well Chest pain classifies heart disease and build a tree as shown below:
We build similar trees for Good Blood Circulation and blocked Artery
As shown above we dont kno the BA status for this patient. We skip it but there are other alternatives
As there are missing values for a feature the total number of patients in each tree is diff
because none of the leaf nodes are 100% YES Heart disease or 100% NO they all are considered as "impure"
To determine which separation is best we need a way to measure and compare impurity
Gini impurity (GI) is calculated for each leaf node as shown below:
Similarly we calculate GI for right leaf node
The leaf nodes do not reppresent same number of patients
Thus total GI for using Chest pain as root node is the weighted avg of GI of the 2 nodes:
Similarly we calculate GI for all 3 possible root nodes
Good blood circulation has lowest impurity and it separates the people with or without heart disease the best
So first node (root) = GBC
After the split we get 2 leaf nodes
Left: (37 y, 127 n)
Right: (100 y, 33 n)
Now we need to figure out how to separate (and if we should separate further) these patients in the Left and Right
Lets start with left:
These are the patients with GBC == true
Just like before we separate these patients based on CP and calculate GI as before
We do same for Blocked Artery
GI for BA = 2.9
This is less than GI for CP and also less than GI for GBC
Thus we use BA in the left part
Resulting tree:
Now we will use CP to try and separate the L->L node(24/25)
These are the patients with GBC = true and BA = true
CP does a good job in separating the patients:
Now we look at node in Root->L->R (13/102)
Lets try and use CP to divide these 115 patients
Note : Vast majority (89%) of patients in this node dont have heart disease
After separating we get a higher GI than before separating
So we make this node a leaf node
We have built the entire LHS of the tree
For RHS we follow same steps:
Calculate all GI scores
If node otself has lowest score, then there is no point in separating and the node becomes a leaf node
If separating the data results in an improvement, pick the separation with the lowest impurity value
Complete tree:
Imagine if our features were numeric not just Y/N:
Calculate avg wts for all adjacent patients
Calculate GI for each avg wt
In the above diag GI is calculated for wt < 167.5
So this is the cutoff that we will use when we compare wt to CP or BA
Ranked data is similar to numeric data, except that now we calculate impurity scores for all possiblle ranks
So if rank is from 1 to 4 (4 being best), we calculate impurity scores as:
rank <= 1
rank <= 2
rank <= 3
We dont need <=4 as it includes everyone
When there are multiple choices like color choices - B, R or G we calculate GI for each one as well as each possible combination
B
G
R
B or G
B or R
G or R
We dont need to calculate for B or R or G as it includes everyone
DTs are easy to build, use and interpret
But in practice, theyare not that awesome
Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy
They work great with the data used to create them but are not flexible when it comes to classifying new samples
RF combines simplicity of DTs with flexibility resulting in a vast improvement in accuracy
Say these 4 samples are entire dataset
To create a bootstrapped dataset that is same size as original we randomly select samples from original dataset
We are allowed to pick the same sample more than once
Say first sample in original dataset = S1
We create bootstrap dataset as: S2, S1, S4, S4
In this example we will consider 2 vars at each step
Thus instead of considering all 4 vars (CP, GBC, BA, Wt) to figure out how to split the root node we randomly select 2 : GBC, BA
Say GBC did the best job at separating the samples
We used GBC, we grey it out, so that we can focus on rem vars
Now we have to figure out how to select vars for circled node:
Just like for the root we randomly select 2 vars from (CP, BA, wt)
We select CP and wt
Thus we build the tree by:
using the bootstrapped dataset
considering a random subset of vars at each step
This is done for a single tree
Now we make a new bootstrapped dataset and build tree considering a random subset of vars at each step
Ideally we do this 100s of times
considering a random subset of vars at each step
Because of the randomness associated with creating the bootsrapped dataset and also due to choosing random columns for each step, RF results in a wide variety of DTs
This variety makes RF more effevtive that DTs
First we get data of a new patient
We want to predict if Heart disease or not
We take data and run down 1st tree
Output: Yes
We keep track of this result
Similarly we run data thru 2nd... last tree
We keep track of the results and see which option received most votes
Here Yes: 5 No : 1
So conclusion : YES
Bagging : Bootstrapping the data plus using the aggregrate to make a decision is called Bagging
When we created the bootstrapped dataset we allowed duplicate entries in the bootstrapped dataset
As a result above entry was not included in the bootstrapped dataset
Typically about 1/3 the original data does not end up in the bootstrapped dataset
These entries are called the Out-of-Bag Dataset
We know the results of OoB data
Say there is only 1 entry in OoB data = No
we use them to test
We run the data through our first DT
Result : No
Similarly we run throuugh all trees and keep track of the results
Then we chose the most common result: Here it is correct and = No
We repeat the process for all OoB samples for all trees
Some may be incorrectly labeled
Accuracy: Proportion of OoB Samples that were correctly claasified by the RF
The proportion of OoB smaples that were incorrectly classified is the OoB Error
We now know how to:
Build a RF
Use a RF
Estimate accuracy of RF
We used 2 vars to make a decision at each step
Now we can compare OoB Error for RF built using 2vars per step to a RF built using 3 vars per step
We can test many diff settings and chose the most accurate RF
Process:
Build a RF
Est accuracy of RF
Change no of vars used per step
Repeat for a number of times and chose the RF that is most accurate
Typically we start by using the square of number (sq root?) and then try a few settings above and below that value
Lets see how RF deals with missing data
Missing data can be of 2 types:
Lets start with Missing data in the original dataset:
We want to create a RF from the data
But we dont know if the 4th patient has BA or what is their wt
We make an initial guess that mey be bad and gradually refine the guess until it (hopefully) gets good
Initial guess for BA = most common value = No
Since wt is numeric our initial guess is the median val = 180
This is the dataset with the initial guesses
Now we want to refine our guesses
We do this by detemining which samples are similar to the one with the missing data
Build a RF
Run all of the data down all of the trees
Lets start by running all of the data down the 1st tree:
Say sample 3 and 4 ended up in the same leaf node
That means they are simialar (that is how similarity is defined in RF)
We keep track of similar samples using a Proximity Matrix
The PM has a row foreach sample and a col for each sample
As samples 3 and 4 are similar we put 1 there
Similarly we run all of the data down the 2nd tree
Now samples 2, 3 and 4 all ended up in the same leaf nodes
PM now:
We add 1 as the pairs come in smae leaf node again
We run all the data down the 3rd tree
Updated PM:
Ultimately, we run the data down all the trees and the PM fills in
Then we divide each proximity value by total number of trees (say we had 10 trees)
Updated PM:
Now we can use the proximity values for sample 4 to make better guesses about the missing data
For BA we calculate the weighted freq of Y and N using prox values as wts
Calculations:
Freq of Yes = 1/3
Freq of No = 2/3
The wighted freq of Yes = Freq of Yes * The weight for Yes
The weight for Yes = (Proximity of Yes)/(All proximities)
The proximity for Yes = Proximity value for sample 2 (the only one with Yes)
We divide that by sum of proximities for sample 4
The weight for Yes = 0.1/(0.1 + 0.1 + 0.8) = 0.1
The wighted freq of Yes = 1/3 * 0.1 = 0.03
Similarly,
The wighted freq of No = Freq of No * The weight for No
The weight for No = (0.1 + 0.8)/(0.1 + 0.1 + 0.8) = 0.9
The wighted freq of No = 2/3 * 0.9 = 0.6
Since No has a way higher wt freq we chose No
So our new, improved revised guess based on proximities for BA is No
Filling in missing values for wt:
For wt we use proximities to calculate a weighted avg
Weighted avg = (wt of sample 1 wt avg wt of sample 1) + (wt of sample 2 wt avg wt of sample 2) + (wt of sample 3 * wt avg wt of sample 3)
wt avg wt of sample 1 = (proximity of sample 1) / (sum of proximities) = 0.1 / (0.1 + 0.1 + 0.8) = 0.1
Similarly we calculate wt avg wt of sample 2 and wt avg wt of sample 3
Weighted avg = (125 0.1) + (180 0.1) + (210 * 0.8) = 198.5
wts used to calculate the weighted avg is based on proximities
So we fill missing val as 198.5
Now that we have revised our guesses a little bit, we do the whole thing over again..
We have already seen this PM
This is the PM b4 we divided each value by 10
If samples 3 and 4 ended up in the same leaf node for all 10 trees:
We divide each number by 10
For Samples 3 and 4 the entry will be 1
1 in PM => samples are as close as they can be
Also
1 - prox value = distance
Thus it is possible to derive a Distance Matrix from the PM
Getting distance matrix (which is similar to corr matrix) means we can plot Heat Maps
We can also draw MDS Plots
Imagine that we have already built a RF and we wanted to classify a new patient
But the patient has missing data for BA
We dont know if patient has BA
So we need to make a guess about BA so that we can run the patient down all the trees in the forest
Then we run the 2 samples down the trees in the forest
Then we see which of the 2 is correctly labeled by the RF most number of times
The sample which is correctly labeled more times wins
# Import libraries:
library(ggplot2)
# cowplot improves ggplot2's default settings
library(cowplot)
library(randomForest)
# install.packages('cowplot', repos='http://cran.us.r-project.org')
We are going to use heart disease dataset from UCI ML repo
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header = FALSE)
head(data)
Lets label the cols:
Only 14 used
-- 1. #3 (age)
-- 2. #4 (sex)
-- 3. #9 (cp)
-- 4. #10 (trestbps)
-- 5. #12 (chol)
-- 6. #16 (fbs)
-- 7. #19 (restecg)
-- 8. #32 (thalach)
-- 9. #38 (exang)
-- 10. #40 (oldpeak)
-- 11. #41 (slope)
-- 12. #44 (ca)
-- 13. #51 (thal)
-- 14. #58 (num) (the predicted attribute)
3 age: age in years
4 sex: sex (1 = male; 0 = female)
9 cp: chest pain type -- Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-anginal pain -- Value 4: asymptomatic
10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)
12 chol: serum cholestoral in mg/dl
16 fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
19 restecg: resting electrocardiographic results -- Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
32 thalach: maximum heart rate achieved
38 exang: exercise induced angina (1 = yes; 0 = no)
40 oldpeak = ST depression induced by exercise relative to rest
41 slope: the slope of the peak exercise ST segment -- Value 1: upsloping -- Value 2: flat -- Value 3: downsloping
44 ca: number of major vessels (0-3) colored by flourosopy
51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
58 num: diagnosis of heart disease (angiographic disease status) -- Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)
colnames(data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak",
"slope", "ca", "thal", "hd")
head(data)
str() function gives us the structure of the data
str(data)
Some of the cols are messed up
sex is supposed to be a factor where 0: female and 1: male
cp is supposed to be a factor where levels 1-3 represents diff types of pain and 4 represents no chest pain
fbs is supposed to be a factor
restecg is supposed to be a factor
exang is supposed to be a factor
slope is supposed to be a factor
ca and thal are correctly called factors but one of the levels is "?" when we need it to be NA
data[data == '?'] <- NA
To make data easier on the eye, convert 0s in sex to F and 1s to M
Then convert the col into a factor
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
data$sex <- as.factor(data$sex)
head(data)
We convert the other cols into factors:
data$cp = as.factor(data$cp)
data$fbs = as.factor(data$fbs)
data$restecg = as.factor(data$restecg)
data$exang = as.factor(data$exang)
data$slope = as.factor(data$slope)
Since ca and thal cols had ? in them R took it to be a col of strings
We convert these cols to int then convert them as factors
data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)
data$thal <- as.integer(data$thal)
data$thal <- as.factor(data$thal)
Last thing is to make hd (heart disease as 0: Healthy, 1: Unhealthy)
data$hd <- ifelse(test = data$hd == 0, yes = "Healthy", no = "Unhealthy")
data$hd <- as.factor(data$hd)
head(data)
str(data)
Since we are going to be randomly sampling things, lets set the seed for the random no generator so that we can reproduce our results
set.seed(42)
Now we impute values for the NAs in the dataset with rfImpute()
The 1st arg is hd ~ .
This means that we want the hd col to be predicted by the data in the other cols
data specifies the dataset
iter = 6: Here we specify how many RFs should rfImpute() build to estimate the mssing values
In theory, 4-6 iters are enough
Lastly, we save the results i.e the dataset with imputed values instead of NAs as data.imputed
data.imputed = rfImpute(hd ~ ., data = data, iter = 6)
head(data)
After each iteration rfImpute() prints the Out-of-Bag(OOB) error rate
This should get smaller if the estimates are improving
Now that we have imputed the values, we build a RF
model <- randomForest(hd ~ ., data = data.imputed, proximity = TRUE)
The 1st arg is hd ~ .
This means that we want the hd col to be predicted by the data in the other cols
We also want randomForest() to return the PM
We will use this to cluster the samples
Lastly, we save the randomForest and asspciated data like PM as model
Get summary of RF and how well it performed:
model
Type of random forest: classification
If we had used the RF to predict wt or ht it would say "regression"
If we had omitted the thing RF was supposed to predict entirely, it would say "unsupervised"
Number of trees: 500: how many trees are in RF
No. of variables tried at each split: 3
Classification trees have a default setting of sq root of no of vars
Regression trees have a default setting of no of vars div by 3
OOB estimate of error rate: 16.5% : This means that 83.5% of the OoB samples were correctly classified by the RF
Helthy Unhealthy class.error Helthy 141 23 0.1402439 Unhealthy 27 112 0.1942446
This is the Confusion Matrix
head(model$err.rate)
nrow(model$err.rate)
Each row in model$err.rate reflects the error rates at diff stages of creating the RF
The 1st row contains error rates after making 1st tree
2nd row contains error rates after making 1st 2 trees
... and so on
last row contains error rates after making all 500 trees
We want to construct a df which has the type of error in the rows rather than the cols
print(rep(c(2,4), each = 4))
print(rep(c(2,4), times = 4))
Creating col: Type
Type = rep(c("OOB", "Healthy", "Unhealthy"), each = nrow(model$err.rate))
Type
Creating col: Trees
Trees = rep(1:nrow(model$err.rate), times = 3)
Trees
Error = c(model$err.rate[, "OOB"],
model$err.rate[, "Healthy"],
model$err.rate[, "Unhealthy"])
length(Error)
Making the df:
oob.error.data <- data.frame(Trees = Trees, Type = Type, Error = Error)
head(oob.error.data)
nrow(oob.error.data)
Now we plot this error
ggplot(data = oob.error.data, aes(x=Trees, y=Error)) + geom_line(aes(color = Type))
The blue line shows the error rates while classifying Unhealthy patients
The green line shows the overall OOB Error Rate. So its in the middle (avg) of the 2
The red line shows the error rates while classifying healthy patients
We see in general the error rate dec when the RF has more trees
If we added more trees would the error rate go down further?
Lets make a RF with 1000 trees
model2 <- randomForest(hd ~ ., data = data.imputed, ntree = 1000, proximity = TRUE)
model2
OOB error rate is same as before
And confusion matrix tells us we did no better than before
Type = rep(c("OOB", "Healthy", "Unhealthy"), each = nrow(model2$err.rate))
Trees = rep(1:nrow(model2$err.rate), times = 3)
Error = c(model2$err.rate[, "OOB"],
model2$err.rate[, "Healthy"],
model2$err.rate[, "Unhealthy"])
oob.error.data <- data.frame(Trees = Trees, Type = Type, Error = Error)
ggplot(data = oob.error.data, aes(x=Trees, y=Error)) + geom_line(aes(color = Type))
The error rates stabilize right after 500 trees
So adding more trees would not help
But we would not have known this had we not added more trees
This is done using the param: mtry
# Create an empty vector:
oob.values = vector(length = 10)
for (i in 1: 10){
# build an RF using "i" to determine no of vars to try at each step
temp.model = randomForest(hd ~ ., data = data.imputed, mtry = i, ntree = 1000)
# print(temp.model)
# reqd oob value is the 1st col (OOB) of last (after building 1000) tree
oob.value <- temp.model$err.rate[nrow(temp.model$err.rate),1]
# Stre OOB error rate for current model
oob.values[i] <- oob.value
}
oob.values
The 3rd value i.e no of vars = 3 is the optimal value
Coincidentally this is the default value