I need you to implement on the r file I already started at least 10 models of su

I need you to implement on the r file I already started at least 10 models of supervised and unsupervised learning (logistic regression, random forest, decision tree, Lasso, XGBoost, PCA, K-means, KNN, and other).I want you to comment every step, the graphs and put a final prediction and solution showing results. At the end send me the r file with all the codes and import the codes into pdf file or html.Thank you!It is a very importan assignment I need to do.Here some links of examples https://www.kaggle.com/code/ysjang0926/analysis-of… , https://www.kaggle.com/code/djbacad/r-logistic-reg… , https://www.kaggle.com/code/rankirsh/predicting-at… , https://www.kaggle.com/code/esmaeil391/ibm-hr-anal… , https://rstudio-pubs-static.s3.amazonaws.com/38217… , https://www.kaggle.com/code/sidliang/predicting-ib…

Posted in R

Linked are data collected during a colleague’s research on type 2 diabetes in ru

Linked are data collected during a colleague’s research on type 2 diabetes in rural Guatemala*: diabetes final workbook – 11 Nov 2015.xlsxDownload diabetes final workbook – 11 Nov 2015.xlsx
Data dictionary here: diabetes variables.docxDownload diabetes variables.docx
Please use the R Graph GalleryLinks to an external site. to create 2 distinct graphs using this dataset. We leave the choice of which graphs up to you. We would like them well-formatted, with proper labeling for the legend/title/axes. One must be faceted. Then explain your graphs and what they show you.
Upload your submission with as a markdown or quarto document (.rmd or .qmd) or a rendered version of those documents (html, pdf). Submissions must include code, graphs, and comments.

Posted in R

All the information you need in the Zib file, Please read each file carefully Do

All the information you need in the Zib file, Please read each file carefully
Don’t Use chat GPT
1- Develop an
R-Code to Use the Penman-Monteith Equation (Using the Step-by-step
calculation of the penman-monteith file that I sent to you ) to calculate
the crop water requirements (ET) for 1) Rice, 2) Peanuts,
and 3) cauliflower crops to be grown near Beaumont, Texas.
Assume the planting date for
cauliflower is the 14th of September, rice is March 15th and peanuts is May
1st.
2- Compare the crop
water requirements for a drought year 2011 and a wet year (2015).
Deliverables:
1-Submit a
single Word document with Formatting with answers to the questions enumerated
or the items required.
2-Submit all
the Data you downloaded And used in R . (The Excel files).(The most important
part) (you have to find the data). (Mention the name of the site from which the
data was collected )
3-submit
the. R-code file, And text file of R-Code
Note: Read the files carefully:1-step-by-step calculation of the penman-monteith
2-Where to get data

Posted in R

Use the below codes to solve the questions: I am expecting a R file and a word d

Use the below codes to solve the questions: I am expecting a R file and a word document
if (!require(mlba)) {
library(devtools)
install_github(“gedeck/mlba/mlba”, force=TRUE)
}
options(scipen=999)
# Logistic Regression
## The Logistic Regression Model
library(ggplot2)
library(gridExtra)
p <- seq(0.005, 0.995, 0.01) df <- data.frame( p = p, odds = p / (1 - p), logit = log(p / (1 - p)) ) g1 <- ggplot(df, aes(x=p, y=odds)) + geom_line() + coord_cartesian(xlim=c(0, 1), ylim=c(0, 100)) + labs(x='Probability of success', y='Odds', title='(a)') + geom_hline(yintercept = 0) + theme_bw() + theme(axis.line = element_line(colour = "black"), axis.line.x = element_blank(), panel.border = element_blank()) g2 <- ggplot(df, aes(x=p, y=logit)) + geom_line() + coord_cartesian(xlim=c(0, 1), ylim=c(-4, 4)) + labs(x='Probability of success', y='Logit', title='(b)' ) + geom_hline(yintercept = 0) + theme_bw() + theme(axis.line = element_line(colour = "black"), axis.line.x = element_blank(), panel.border = element_blank()) grid.arrange(g1, g2, ncol=2) ## Example: Acceptance of Personal Loan ### Model with a Single Predictor bank.df <- mlba::UniversalBank g <- ggplot(bank.df, aes(x=Income, y=Personal.Loan)) + geom_jitter(width=0, height=0.01, alpha=0.1) + geom_function(fun=function(x){ return (1 / (1 + exp(6.04892 - 0.036*x)))}) + xlim(0, 250) + labs(x='Income (in $000s)') + theme_bw() g # Z Obtain coefficients for Personal.Loan ~ Income glm.model.income <- glm(Personal.Loan ~ Income, data = bank.df, family = binomial) glm.model.income ############################################### ### Estimating the Logistic Model from Data: Computing Parameter Estimates #### Estimated Model library(caret) library(tidyverse) # load and preprocess data bank.df <- mlba::UniversalBank %>%
select(-c(ID, ZIP.Code)) %>% # Drop ID and zip code columns.
mutate(
Education = factor(Education, levels=c(1:3),
labels=c(“Undergrad”, “Graduate”, “Advanced/Professional”)),
Personal.Loan = factor(Personal.Loan, levels=c(0, 1),
labels=c(“No”, “Yes”))
)
# partition data
set.seed(2)
idx <- caret::createDataPartition(bank.df$Personal.Loan, p=0.6, list=FALSE) train.df <- bank.df[idx, ] holdout.df <- bank.df[-idx, ] # build model trControl <- caret::trainControl(method="cv", number=5, allowParallel=TRUE) logit.reg <- caret::train(Personal.Loan ~ ., data=train.df, trControl=trControl, # fit logistic regression with a generalized linear model method="glm", family="binomial") logit.reg summary(logit.reg$finalModel) ## Evaluating Classification Performance ### Interpreting Results in Terms of Odds (for a Profiling Goal) # use predict() with type = "response" to compute predicted probabilities. logit.reg.pred <- predict(logit.reg, holdout.df[, -8], type = "prob") str(holdout.df) # display four different cases interestingCases = c(1, 12, 32, 1333) results=data.frame( actual = holdout.df$Personal.Loan[interestingCases], p0 = logit.reg.pred[interestingCases, 1], p1 = logit.reg.pred[interestingCases, 2], predicted = ifelse(logit.reg.pred[interestingCases, 2] > 0.5, 1, 0)
)
# Z evalulate performance ##################################################################
# predict training set
logit.reg.pred.train <- predict(logit.reg, train.df[, -8], type = "prob") predicted.train <- factor(ifelse(logit.reg.pred.train[, 2] > 0.5, 1, 0), levels = c(0, 1), labels = c(“No”, “Yes”))
# accuracy train
confusionMatrix(train.df$Personal.Loan ,predicted.train)
# predict holdout
predicted.holdout <- factor(ifelse(logit.reg.pred[, 2] > 0.5, 1, 0), levels = c(0, 1), labels = c(“No”, “Yes”))
# accuracy holdout
confusionMatrix(holdout.df$Personal.Loan ,predicted.holdout)
#########################################################################################
library(gains)
actual <- ifelse(holdout.df$Personal.Loan == "Yes", 1, 0) # Z comments######################################################################################### # when you specify the groups argument below, it typically represents the number of distinct or unique values #in your predicted probabilities that you want to use for creating the cumulative gains chart. n_distinct(logit.reg.pred[,2]) ####################################################################################################### gain <- gains(actual, logit.reg.pred[,2], groups=length(actual)-2) # plot gains chart nactual <-sum(actual) g1 <- ggplot() + geom_line(aes(x=gain$cume.obs, y=gain$cume.pct.of.total * nactual)) + geom_line(aes(x=c(0, max(gain$cume.obs)), y=c(0, nactual)), color="darkgrey") + labs(x="# Cases", y="Cumulative") # plot decile-wise lift chart gain10 <- gains(actual, logit.reg.pred[,2], groups=10) g2 <- ggplot(mapping=aes(x=gain10$depth, y=gain10$lift / 100)) + geom_col(fill="steelblue") + geom_text(aes(label=round(gain10$lift / 100, 1)), vjust=-0.2, size=3) + ylim(0, 8) + labs(x="Percentile", y="Lift") grid.arrange(g1, g2, ncol=2) ### Z residual plot # Fit a logistic regression model model <- glm(Personal.Loan ~., data = bank.df, family = binomial(link = "logit")) summary(model) residuals = residuals(model, type = "response") # Create a residual plot plot(model$fitted.values, residuals, ylab = "Residuals", xlab = "Fitted Values", main = "Logistic Regression Residuals vs. Fitted") abline(h = 0, col = "red", lty = 2) # Add a horizontal line at y = 0 for reference # Calculate predicted probabilities same as model fitted value predicted_probs <- predict(model, type = "response") predicted_probs[1:5] model$fitted.values[1:5] # Define bin intervals (adjust bins and breaks as needed) bins <- cut(predicted_probs, breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0), labels = FALSE) # Calculate binned residuals binned_residuals <- data.frame( PredictedProb = predicted_probs, ActualOutcome = ifelse(bank.df$Personal.Loan == "Yes", 1, 0), # Replace with the actual outcome variable Bin = bins ) # Example: Calculate mean residuals for each bin library(dplyr) binned_residuals_summary <- binned_residuals %>%
group_by(Bin) %>%
summarize(MeanResidual = mean(ActualOutcome – PredictedProb))
# Visualize mean residuals
barplot(binned_residuals_summary$MeanResidual, names.arg = binned_residuals_summary$Bin,
xlab = “Bin”, ylab = “Mean Residual”, main = “Mean Residuals by Bin”)
# Load the ggplot2 package if not already loaded
library(ggplot2)
# Assuming you have calculated binned residuals as described earlier
# Create a scatter plot of binned residuals
binned_residuals_summary %>% ggplot(aes(x = Bin, y = MeanResidual)) +
geom_col() +
labs(x = “Predicted Probabilities”, y = “Binned Residuals”) +
ggtitle(“Scatter Plot of Binned Residuals”) +
theme_minimal()
library(arm)
# confidence bands
binnedplot(predicted_probs,residuals)
STIMULATION:
library(ggplot2)
library(reshape)
library(glmnet)
rm(list=ls())
set.seed(1)
x = rnorm(1000, sd=3) # A random variable
hist(x)
summary(x)
p = 1/(1+exp(-(1+10*x)))
hist(p)
plot(x,p)
y = rbinom(1000,1,p) # bernoulli response variable
plot(x,y)
# two approaches to set outcome categories
# 1 cutoff
#y = ifelse(p>=0.5,1,0)
# bernoulli response
#set.seed(1)
#prob <- c(0.3, 0.5, 0.7, 0.4, 0.6) #result <- rbinom(5, 1, prob) #prob #result data.plot = data.frame(y,p,x) # plot probability, category data.plot %>% ggplot(aes(x)) +
geom_line(aes(y=p)) +
geom_point(aes(y=y,color=factor(y)))+
theme_bw()
# plot category
data.plot %>% ggplot(aes(x,y,color=y)) +
geom_point() +
theme_bw()
# plot linear regression
data.plot %>% ggplot(aes(x,y,color=y)) +
geom_point() +
geom_smooth(method=’lm’,se=FALSE) +
theme_bw()
# plot glm
data.plot %>% ggplot(aes(x,y,color=y)) +
geom_point() +
geom_smooth(method=’glm’,
method.args = list(family = “binomial”),
se=FALSE) +
theme_bw()
# get coefficients through linear regression
logit = log(p/(1-p))
plot(x,logit)
data = data.frame(cbind(logit,x))
summary(data)
# remove infinite in all columns:
data = data %>%
filter_all(all_vars(!is.infinite(.)))
summary(data)
logit.model = lm(logit~x,data)
logit.model
# get coefficients through logistic regression via glm
data.glm = t(rbind(as.numeric(y),as.numeric(x)))
data.glm = data.frame(data.glm)
names(data.glm) = c(“y”,”x”)
#data.glm = data.glm %>% rename(y=X1,x=X2)
data.glm$y = as.factor(data.glm$y)
summary(data.glm)
plot(data.glm$x,data.glm$y)
set.seed(1)
glm.model <- glm(y ~ x, data = data.glm, family = binomial) glm.model

Posted in R

All the information you need in the world files and the Google Drive link Downlo

All the information you need in the world files and the Google Drive link
Download 15-minute storm data for Conroe, TX How do the field hyetographs you developed compare to those from the Huff Method?
Develop a program to calculate monthly precipitation indices for Beaumont, TX using PRISM dataCompare and Contrast the indices for Amarillo (download Amarillo data and rework).? Make sure you comment on why the differences occur from a hydrological standpoint.
Use daily precipitation, temperature, and relative humidity data from Amarillo, TX to partition snow and non-snow days in the years 2011 – 2015Compare and contrast the results using data from PRISM and data from Weather Underground. Make sure you comment on why the differences occur
How well do you think the approach worked based on what was reported in the field?

Posted in R

Deliverable: Once you have completed the assignment below, please submit a singl

Deliverable: Once you have completed the assignment below, please submit a single document with your two maps and reflection using this Canvas submission portal.
Learning Objectives:
Use health data to produce maps in R
Consider the choices cartographers make when creating a map
Practice putting map elements together to make a cohesive argument through a map
Description:
Students are expected to complete three exercises over the course of the quarter. This is the first of those. In this exercise, students are introduced to data manipulation and mapping in R. R is a free software environment and programming language that is widely used by geographers and statisticians for both data analysis and mapping. This exercise relies on the Intro to R tutorial provided to you which uses census data and walks students through taking that data from a table and getting it into a format in which it can be mapped in R. Thus, that tutorial and this exercise are designed to set you up to be able to make your own maps in R from any census data you wish (something you might find useful in making maps for your atlases).
This exercise is worth 10% of your course grade and will be graded out of 10 points.
Instructions:
answer the questions that follow. After you have made your two maps (one each in response to the two questions below), you are asked to reflect on the experience following the prompt.
Question 1: The Affordable Care Act, also known as Obamacare, was introduced by President Obama in 2010. One of the major tools it employed was the individual mandate which required all individuals to be enrolled in health insurance or to pay a penalty on their taxes. This individual mandate went into effect in 2014. When President Trump came into office, he repealed that individual mandate starting in 2019. So, in 2019 and since then, there has no longer been a tax penalty for anyone who does not have health insurance. As health geographers, we want to know what impact the repeal of the individual mandate had and how that effect varies geographically. Please make a choropleth map showing the percent change in the number of people enrolled in health insurance between 2018 (the year before the individual mandate was repealed when we would expect health insurance rates to be highest) and 2021 (the most recent data available to us) by county. It is up to you whether and how you classify that data, what colors you use, etc., but you’ll want to make sure that your map effectively communicates the information (upon which you will be graded) and you’ll want to be conscious of the choices you are making as a cartographer (which will come in handy in the reflection below).
Question 2: The Americans with Disabilities Act (ADA) was first enacted in 1990. Since then, it has been interpreted in many different ways in different places. While it is a federal statute, the ADA is primarily enacted by states. Because different states have interpreted the law differently and enacted different programs in support of the law, some states have come to be known as much friendlier to people with disabilities (i.e. providing more services, in more accessible ways) than others. As such, there is a geographic variability in where people with disabilities choose to live (and where people tend to be diagnosed with disabilities and access services). Please make a choropleth map showing the 2021 distribution of the population with disabilities by state. Again, it is up to you how you make your map, but you will be assessed on how clearly it conveys the information. As you work, make note of the choices you are making to include in your reflection (see below).
Question 3: Finally, please write a reflection (250 – 400 words) that considers:
What choices did you make as a cartographer?
Why / how did you make them?
How did the technology you were using (R / RStudio) constrain or support those choices?
What impact did the choices you made and the technology you used have on the story your maps tell?
Data:
Data has been provided for you in the form of two shapefiles: Ex1A.shp and Ex1B.shp. Below you can find a glossary of the data provided:
From Ex1A.shp:
GEOID: geographic identifier (assigned by the Census Bureau)
NAME: name of the county and states
STFIPS: state FIPS code (geographic identifier)
COUNTY: name of the county
STATE: name of the state
YEAR: year on which the estimate is based
POPULATION: estimate of total population
DIS19: total population under the age of 19 with a disability
DIS1964: total population between the ages of 19 and 64 with a disability
DIS65: total population over the age of 65 with a disability
From Ex1B.shp:
GEOID: geographic identifier (assigned by the Census Bureau)
NAME: name of the county and states
STFIPS: state FIPS code (geographic identifier)
COUNTY: name of the county
STATE: name of the state
YEAR: year on which the estimate is based
CIVILPOP: the total non-institutionalized civilian population
UNINSURED: the total non-institutionalized civilian population that is uninsured
You can find the shapefiles in a zipped file here Download a zipped file here. In that zipped file, you will also find an R script containing the info in this assignment to get you started.
Rubric
Exercise 1 Rubric
Exercise 1 Rubric
CriteriaRatingsPts
This criterion is linked to a Learning OutcomeQuestion 1How well does your map communicate the information and answer the question.
4 ptsExcellent
Map does a great job of answering the question. Information is presented clearly and effectively. Map is aesthetically pleasing and does not include superfluous map elements.
3 ptsGood
Map answers the question, but could be a little clearer. It may contain superfluous map elements or distracting map features, but it does answer the question.
2 ptsAcceptable
Map contains information relevant to the question, but doesn’t fully answer it. Map may lack clarity.
1 ptsAttempted
If you attempt to make a map, you will earn at least one point, even if you do not successfully answer the question with a map.
0 ptsNo Map/Answer Included
4 pts
This criterion is linked to a Learning OutcomeQuestion 2How well does your map communicate the information and answer the question.
4 ptsExcellent
Map does a great job of answering the question. Information is presented clearly and effectively. Map is aesthetically pleasing and does not include superfluous map elements.
3 ptsGood
Map answers the question, but could be a little clearer. It may contain superfluous map elements or distracting map features, but it does answer the question.
2 ptsAcceptable
Map contains information relevant to the question, but doesn’t fully answer it. Map may lack clarity.
1 ptsAttempted
If you attempt to make a map, you will earn at least one point, even if you do not successfully answer the question with a map.
0 ptsNo Map/Answer Included
4 pts
This criterion is linked to a Learning OutcomeQuestion 3Does your reflection touch on all of the points of Question 3 and present a coherent reflection on the exercise process and the limitations of the technology.
2 ptsExcellent
Reflection shows significant thought and addresses all aspects of the question. Reflection is clear and brings in terms and/or concepts from class.
1 ptsAcceptable
Reflection addresses the prompts in the question but may not be complete or could be strengthened with more attention.
0 ptsNo Reflection Included
2 pts
Total Points: 10

Posted in R

Week 9 Data Visualization Complete the following: Download PowerBI or Tableau.

Week 9 Data Visualization
Complete the following:
Download PowerBI or Tableau.
Connect to the Excel dataset provided in this exercise.
Create 4 visualizations with the data.
Write a short narrative about what the data means and why you chose the visualizations/colors that you did.

Posted in R

Predicting Prices of Used Cars (Regression Trees). The dataset mlba::ToyotaCorol

Predicting Prices of Used Cars (Regression Trees). The dataset mlba::ToyotaCorolla contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 variables, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section 9.7 is a subset of this dataset.). Use the code below to read the data
car.df <- mlba::ToyotaCorolla Data Preprocessing. Split the data into training (60%) and holdout (40%) datasets. Note use set.seed(1) for this purpose. Run a regression tree (RT) with outcome variable Price and predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal node to 1, maximum number of tree levels to 30, and cp = 0.001, to make the run least restrictive. Which appear to be the three or four most important car specifications for predicting the car's price? Note for rpart function, the important variables are saved in model object under $variable.importance. Compare the prediction errors of the training and holdout sets by examining their RMSE and by plotting the two boxplots. How does the predictive performance of the holdout set compare to the training set? Why does this occur? How might we achieve better holdout predictive performance at the expense of training performance? Note, you only need to answer the question. No coding is required for this question. Create a smaller tree by leaving the arguments cp, minbucket, and maxdepth at their defaults. Compared to the deeper tree, what is the predictive performance on the holdout set? Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping Binned_Price instead of Price. Run a classification tree with the same set of input variables as in the RT, and with Binned_Price as the output variable. As in the less deep regression tree, leave the arguments cp, minbucket, and maxdepth at their defaults. Use the code below to create bins, binned price. Note the train.index, holdout.index are from your data partition and you may name the indexes differently. # categorical price and binned price bins <- seq(min(car.df$Price), max(car.df$Price), (max(car.df$Price) - min(car.df$Price))/20) Binned_Price <- .bincode(car.df$Price, bins, include.lowest = TRUE) Binned_Price <- as.factor(Binned_Price) train.df$Binned_Price <- Binned_Price[train.index] holdout.df$Binned_Price <- Binned_Price[holdout.index] Compare the smaller tree generated by the CT with the smaller tree generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc.) Why? Predict the price, using the smaller RT and the CT, of a used Toyota Corolla with the specifications listed in table below. TABLE SPECIFICATIONS FOR A PARTICULAR TOYOTA COROLLA (picture uploaded below) Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.

Posted in R

Do question 6 For this midterm, answer the questions below and answer using what

Do question 6
For this midterm, answer the questions below and answer using what you’ve learned this semester. You must upload your R code as a .R file or markdown file. Please provide an explanation describing your results in the comments or markdown text. Any graphs and tables must have appropriate labels and titles and not show direct variable names.
This is a group assignment – you will submit one file per group. Your code must run when given the data provided (we can set the working directory etc.). If not, you will lose 25 points.
1. Using the “Oxygen_Delivery” sheet, find the maximum Exhaled Tidal volume per patient per day. Is the average maximum title volume across patients higher in March of 2020, or May of 2020?
2. Recategorize the Race variable as African-American or Other. Which lab from the “Long_Labs” sheet has the greatest difference between averages given this grouping?
3. Was there a difference in vital signs (systolic bp, diastolic bp, map cuff/arterial, heart rate, respiratory rate, temperature) among those who died and those who survived? Please write code to capture the median of each vital for each group.
4. Is there a difference in the IQR of patients’ initial c-reactive protein, d-dimer, or white blood cell count based on the patient’s individual code status? Use code to support your answer.
5.Categorize BMI as Underweight (<=18), Normal (<=25), Overweight (<=30), and Obese (>30). Which of these groups had the highest death rate (people who died / people within group)?
6. Filter the “Long_Labs” sheet so the only values that remain are between a patient’s ICU admission and discharge (ICU_Admit_DEID and ICU_DC_DEID from “Main_Dataset”). Sumarize the distribution and characteristics of the various values remaining.
7. Develop one question about any of the sheets (or multiples of them) and answer it with the data. Have it approved in email by Alex or Chad.

Posted in R

USE THE BELOW CODE TO SOLVE THE QUESTIONS ; if (!require(mlba)) { library(devt

USE THE BELOW CODE TO SOLVE THE QUESTIONS ;
if (!require(mlba)) {
library(devtools)
install_github(“gedeck/mlba/mlba”, force=TRUE)
}
options(scipen=999)
# Classification and Regression Trees
## Classification Trees
### Example 1: Riding Mowers
library(rpart)
library(rpart.plot)
mowers.df <- mlba::RidingMowers # control parameter maxdepth , minsplit # Z add default tree class.tree.default <- rpart(Ownership ~ ., data = mowers.df, method = "class" ) rpart.plot(class.tree.default, extra=1, fallen.leaves=FALSE) rpart.rules(class.tree.default) ######################################## class.tree <- rpart(Ownership ~ ., data = mowers.df, control = rpart.control(minsplit = 0), # Z 0, 2, 7 method = "class" ) rpart.plot(class.tree, extra=1, fallen.leaves=FALSE) rpart.rules(class.tree) setwd("C:/Users/Dell/Desktop/CU Predictive Analytics 8 2023/Lectures") plot_common_styling <- function(g, filename) { g <- g + geom_point(size=2) + scale_color_manual(values=c("darkorange", "steelblue")) + scale_fill_manual(values=c("darkorange", "lightblue")) + labs(x="Income ($000s)", y="Lot size (000s sqft)") + theme_bw() + theme(legend.position=c(0.89, 0.91), legend.title=element_blank(), legend.key=element_blank(), legend.background=element_blank()) ggsave(file=file.path("figures", "chapter_09", filename), g, width=5, height=3, units="in") return(g) } g <- ggplot(mowers.df, mapping=aes(x=Income, y=Lot_Size, color=Ownership, fill=Ownership)) plot_common_styling(g, "mowers_tree_0.pdf") g <- g + geom_vline(xintercept=59.7) plot_common_styling(g, "mowers_tree_1.pdf") g <- g + geom_segment(x=59.9, y=21, xend=25, yend=21, color="black") plot_common_styling(g, "mowers_tree_2.pdf") g <- g + geom_segment(x=59.9, y=19.8, xend=120, yend=19.8, color="black") plot_common_styling(g, "mowers_tree_3.pdf") g <- g + geom_segment(x=84.75, y=19.8, xend=84.75, yend=10, color="black") plot_common_styling(g, "mowers_tree_4.pdf") g <- g + geom_segment(x=61.5, y=19.8, xend=61.5, yend=10, color="black") plot_common_styling(g, "mowers_tree_5.pdf") ### Measures of Impurity #### Normalization ggplot() + scale_x_continuous(limits=c(0,1)) + geom_hline(yintercept=0.5, linetype=2, color="grey") + geom_hline(yintercept=1, linetype=2, color="grey") + geom_function(aes(color="Entropy measure"), fun = function(x) {- x*log2(x) - (1-x)*log2(1-x)}, xlim=c(0.0001, 0.9999), n=100) + geom_function(aes(color="Gini index"), fun = function(x) {1 - x^2 - (1-x)^2}) + labs(y="Impurity measure", x=expression(~italic(p)[1]), color="Impurity measure") + scale_color_manual(values=c("Entropy measure"="darkorange", "Gini Index"="steelblue")) ggsave(file=file.path( "figures", "chapter_09", "gini_entropy.pdf"), last_plot() + theme_bw(), width=5, height=2.5, units="in") library(rpart) library(rpart.plot) mowers.df <- mlba::RidingMowers # use rpart() to run a classification tree. # define rpart.control() in rpart() to determine the depth of the tree. class.tree <- rpart(Ownership ~ ., data = mowers.df, control=rpart.control(maxdepth=2), method="class") ## plot tree # use rpart.plot() to plot the tree. You can control plotting parameters such # as color, shape, and information displayed (which and where). rpart.plot(class.tree, extra=1, fallen.leaves=FALSE) pdf(file.path( "figures", "chapter_09", "CT-mowerTree1.pdf"), width=3, height=3) rpart.plot(class.tree, extra=1, fallen.leaves=FALSE) dev.off() class.tree <- rpart(Ownership ~ ., data = mowers.df, control=rpart.control(minsplit=1), method="class") rpart.plot(class.tree, extra=1, fallen.leaves=FALSE) pdf(file.path("..", "figures", "chapter_09", "CT-mowerTree3.pdf"), width=5, height=5) rpart.plot(class.tree, extra=1, fallen.leaves=FALSE) dev.off() ## Evaluating the Performance of a Classification Tree ### Example 2: Acceptance of Personal Loan library(tidyverse) library(caret) # Load and preprocess data bank.df <- mlba::UniversalBank %>%
# Drop ID and zip code columns.
select(-c(ID, ZIP.Code)) %>%
# convert Personal.Loan to a factor with labels Yes and No
mutate(Personal.Loan = factor(Personal.Loan, levels=c(0, 1), labels=c(“No”, “Yes”)),
Education = factor(Education, levels=c(1, 2, 3), labels=c(“UG”, “Grad”, “Prof”)))
# partition
set.seed(1)
idx <- createDataPartition(bank.df$Personal.Loan, p=0.6, list=FALSE) train.df <- bank.df[idx, ] holdout.df <- bank.df[-idx, ] # classification tree default.ct <- rpart(Personal.Loan ~ ., data=train.df, method="class") # plot tree rpart.plot(default.ct, extra=1, fallen.leaves=FALSE) pdf(file.path("..", "figures", "chapter_09", "CT-universalTree1.pdf"), width=5, height=5) rpart.plot(default.ct, extra=1, fallen.leaves=FALSE) dev.off() deeper.ct <- rpart(Personal.Loan ~ ., data=train.df, method="class", cp=0, minsplit = 0) # Z remove cp=0 deeper.ct <- rpart(Personal.Loan ~ ., data=train.df, method="class", control = rpart.control(cp = 1,minsplit=0)) # cp =0 cp=1 ############################################## # count number of leaves sum(deeper.ct$frame$var == "“)
# plot tree
rpart.plot(deeper.ct, extra=1, fallen.leaves=FALSE)
pdf(file.path(“..”, “figures”, “chapter_09”, “CT-universalTree2.pdf”), width=5, height=2.5)
rpart.plot(deeper.ct, extra=1, fallen.leaves=FALSE)
dev.off()
# classify records in the holdout data.
# set argument type = “class” in predict() to generate predicted class membership.
default.ct.point.pred.train <- predict(default.ct,train.df,type = "class") # generate confusion matrix for training data confusionMatrix(default.ct.point.pred.train, train.df$Personal.Loan) ### repeat the code for the holdout set, and the deeper tree default.ct.point.pred.holdout <- predict(default.ct,holdout.df,type = "class") confusionMatrix(default.ct.point.pred.holdout, holdout.df$Personal.Loan) deeper.ct.point.pred.train <- predict(deeper.ct,train.df,type = "class") confusionMatrix(deeper.ct.point.pred.train, train.df$Personal.Loan) deeper.ct.point.pred.holdout <- predict(deeper.ct,holdout.df,type = "class") confusionMatrix(default.ct.point.pred.holdout, holdout.df$Personal.Loan) ## Avoiding Overfitting ### Stopping Tree Growth #### Stopping Tree Growth: Grid Search for Parameter Tuning set.seed(1) trControl <- trainControl(method="cv", number=5, allowParallel=TRUE) model1 <- train(Personal.Loan ~ ., data=train.df, method="rpart", trControl=trControl, tuneGrid=data.frame(cp=c(1, 0.1, 0.01, 0.001, 0.0001))) model1$results # focus grid search around cp=0.001 model2 <- train(Personal.Loan ~ ., data=train.df, method="rpart", trControl=trControl, tuneGrid=data.frame(cp=c(0.005, 0.002, 0.001, 0.0005, 0.0002))) model2$results ### Pruning the Tree #### Stopping Tree Growth: Conditional Inference Trees # argument xval refers to the number of folds to use in rpart's built-in # cross-validation procedure # argument cp sets the smallest value for the complexity parameter. cv.ct <- rpart(Personal.Loan ~ ., data=train.df, method="class", cp=0.00001, minsplit=5, xval=5) # use printcp() to print the table. printcp(cv.ct) # prune by lower cp pruned.ct <- prune(cv.ct, cp=cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"]) sum(pruned.ct$frame$var == "“)
rpart.plot(pruned.ct, extra=1, fallen.leaves=FALSE)
pdf(file.path(“figures”, “chapter_09”, “CT-universalTree-pruned.pdf”), width=5, height=2.5)
rpart.plot(pruned.ct, extra=1, fallen.leaves=FALSE)
dev.off()
### Best-Pruned Tree
# prune by lower cp
minErrorRow <- cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]), ] cutoff <- minErrorRow["xerror"] + minErrorRow["xstd"] best.cp <- cv.ct$cptable[cv.ct$cptable[,"xerror"] < cutoff,][1, "CP"] best.ct <- prune(cv.ct, cp=best.cp) sum(best.ct$frame$var == "“)
rpart.plot(best.ct, extra=1, fallen.leaves=FALSE)
pdf(file.path( “figures”, “chapter_09”, “CT-universalTree-best.pdf”), width=4, height=2.75)
rpart.plot(best.ct, extra=1, fallen.leaves=FALSE)
dev.off()
## Classification Rules from Trees
rpart.rules(best.ct)
## Regression Trees
# select variables for regression
outcome <- "Price" predictors <- c("Age_08_04", "KM", "Fuel_Type", "HP", "Met_Color", "Automatic", "CC", "Doors", "Quarterly_Tax", "Weight") # reduce data set to first 1000 rows and selected variables car.df <- mlba::ToyotaCorolla[1:1000, c(outcome, predictors)] # partition data set.seed(1) # set seed for reproducing the partition idx <- createDataPartition(car.df$Price, p=0.6, list=FALSE) car.train.df <- car.df[idx, ] car.holdout.df <- car.df[-idx, ] # use method "anova" for a regression model cv.rt <- rpart(Price ~ ., data=car.train.df, method="anova", cp=0.00001, minsplit=5, xval=5) # prune by lower cp minErrorRow <- cv.rt$cptable[which.min(cv.rt$cptable[,"xerror"]), ] cutoff <- minErrorRow["xerror"] + minErrorRow["xstd"] best.cp <- cv.rt$cptable[cv.rt$cptable[,"xerror"] < cutoff,][1, "CP"] best.rt <- prune(cv.rt, cp=best.cp) # set digits to a negative number to avoid scientific notation rpart.plot(best.rt, extra=1, fallen.leaves=FALSE, digits=-4) pdf(file.path( "figures", "chapter_09", "RT-ToyotaTree.pdf"), width=7, height=4) rpart.plot(best.rt, extra=1, fallen.leaves=FALSE, digits=-4) dev.off() ## Improving Prediction: Random Forests and Boosted Trees ### Random Forests library(randomForest) ## random forest rf <- randomForest(Personal.Loan ~ ., data=train.df, ntree=500, mtry=4, nodesize=5, importance=TRUE) ## variable importance plot varImpPlot(rf, type=1) ## confusion matrix rf.pred <- predict(rf, holdout.df) confusionMatrix(rf.pred, holdout.df$Personal.Loan) pdf(file.path("..", "figures", "chapter_09", "VarImp.pdf"), width=7, height=4) varImpPlot(rf, type=1, main="") dev.off() ### Boosted Trees library(caret) library(xgboost) xgb <- train(Personal.Loan ~ ., data=train.df, method="xgbTree", verbosity=0) # compare ROC curves for classification tree, random forest, and boosted tree models library(ROCR) rocCurveData <- function(model, data) { prob <- predict(model, data, type="prob")[, "Yes"] predob <- prediction(prob, data$Personal.Loan) perf <- performance(predob, "tpr", "fpr") return (data.frame(tpr=perf@x.values[[1]], fpr=perf@y.values[[1]])) } performance.df <- rbind( cbind(rocCurveData(best.ct, holdout.df), model="Best-pruned tree"), cbind(rocCurveData(rf, holdout.df), model="Random forest"), cbind(rocCurveData(xgb, holdout.df), model="xgboost") ) colors <- c("Best-pruned tree"="grey", "Random forest"="blue", "xgboost"="tomato") ggplot(performance.df, aes(x=tpr, y=fpr, color=model)) + geom_line() + scale_color_manual(values=colors) + geom_segment(aes(x=0, y=0, xend=1, yend=1), color="grey", linetype="dashed") + labs(x="1 - Specificity", y="Sensitivity", color="Model") library(gridExtra) g <- last_plot() + theme_bw() g1 <- g + guides(color="none") g2 <- g + scale_x_continuous(limits=c(0, 0.2)) + scale_y_continuous(limits=c(0.8, 1.0)) g <- arrangeGrob(g1, g2, widths=c(3, 4.5), ncol=2) ggsave(file=file.path("figures", "chapter_09", "xgboost-ROC-1.pdf"), g, width=8, height=3, units="in") xgb.focused <- train(Personal.Loan ~ ., data=train.df, method="xgbTree", verbosity=0, scale_pos_weight=10) saveRDS(xgb.focused,"xgb.focused.save.RDS") performance.df <- rbind( cbind(rocCurveData(xgb, holdout.df), model="xgboost"), cbind(rocCurveData(xgb.focused, holdout.df), model="xgboost (focused)") ) colors <- c("xgboost"="tomato", "xgboost (focused)"="darkgreen") ggplot(performance.df, aes(x=tpr, y=fpr, color=model)) + geom_line() + scale_color_manual(values=colors) + geom_segment(aes(x=0, y=0, xend=1, yend=1), color="grey", linetype="dashed") + labs(x="1 - Specificity", y="Sensitivity", color="Model") library(gridExtra) g <- last_plot() + theme_bw() g1 <- g + guides(color="none") g2 <- g + scale_x_continuous(limits=c(0, 0.2)) + scale_y_continuous(limits=c(0.8, 1.0)) g <- arrangeGrob(g1, g2, widths=c(3, 4.5), ncol=2) ggsave(file=file.path( "figures", "chapter_09", "xgboost-ROC-2.pdf"), g, width=8, height=3, units="in")

Posted in R