Predicting Prices of Used Cars (Regression Trees). The dataset mlba::ToyotaCorolla contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 variables, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section 9.7 is a subset of this dataset.). Use the code below to read the data
car.df <- mlba::ToyotaCorolla
Data Preprocessing. Split the data into training (60%) and holdout (40%) datasets. Note use set.seed(1) for this purpose.
Run a regression tree (RT) with outcome variable Price and predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal node to 1, maximum number of tree levels to 30, and cp = 0.001, to make the run least restrictive.
Which appear to be the three or four most important car specifications for predicting the car's price? Note for rpart function, the important variables are saved in model object under $variable.importance.
Compare the prediction errors of the training and holdout sets by examining their RMSE and by plotting the two boxplots. How does the predictive performance of the holdout set compare to the training set? Why does this occur?
How might we achieve better holdout predictive performance at the expense of training performance? Note, you only need to answer the question. No coding is required for this question.
Create a smaller tree by leaving the arguments cp, minbucket, and maxdepth at their defaults. Compared to the deeper tree, what is the predictive performance on the holdout set?
Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping Binned_Price instead of Price. Run a classification tree with the same set of input variables as in the RT, and with Binned_Price as the output variable. As in the less deep regression tree, leave the arguments cp, minbucket, and maxdepth at their defaults.
Use the code below to create bins, binned price. Note the train.index, holdout.index are from your data partition and you may name the indexes differently.
# categorical price and binned price
bins <- seq(min(car.df$Price),
max(car.df$Price),
(max(car.df$Price) - min(car.df$Price))/20)
Binned_Price <- .bincode(car.df$Price,
bins,
include.lowest = TRUE)
Binned_Price <- as.factor(Binned_Price)
train.df$Binned_Price <- Binned_Price[train.index]
holdout.df$Binned_Price <- Binned_Price[holdout.index]
Compare the smaller tree generated by the CT with the smaller tree generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc.) Why?
Predict the price, using the smaller RT and the CT, of a used Toyota Corolla with the specifications listed in table below.
TABLE SPECIFICATIONS FOR A PARTICULAR TOYOTA COROLLA
(picture uploaded below)
Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.
Place this order or similar order and get an amazing discount. USE Discount code “GET20” for 20% discount