Mid-Term Review Session

Chapter 6, Exercise 10
Chapter 7, Exercise 11
Chapter 8, Exercise 11

Chapter 6, Exercise 10

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

rm(list=ls())

set.seed(8)
# leaps has the regsubsets() method
library(leaps)

Question (a)

Generate a data set with p = 20 features, n = 1000 observations and an associated quantitative response vector generated according to the model $Y=X \beta + \epsilon$, where $\beta$ has some elements that are exactly equal to zero.

# generate dataset of 20 features with 1000 observations
p <- 20
n <- 1000

# just use rnorm to generate everything
data <- matrix(rnorm(n*p), n, p)

# Generate Betas, then manually set several to zero
Beta <- rnorm(p)
Beta[2] <- 0
Beta[4] <- 0
Beta[6] <- 0

# epsilon we can also generate with rnorm
eps <- rnorm(p)

# %*% is for matrix multiplication here we have a 1000 x 20 * a 20 x 1 
Y <- data%*%Beta + eps

Question (b)

Split your data set into a training set containing 100 observations and a test set containing 900 observations.

indices <- sample(seq(1000), 100)
train_x <- data[indices,]
# - sign in an index excludes those indices in R
test_x <- data[-indices,]

train_y <- Y[indices]
test_y <- Y[-indices]

Question (c)

Perform best subset selection on the training set, plot the training set MSE associated with the best model of each size.

# want to use regsubsets function to predict Y as a function of the data
# need to make a dataframe
df <- data.frame(x = train_x, y = train_y)
# also takes nvmax argument which determines the maximum number of variables that it can put in a subset
regfit_full <- regsubsets(y~., data = df, nvmax = p)
val.errors <- rep(NA,p)
x_names <- colnames(train_x, do.NULL = FALSE, prefix = "x.")
for( i in 1:p){
  coefi = coef(regfit_full, id = i)
  # this is subset selection. The indices of the coeficients are random so we must reference them
  # by name. %in% checks for membership in a vector in R.
  pred <- as.matrix(train_x[, x_names %in% names(coefi)]) %*% coefi[names(coefi) %in% x_names]
  val.errors[i] <- mean((train_y - pred)^2)
}

plot(val.errors, ylab = "Training MSE", xlab = "Size of subset", type = 'b')

Question (d)

Plot the test set MSE associated with the best model of each size.

# essentially the same thing as c). but using the test set
val.errors2 <- rep(NA,p)

for( i in 1:p){
  coefi = coef(regfit_full, id = i)
  # We took the names of the x variables and predict using the ones 
  # %in% the subset of i X variables.
  pred <- as.matrix(test_x[, x_names %in% names(coefi)]) %*% coefi[names(coefi) %in% x_names]
  val.errors2[i] <- mean((test_y - pred)^2)
}

plot(val.errors2, ylab = "Test MSE", xlab = "Size of subset", type = 'b')

Question (e)

For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test MSE is minimized for an intermediate model size.

which.min(val.errors2)

## [1] 14

The test mse is smallest with a subset size of 14.

Question (f)

How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

coef(regfit_full, id = 16)

## (Intercept)         x.3         x.4         x.5         x.7         x.8 
##   0.4725956   0.7563194  -0.1396924  -0.1657850  -0.7360974  -0.8690224 
##         x.9        x.10        x.11        x.12        x.13        x.14 
##   0.7427001   0.8509050   0.2526848  -1.9481681  -0.1942387   0.5531693 
##        x.15        x.16        x.17        x.18        x.19 
##  -0.4856334  -1.5262369  -0.2729310   2.1844493  -0.5892004

We set the coeficients at x2, x4, and x6 to zero x2, and x6 are absent meaning the model correctly caught that they were zero, however it did assign a weight to x4 which means it missed one of the three meaningful coefficients, given that all the rest are totally random.

Question (g)

Create a plot displaying $\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta_j^r})^2}$ for a range of values of r, where $\hat{\beta^r_j}$ is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

errors <- rep(NA, p)
r <- 20


for( i in 1:r) {
  coefi = coef(regfit_full, id = i)
  # Because the subsets are random, we can't expect numerical indices to match up
  # so we need to check for membership with %in%. Also, because there are more actual Beta terms
  # than coefficients for all r's other than 20, we need to also square the Beta's that aren't 
  # included.
  tot <- sum((Beta[x_names %in% names(coefi)]
              - coefi[names(coefi) %in% x_names])^2)
             + sum(Beta[!(x_names %in% names(coefi))]^2)
  errors[i] <- sqrt(tot)
  
}

plot(x = 1:20, y = errors, xlab = "number of coefficients",
     ylab = "error between estimated and actual coefficients")

which.min(errors)

## [1] 2

While this is not always the case, for this particular random seed we end up with the coefficient error being lowest at 2 coefficients. Other seeds can yield other values given that there is no actual pattern to the data. Using a seed of 42 for example had both the test MSE and coefficient error lowest at 16 coefficients. Essentially the only meaningful observation of this plot is that the cofficients are farthest away from the actual value when we have a subset of size 1.

Ultimately, the point of the exercise is to illustrate that even for garbage data like this, you will see improvement in training error as you increase the flexibility of the model, however that does not mean that your model is actually improving.

Chapter 7, Exercise 11

This question explores backfitting in the context of multiple linear regression. If you want to perform multiple linear regression, but only have the software to perform simple linear regression, we take the following approach: repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. This process is continued until convergence - until the coefficient estimates stop changing. We are going to try this out on a toy example.

Question (a)

Generate a response Y and two predictors X1 and X2, with n=100.

# create variables according the equation Y=-2.1 + 1.3X1 + 0.54X2
set.seed(1)
X1 = rnorm(100)
X2 = rnorm(100)
eps = rnorm(100, sd = 0.1)
Y = -2.1 + 1.3 * X1 + 0.54 * X2 + eps

Question (b)

Initialize first of the β^1 to value of your choice.

# Create list of 1000 beta0, beta1, and beta2.
beta0 = rep(NA, 1000)
beta1 = rep(NA, 1000)
beta2 = rep(NA, 1000)

# Initialize first of β^1 to 10.
beta1[1] = 10

Question (c)

Keeping β^1 fixed, fit the model Y-β^1X1 = β0 + β2X2 + eps

a=Y-beta1[1]*X1
beta2[1]=lm(a ~ X2)$coef[2]

Question (d)

Keeping β^2 fixed, fit the model Y-β^2X2 = β0 + β1X1 + eps

a=Y-beta2[1]*X2
beta1[2]=lm(a∼X1)$coef[2]

Question (e)

Write a for loop to repeat (c) and (d) 1000 times. Report the estimates of β^0, β^1, and β^2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with β^0, β^1, and β^2 each shown in a different color.

#for loop repeating parts (c) and (d) 1000 times
for (i in 1:1000) {
    a = Y - beta1[i] * X1
    beta2[i] = lm(a ~ X2)$coef[2]
    a = Y - beta2[i] * X2
    lm.fit = lm(a ~ X1)
    if (i < 1000) {
        beta1[i + 1] = lm.fit$coef[2]
    }
    beta0[i] = lm.fit$coef[1]
} 
#plot values
plot(1:1000, beta0, type = "l", xlab = "iteration", ylab = "betas", ylim = c(-2.2, 
    1.6), col = "green")
lines(1:1000, beta1, col = "red")
lines(1:1000, beta2, col = "blue")
legend("center", c("beta0", "beta1", "beta2"), lty = 1, col = c("green", "red", 
    "blue"))

The coefficients quickly attain their least square values.

Question (f)

Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

#using multiple linear regression to predict Y
lm.fit = lm(Y ~ X1 + X2) 
#plot results from backfitting
plot(1:1000, beta0, type = "l", xlab = "iteration", ylab = "betas", ylim = c(-2.2, 
    1.6), col = "green")
lines(1:1000, beta1, col = "red")
lines(1:1000, beta2, col = "blue") 
#plot results from multiple linear regression
abline(h = lm.fit$coef[1], lty = "dashed", lwd = 3, col = rgb(0, 0, 0, alpha = 0.4))
abline(h = lm.fit$coef[2], lty = "dashed", lwd = 3, col = rgb(0, 0, 0, alpha = 0.4))
abline(h = lm.fit$coef[3], lty = "dashed", lwd = 3, col = rgb(0, 0, 0, alpha = 0.4))
legend("center", c("beta0", "beta1", "beta2", "multiple regression"), lty = c(1, 
    1, 1, 2), col = c("green", "red", "blue", "black"))

The plot shows that the estimated multiple regression coefficients match exactly with the coefficients obtained using backfitting.

Question (g)

On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

When the relationship between Y and X’s is linear, one iteration is sufficient to attain a good approximation of true regression coefficients.

Chapter 8, Exercise 11

This question uses the Caravan data set. Let’s first take a look at the data set.

require(ISLR)
dim(Caravan)

## [1] 5822   86

?Caravan

**Description**
The data contains 5822 real customer records. Each record consists of 86 variables, containing sociodemographic data (variables 1-43) and product ownership (variables 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. 
Variable 86 (Purchase) indicates whether the customer purchased a caravan insurance policy.

Question (a)

Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

# Split Caravan into training and test sets
training <- Caravan[1:1000,]
test <- Caravan[1001:nrow(Caravan),]

# Check the split
dim(training)

## [1] 1000   86

dim(test)

## [1] 4822   86

Question (b)

Fit a boosting model to the training set with `Purchase` as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

# Display the values of training$Purchase
levels(training$Purchase)

## [1] "No"  "Yes"

# Encode the string values of Purchase to 0/1
training$Purchase <- ifelse(training$Purchase=="Yes",1,0)

# Fit a boosting model using gbm()
require(gbm)
set.seed(342)
boost.caravan <- gbm(Purchase~., data = training, n.trees = 1000, shrinkage = 0.01,
                     distribution = 'bernoulli')

Key takeaways:
1. String values of the response variable should be converted to numerical in order to be passed to gbm(), otherwise there would be an error message: Bernoulli requires the response to be in {0,1};
2. distribution = 'bernoulli' must be specified to indicate a classification problem with a binary y;
3. as.factor() should NOT be applied to training$Purchase after converting it to a 0/1 binary variable. Otherwise gbm() can work without error messages but return NaN values for feature importances.

To display the top 5 important features, we can slice the summary(boost.caravan) as we do to a Data Frame.

# plotit = FALSE can mute the plot of all features' importances
summary(boost.caravan,plotit = FALSE)[1:5,]

##               var   rel.inf
## PPERSAUT PPERSAUT 15.155340
## MKOOPKLA MKOOPKLA  9.234995
## MOPLHOOG MOPLHOOG  8.670170
## MBERMIDD MBERMIDD  5.394037
## MGODGE     MGODGE  5.030477

Question (c)

Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?.

c-1 Boosting

# Make predictions of probabilities: seems that newdata does not require a 0/1 binary response variable
pred_probs <- predict(object = boost.caravan, newdata = test, n.trees = 1000, type = 'response')

# Convert to predictions of labels using a threshold of 20%
pred_labels <- ifelse(pred_probs>0.2,"Yes","No")

# Form a confusion matrix
table(ActualValues=test$Purchase, Predictions=pred_labels)

##             Predictions
## ActualValues   No  Yes
##          No  4396  137
##          Yes  255   34

Key takeaways:
1. n.trees must be specified when using predict on a gbm object;
2. To make predictions of probabilities, specify the argument type = 'response'; By default predict would return values on the log odds scale for distribution = 'bernoulli';
3. Machine Learning 1 legacy: always put actual values first in the table function, which will appear as row labels later in the confusion matrix and specify names of rows and columns to make it clear.

# Fraction of the people predicted to make a purchase do in fact make one -> What's the name of this value?
34/(137+34)

## [1] 0.1988304

c-2 KNN

Now we fit a KNN model and compare its result to the boosting model.

require(class)
# Scale the data set by excluding the last Purchase column
scale.x <- as.data.frame(scale(Caravan[1:ncol(Caravan)-1]))
training.scale.x <- scale.x[1:1000,]
test.scale.x <- scale.x[1001:nrow(Caravan),]

# Use the square root of training sample size as # of nearest neighbours in KNN
(k <- round(sqrt(nrow(training))))

## [1] 32

# Fit a KNN model and make predictions
knn.caravan <- knn(train = training.scale.x, test = test.scale.x, 
                   cl=training$Purchase, k = k, prob = TRUE)
## Extract the proportions of the votes for the winning class for each prediction
knn.winning.prop <- attributes(knn.caravan)$prob
## Convert the proportions to probabilities of predicting 1
knn.probs <- ifelse(knn.caravan==1,knn.winning.prop,1-knn.winning.prop)
## Convert to predictions of labels using a threshold of 20%
knn.labels <- ifelse(knn.probs>0.2,"Yes","No")

# Form a confusion matrix
table(ActualValues=test$Purchase, Predictions=knn.labels)

##             Predictions
## ActualValues   No  Yes
##          No  4465   68
##          Yes  272   17

Key takeaways:
1. Before fitting a KNN model, scaling must be done on the independent variables of the whole data set;
2. knn() will return the predicted labels directly. If specifying the argument prob = TRUE, the proportion of the votes for the winning class are returned as attribute prob. Still, this needs to be converted to the probability of predicting 1 or Yes;

Calculate the precision of the KNN model.

17/(17+68)

## [1] 0.2

The KNN model performs slightly better in terms of model precision.

c-3 Logistic Regression

Finally, fit a logistic regression model.

# Convert the Purchase to a factor variable
training$Purchase <- as.factor(training$Purchase)

# Fit a logistic regression model
glm.caravan <- glm(Purchase~., data = training, family = 'binomial')

# Make predictions
glm.probs <- predict(glm.caravan,newdata = test,type = 'response')
glm.labels <- ifelse(glm.probs>0.2,"Yes","No")

# Form a confusion matrix
table(ActualValues=test$Purchase, Predictions=glm.labels)

##             Predictions
## ActualValues   No  Yes
##          No  4183  350
##          Yes  231   58

Calculate the precision of the KNN model.

58/(58+350)

## [1] 0.1421569

Logistic regression performs much worse compared to Boosting and KNN in terms of precision.

Mid-Term Review Session

Mid-Term Review Session

Chapter 6, Exercise 10

Question (a)

Generate a data set with p = 20 features, n = 1000 observations and an associated quantitative response vector generated according to the model \(Y=X \beta + \epsilon\), where \(\beta\) has some elements that are exactly equal to zero.

Question (b)

Split your data set into a training set containing 100 observations and a test set containing 900 observations.

Question (c)

Perform best subset selection on the training set, plot the training set MSE associated with the best model of each size.

Question (d)

Plot the test set MSE associated with the best model of each size.

Question (e)

Question (f)

How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

Question (g)

Create a plot displaying \(\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta_j^r})^2}\) for a range of values of r, where \(\hat{\beta^r_j}\) is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

Chapter 7, Exercise 11

Question (a)

Generate a response Y and two predictors X1 and X2, with n=100.

Question (b)

Initialize first of the β^1 to value of your choice.

Question (c)

Keeping β^1 fixed, fit the model Y-β^1X1 = β0 + β2X2 + eps

Question (d)

Keeping β^2 fixed, fit the model Y-β^2X2 = β0 + β1X1 + eps

Question (e)

Write a for loop to repeat (c) and (d) 1000 times. Report the estimates of β^0, β^1, and β^2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with β^0, β^1, and β^2 each shown in a different color.

Question (f)

Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

Question (g)

On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

Chapter 8, Exercise 11

Question (a)

Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

Question (b)

Fit a boosting model to the training set with `Purchase` as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

Question (c)

c-1 Boosting

c-2 KNN

c-3 Logistic Regression

Mid-Term Review Session

Mid-Term Review Session

Chapter 6, Exercise 10

Question (a)

Generate a data set with p = 20 features, n = 1000 observations and an associated quantitative response vector generated according to the model \(Y=X \beta + \epsilon\), where \(\beta\) has some elements that are exactly equal to zero.

Question (b)

Split your data set into a training set containing 100 observations and a test set containing 900 observations.

Question (c)

Perform best subset selection on the training set, plot the training set MSE associated with the best model of each size.

Question (d)

Plot the test set MSE associated with the best model of each size.

Question (e)

Question (f)

How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

Question (g)

Create a plot displaying \(\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta_j^r})^2}\) for a range of values of r, where \(\hat{\beta^r_j}\) is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

Chapter 7, Exercise 11

Question (a)

Generate a response Y and two predictors X1 and X2, with n=100.

Question (b)

Initialize first of the β^1 to value of your choice.

Question (c)

Keeping β^1 fixed, fit the model Y-β^1X1 = β0 + β2X2 + eps

Question (d)

Keeping β^2 fixed, fit the model Y-β^2X2 = β0 + β1X1 + eps

Question (e)

Write a for loop to repeat (c) and (d) 1000 times. Report the estimates of β^0, β^1, and β^2 at each iteration of the for loop. Create a plot in which each of these values is displayed, with β^0, β^1, and β^2 each shown in a different color.

Question (f)

Compare your answer in (e) to the results of simply performing multiple linear regression to predict Y using X1 and X2. Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

Question (g)

On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

Chapter 8, Exercise 11

Question (a)

Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

Question (b)

Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

Question (c)

c-1 Boosting

c-2 KNN

c-3 Logistic Regression

Fit a boosting model to the training set with `Purchase` as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?