Mid-Term Review Session

Chapter 6, Exercise 10

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

Question (b)

Split your data set into a training set containing 100 observations and a test set containing 900 observations.


Question (e)

For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test MSE is minimized for an intermediate model size.

## [1] 14

The test mse is smallest with a subset size of 14.


Question (f)

How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

## (Intercept)         x.3         x.4         x.5         x.7         x.8 
##   0.4725956   0.7563194  -0.1396924  -0.1657850  -0.7360974  -0.8690224 
##         x.9        x.10        x.11        x.12        x.13        x.14 
##   0.7427001   0.8509050   0.2526848  -1.9481681  -0.1942387   0.5531693 
##        x.15        x.16        x.17        x.18        x.19 
##  -0.4856334  -1.5262369  -0.2729310   2.1844493  -0.5892004

We set the coeficients at x2, x4, and x6 to zero x2, and x6 are absent meaning the model correctly caught that they were zero, however it did assign a weight to x4 which means it missed one of the three meaningful coefficients, given that all the rest are totally random.


Question (g)

Create a plot displaying \(\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta_j^r})^2}\) for a range of values of r, where \(\hat{\beta^r_j}\) is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

## [1] 2


While this is not always the case, for this particular random seed we end up with the coefficient error being lowest at 2 coefficients. Other seeds can yield other values given that there is no actual pattern to the data. Using a seed of 42 for example had both the test MSE and coefficient error lowest at 16 coefficients. Essentially the only meaningful observation of this plot is that the cofficients are farthest away from the actual value when we have a subset of size 1.

Ultimately, the point of the exercise is to illustrate that even for garbage data like this, you will see improvement in training error as you increase the flexibility of the model, however that does not mean that your model is actually improving.


Chapter 7, Exercise 11

This question explores backfitting in the context of multiple linear regression. If you want to perform multiple linear regression, but only have the software to perform simple linear regression, we take the following approach: repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. This process is continued until convergence - until the coefficient estimates stop changing. We are going to try this out on a toy example.


Question (c)

Keeping β^1 fixed, fit the model Y-β^1X1 = β0 + β2X2 + eps


Question (d)

Keeping β^2 fixed, fit the model Y-β^2X2 = β0 + β1X1 + eps


Question (g)

On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

When the relationship between Y and X’s is linear, one iteration is sufficient to attain a good approximation of true regression coefficients.


Chapter 8, Exercise 11

This question uses the Caravan data set. Let’s first take a look at the data set.

## [1] 5822   86
**Description**
The data contains 5822 real customer records. Each record consists of 86 variables, containing sociodemographic data (variables 1-43) and product ownership (variables 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. 
Variable 86 (Purchase) indicates whether the customer purchased a caravan insurance policy.


Question (a)

Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.
## [1] 1000   86
## [1] 4822   86


Question (b)

Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
## [1] "No"  "Yes"

Key takeaways:
1. String values of the response variable should be converted to numerical in order to be passed to gbm(), otherwise there would be an error message: Bernoulli requires the response to be in {0,1};
2. distribution = 'bernoulli' must be specified to indicate a classification problem with a binary y;
3. as.factor() should NOT be applied to training$Purchase after converting it to a 0/1 binary variable. Otherwise gbm() can work without error messages but return NaN values for feature importances.

To display the top 5 important features, we can slice the summary(boost.caravan) as we do to a Data Frame.

##               var   rel.inf
## PPERSAUT PPERSAUT 15.155340
## MKOOPKLA MKOOPKLA  9.234995
## MOPLHOOG MOPLHOOG  8.670170
## MBERMIDD MBERMIDD  5.394037
## MGODGE     MGODGE  5.030477


Question (c)

Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?.
c-1 Boosting
##             Predictions
## ActualValues   No  Yes
##          No  4396  137
##          Yes  255   34

Key takeaways:
1. n.trees must be specified when using predict on a gbm object;
2. To make predictions of probabilities, specify the argument type = 'response'; By default predict would return values on the log odds scale for distribution = 'bernoulli';
3. Machine Learning 1 legacy: always put actual values first in the table function, which will appear as row labels later in the confusion matrix and specify names of rows and columns to make it clear.

## [1] 0.1988304


c-2 KNN

Now we fit a KNN model and compare its result to the boosting model.

## [1] 32
##             Predictions
## ActualValues   No  Yes
##          No  4465   68
##          Yes  272   17

Key takeaways:
1. Before fitting a KNN model, scaling must be done on the independent variables of the whole data set;
2. knn() will return the predicted labels directly. If specifying the argument prob = TRUE, the proportion of the votes for the winning class are returned as attribute prob. Still, this needs to be converted to the probability of predicting 1 or Yes;

Calculate the precision of the KNN model.

## [1] 0.2

The KNN model performs slightly better in terms of model precision.

c-3 Logistic Regression

Finally, fit a logistic regression model.

##             Predictions
## ActualValues   No  Yes
##          No  4183  350
##          Yes  231   58

Calculate the precision of the KNN model.

## [1] 0.1421569

Logistic regression performs much worse compared to Boosting and KNN in terms of precision.

Team 7 - Yixi Chen, Kelby Williamson, Carlos Garrido, Scott Mundy

3/2/2020