Predictive Analysis on Pet Adoption Speed
Background and Dataset Introduction
PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.
Animal adoption rates are strongly correlated to the metadata associated with their online profiles. In this project we will be developing algorithms to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles’ appeal, reducing animal suffering and euthanization.
Dataset: PetFinder.my pet profiles
The dataset we use in this project is based on the pet’s listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. But we will mainly focus on the tabular data to apply machine learning techniques.
### Functions
installIfAbsentAndLoad <- function(neededVector) {
for(thispackage in neededVector) {
if( ! require(thispackage, character.only = T) )
{ install.packages(thispackage)}
require(thispackage, character.only = T)
}
}
### Load reequired packages
needed <- c('dplyr','stringr','caret', 'Rborist', 'psych','ggplot2',
'randomForest','ranger','e1071','pdp','class')
installIfAbsentAndLoad(needed)
Let’s load the data and take a look at it:
## Type Name Age Breed1
## Min. :1.000 : 1257 Min. : 0.00 Min. : 0.0
## 1st Qu.:1.000 Baby : 66 1st Qu.: 2.00 1st Qu.:265.0
## Median :1.000 Lucky : 64 Median : 3.00 Median :266.0
## Mean :1.458 Brownie: 54 Mean : 10.45 Mean :265.3
## 3rd Qu.:2.000 No Name: 54 3rd Qu.: 12.00 3rd Qu.:307.0
## Max. :2.000 Mimi : 52 Max. :255.00 Max. :307.0
## (Other):13446
## Breed2 Gender Color1 Color2
## Min. : 0.00 Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
## Median : 0.00 Median :2.000 Median :2.000 Median :2.000
## Mean : 74.01 Mean :1.776 Mean :2.234 Mean :3.223
## 3rd Qu.:179.00 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:6.000
## Max. :307.00 Max. :3.000 Max. :7.000 Max. :7.000
##
## Color3 MaturitySize FurLength Vaccinated
## Min. :0.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :0.000 Median :2.000 Median :1.000 Median :2.000
## Mean :1.882 Mean :1.862 Mean :1.467 Mean :1.731
## 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :7.000 Max. :4.000 Max. :3.000 Max. :3.000
##
## Dewormed Sterilized Health Quantity
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 1.000
## Median :1.000 Median :2.000 Median :1.000 Median : 1.000
## Mean :1.559 Mean :1.914 Mean :1.037 Mean : 1.576
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :3.000 Max. :3.000 Max. :3.000 Max. :20.000
##
## Fee State RescuerID
## Min. : 0.00 Min. :41324 fa90fa5b1ee11c86938398b60abc32cb: 459
## 1st Qu.: 0.00 1st Qu.:41326 aa66486163b6cbc25ea62a34b11c9b91: 315
## Median : 0.00 Median :41326 c00756f2bdd8fa88fc9f07a8309f7d5d: 231
## Mean : 21.26 Mean :41346 b53c34474d9e24574bcec6a3d3306a0d: 228
## 3rd Qu.: 0.00 3rd Qu.:41401 ee2747ce26468ec44c7194e7d1d9dad9: 156
## Max. :3000.00 Max. :41415 95481e953f8aed9ec3d16fc4509537e8: 134
## (Other) :13470
## VideoAmt Description
## Min. :0.00000 For Adoption : 164
## 1st Qu.:0.00000 Dog 4 Adoption : 54
## Median :0.00000 Cat for adoption : 25
## Mean :0.05676 Friendly : 20
## 3rd Qu.:0.00000 Dog for adoption : 18
## Max. :8.00000 Please feel free to contact us : Stuart: 18
## (Other) :14694
## PetID PhotoAmt AdoptionSpeed
## 0008c5398: 1 Min. : 0.000 Min. :0.000
## 000a290e4: 1 1st Qu.: 2.000 1st Qu.:2.000
## 000fb9572: 1 Median : 3.000 Median :2.000
## 0011d7c25: 1 Mean : 3.889 Mean :2.516
## 00156db4a: 1 3rd Qu.: 5.000 3rd Qu.:4.000
## 001a1aaad: 1 Max. :30.000 Max. :4.000
## (Other) :14987
Here are 24 columns, and the detailed descriptions are as followed:
Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
Ressponse Variable: AdoptionSpeed
We will need to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:
* 0 - Pet was adopted on the same day as it was listed.
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).
EDA Conclusions Review
We reviewed the notebook Exploration of data step by step
(https://www.kaggle.com/artgor/exploration-of-data-step-by-step/data#Main-data-exploration) and summarized the conclusions as follows:
- The target of the program is to find the adoption speed of the pets. Based on EDA, some pets are adopted immediately, but many are not adopted at all.
- The data is broken into 2 types of animals - cats and dogs. The EDA shows that the percentage of not adopted cats is lower, possibly because the dataset is small and biased.
- While looking at the exploratory data analysis, we saw that less than 10% of pets did not have names, but that those pets had a higher possibility of not being adopted. The EDA also shows that some of the names given to the pets are meaningless and have only 1 or 2 characters. Pets with these meaningless names could have less success of being adopted and we took this into account in our data engineering.
- The EDA shows that young pets are adopted quickly and frequently. Most of the pets adopted are less than 4 months old.
- Most of the dogs available for adoption are not pure breeds, but are mixed breeds. The values listed under the “breed” feature were sometimes random and only a description. We took this into account in our data engineering.
- As far as gender goes, male pets are adopted faster than female pets. If the pet has no information regarding gender, it decreases their chances of being adopted.
- Another interesting aspect of the data is the fee. Some pets can be acquired for free and asking for a fee slightly descreases the chance of adoption. According to the EDA, free cats are adopted faster than free dogs.
- In most cases there are no videos for the pets, so this is not very useful. However, pets typically have photos and can have up to 30. But the EDA shows that the amount of photos does not have much influence on the adoption of the pet.
Critique of Selected Notebooks
Linear Model Notebook:
- There was no feature engineering done aside from counting the words in the description column. They referenced all X columns by name in the linear model when this is unnecessary. You can just do a data = train_set
- Feature selection would be useful, they did a simple LM instead of using any kind of subset selection, ridge regression or lasso regression to select important variables. They also do not appear to have set variables to be factors or categorical, which could cause problems.
- The notebook indicates that the standard deviation is more than 1, when the range is only 4. This does not indicate a good model. The plots displayed are poorly labeled and do not seem to add to the analysis.
Link to the notebook: https://www.kaggle.com/cjbecerr/linear-model-clustering
Random Forest Notebook:
- Threw out name and description columns and there was no attempt made at finding a way to extract something of use from them. However, they did check for missing values and factorize a number of columns.
- Uses the caret library’s train function, with the ranger library used to generate the RF. No justification is given as to why they chose to use this method over the randomforest method. Though the traincontrol parameter allowed the author to specify 10 Kfold cross validation easily.
- While we won’t comment on the use of XGBoost, at the end of the notebook the author combines the two models by simply averaging their predictions. This is not a particularly good approach because it assumes that both models are equally good, without any analysis to indicate that averaging them would yield a better prediction.
Link to the notebook: https://www.kaggle.com/cberrido/eda-rf-and-xgboost-modelling
Data Engineering
Based on the EDA conclusions, we decided to engineer the features in the following way:
1. Remove features that we would not use in our models:
* VideoAmt: most pets don’t have any videos, so we believe this field will have little predictive power.
* Description: we will not perform text analysis.
* PetID: the unique identifier for pet profiles but will be useless in predictive models.
- Create a new column “MeaningfulName” based on the “Name” field: assign
Yes
if the length of Name is larger than 2, else assignNo
.
- Create a new column “PureBreed” based on the “Breed1” and “Breed2”: assign
Yes
if Breed2 is 0, else assignNo
.
- Create a new column “rank”: since we found that there are a few rescuers that rescued a lot of animals, and the adoption speed shows variations amont them, we would like to label the top 5 rescuers as
1-5
, and label all the others asothers
.
pets %>% group_by(RescuerID) %>%
summarise(pets_count=n()) %>%
arrange(desc(pets_count)) %>%
mutate(rank=dense_rank(desc(pets_count))) %>%
mutate(rank=ifelse(rank>5,"Others",rank)) %>%
select(c(RescuerID,rank))-> Rescuer
pets <- left_join(pets,Rescuer,by="RescuerID")
- Drop redundant columns: we can drop columns that we have used to generate new useful columns.
- Convert categorical variables into factors: finally we need to convert all the categorical fields into factors for the models not to identify them as continuous variables. The only continuous variables we have are “Age”, “Quantity”, “Fee”, and “PhotoAmt”.
categorical <- setdiff(names(pets),c("Age","Quantity","Fee","PhotoAmt"))
pets %>% mutate_at(categorical,as.factor) -> pets
After completing the data engineering steps, let’s look at the summary of the pets dataset again:
## Type Age Gender Color1 Color2 Color3 MaturitySize
## 1:8132 Min. : 0.00 1:5536 1:7427 0:4471 0:10604 1: 3395
## 2:6861 1st Qu.: 2.00 2:7277 2:3750 2:3313 3: 175 2:10305
## Median : 3.00 3:2180 3: 947 3: 710 4: 198 3: 1260
## Mean : 10.45 4: 634 4: 870 5: 417 4: 33
## 3rd Qu.: 12.00 5: 884 5:1128 6: 378
## Max. :255.00 6: 684 6:1063 7: 3221
## 7: 667 7:3438
## FurLength Vaccinated Dewormed Sterilized Health Quantity
## 1:8808 1:5898 1:8397 1: 3101 1:14478 Min. : 1.000
## 2:5361 2:7227 2:4815 2:10077 2: 481 1st Qu.: 1.000
## 3: 824 3:1868 3:1781 3: 1815 3: 34 Median : 1.000
## Mean : 1.576
## 3rd Qu.: 1.000
## Max. :20.000
##
## Fee State PhotoAmt AdoptionSpeed MeaningfulName
## Min. : 0.00 41326 :8714 Min. : 0.000 0: 410 No : 1473
## 1st Qu.: 0.00 41401 :3845 1st Qu.: 2.000 1:3090 Yes:13520
## Median : 0.00 41327 : 843 Median : 3.000 2:4037
## Mean : 21.26 41336 : 507 Mean : 3.889 3:3259
## 3rd Qu.: 0.00 41330 : 420 3rd Qu.: 5.000 4:4197
## Max. :3000.00 41332 : 253 Max. :30.000
## (Other): 411
## PureBreed rank
## No : 4231 1 : 459
## Yes:10762 2 : 315
## 3 : 231
## 4 : 228
## 5 : 156
## Others:13604
##
And check if there are any missing values:
## [1] "Type"
## [1] 0
## [1] "Age"
## [1] 0
## [1] "Gender"
## [1] 0
## [1] "Color1"
## [1] 0
## [1] "Color2"
## [1] 0
## [1] "Color3"
## [1] 0
## [1] "MaturitySize"
## [1] 0
## [1] "FurLength"
## [1] 0
## [1] "Vaccinated"
## [1] 0
## [1] "Dewormed"
## [1] 0
## [1] "Sterilized"
## [1] 0
## [1] "Health"
## [1] 0
## [1] "Quantity"
## [1] 0
## [1] "Fee"
## [1] 0
## [1] "State"
## [1] 0
## [1] "PhotoAmt"
## [1] 0
## [1] "AdoptionSpeed"
## [1] 0
## [1] "MeaningfulName"
## [1] 0
## [1] "PureBreed"
## [1] 0
## [1] "rank"
## [1] 0
Predictive Modeling
Selection of Metrics
We decided to treat the problem as a classification one since the value Adoption Speed is determined by binning the number of days that it took for a pet or a group of them to be adopted after being listed into 5 categories. While the distance between the values 0-4 does not mean same interval of days, Adoption Speed is an ordinal categorical variable in that the larger the value is, the longer the time it represents.
Therefore, in addition to the accuracy
metric we usually use for a classification problem, we choose to use another metric Quadratic Weighted Kappa
as Kaggle suggested. It takes into consideration how different the actual values and their predictions are when calculating the weights. The closer it is to 1, the better the predictions are.
We found the cohen.kappa
from the psych
package can be used to compute this metric conveniently given predictions and actual values.
Linear Model
When we attempted a simple linear regression, we needed to drop a number of highly factored columns, namely the three color columns.
# Linear Model
n <- nrow(pets)
train.indices <- sample(n, .8 * n)
train_lm_x <- pets[train.indices, c(-17,-4,-5,-6)]
train_lm_y <- pets[train.indices, 17]
test_lm_x <- pets[-train.indices, c(-17,-4,-5,-6)]
test_lm_y <- pets[-train.indices, 17]
L_model <- lm(train_lm_y~.,data = train_lm_x)
LM_preds <- predict(L_model, test_lm_x)
LM_preds <- round(LM_preds, 0)
table(LM_preds, test_lm_y)
## test_lm_y
## LM_preds 0 1 2 3 4
## 2 0 1 1 2 0
## 3 34 453 448 343 278
## 4 31 198 341 273 588
## 5 1 1 0 1 4
## 6 0 0 0 0 1
## [1] 0.6892297
## [1] 0.1797893
The end result is that a simple linear regression yields an error rate of ~0.7, and a weighted kappa of between 0.15 and 0.18, usually around 0.168. Neither of these are good, and indicate that a linear regression is not a good method for predicting the data. This is likely due to the fact that the majority of variables are categorical in this dataset, but we’re treating them like they have numerical meaning here.
Lasso Regression
We also tried a lasso regression, and this performed worse than the simple linear regression. The code did run successfully as of 4/15/20, but stopped working the following day for no apparent reason. It was tested on two machines, and in both cases froze the R console and required the use of task manager to forcibly terminate R studio, meaning the issue is likely with the interaction between Rstudio and Windows/ antivirus.
Ultimately, the data was not well suited for a lasso, as glmnet requires the input to be in the form of a model matrix. Converting the data frame to a model matrix resulted in a bloated mess of 232 columns, instead of the original 21. This is due to the fact that it transformed factors into one-hot-encoded vectors, one for each level of the factor.
The model may be improved by removing the breed and three color columns that contribute the most to the expansion, however we were unable to test this due to technical issues.
From the time when the model worked, the test error rate was greater than 0.8, which means it was actually worse than randomly placing each pet in one of the five bins.
# glmnet, the function used to perform the lasso wants a model matrix.
# this does have the unfortunate side effect of blowing up factors into one-hot encoded
# columns, so it goes from a 21 column dataframe to a 232 column monstrosity.
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)
# remove column 18, which correpsonds to adoption speed, i.e. the response variable
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)[, -18]
##############################################################
# POSSIBLE IMPROVEMENTS #######################################
# try commenting out the line below and uncommenting the one below that
# remove column 18, which correpsonds to adoption speed, i.e. the response variable
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)[, -18]
# remove column 3, the breed, 5,6,7 the color, 16 the state, and 21 the rank of the
# rescuer. This will drastically reduce the bloating that occurs and may well yield a
# better model.
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)[,c(-18,-3,-5,-6,-7,-21,-16)]
################################################################
# Note that the X's have to be in model matrix format, while the Y's are in
# data frame format
#train.indices <- sample(n, .8 * n)
#train.x <- my.mm[train.indices, ]
#test.x <- my.mm[-train.indices, ]
#train.y <- pets$AdoptionSpeed[train.indices]
#test.y <- pets$AdoptionSpeed[-train.indices]
#train.data <- data.frame(train.x, Adoption_speed=train.y)
# we perform a grid search to find the best lambda. Virtually no improvement
# searching over 1000 units as compared to 100. The lambda is very small.
#grid = 10^seq(5, -3, length = 100)
# glmnet will allow multinomial regression, a regular glm does not have this option
# but it makes sense here to actually have it classify rather than run a linear regression
#mod.lasso = glmnet(train.x, train.y, alpha = 1, lambda = grid,
# family = "multinomial")
# the cross validation no longer works on my machine, it just freezes R
# this should run 10 times and allow us to identify what the best lambda found was
# for use in later models.
# We can save time by just hard coding the best lambda after running once
#cv.out.class <- cv.glmnet(train.x, train.y, alpha = 1, lambda = grid,
# family = "multinomial")
#best_lambda = cv.out.class$lambda.min
#print(best_lambda)
# predictions are in the form of probabilities so we need to pick the most likely
# option
#lasso.pred = predict(mod.lasso, s = best_lambda, newx = test.x,
# type = "response")
#pick_5 <- c()
#predictions <- c()
# the prediction is in a long list with each set of 5 being the probabilities
# of adoption speed being 0-4 for a given pet.
# this is not a list of lists, but a continuous string so we need a for loop
# to group them into proper sets of five. Afterwards we pick the most probable
# as our prediction
#for(i in 1:length(lasso.pred)){
# pick_5 <- c(pick_5, lasso.pred[i])
# if(i%%5 == 0){
# predictions <- c(predictions, which.max(pick_5)-1)
# pick_5 <- c()
# }
#}
# confusion matrix
#table(predictions, test.y)
# compute test error rate (1-test error = accuracy)
#test_error <- 1- mean(predictions == test.y)
#print(test_error)
# compute kappa object
#kappa <- cohen.kappa(x=cbind(test.y ,predictions))
##################################################################
#### Probably Broken #############################################
#test_out <- glmnet(test.x, test.y, alpha = 1, lambda = best_lambda,
# family = "multinomial")
#lasso.coef = predict(test_out, type = "coefficients", s = best_lambda)
#print(lasso.coef)
####################################################################
KNN
For a KNN model, we had to convert the Yes/No columns MeaningfulName and PureBreed to binary columns. Aside from that, we dropped the “rank” column because it has “other” as a rank, which logically is quite different from the sixth highest ranked rescuer ID. Furthermore, the three color columns and the state column were dropped, as their numerical values are too high to compare reasonably with the other categorical variables. We try 9 different values of K, ranging from 10 to 50 and pick the best one as our final model.
# KNN does not allow strings. Presumably because you can't calculate the euclidean
# distance between "Yes" and "No", however for those they can easily be turned into
# 1 and 0
# 2. Create a new column "MeaningfulName": Yes if the Name length>2, else No
pets$MeaningfulName <- ifelse(pets$MeaningfulName == "Yes","1","0")
# 3. Create a new column "PureBreed": Yes if Breed2==0, else No
pets$PureBreed <- ifelse(pets$PureBreed=="Yes","1","0")
n <- nrow(pets)
train.indices <- sample(n, .8 * n)
# remove categorical values that aren't represented properly by small integer values
# This means breed, the three color columns, the state (which had values in the 40,000's)
# and the rank column for rescuers (which has an 'other' column).
train_lm_x <- pets[train.indices, c(-17,-4,-5,-6,-20,-15)]
train_lm_y <- pets[train.indices, 17]
test_lm_x <- pets[-train.indices, c(-17,-4,-5,-6,-20,-15)]
test_lm_y <- pets[-train.indices, 17]
errors <- c()
kappas <- c()
# we try 9 different values of K and find the best one
kGrid = c(10,15,20,25,30,35,40,45,50)
for( i in 1:9){
fit <- knn(train = train_lm_x, test=test_lm_x,
cl = train_lm_y, k = kGrid[i])
errors[i] <- 1- mean(fit == test_lm_y)
kappas[i] <- cohen.kappa(x=cbind(test_lm_y ,fit))
}
best_K = kGrid[which.max(kappas)]
best_kappa = kappas[which.max(kappas)]
best_test_err = errors[which.min(errors)]
# K = 25
# kappa = 0.163
# error = 0.619
sprintf("Best value of K: %f", best_K)
## [1] "Best value of K: 35.000000"
## [1] "Weighted kappa for best K: 0.129456"
## [1] "Test error for best K: 0.644215"
Generally, the best K is 25, though we have seen seeds where it was something else. Weighted kappa tends to be around 0.15 or 0.16, with test error around 0.6.
Random Forest
For this part, we will build up a random forest model using default settings as baseline, and then use the train()
function from caret
package to tune a few parameters in search for a model with better performance.
Training-test split
First of all, we need to split the pets dataset into training vs. test set, the proportion of training set as 75%.
Train a baseline Random Forest
Then we train a Random Forest model using default settings and display a few parameters.
rf.0 <- randomForest(AdoptionSpeed~., data = pets.train)
## Show the default settings:
print(paste("Number of features selected at each split:",rf.0$mtry))
## [1] "Number of features selected at each split: 4"
## [1] "Number of trees in the forest: 500"
Now we will make predictions of Adoption Speed on the test set and evaluate the model’s performance.
pred.0 <- predict(rf.0, newdata = pets.test)
# Accuracy:
print(paste("Accuracy:",mean(as.integer(pets$AdoptionSpeed[-train_index]==pred.0))))
## [1] "Accuracy: 0.390549919914576"
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.16 0.19 0.21
## weighted kappa 0.25 0.31 0.36
##
## Number of subjects = 3746
Tune parameters
Next, we will vary the number of features and minimum node sizes and use cross-validation to search for the combination that results in model with best Quadratic Weighted Kappa.
When creating a search grid and specifying series of the available parameters, Note that the available parameters to tune are determined by the method we will use in the train()
function. There are many methods options for random forest models, to tune minimum node size and features at split we decided to use Rborist
rfGrid.1 <- expand.grid(predFixed = c(3, 6, 9),
minNode = c(1,5,10))
# Create control object
ctrl.1 <- trainControl(method = "cv",
number = 3, # number of folds
selectionFunction = "best",
search = "grid", # grid search
verboseIter = FALSE, # print a training log
returnResamp = "all"
)
Now we will apply grid search to train 15 different models, use cross-validation to select the best one, and then train the best model again with the full training set.
# set.seed(1234)
# rf.1 <- train(AdoptionSpeed~.,
# data = pets.train,
# method = 'Rborist', # rf can choose from a variety of methods; tuning parameters are decided by method
# metric = "Kappa", # by default accuracy and Kappa are computed for classification
# trControl = ctrl.1,
# tuneGrid = rfGrid.1
# )
We can display the best combination of parameters:
Using the best parameters to retrain a random forest model:
set.seed(1234)
rf.best <- randomForest(AdoptionSpeed~., data = pets.train, mtry=6, nodesize=10)
pred.best <- predict(rf.best, newdata = pets.test)
# Accuracy:
mean(as.integer(pets$AdoptionSpeed[-train_index]==pred.best))
## [1] 0.3950881
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.17 0.19 0.21
## weighted kappa 0.26 0.31 0.37
##
## Number of subjects = 3746
Compared to the default setting model rf.0
, this new model does not achieve a significantly better accuracy and Quadratic Weighted Kappa. This might result from that the default settings are usually fair enough, and might also be due to our choice of parameters values to choose from.
Given more time, we can expand the parameter grid by specifying more values and see if we could find a better combination.
We will analyze feature importance in the next part.
Model Interpretation and Conclusions
Linear & KNN
From our attempts at using linear regression and KNN, there is a rather obvious takeaway from this project. Namely, you probably shouldn’t use these when you have mostly categorical variables. In both cases, the predictive strength was better than random chance at least. However, the fact is that of the original 21 variables, only 4 were continuous, the other 17 were categorical. As such, you would likely be best served with a model that is specialized for datasets like this.
It would have been interesting to see whether the lasso reduced any coefficients to zero, but technical issues prevented us from following through on this idea.
Random Forest
To have a clear picture of which features were heavily relied on in building the random forest model, we decide to display feature importance.
var_importance <- data.frame(variables=setdiff(colnames(pets), c("AdoptionSpeed",'Breed1')),
importance=varImp(rf.best),row.names = NULL)
var_importance[order(var_importance$Overall,decreasing = TRUE),]
## variables Overall
## 2 Age 515.54106
## 16 PhotoAmt 467.60510
## 5 Color2 407.75793
## 4 Color1 371.03137
## 15 State 344.95819
## 6 Color3 203.59322
## 14 Fee 189.51396
## 8 FurLength 172.25524
## 13 Quantity 169.93823
## 7 MaturitySize 160.36168
## 3 Gender 155.80157
## 11 Sterilized 149.26986
## 19 rank 135.82440
## 10 Dewormed 135.11197
## 9 Vaccinated 134.68680
## 18 PureBreed 110.81332
## 1 Type 89.78642
## 17 MeaningfulName 77.43344
## 12 Health 45.49620
We can see that Age
and PhotoAmt
are the most important features, then come categorical features such as Color2
, Color1
and State
. This result makes sense and is aligned with the findings from EDA.
Reflection
In retrospect, we noticed that the data we have mainly consists of categorical variables. So the nature of the problem probably determines that the tree-based methods that do not require one-hot-encoding categorical variables will generate the best result, because both linear model and KNN would suffer from curse of high dimensionality after one-hot-encoding. Based on our attempts and the models’ performance, we do find that the Random Forest model yielded the best accuracy and Quardratic Weighted Kappa.
We also found that many notebooks focused purely on modeling and predictions rather than on model interpretability and business impacts. We believe in practice these two factors are also essential, so that’s why we would like to add the feature importance. These tools enable us to identify important variables, based on which we can generate potential suggestions for PetFinder.my if we had the chance.
Last but not least, based on the fact that the top score on kaggle still has a weighted kappa under 0.5, we would guess that the dataset itself isn’t large/good enough to have high predictive strength. It would probably need to be 10-100 times as large, and have the cats and dogs separated.