Predictive Analysis on Pet Adoption Speed

Background and Dataset Introduction
- Dataset: PetFinder.my pet profiles
EDA Conclusions Review
Critique of Selected Notebooks
- Linear Model Notebook:
- Random Forest Notebook:
Data Engineering
Predictive Modeling
Model Interpretation and Conclusions
- Linear & KNN
- Random Forest
Reflection

Background and Dataset Introduction

PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.
Animal adoption rates are strongly correlated to the metadata associated with their online profiles. In this project we will be developing algorithms to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles’ appeal, reducing animal suffering and euthanization.

Dataset: PetFinder.my pet profiles

The dataset we use in this project is based on the pet’s listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. But we will mainly focus on the tabular data to apply machine learning techniques.

### Functions
installIfAbsentAndLoad <- function(neededVector) {
  for(thispackage in neededVector) {
    if( ! require(thispackage, character.only = T) )
    { install.packages(thispackage)}
    require(thispackage, character.only = T)
  }
}

### Load reequired packages
needed <- c('dplyr','stringr','caret', 'Rborist', 'psych','ggplot2',
            'randomForest','ranger','e1071','pdp','class')  
installIfAbsentAndLoad(needed)

Let’s load the data and take a look at it:

pets <- read.csv('train.csv')
summary(pets)

##       Type            Name            Age             Breed1     
##  Min.   :1.000          : 1257   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:1.000   Baby   :   66   1st Qu.:  2.00   1st Qu.:265.0  
##  Median :1.000   Lucky  :   64   Median :  3.00   Median :266.0  
##  Mean   :1.458   Brownie:   54   Mean   : 10.45   Mean   :265.3  
##  3rd Qu.:2.000   No Name:   54   3rd Qu.: 12.00   3rd Qu.:307.0  
##  Max.   :2.000   Mimi   :   52   Max.   :255.00   Max.   :307.0  
##                  (Other):13446                                   
##      Breed2           Gender          Color1          Color2     
##  Min.   :  0.00   Min.   :1.000   Min.   :1.000   Min.   :0.000  
##  1st Qu.:  0.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.000  
##  Median :  0.00   Median :2.000   Median :2.000   Median :2.000  
##  Mean   : 74.01   Mean   :1.776   Mean   :2.234   Mean   :3.223  
##  3rd Qu.:179.00   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:6.000  
##  Max.   :307.00   Max.   :3.000   Max.   :7.000   Max.   :7.000  
##                                                                  
##      Color3       MaturitySize     FurLength       Vaccinated   
##  Min.   :0.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :0.000   Median :2.000   Median :1.000   Median :2.000  
##  Mean   :1.882   Mean   :1.862   Mean   :1.467   Mean   :1.731  
##  3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :7.000   Max.   :4.000   Max.   :3.000   Max.   :3.000  
##                                                                 
##     Dewormed       Sterilized        Health         Quantity     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 1.000  
##  Median :1.000   Median :2.000   Median :1.000   Median : 1.000  
##  Mean   :1.559   Mean   :1.914   Mean   :1.037   Mean   : 1.576  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.: 1.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :20.000  
##                                                                  
##       Fee              State                                  RescuerID    
##  Min.   :   0.00   Min.   :41324   fa90fa5b1ee11c86938398b60abc32cb:  459  
##  1st Qu.:   0.00   1st Qu.:41326   aa66486163b6cbc25ea62a34b11c9b91:  315  
##  Median :   0.00   Median :41326   c00756f2bdd8fa88fc9f07a8309f7d5d:  231  
##  Mean   :  21.26   Mean   :41346   b53c34474d9e24574bcec6a3d3306a0d:  228  
##  3rd Qu.:   0.00   3rd Qu.:41401   ee2747ce26468ec44c7194e7d1d9dad9:  156  
##  Max.   :3000.00   Max.   :41415   95481e953f8aed9ec3d16fc4509537e8:  134  
##                                    (Other)                         :13470  
##     VideoAmt                                        Description   
##  Min.   :0.00000   For Adoption                           :  164  
##  1st Qu.:0.00000   Dog 4 Adoption                         :   54  
##  Median :0.00000   Cat for adoption                       :   25  
##  Mean   :0.05676   Friendly                               :   20  
##  3rd Qu.:0.00000   Dog for adoption                       :   18  
##  Max.   :8.00000   Please feel free to contact us : Stuart:   18  
##                    (Other)                                :14694  
##        PetID          PhotoAmt      AdoptionSpeed  
##  0008c5398:    1   Min.   : 0.000   Min.   :0.000  
##  000a290e4:    1   1st Qu.: 2.000   1st Qu.:2.000  
##  000fb9572:    1   Median : 3.000   Median :2.000  
##  0011d7c25:    1   Mean   : 3.889   Mean   :2.516  
##  00156db4a:    1   3rd Qu.: 5.000   3rd Qu.:4.000  
##  001a1aaad:    1   Max.   :30.000   Max.   :4.000  
##  (Other)  :14987

Here are 24 columns, and the detailed descriptions are as followed:

Data Fields

PetID - Unique hash ID of pet profile
AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
Type - Type of animal (1 = Dog, 2 = Cat)
Name - Name of pet (Empty if not named)
Age - Age of pet when listed, in months
Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
Quantity - Number of pets represented in profile
Fee - Adoption fee (0 = Free)
State - State location in Malaysia (Refer to StateLabels dictionary)
RescuerID - Unique hash ID of rescuer
VideoAmt - Total uploaded videos for this pet
PhotoAmt - Total uploaded photos for this pet
Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

Ressponse Variable: AdoptionSpeed

We will need to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:
* 0 - Pet was adopted on the same day as it was listed.
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

EDA Conclusions Review

We reviewed the notebook Exploration of data step by step(https://www.kaggle.com/artgor/exploration-of-data-step-by-step/data#Main-data-exploration) and summarized the conclusions as follows:

The target of the program is to find the adoption speed of the pets. Based on EDA, some pets are adopted immediately, but many are not adopted at all.
The data is broken into 2 types of animals - cats and dogs. The EDA shows that the percentage of not adopted cats is lower, possibly because the dataset is small and biased.
While looking at the exploratory data analysis, we saw that less than 10% of pets did not have names, but that those pets had a higher possibility of not being adopted. The EDA also shows that some of the names given to the pets are meaningless and have only 1 or 2 characters. Pets with these meaningless names could have less success of being adopted and we took this into account in our data engineering.
The EDA shows that young pets are adopted quickly and frequently. Most of the pets adopted are less than 4 months old.
Most of the dogs available for adoption are not pure breeds, but are mixed breeds. The values listed under the “breed” feature were sometimes random and only a description. We took this into account in our data engineering.
As far as gender goes, male pets are adopted faster than female pets. If the pet has no information regarding gender, it decreases their chances of being adopted.
Another interesting aspect of the data is the fee. Some pets can be acquired for free and asking for a fee slightly descreases the chance of adoption. According to the EDA, free cats are adopted faster than free dogs.
In most cases there are no videos for the pets, so this is not very useful. However, pets typically have photos and can have up to 30. But the EDA shows that the amount of photos does not have much influence on the adoption of the pet.

Critique of Selected Notebooks

Linear Model Notebook:

There was no feature engineering done aside from counting the words in the description column. They referenced all X columns by name in the linear model when this is unnecessary. You can just do a data = train_set
Feature selection would be useful, they did a simple LM instead of using any kind of subset selection, ridge regression or lasso regression to select important variables. They also do not appear to have set variables to be factors or categorical, which could cause problems.
The notebook indicates that the standard deviation is more than 1, when the range is only 4. This does not indicate a good model. The plots displayed are poorly labeled and do not seem to add to the analysis.
Link to the notebook: https://www.kaggle.com/cjbecerr/linear-model-clustering

Random Forest Notebook:

Threw out name and description columns and there was no attempt made at finding a way to extract something of use from them. However, they did check for missing values and factorize a number of columns.
Uses the caret library’s train function, with the ranger library used to generate the RF. No justification is given as to why they chose to use this method over the randomforest method. Though the traincontrol parameter allowed the author to specify 10 Kfold cross validation easily.
While we won’t comment on the use of XGBoost, at the end of the notebook the author combines the two models by simply averaging their predictions. This is not a particularly good approach because it assumes that both models are equally good, without any analysis to indicate that averaging them would yield a better prediction.
Link to the notebook: https://www.kaggle.com/cberrido/eda-rf-and-xgboost-modelling

Data Engineering

Based on the EDA conclusions, we decided to engineer the features in the following way:
1. Remove features that we would not use in our models:
* VideoAmt: most pets don’t have any videos, so we believe this field will have little predictive power.
* Description: we will not perform text analysis.
* PetID: the unique identifier for pet profiles but will be useless in predictive models.

pets <- select(pets,-c(VideoAmt,Description,PetID))

Create a new column “MeaningfulName” based on the “Name” field: assign Yes if the length of Name is larger than 2, else assign No.

pets$MeaningfulName <- as.factor(ifelse(str_length(pets$Name)>2,"Yes","No"))

Create a new column “PureBreed” based on the “Breed1” and “Breed2”: assign Yes if Breed2 is 0, else assign No.

pets$PureBreed <- as.factor(ifelse(pets$Breed2==0,"Yes","No"))

Create a new column “rank”: since we found that there are a few rescuers that rescued a lot of animals, and the adoption speed shows variations amont them, we would like to label the top 5 rescuers as 1-5, and label all the others as others.

pets %>% group_by(RescuerID) %>% 
  summarise(pets_count=n()) %>% 
  arrange(desc(pets_count)) %>% 
  mutate(rank=dense_rank(desc(pets_count))) %>% 
  mutate(rank=ifelse(rank>5,"Others",rank)) %>% 
  select(c(RescuerID,rank))-> Rescuer

pets <- left_join(pets,Rescuer,by="RescuerID")

Drop redundant columns: we can drop columns that we have used to generate new useful columns.

pets <- select(pets,-c(Name,Breed1, Breed2, RescuerID))

Convert categorical variables into factors: finally we need to convert all the categorical fields into factors for the models not to identify them as continuous variables. The only continuous variables we have are “Age”, “Quantity”, “Fee”, and “PhotoAmt”.

categorical <- setdiff(names(pets),c("Age","Quantity","Fee","PhotoAmt"))
pets %>% mutate_at(categorical,as.factor) -> pets

After completing the data engineering steps, let’s look at the summary of the pets dataset again:

summary(pets)

##  Type          Age         Gender   Color1   Color2   Color3    MaturitySize
##  1:8132   Min.   :  0.00   1:5536   1:7427   0:4471   0:10604   1: 3395     
##  2:6861   1st Qu.:  2.00   2:7277   2:3750   2:3313   3:  175   2:10305     
##           Median :  3.00   3:2180   3: 947   3: 710   4:  198   3: 1260     
##           Mean   : 10.45            4: 634   4: 870   5:  417   4:   33     
##           3rd Qu.: 12.00            5: 884   5:1128   6:  378               
##           Max.   :255.00            6: 684   6:1063   7: 3221               
##                                     7: 667   7:3438                         
##  FurLength Vaccinated Dewormed Sterilized Health       Quantity     
##  1:8808    1:5898     1:8397   1: 3101    1:14478   Min.   : 1.000  
##  2:5361    2:7227     2:4815   2:10077    2:  481   1st Qu.: 1.000  
##  3: 824    3:1868     3:1781   3: 1815    3:   34   Median : 1.000  
##                                                     Mean   : 1.576  
##                                                     3rd Qu.: 1.000  
##                                                     Max.   :20.000  
##                                                                     
##       Fee              State         PhotoAmt      AdoptionSpeed MeaningfulName
##  Min.   :   0.00   41326  :8714   Min.   : 0.000   0: 410        No : 1473     
##  1st Qu.:   0.00   41401  :3845   1st Qu.: 2.000   1:3090        Yes:13520     
##  Median :   0.00   41327  : 843   Median : 3.000   2:4037                      
##  Mean   :  21.26   41336  : 507   Mean   : 3.889   3:3259                      
##  3rd Qu.:   0.00   41330  : 420   3rd Qu.: 5.000   4:4197                      
##  Max.   :3000.00   41332  : 253   Max.   :30.000                               
##                    (Other): 411                                                
##  PureBreed       rank      
##  No : 4231   1     :  459  
##  Yes:10762   2     :  315  
##              3     :  231  
##              4     :  228  
##              5     :  156  
##              Others:13604  
##

And check if there are any missing values:

for (col in colnames(pets)) {
  print(col)
  print(sum(is.na(pets[col])))
}

## [1] "Type"
## [1] 0
## [1] "Age"
## [1] 0
## [1] "Gender"
## [1] 0
## [1] "Color1"
## [1] 0
## [1] "Color2"
## [1] 0
## [1] "Color3"
## [1] 0
## [1] "MaturitySize"
## [1] 0
## [1] "FurLength"
## [1] 0
## [1] "Vaccinated"
## [1] 0
## [1] "Dewormed"
## [1] 0
## [1] "Sterilized"
## [1] 0
## [1] "Health"
## [1] 0
## [1] "Quantity"
## [1] 0
## [1] "Fee"
## [1] 0
## [1] "State"
## [1] 0
## [1] "PhotoAmt"
## [1] 0
## [1] "AdoptionSpeed"
## [1] 0
## [1] "MeaningfulName"
## [1] 0
## [1] "PureBreed"
## [1] 0
## [1] "rank"
## [1] 0

Predictive Modeling

Selection of Metrics

We decided to treat the problem as a classification one since the value Adoption Speed is determined by binning the number of days that it took for a pet or a group of them to be adopted after being listed into 5 categories. While the distance between the values 0-4 does not mean same interval of days, Adoption Speed is an ordinal categorical variable in that the larger the value is, the longer the time it represents.

Therefore, in addition to the accuracy metric we usually use for a classification problem, we choose to use another metric Quadratic Weighted Kappa as Kaggle suggested. It takes into consideration how different the actual values and their predictions are when calculating the weights. The closer it is to 1, the better the predictions are.

We found the cohen.kappa from the psych package can be used to compute this metric conveniently given predictions and actual values.

Linear Model

When we attempted a simple linear regression, we needed to drop a number of highly factored columns, namely the three color columns.

# Linear Model
n <- nrow(pets)
train.indices <- sample(n, .8 * n)
train_lm_x <- pets[train.indices, c(-17,-4,-5,-6)]
train_lm_y <- pets[train.indices, 17]
test_lm_x <- pets[-train.indices, c(-17,-4,-5,-6)]
test_lm_y <- pets[-train.indices, 17]

L_model <- lm(train_lm_y~.,data = train_lm_x)
LM_preds <- predict(L_model, test_lm_x)
LM_preds <- round(LM_preds, 0)

table(LM_preds, test_lm_y)

##         test_lm_y
## LM_preds   0   1   2   3   4
##        2   0   1   1   2   0
##        3  34 453 448 343 278
##        4  31 198 341 273 588
##        5   1   1   0   1   4
##        6   0   0   0   0   1

test_error <- 1- mean(LM_preds == test_lm_y)
print(test_error)

## [1] 0.6892297

kappa <- cohen.kappa(x=cbind(test_lm_y ,LM_preds))
print(kappa$weighted.kappa)

## [1] 0.1797893

The end result is that a simple linear regression yields an error rate of ~0.7, and a weighted kappa of between 0.15 and 0.18, usually around 0.168. Neither of these are good, and indicate that a linear regression is not a good method for predicting the data. This is likely due to the fact that the majority of variables are categorical in this dataset, but we’re treating them like they have numerical meaning here.

Lasso Regression

We also tried a lasso regression, and this performed worse than the simple linear regression. The code did run successfully as of 4/15/20, but stopped working the following day for no apparent reason. It was tested on two machines, and in both cases froze the R console and required the use of task manager to forcibly terminate R studio, meaning the issue is likely with the interaction between Rstudio and Windows/ antivirus.
Ultimately, the data was not well suited for a lasso, as glmnet requires the input to be in the form of a model matrix. Converting the data frame to a model matrix resulted in a bloated mess of 232 columns, instead of the original 21. This is due to the fact that it transformed factors into one-hot-encoded vectors, one for each level of the factor.
The model may be improved by removing the breed and three color columns that contribute the most to the expansion, however we were unable to test this due to technical issues.
From the time when the model worked, the test error rate was greater than 0.8, which means it was actually worse than randomly placing each pet in one of the five bins.

# glmnet, the function used to perform the lasso wants a model matrix.
# this does have the unfortunate side effect of blowing up factors into one-hot encoded
# columns, so it goes from a 21 column dataframe to a 232 column monstrosity.
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)

# remove column 18, which correpsonds to adoption speed, i.e. the response variable
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)[, -18]


##############################################################
# POSSIBLE IMPROVEMENTS #######################################

# try commenting out the line below and uncommenting the one below that

# remove column 18, which correpsonds to adoption speed, i.e. the response variable
#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)[, -18]

# remove column 3, the breed, 5,6,7 the color, 16 the state, and 21 the rank of the 
# rescuer. This will drastically reduce the bloating that occurs and may well yield a 
# better model.

#my.mm <- model.matrix(AdoptionSpeed ~ ., pets)[,c(-18,-3,-5,-6,-7,-21,-16)]
################################################################

# Note that the X's have to be in model matrix format, while the Y's are in 
# data frame format
#train.indices <- sample(n, .8 * n)
#train.x <- my.mm[train.indices, ]
#test.x <- my.mm[-train.indices, ]
#train.y <- pets$AdoptionSpeed[train.indices]
#test.y <- pets$AdoptionSpeed[-train.indices]
#train.data <- data.frame(train.x, Adoption_speed=train.y)

# we perform a grid search to find the best lambda. Virtually no improvement
# searching over 1000 units as compared to 100. The lambda is very small.
#grid = 10^seq(5, -3, length = 100)

# glmnet will allow multinomial regression, a regular glm does not have this option
# but it makes sense here to actually have it classify rather than run a linear regression
#mod.lasso = glmnet(train.x, train.y, alpha = 1, lambda = grid,
             #      family = "multinomial")

# the cross validation no longer works on my machine, it just freezes R
# this should run 10 times and allow us to identify what the best lambda found was
# for use in later models.
# We can save time by just hard coding the best lambda after running once 
#cv.out.class <- cv.glmnet(train.x, train.y, alpha = 1, lambda = grid,
           #               family = "multinomial")

#best_lambda = cv.out.class$lambda.min
#print(best_lambda)

# predictions are in the form of probabilities so we need to pick the most likely
# option
#lasso.pred = predict(mod.lasso, s = best_lambda, newx = test.x,
     #                type = "response")
#pick_5 <- c()
#predictions <- c()
# the prediction is in a long list with each set of 5 being the probabilities
# of adoption speed being 0-4 for a given pet.
# this is not a list of lists, but a continuous string so we need a for loop
# to group them into proper sets of five. Afterwards we pick the most probable 
# as our prediction
#for(i in 1:length(lasso.pred)){
#  pick_5 <- c(pick_5, lasso.pred[i])
#  if(i%%5 == 0){
#    predictions <- c(predictions, which.max(pick_5)-1)
#    pick_5 <- c()
#  }
#}

# confusion matrix
#table(predictions, test.y)

# compute test error rate (1-test error = accuracy)
#test_error <- 1- mean(predictions == test.y)
#print(test_error)
# compute kappa object
#kappa <- cohen.kappa(x=cbind(test.y ,predictions))



##################################################################
#### Probably Broken #############################################
#test_out <- glmnet(test.x, test.y, alpha = 1, lambda = best_lambda,
#                   family = "multinomial")
#lasso.coef = predict(test_out, type = "coefficients", s = best_lambda)
#print(lasso.coef)
####################################################################

KNN

For a KNN model, we had to convert the Yes/No columns MeaningfulName and PureBreed to binary columns. Aside from that, we dropped the “rank” column because it has “other” as a rank, which logically is quite different from the sixth highest ranked rescuer ID. Furthermore, the three color columns and the state column were dropped, as their numerical values are too high to compare reasonably with the other categorical variables. We try 9 different values of K, ranging from 10 to 50 and pick the best one as our final model.

# KNN does not allow strings. Presumably because you can't calculate the euclidean
# distance between "Yes" and "No", however for those they can easily be turned into
# 1 and 0
# 2. Create a new column "MeaningfulName": Yes if the Name length>2, else No
pets$MeaningfulName <- ifelse(pets$MeaningfulName == "Yes","1","0")
# 3. Create a new column "PureBreed": Yes if Breed2==0, else No
pets$PureBreed <- ifelse(pets$PureBreed=="Yes","1","0")
n <- nrow(pets)
train.indices <- sample(n, .8 * n)
# remove categorical values that aren't represented properly by small integer values
# This means breed, the three color columns, the state (which had values in the 40,000's)
# and the rank column for rescuers (which has an 'other' column).
train_lm_x <- pets[train.indices, c(-17,-4,-5,-6,-20,-15)]
train_lm_y <- pets[train.indices, 17]
test_lm_x <- pets[-train.indices, c(-17,-4,-5,-6,-20,-15)]
test_lm_y <- pets[-train.indices, 17]


errors <- c()
kappas <- c()
# we try 9 different values of K and find the best one
kGrid = c(10,15,20,25,30,35,40,45,50)
for( i in 1:9){
  fit <- knn(train = train_lm_x, test=test_lm_x,
                      cl = train_lm_y, k = kGrid[i])
  errors[i] <- 1- mean(fit == test_lm_y)
  kappas[i] <- cohen.kappa(x=cbind(test_lm_y ,fit))
}

best_K = kGrid[which.max(kappas)]
best_kappa = kappas[which.max(kappas)]
best_test_err = errors[which.min(errors)]
# K = 25
# kappa = 0.163
# error = 0.619
sprintf("Best value of K: %f", best_K)

## [1] "Best value of K: 35.000000"

sprintf("Weighted kappa for best K: %f", best_kappa)

## [1] "Weighted kappa for best K: 0.129456"

sprintf("Test error for best K: %f", best_test_err)

## [1] "Test error for best K: 0.644215"

Generally, the best K is 25, though we have seen seeds where it was something else. Weighted kappa tends to be around 0.15 or 0.16, with test error around 0.6.

Random Forest

For this part, we will build up a random forest model using default settings as baseline, and then use the train() function from caret package to tune a few parameters in search for a model with better performance.

Training-test split

First of all, we need to split the pets dataset into training vs. test set, the proportion of training set as 75%.

# Train test split: Partition the data using caret's createDataPartition()
set.seed(1234)
train_index <- createDataPartition(pets$AdoptionSpeed,p=0.75,list=FALSE)
pets.train <- pets[train_index,]
pets.test <- pets[-train_index,]

Train a baseline Random Forest

Then we train a Random Forest model using default settings and display a few parameters.

rf.0 <- randomForest(AdoptionSpeed~., data = pets.train)
## Show the default settings:
print(paste("Number of features selected at each split:",rf.0$mtry))

## [1] "Number of features selected at each split: 4"

print(paste("Number of trees in the forest:",rf.0$ntree))

## [1] "Number of trees in the forest: 500"

Now we will make predictions of Adoption Speed on the test set and evaluate the model’s performance.

pred.0 <- predict(rf.0, newdata = pets.test)
# Accuracy:
print(paste("Accuracy:",mean(as.integer(pets$AdoptionSpeed[-train_index]==pred.0))))

## [1] "Accuracy: 0.390549919914576"

# Kappa
cohen.kappa(x=cbind(pets$AdoptionSpeed[-train_index],pred.0))

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.16     0.19  0.21
## weighted kappa    0.25     0.31  0.36
## 
##  Number of subjects = 3746

Tune parameters

Next, we will vary the number of features and minimum node sizes and use cross-validation to search for the combination that results in model with best Quadratic Weighted Kappa.

When creating a search grid and specifying series of the available parameters, Note that the available parameters to tune are determined by the method we will use in the train() function. There are many methods options for random forest models, to tune minimum node size and features at split we decided to use Rborist

rfGrid.1 <- expand.grid(predFixed = c(3, 6, 9), 
                    minNode = c(1,5,10))

# Create control object
ctrl.1 <- trainControl(method = "cv",
                     number = 3, # number of folds
                     selectionFunction = "best",
                     search = "grid", # grid search
                     verboseIter = FALSE, # print a training log
                     returnResamp = "all" 
                     )

Now we will apply grid search to train 15 different models, use cross-validation to select the best one, and then train the best model again with the full training set.

# set.seed(1234)
# rf.1 <- train(AdoptionSpeed~.,
#               data = pets.train,
#               method = 'Rborist', # rf can choose from a variety of methods; tuning parameters are decided by method
#               metric = "Kappa", # by default accuracy and Kappa are computed for classification
#               trControl = ctrl.1,
#               tuneGrid = rfGrid.1
#               )

We can display the best combination of parameters:

# rf.1$bestTune
# predFixed: 6
# minNode: 10

Using the best parameters to retrain a random forest model:

set.seed(1234)
rf.best <- randomForest(AdoptionSpeed~., data = pets.train, mtry=6, nodesize=10)
pred.best <- predict(rf.best, newdata = pets.test)
# Accuracy:
mean(as.integer(pets$AdoptionSpeed[-train_index]==pred.best))

## [1] 0.3950881

# Kappa
cohen.kappa(x=cbind(pets$AdoptionSpeed[-train_index],pred.best))

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.17     0.19  0.21
## weighted kappa    0.26     0.31  0.37
## 
##  Number of subjects = 3746

Compared to the default setting model rf.0, this new model does not achieve a significantly better accuracy and Quadratic Weighted Kappa. This might result from that the default settings are usually fair enough, and might also be due to our choice of parameters values to choose from.
Given more time, we can expand the parameter grid by specifying more values and see if we could find a better combination.
We will analyze feature importance in the next part.

Model Interpretation and Conclusions

Linear & KNN

From our attempts at using linear regression and KNN, there is a rather obvious takeaway from this project. Namely, you probably shouldn’t use these when you have mostly categorical variables. In both cases, the predictive strength was better than random chance at least. However, the fact is that of the original 21 variables, only 4 were continuous, the other 17 were categorical. As such, you would likely be best served with a model that is specialized for datasets like this.
It would have been interesting to see whether the lasso reduced any coefficients to zero, but technical issues prevented us from following through on this idea.

Random Forest

To have a clear picture of which features were heavily relied on in building the random forest model, we decide to display feature importance.

var_importance <- data.frame(variables=setdiff(colnames(pets), c("AdoptionSpeed",'Breed1')),
                             importance=varImp(rf.best),row.names = NULL)
var_importance[order(var_importance$Overall,decreasing = TRUE),]

##         variables   Overall
## 2             Age 515.54106
## 16       PhotoAmt 467.60510
## 5          Color2 407.75793
## 4          Color1 371.03137
## 15          State 344.95819
## 6          Color3 203.59322
## 14            Fee 189.51396
## 8       FurLength 172.25524
## 13       Quantity 169.93823
## 7    MaturitySize 160.36168
## 3          Gender 155.80157
## 11     Sterilized 149.26986
## 19           rank 135.82440
## 10       Dewormed 135.11197
## 9      Vaccinated 134.68680
## 18      PureBreed 110.81332
## 1            Type  89.78642
## 17 MeaningfulName  77.43344
## 12         Health  45.49620

We can see that Age and PhotoAmt are the most important features, then come categorical features such as Color2, Color1 and State. This result makes sense and is aligned with the findings from EDA.

Reflection

In retrospect, we noticed that the data we have mainly consists of categorical variables. So the nature of the problem probably determines that the tree-based methods that do not require one-hot-encoding categorical variables will generate the best result, because both linear model and KNN would suffer from curse of high dimensionality after one-hot-encoding. Based on our attempts and the models’ performance, we do find that the Random Forest model yielded the best accuracy and Quardratic Weighted Kappa.

We also found that many notebooks focused purely on modeling and predictions rather than on model interpretability and business impacts. We believe in practice these two factors are also essential, so that’s why we would like to add the feature importance. These tools enable us to identify important variables, based on which we can generate potential suggestions for PetFinder.my if we had the chance.

Last but not least, based on the fact that the top score on kaggle still has a weighted kappa under 0.5, we would guess that the dataset itself isn’t large/good enough to have high predictive strength. It would probably need to be 10-100 times as large, and have the cats and dogs separated.