Predictive modeling

modeling

Published

June 5, 2022

Overview

This post is dedicated to make a comparison between Caret and TidyModels R packages. Data modelling with R pass through data preprocessing and parameters assessments to predicting an outcome. Both set of packages can be used to achieve same results, with the purpose of finding the best predictive performance for data specific models.

The Caret package is the starting point for understanding how to manage models and produce unbiases predictions with R. As well as TidyModels meta package, it gives the opportunity to contruct a multivariate model syntax to manage several models to be applied on same set of data. TidyModels allows the use of a set of concatenated functions in partership with the TidyVerse grammar to build a structural model base which blends different models as one global model.

The following is an attempt to a comparison between the two predictive model structures.

Caret package

The most important functions for this package, grouped by steps to modeling, are:

Preprocessing (data cleaning/wrangling)
- preProcess()
Data splitting and resampling
- createDataPartition()
- createResample()
- createTimeSlices()
Model fit and prediction
- train()
- predict()
Model comparison
- confusionMatrix()

TidyModels meta package

This “meta package” is made of a set of packages for modeling, with the support of other well known packages for data manipulation and visualization such as broom, dplyr, ggplot2, purrr, infer, modeldata, and tibble; it includes:

recipes (a preprocessor)
rsample (for resampling)
parsnip (model syntax)
tune and dials (optimization of hyperparameters)
workflows and workflowsets (combine pre-processing steps and models)
yardstick (for evaluating models)

The most important functions for this meta package, grouped by steps to modeling, are:

Preprocessing (data cleaning/wrangling)
- recipes::recipe()
- recipes::step_()
Data splitting and resampling
- rsample::initial_split()
- rsample::training()
- rsample::testing()
- rsample::bootstraps()
- rsample::vfold_cv()
- tune::control_resamples()
Model fit and prediction
- parsnip::() %>% set_mode() %>% set_engine()
- parsnip::extract_fit_engine()
- parsnip::extract_fit_parsnip()
- parsnip::fit() stats::predict()
- tune::fit_resamples()
Model workflow
- workflows::workflow() %>% add_model()
- workflows::add_formula()
- workflows::add_recipe()
- parsnip::fit()
- stats::predict()
- workflows::update_formula()
- workflows::add_variables() / remove_variables()
- workflowsets::workflow_set()
- workflowsets::workflow_map()
- workflowsets/tune::extract_workflow() / extract_recipe() / extract_fit_parsnip()
- tune::last_fit()
- workflowsets/tune::collect_metrics()
- workflowsets/tune::collect_predictions()
Model comparison
- yardstick::conf_mat()
- yardstick::accuracy()
- yardstick::metric_set()
- yardstick::roc_curve()
- yardstick::roc_auc()
- yardstick::sensitivity()

Machine learning algorithms in R

Linear discriminant analysis
Regression
Naive Bayes
Support vector machines
Classification and regression trees
Random forests
Boosting
etc.

Resource: Practical Machine Learning

Caret or TidyModels?

Caret Tidymodels

Caret Example with SPAM Data

library(caret); library(kernlab); data(spam)

Loading required package: ggplot2

Loading required package: lattice


Attaching package: 'kernlab'

The following object is masked from 'package:ggplot2':

    alpha

inTrain <- createDataPartition(y=spam$type,
                              p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# dim(training)

set.seed(32343)
modelFit <- train(type ~.,data=training, method="glm")

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# modelFit

predictions <- predict(modelFit,newdata=testing)
# predictions

cm <- confusionMatrix(predictions,testing$type)
cm

Confusion Matrix and Statistics

          Reference
Prediction nonspam spam
   nonspam     666   52
   spam         31  401
                                          
               Accuracy : 0.9278          
                 95% CI : (0.9113, 0.9421)
    No Information Rate : 0.6061          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8476          
                                          
 Mcnemar's Test P-Value : 0.02814         
                                          
            Sensitivity : 0.9555          
            Specificity : 0.8852          
         Pos Pred Value : 0.9276          
         Neg Pred Value : 0.9282          
             Prevalence : 0.6061          
         Detection Rate : 0.5791          
   Detection Prevalence : 0.6243          
      Balanced Accuracy : 0.9204          
                                          
       'Positive' Class : nonspam

plot(cm$table,main="Table")

TidyModels Example with SPAM Data

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

✔ broom        1.0.10     ✔ rsample      1.3.1 
✔ dials        1.4.2      ✔ tailor       0.1.0 
✔ dplyr        1.1.4      ✔ tidyr        1.3.1 
✔ infer        1.0.9      ✔ tune         2.0.1 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
✔ purrr        1.2.0      ✔ yardstick    1.3.2 
✔ recipes      1.3.1

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::alpha()          masks kernlab::alpha(), ggplot2::alpha()
✖ dials::buffer()          masks kernlab::buffer()
✖ rsample::calibration()   masks caret::calibration()
✖ purrr::cross()           masks kernlab::cross()
✖ purrr::discard()         masks scales::discard()
✖ dplyr::filter()          masks stats::filter()
✖ dplyr::lag()             masks stats::lag()
✖ purrr::lift()            masks caret::lift()
✖ yardstick::precision()   masks caret::precision()
✖ yardstick::recall()      masks caret::recall()
✖ yardstick::sensitivity() masks caret::sensitivity()
✖ yardstick::specificity() masks caret::specificity()
✖ recipes::step()          masks stats::step()

tidymodels_prefer()
set.seed(123)
split <- initial_split(spam,0.75,strata=type)
training <- training(split)
testing <- testing(split)

modelFit <- logistic_reg() %>% 
  set_engine("glm") %>%
  fit(type~.,data=spam)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# tidy(modelFit)

predictions <- predict(modelFit,new_data=testing)
# predictions

testing$pred <- predictions$.pred_class
cm <- yardstick::conf_mat(data = testing, truth = type, estimate = pred)
cm

          Truth
Prediction nonspam spam
   nonspam     668   50
   spam         29  404

autoplot(cm)