Handling Missing Values with plssem • plssem

The pls() function offers some very basic approaches for handling missing values in the data, specified via the missing argument. Currently, there are three options.

Listwise deletion (missing = "listwise")
Mean imputation (missing = "mean")
k nearest neighbors (kNN) imputation (missing = "kNN")

The last two options are single imputation approaches. The pls() function does not currently offer any multiple imputation approaches, but we show how this can be done by the user itself, using the mice package, at the end of the vignette.

Listwise Deletion

With missing="listwise" (the default) any observation (i.e., a row) containing missing values for the variables used in the model are removed. Here we can see an example.

model <- "Survived ~ Age + Female + Age:Female"
fit <- pls(model, data = titanic, missing = "listwise", ordered = "Survived")
#> plssem->getPLS_Data():  
#>    Removing missing data using listwise deletion...

Mean Imputation

With missing="mean" missing values are imputed with (univariate) expected values. For continous values missing values are imputed using the mean. For ordinal variables with more than two categories, missing values are imputed with the median. For binary ordered variables missing values are imputed with the mode.

In our example, missing values in Age are imputed with the mean of age. Both Survived and Female are binary variables, where the missing values get imputed with the most common value.

model <- "Survived ~ Age + Female + Age:Female"
fit <- pls(model, data = titanic, missing = "mean", ordered = "Survived")
#> plssem->getPLS_Data():  
#>    Imputing missing data using mean imputation...

kNN Imputation

With missing="kNN" missing values are imputed by finding the k nearest (complete data) neighbors of an observation with missing data. The values of the values of the k neighbors are then aggregated using either the mean, median or the mode, depending on the data type of the variable. The k number of neighbors to be used, can be specified using the knn.k argument.

model <- "Survived ~ Age + Female + Age:Female"
fit <- pls(model, data = titanic, missing = "kNN",
           ordered = "Survived", knn.k = 5) # use the 5 nearest neighbors
#> plssem->getPLS_Data():  
#>    Imputing missing data using k-Nearest Neighbors (kNN), k = 5...
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!

Multiple Imputation

Multiple imputation cannot be performed just using the pls() function, but it can be performed using other available multiple imputation packages in R. Here we use the mice package, but other packages can be used as well (e.g., the Amelia package).

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind

m <- 20 # Number of imputations
vars <- c("Survived", "Age", "Female") # Variables to impute/use in the analysis

imputations <- mice(titanic[vars], m = m)
#> 
#>  iter imp variable
#>   1   1  Survived  Age
#>   1   2  Survived  Age
#>   1   3  Survived  Age
#>   1   4  Survived  Age
#>   1   5  Survived  Age
#>   1   6  Survived  Age
#>   1   7  Survived  Age
#>   1   8  Survived  Age
#>   1   9  Survived  Age
#>   1   10  Survived  Age
#>   1   11  Survived  Age
#>   1   12  Survived  Age
#>   1   13  Survived  Age
#>   1   14  Survived  Age
#>   1   15  Survived  Age
#>   1   16  Survived  Age
#>   1   17  Survived  Age
#>   1   18  Survived  Age
#>   1   19  Survived  Age
#>   1   20  Survived  Age
#>   2   1  Survived  Age
#>   2   2  Survived  Age
#>   2   3  Survived  Age
#>   2   4  Survived  Age
#>   2   5  Survived  Age
#>   2   6  Survived  Age
#>   2   7  Survived  Age
#>   2   8  Survived  Age
#>   2   9  Survived  Age
#>   2   10  Survived  Age
#>   2   11  Survived  Age
#>   2   12  Survived  Age
#>   2   13  Survived  Age
#>   2   14  Survived  Age
#>   2   15  Survived  Age
#>   2   16  Survived  Age
#>   2   17  Survived  Age
#>   2   18  Survived  Age
#>   2   19  Survived  Age
#>   2   20  Survived  Age
#>   3   1  Survived  Age
#>   3   2  Survived  Age
#>   3   3  Survived  Age
#>   3   4  Survived  Age
#>   3   5  Survived  Age
#>   3   6  Survived  Age
#>   3   7  Survived  Age
#>   3   8  Survived  Age
#>   3   9  Survived  Age
#>   3   10  Survived  Age
#>   3   11  Survived  Age
#>   3   12  Survived  Age
#>   3   13  Survived  Age
#>   3   14  Survived  Age
#>   3   15  Survived  Age
#>   3   16  Survived  Age
#>   3   17  Survived  Age
#>   3   18  Survived  Age
#>   3   19  Survived  Age
#>   3   20  Survived  Age
#>   4   1  Survived  Age
#>   4   2  Survived  Age
#>   4   3  Survived  Age
#>   4   4  Survived  Age
#>   4   5  Survived  Age
#>   4   6  Survived  Age
#>   4   7  Survived  Age
#>   4   8  Survived  Age
#>   4   9  Survived  Age
#>   4   10  Survived  Age
#>   4   11  Survived  Age
#>   4   12  Survived  Age
#>   4   13  Survived  Age
#>   4   14  Survived  Age
#>   4   15  Survived  Age
#>   4   16  Survived  Age
#>   4   17  Survived  Age
#>   4   18  Survived  Age
#>   4   19  Survived  Age
#>   4   20  Survived  Age
#>   5   1  Survived  Age
#>   5   2  Survived  Age
#>   5   3  Survived  Age
#>   5   4  Survived  Age
#>   5   5  Survived  Age
#>   5   6  Survived  Age
#>   5   7  Survived  Age
#>   5   8  Survived  Age
#>   5   9  Survived  Age
#>   5   10  Survived  Age
#>   5   11  Survived  Age
#>   5   12  Survived  Age
#>   5   13  Survived  Age
#>   5   14  Survived  Age
#>   5   15  Survived  Age
#>   5   16  Survived  Age
#>   5   17  Survived  Age
#>   5   18  Survived  Age
#>   5   19  Survived  Age
#>   5   20  Survived  Age

COEF <- NULL # Matrix with estimated coefficients for each imputation
BOOT <- NULL # Matrix with all the bootstraps from all imputations

model <- "Survived ~ Age + Female + Age:Female"

for (i in seq_len(m)) {
  fit.i <- pls(model, data = complete(imputations, i), # get the ith imputation
               ordered = "Survived",
               bootstrap = TRUE,
               boot.R = 100,
               boot.parallel = "multicore", # Use parallel bootstrap
               boot.ncores = 2L)

  COEF <- rbind(COEF, coef(fit.i))
  BOOT <- rbind(BOOT, boot(fit.i))
}
#> Warning: plssem->bootstrap():  
#>    Kept 8 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 34 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 16 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 25 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 20 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 32 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 11 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 30 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 26 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 27 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 23 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 31 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 6 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 24 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 21 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 36 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 12 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 22 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->bootstrap():  
#>    Kept 8 (out of 100) bootstrap replicate(s) with inadmissible solutions.
#> Warning: plssem->mcpls():  
#>    Base fit is inadmissible! The MC-PLS algorithm might not converge to a 
#>    proper solution!
#> Warning: plssem->bootstrap():  
#>    Kept 16 (out of 100) bootstrap replicate(s) with inadmissible solutions.

round(apply(COEF, MARGIN = 2, FUN = mean), 3) # Mean estimate across imputations
#>     Survived<~Survived               Age<~Age         Female<~Female 
#>                  1.000                  1.000                  1.000 
#>           Survived~Age        Survived~Female    Survived~Age:Female 
#>                 -0.071                  0.674                  0.180 
#>     Survived~~Survived               Age~~Age            Age~~Female 
#>                  0.480                  1.000                 -0.060 
#>        Age~~Age:Female         Female~~Female     Female~~Age:Female 
#>                 -0.002                  1.000                  0.000 
#> Age:Female~~Age:Female            Survived|t1 
#>                  1.001                  0.222
round(apply(BOOT, MARGIN = 2, FUN = sd), 3)   # Standard errors
#>     Survived<~Survived               Age<~Age         Female<~Female 
#>                  0.000                  0.000                  0.000 
#>           Survived~Age        Survived~Female    Survived~Age:Female 
#>                  0.053                  0.067                  0.060 
#>     Survived~~Survived               Age~~Age            Age~~Female 
#>                  0.062                  0.040                  0.053 
#>        Age~~Age:Female         Female~~Female     Female~~Age:Female 
#>                  0.063                  0.016                  0.044 
#> Age:Female~~Age:Female            Survived|t1 
#>                  0.055                  0.092