In this session, we will use the package VIM
since it provides many functions to explore and analyze the structure of missing or imputed values in data using graphical methods. Getting knowledge about the structure is helpful to identify the mechanism that generates the missing values and errors that may have happened in the imputation process. It implements three imputation methods:
- \(k\)-Nearest Neighbor imputation
- Hotdeck imputation
- IRMI (iterative robust model-based imputation)
library(VIM)
# show the dataset with missing values
str(sleep)
R>> 'data.frame': 62 obs. of 10 variables:
R>> $ BodyWgt : num 6654 1 3.38 0.92 2547 ...
R>> $ BrainWgt: num 5712 6.6 44.5 5.7 4603 ...
R>> $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
R>> $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
R>> $ Sleep : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
R>> $ Span : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
R>> $ Gest : num 645 42 60 25 624 180 35 392 63 230 ...
R>> $ Pred : int 3 3 1 5 3 4 1 4 1 1 ...
R>> $ Exp : int 5 1 1 2 5 4 1 5 2 1 ...
R>> $ Danger : int 3 3 1 3 4 4 1 4 1 1 ...
# plot missingness pattern (blue for observed and red for missing values)
aggr(sleep, numbers = TRUE, prop = c(TRUE, FALSE))

# impute via kNN
sleep_kNN <- kNN(sleep)
sleep_kNN_part <- kNN(sleep, variable = c("Sleep", "Dream", "NonD"), dist_var = colnames(sleep))
# plot missingness and imputation pattern (orange is for imputed values and red for missing values)
aggr(sleep_kNN_part, delimiter = "_imp", numbers = TRUE, prop = c(TRUE, FALSE))

We can also plot a modified boxplot with imputed missings:
sleep_vars <- c("Dream", "NonD", "Sleep", "Dream_imp", "NonD_imp", "Sleep_imp")
pbox(sleep_kNN_part[, sleep_vars], delimiter = "_imp", selection = "any", interactive = FALSE)

The marginplot is an enhancement to the normal scatterplot, where imputed values are highlighted for each variable:
str(tao)
R>> 'data.frame': 736 obs. of 8 variables:
R>> $ Year : int 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
R>> $ Latitude : int 0 0 0 0 0 0 0 0 0 0 ...
R>> $ Longitude : int -110 -110 -110 -110 -110 -110 -110 -110 -110 -110 ...
R>> $ Sea.Surface.Temp: num 27.6 27.6 27.6 27.6 27.6 ...
R>> $ Air.Temp : num 27.1 27 27 26.9 26.8 ...
R>> $ Humidity : num 79.6 75.8 76.5 76.2 76.4 76.7 76.5 78.3 78.6 76.9 ...
R>> $ UWind : num -6.4 -5.3 -5.1 -4.9 -3.5 -4.4 -2 -3.7 -4.2 -3.6 ...
R>> $ VWind : num 5.4 5.3 4.5 2.5 4.1 1.6 3.5 4.5 5 3.5 ...
aggr(tao, numbers = TRUE, prop = c(TRUE, FALSE))

tao_kNN <- kNN(tao)
tao_vars <- c("Air.Temp", "Humidity", "Air.Temp_imp", "Humidity_imp")
marginplot(tao_kNN[, tao_vars], delimiter = "imp", alpha = 0.6)
The variable Air.Temp
is clearly separated in two clusters. Furthermore, it reveals that imputed values of Humidity
only occur in the first cluster, which are values with low Air.Temp
and high Humidity
and that they are imputed with a value between 84 and 91. On the other hand, most of the imputed values of the variable Air.Temp
have been imputed with a value between 26 and 27 and mostly occur in the second cluster, which are value with high Air.Temp
.
We can plot marginplots of data imputed with different imputation methods:
library(e1071)
tao_kNN <- kNN(tao, k = 5)
tao_mean <- as.data.frame(impute(tao, what = "mean"))
tao_mean <- cbind(tao_mean, tao_kNN[, c("Air.Temp_imp","Humidity_imp")])
vars <- c("Air.Temp", "Humidity", "Air.Temp_imp", "Humidity_imp")
par(mfrow = c(1, 2))
marginplot(tao_kNN[, vars], delimiter = "imp", alpha = 0.6, main = "kNN")
marginplot(tao_mean[, vars], delimiter = "imp", alpha = 0.6, main = "mean")

We can also plot a scatterplot with imputed missings:
vars <- c("Air.Temp", "Humidity", "Air.Temp_imp", "Humidity_imp")
scattMiss(tao_kNN[, vars], delimiter = "imp", alpha = 0.6, interactive = FALSE)

We now show a marginplot matrix:
vars <- c("Air.Temp", "Humidity", "Sea.Surface.Temp",
"Air.Temp_imp", "Humidity_imp", "Sea.Surface.Temp_imp")
marginmatrix(tao_kNN[, vars], delimiter = "_imp", alpha=0.6)

as well as a scatterplot matrix with imputed missings:
vars <- c("Air.Temp", "Humidity", "Sea.Surface.Temp",
"Air.Temp_imp", "Humidity_imp", "Sea.Surface.Temp_imp")
scattmatrixMiss(tao_kNN[, vars], highlight = "Air.Temp", delimiter = "_imp", alpha=0.6, interactive = FALSE)

Finally, the matrix plot is a very useful multivariate plot, it helps to detect multivariate dependencies and patterns:
matrixplot(sleep_kNN, delimiter = "_imp", sortby = "Span", interactive = FALSE)

The mosaic plot is a graphical representation of multi-way contingency tables, there- fore it’s mainly intended for categorial variables. The frequencies of the different cells are visualized by area-proportional rectangles (tiles). For constructing a mosaic plot, a rectangle is first split vertically at positions corresponding to the relative frequencies of the categories of a corresponding variable. Then the resulting smaller rectangles are again subdivided according to the conditional probabilities of a second variable.
mosaicMiss(sleep_kNN, highlight=3, plotvars=8:10, delimiter="_imp", miss.labels=FALSE)

It may occur, that the structure, because of which the imputed values have been missing is affected by spatial patterns. The map of imputed missings method is developed to explore this matter:
vars <- c("As", "Bi", "Ca", "As_imp", "Bi_imp")
chorizon_kNN <- kNN(chorizonDL ,variable=c("Ca","As","Bi"), dist_var=c("Hg","K"))
coo <- chorizon_kNN[, c("XCOO", "YCOO")]
mapMiss(chorizon_kNN[, vars], coo, map = kola.background, delimiter = "_imp", alpha = 0.6, interactive = FALSE)

Like the map of imputed missing function, this method is also developed to analyze geographical data. However, the values of the variable of interest are represented by growing dots instead of normal points in this function. Imputed values are highlighted again.
growdotMiss(chorizon_kNN[, vars], coo, kola.background, delimiter = "_imp", alpha = 0.6, interactive = FALSE)
