Multiple imputation in R (with regression output, clustering, and weights)

Last updated on Jan 21, 2023 14 min read R

The Problem

There are several guides on using multiple imputation in R. However, analyzing imputed models with certain options (i.e., with clustering, with weights) is a bit more challenging. More challenging even (at least for me), is getting the results to display a certain way that can be used in publications (i.e., showing regressions in a hierarchical fashion or multiple models side by side) that can be exported to MS Word. I have used packages such as stargazer and huxtable for getting nice (or as it has been called, pretty) regression tables. However, I have not been able to figure out how to get the output of multiply imputed results to display properly (if someone knows of a way, please email me or leave a comment).

My preference for imputation in R is to use the mice package together with the miceadds package. I specifically wanted to:

Account for clustering (working with nested data)
Include weights (as is the case with nationally representative datasets)
Display multiple models side by side (i.e., show standard errors below regression coefficients)

This note does not show how to perform multilevel imputation– just single-level imputation and getting the output ‘publication ready’.

Possible solutions

I have seen another post that describes how to perform this using a mix of dplyr, purrr, and huxtable. Although comprehensive, may be a bit more challenging to follow (though is probably good if you want to learn stuff about the tidyverse too).

Now the miceadds::lm.cluster function can deal with the clustering though does not allow for the inclusion of weights. The mitools package together with the survey package allows for regressions using weights and clustering with imputed data (though getting the p values and the R squared requires a bit more work). In either case, stargazer nor huxtable work with either automatically for formatted regression output (that I know).

Walkthrough

I settled on using the mitools package (to combine the imputation results just using the lm function). The cluster robust standard errors were computed using the sandwich package. A function then saves the results into a data frame, which after some processing, is read in texreg to display/save the output. This still is a lot of steps.

For an example, I will use the data.ma01 dataset in the miceadds package. It’s a nested dataset with weights.

library(miceadds)
data(data.ma01)
head(data.ma01)

##     idstud idschool studwgt math read migrant books hisei paredu female
## 1 10010001     1001 6.05019  594  647       0     6    NA      3      1
## 2 10010084     1001 6.05019  605  651       0     6    77      7      1
## 3 10010116     1001 5.26952  616  539       0     5    69      7      0
## 4 10010162     1001 5.26952  524  551       1     2    45      2      0
## 5 10010357     1001 6.05019  685  689       0     6    66      7      1
## 6 10010720     1001 5.26952  387  502       0     3    53      3      1
##   urban
## 1     1
## 2     1
## 3     1
## 4     1
## 5     1
## 6     1

names(data.ma01)

##  [1] "idstud"   "idschool" "studwgt"  "math"     "read"     "migrant" 
##  [7] "books"    "hisei"    "paredu"   "female"   "urban"

#data.ma01$female <- factor(data.ma01$female, labels = c('m', 'f'))
dat <- data.ma01[,-c(1,5)] #remove student id and math
dim(dat) #number of observations and variables

## [1] 4073    9

length(table(dat$idschool)) #79 schools

## [1] 79

md.pattern(dat, plot = F) #view missing data pattern

##      idschool studwgt female urban math migrant books paredu hisei     
## 3006        1       1      1     1    1       1     1      1     1    0
## 243         1       1      1     1    1       1     1      1     0    1
## 193         1       1      1     1    1       1     1      0     1    1
## 99          1       1      1     1    1       1     1      0     0    2
## 46          1       1      1     1    1       1     0      1     1    1
## 15          1       1      1     1    1       1     0      1     0    2
## 9           1       1      1     1    1       1     0      0     1    2
## 52          1       1      1     1    1       1     0      0     0    3
## 57          1       1      1     1    1       0     1      1     1    1
## 14          1       1      1     1    1       0     1      1     0    2
## 10          1       1      1     1    1       0     1      0     1    2
## 22          1       1      1     1    1       0     1      0     0    3
## 5           1       1      1     1    1       0     0      1     1    2
## 10          1       1      1     1    1       0     0      1     0    3
## 10          1       1      1     1    1       0     0      0     1    3
## 71          1       1      1     1    1       0     0      0     0    4
## 4           1       1      1     1    0       1     1      1     1    1
## 2           1       1      1     1    0       1     1      0     0    3
## 1           1       1      1     1    0       0     0      1     0    4
## 204         1       1      1     1    0       0     0      0     0    5
##             0       0      0     0  211     404   423    672   733 2443

In our dataset, 3006 observations out of 4073 (74%) have complete data. Generally, we would impute ~25 datasets (m = 25) but for demonstration purposes, we will just impute 5 datasets. You can perform all the necessary diagnostics after imputation. You can specify other types of imputation (instead of predictive mean matching or pmm and instead use logreg for variables such as female– or convert it to factor before imputing– however female had nothing missing anyway). First step is to impute the data using mice.

dat2 <- mice(dat, m = 5, seed = 123)

## 
##  iter imp variable
##   1   1  math  migrant  books  hisei  paredu
##   1   2  math  migrant  books  hisei  paredu
##   1   3  math  migrant  books  hisei  paredu
##   1   4  math  migrant  books  hisei  paredu
##   1   5  math  migrant  books  hisei  paredu
##   2   1  math  migrant  books  hisei  paredu
##   2   2  math  migrant  books  hisei  paredu
##   2   3  math  migrant  books  hisei  paredu
##   2   4  math  migrant  books  hisei  paredu
##   2   5  math  migrant  books  hisei  paredu
##   3   1  math  migrant  books  hisei  paredu
##   3   2  math  migrant  books  hisei  paredu
##   3   3  math  migrant  books  hisei  paredu
##   3   4  math  migrant  books  hisei  paredu
##   3   5  math  migrant  books  hisei  paredu
##   4   1  math  migrant  books  hisei  paredu
##   4   2  math  migrant  books  hisei  paredu
##   4   3  math  migrant  books  hisei  paredu
##   4   4  math  migrant  books  hisei  paredu
##   4   5  math  migrant  books  hisei  paredu
##   5   1  math  migrant  books  hisei  paredu
##   5   2  math  migrant  books  hisei  paredu
##   5   3  math  migrant  books  hisei  paredu
##   5   4  math  migrant  books  hisei  paredu
##   5   5  math  migrant  books  hisei  paredu

summary(dat2)

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
## idschool  studwgt     math  migrant    books    hisei   paredu   female 
##       ""       ""    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"       "" 
##    urban 
##       "" 
## PredictorMatrix:
##          idschool studwgt math migrant books hisei paredu female urban
## idschool        0       1    1       1     1     1      1      1     1
## studwgt         1       0    1       1     1     1      1      1     1
## math            1       1    0       1     1     1      1      1     1
## migrant         1       1    1       0     1     1      1      1     1
## books           1       1    1       1     0     1      1      1     1
## hisei           1       1    1       1     1     0      1      1     1

class(dat2)

## [1] "mids"

After imputing the data, in order to analyze the data, instead of specifying the data frame in the data option, the data are analyzed using the with function. For some other functions, an imputationlist has to be created but with a mids object (multiply imputed dataset), these can be analyzed directly.

model1 <- with(dat2, lm(scale(math) ~  migrant + female, weights = studwgt))
summary(model1)

##           term    estimate  std.error  statistic      p.value
## 1  (Intercept)  0.10343011 0.02261421   4.573679 4.934867e-06
## 2      migrant -0.49575568 0.03918768 -12.650805 5.230175e-36
## 3       female -0.07522121 0.03055474  -2.461851 1.386339e-02
## 4  (Intercept)  0.10384614 0.02261056   4.592816 4.505222e-06
## 5      migrant -0.50249688 0.03907260 -12.860595 3.928566e-37
## 6       female -0.07600872 0.03051405  -2.490941 1.278007e-02
## 7  (Intercept)  0.10844606 0.02261679   4.794937 1.685425e-06
## 8      migrant -0.49746352 0.03919533 -12.691909 3.159126e-36
## 9       female -0.08340406 0.03052509  -2.732312 6.316313e-03
## 10 (Intercept)  0.10938769 0.02261002   4.838018 1.359944e-06
## 11     migrant -0.50631300 0.03906748 -12.959960 1.137243e-37
## 12      female -0.07846245 0.03050740  -2.571915 1.014894e-02
## 13 (Intercept)  0.10214606 0.02261175   4.517388 6.438179e-06
## 14     migrant -0.48143591 0.03944415 -12.205508 1.119162e-33
## 15      female -0.08100336 0.03058448  -2.648512 8.115914e-03

The summary shows the output for each regression run. However, the cluster robust standard errors need to be computed (clustering at the school level using idschool) and the results have to be combined appropriately using Rubin’s rules (to account for within and between group variation).

If all we wanted was to quickly get results, we could use:

pool(model1)

## Warning: package 'bindrcpp' was built under R version 3.4.4

## Class: mipo    m = 5 
##                estimate         ubar            b            t dfcom
## (Intercept)  0.10545121 0.0005113326 1.051253e-05 0.0005239477  4070
## migrant     -0.49669300 0.0015361450 9.012010e-05 0.0016442891  4070
## female      -0.07881996 0.0009325186 1.169079e-05 0.0009465476  4070
##                    df        riv     lambda        fmi
## (Intercept) 2520.0994 0.02467089 0.02407689 0.02485048
## migrant      743.7519 0.07039968 0.06576953 0.06827164
## female      3284.7635 0.01504415 0.01482117 0.01542047

However, the results don’t account for clustering nor do we get p values. To get these, we use the mitools package which needs the data to be an imputation list. The datlist object is analyzed instead of the original mids.

#create imputation list for analyses
datlist <- miceadds::mids2datlist(dat2)
model2 <- with(datlist, lm(scale(math) ~  migrant +
  female, weights = studwgt)) #same model but using datlist

The coefficients (betas) and standard errors can be extracted per regression run. The resulting summary shows the pooled or combined results (with p values). The vcovCL function in the sandwich package is used to get the cluster robust standard errors. The cluster variable is specified using the first dataset in the datlist (which is the same per dataset since everything is now a complete dataset).

library(sandwich)
betas <- lapply(model2, coef)
vars <- lapply(model2, FUN = function(x){vcovCL(x, cluster = datlist[[1]]$idschool)}) #vcovCL is in the sandwich package
  # conduct statistical inference
summary(pool_mi(betas, vars))

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                 results         se         t            p      (lower
## (Intercept)  0.10545121 0.05824299  1.810539 7.021331e-02 -0.00870345
## migrant     -0.49669300 0.05777990 -8.596293 1.183904e-17 -0.60997550
## female      -0.07881996 0.04291278 -1.836748 6.625148e-02 -0.16292896
##                   upper) missInfo
## (Intercept)  0.219605875    0.4 %
## migrant     -0.383410488    3.3 %
## female       0.005289041    0.8 %

We can save the output and then be done with that if that is all we were interested in. However, often, we will want to compare multiple models side by side. We could do that by copying and pasting results. But then, a function can help out with that together with texreg.

Basically, after the initial multiply imputed regression is run, the extract.df function below is run. Specify the imputed model run and the clustering variable. Afterwards, a texreg object is created which can be used to display the output.

extract.df <- function(tt, cl = NULL) {
  require(sandwich)
  require(mitools)
  require(texreg)
  m2 <- length(tt) #number of imputations
  betas <- lapply(tt, coef)
  vars <- lapply(tt, FUN = function(x){ vcovCL(x, cluster = cl)})
  # conduct statistical inference and save results into a data.frame
  model1 <- summary(pool_mi(betas, vars))
  
  R2 <- mean(sapply(1:m2, function(x) summary(tt[[x]])$r.squared))
  #technically, should use R to Z-- but close for now when R is low
  ns <- nobs(tt[[1]])
  
  #creates what is used by texreg
  tr <- createTexreg(
    coef.names = row.names(model1), 
    coef = model1$results, 
    se = model1$se, 
    pvalues = model1$p,
    gof.names = c("R2", "Nobs"), 
    gof = c(R2, ns),
    gof.decimal = c(T,F)
  )
}

out1 <- extract.df(model2, cl = datlist[[1]]$idschool)

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                 results         se         t            p      (lower
## (Intercept)  0.10545121 0.05824299  1.810539 7.021331e-02 -0.00870345
## migrant     -0.49669300 0.05777990 -8.596293 1.183904e-17 -0.60997550
## female      -0.07881996 0.04291278 -1.836748 6.625148e-02 -0.16292896
##                   upper) missInfo
## (Intercept)  0.219605875    0.4 %
## migrant     -0.383410488    3.3 %
## female       0.005289041    0.8 %

screenreg(out1, digits = 3) #my preferred option with 4 or more models

## 
## =========================
##              Model 1     
## -------------------------
## (Intercept)     0.105    
##                (0.058)   
## migrant        -0.497 ***
##                (0.058)   
## female         -0.079    
##                (0.043)   
## -------------------------
## R2              0.039    
## Nobs         4073        
## =========================
## *** p < 0.001, ** p < 0.01, * p < 0.05

screenreg(out1, digits = 3, single.row = T) #another option with 3 or less models

## 
## =================================
##              Model 1             
## ---------------------------------
## (Intercept)     0.105 (0.058)    
## migrant        -0.497 (0.058) ***
## female         -0.079 (0.043)    
## ---------------------------------
## R2              0.039            
## Nobs         4073                
## =================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

We can use this to compare models side by side.

model3 <- with(datlist, lm(scale(math) ~  migrant +
  female + books + hisei + paredu, weights = studwgt))
out2 <- extract.df(model3, cl = datlist[[1]]$idschool)

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
##                  results          se          t            p       (lower
## (Intercept) -0.954487879 0.085248359 -11.196554 5.579285e-27 -1.121846890
## migrant     -0.355519320 0.054265040  -6.551535 5.972009e-11 -0.461889577
## female      -0.101647429 0.040435379  -2.513824 1.196103e-02 -0.180910418
## books        0.184774512 0.014535136  12.712266 4.726350e-32  0.156213037
## hisei        0.006049941 0.001310318   4.617156 1.200200e-05  0.003449145
## paredu       0.031885088 0.010072376   3.165597 2.523272e-03  0.011699614
##                   upper) missInfo
## (Intercept) -0.787128868    7.6 %
## migrant     -0.249149063      2 %
## female      -0.022384440    2.2 %
## books        0.213335986    9.6 %
## hisei        0.008650738     22 %
## paredu       0.052070563   29.5 %

screenreg(list(out1, out2), digits = 3, custom.model.names = c('base', 'w/SES'))

## 
## =======================================
##              base          w/SES       
## ---------------------------------------
## (Intercept)     0.105        -0.954 ***
##                (0.058)       (0.085)   
## migrant        -0.497 ***    -0.356 ***
##                (0.058)       (0.054)   
## female         -0.079        -0.102 *  
##                (0.043)       (0.040)   
## books                         0.185 ***
##                              (0.015)   
## hisei                         0.006 ***
##                              (0.001)   
## paredu                        0.032 ** 
##                              (0.010)   
## ---------------------------------------
## R2              0.039         0.157    
## Nobs         4073          4073        
## =======================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Specifying the model to include school fixed effects is straightforward (including factor(idschool) in the formula) and then omitting– using omit.coef = c('idschool') – the results in the regression table (not unless you want to see the coefficients for 79 - 1 schools). If you want to save the output to Word, use htmlreg and specify the file option. The output will be an html file which can be opened in Word.

Using svyglm

Another option is to use svyglm in the survey package. The svydesign must first be created– and then the multiply imputed dataset list is specified. The weights and the cluster variables are also part of the design object.

library(survey)
imp_list <- imputationList(lapply(1:dat2$m, function(n) mice::complete(dat2, action = n)))
summary(imp_list)

##             Length Class  Mode
## imputations 5      -none- list
## call        2      -none- call

des <- svydesign(ids = ~1, cluster = ~idschool, weight = ~studwgt, data = imp_list)

After the design object is specified, that is what is used in analysis instead of the data.

model4 <- with(des, svyglm(scale(math) ~  migrant + female + books + hisei + paredu))
summary(model4) # a list of results

##      Length Class  Mode
## [1,] 33     svyglm list
## [2,] 33     svyglm list
## [3,] 33     svyglm list
## [4,] 33     svyglm list
## [5,] 33     svyglm list

The output is a list. The output should then be analyzed and combined.

### NOTE: there is an extract.svyglm function but that will only work with a single dataset, not a multiply imputed dataset

extract.svymi <- function(z){
  #require(norm)
  m2 <- length(z)
  ns <- nobs(z[[1]]) #first dataset
  betas <- lapply(z, FUN = function(x){coef(x)} )
  sess <- lapply(z, FUN = function(x){vcov(x)}) #the vcov
  #ses <- lapply(sess, FUN = function(x){sqrt(diag(x))}) #the se, not the complete vcov
  
  dm <- lapply(z, FUN = function(x){model.matrix(x)})
  r.1 <- sapply(1:m2, function(row) cor(dm[[row]] %*% betas[[row]], z[[row]]$y))
  fisher.r2z <- function(r) {0.5 * (log(1+r) - log(1-r))}
  zz <- mean(fisher.r2z(r.1))
  r2 <- psych::fisherz2r(zz)^2 #in psych
  R2 <- mean(r2)
 
  test1 <- summary(pool_mi(betas, sess))
 
  test1$term <- row.names(test1)
  test2 <- test1[,c(8,1,2,4)] #4 is the pvalue
 
  row.names(test2) <- c()
  names(test2) <- c('term','estimate','std.error', 'p.value')

   tr <- createTexreg(
    coef.names = test2$term, 
    coef = test2$estimate, 
    se = test2$std.error, 
    pvalues = test2$p.value,
    gof.names = c("R2", "Nobs"), 
    gof = c(R2, ns),
    gof.decimal = c(T,F)
  )
  
}

ret <- extract.svymi(model4)

## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = sess)
##                  results          se          t            p       (lower
## (Intercept) -0.954487879 0.058789520 -16.235681 3.879433e-36 -1.070557223
## migrant     -0.355519320 0.041246355  -8.619412 1.017630e-17 -0.436389673
## female      -0.101647429 0.030684143  -3.312702 9.354611e-04 -0.161812642
## books        0.184774512 0.012503280  14.778083 2.997891e-36  0.160153403
## hisei        0.006049941 0.001250969   4.836203 6.292020e-06  0.003560504
## paredu       0.031885088 0.010432911   3.056203 3.280180e-03  0.011038579
##                   upper) missInfo
## (Intercept) -0.838418535   16.5 %
## migrant     -0.274648966    3.5 %
## female      -0.041482216    3.8 %
## books        0.209395620   13.1 %
## hisei        0.008539378   24.2 %
## paredu       0.052731598   27.4 %

screenreg(ret, digits = 3, custom.model.names = "using svyglm")

## 
## =========================
##              using svyglm
## -------------------------
## (Intercept)    -0.954 ***
##                (0.059)   
## migrant        -0.356 ***
##                (0.041)   
## female         -0.102 ***
##                (0.031)   
## books           0.185 ***
##                (0.013)   
## hisei           0.006 ***
##                (0.001)   
## paredu          0.032 ** 
##                (0.010)   
## -------------------------
## R2              0.158    
## Nobs         4073        
## =========================
## *** p < 0.001, ** p < 0.01, * p < 0.05

–### END

Multiple imputation in R (with regression output, clustering, and weights)

The Problem

Possible solutions

Walkthrough

Using svyglm

Related