The Problem
There are several guides on using multiple imputation in R. However, analyzing imputed models with certain options (i.e., with clustering, with weights) is a bit more challenging. More challenging even (at least for me), is getting the results to display a certain way that can be used in publications (i.e., showing regressions in a hierarchical fashion or multiple models side by side) that can be exported to MS Word. I have used packages such as stargazer
and huxtable
for getting nice (or as it has been called, pretty) regression tables. However, I have not been able to figure out how to get the output of multiply imputed results to display properly (if someone knows of a way, please email me or leave a comment).
My preference for imputation in R is to use the mice
package together with the miceadds
package. I specifically wanted to:
- Account for clustering (working with nested data)
- Include weights (as is the case with nationally representative datasets)
- Display multiple models side by side (i.e., show standard errors below regression coefficients)
This note does not show how to perform multilevel imputation– just single-level imputation and getting the output ‘publication ready’.
Possible solutions
I have seen another post that describes how to perform this using a mix of dplyr
, purrr
, and huxtable
. Although comprehensive, may be a bit more challenging to follow (though is probably good if you want to learn stuff about the tidyverse
too).
Now the miceadds::lm.cluster
function can deal with the clustering though does not allow for the inclusion of weights. The mitools
package together with the survey
package allows for regressions using weights and clustering with imputed data (though getting the p values and the R squared requires a bit more work). In either case, stargazer
nor huxtable
work with either automatically for formatted regression output (that I know).
Walkthrough
I settled on using the mitools
package (to combine the imputation results just using the lm
function). The cluster robust standard errors were computed using the sandwich
package. A function then saves the results into a data frame, which after some processing, is read in texreg
to display/save the output. This still is a lot of steps.
For an example, I will use the data.ma01
dataset in the miceadds
package. It’s a nested dataset with weights.
library(miceadds)
data(data.ma01)
head(data.ma01)
## idstud idschool studwgt math read migrant books hisei paredu female
## 1 10010001 1001 6.05019 594 647 0 6 NA 3 1
## 2 10010084 1001 6.05019 605 651 0 6 77 7 1
## 3 10010116 1001 5.26952 616 539 0 5 69 7 0
## 4 10010162 1001 5.26952 524 551 1 2 45 2 0
## 5 10010357 1001 6.05019 685 689 0 6 66 7 1
## 6 10010720 1001 5.26952 387 502 0 3 53 3 1
## urban
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
names(data.ma01)
## [1] "idstud" "idschool" "studwgt" "math" "read" "migrant"
## [7] "books" "hisei" "paredu" "female" "urban"
#data.ma01$female <- factor(data.ma01$female, labels = c('m', 'f'))
dat <- data.ma01[,-c(1,5)] #remove student id and math
dim(dat) #number of observations and variables
## [1] 4073 9
length(table(dat$idschool)) #79 schools
## [1] 79
md.pattern(dat, plot = F) #view missing data pattern
## idschool studwgt female urban math migrant books paredu hisei
## 3006 1 1 1 1 1 1 1 1 1 0
## 243 1 1 1 1 1 1 1 1 0 1
## 193 1 1 1 1 1 1 1 0 1 1
## 99 1 1 1 1 1 1 1 0 0 2
## 46 1 1 1 1 1 1 0 1 1 1
## 15 1 1 1 1 1 1 0 1 0 2
## 9 1 1 1 1 1 1 0 0 1 2
## 52 1 1 1 1 1 1 0 0 0 3
## 57 1 1 1 1 1 0 1 1 1 1
## 14 1 1 1 1 1 0 1 1 0 2
## 10 1 1 1 1 1 0 1 0 1 2
## 22 1 1 1 1 1 0 1 0 0 3
## 5 1 1 1 1 1 0 0 1 1 2
## 10 1 1 1 1 1 0 0 1 0 3
## 10 1 1 1 1 1 0 0 0 1 3
## 71 1 1 1 1 1 0 0 0 0 4
## 4 1 1 1 1 0 1 1 1 1 1
## 2 1 1 1 1 0 1 1 0 0 3
## 1 1 1 1 1 0 0 0 1 0 4
## 204 1 1 1 1 0 0 0 0 0 5
## 0 0 0 0 211 404 423 672 733 2443
In our dataset, 3006 observations out of 4073 (74%) have complete data. Generally, we would impute ~25 datasets (m = 25) but for demonstration purposes, we will just impute 5 datasets. You can perform all the necessary diagnostics after imputation. You can specify other types of imputation (instead of predictive mean matching or pmm and instead use logreg for variables such as female– or convert it to factor before imputing– however female had nothing missing anyway). First step is to impute the data using mice
.
dat2 <- mice(dat, m = 5, seed = 123)
##
## iter imp variable
## 1 1 math migrant books hisei paredu
## 1 2 math migrant books hisei paredu
## 1 3 math migrant books hisei paredu
## 1 4 math migrant books hisei paredu
## 1 5 math migrant books hisei paredu
## 2 1 math migrant books hisei paredu
## 2 2 math migrant books hisei paredu
## 2 3 math migrant books hisei paredu
## 2 4 math migrant books hisei paredu
## 2 5 math migrant books hisei paredu
## 3 1 math migrant books hisei paredu
## 3 2 math migrant books hisei paredu
## 3 3 math migrant books hisei paredu
## 3 4 math migrant books hisei paredu
## 3 5 math migrant books hisei paredu
## 4 1 math migrant books hisei paredu
## 4 2 math migrant books hisei paredu
## 4 3 math migrant books hisei paredu
## 4 4 math migrant books hisei paredu
## 4 5 math migrant books hisei paredu
## 5 1 math migrant books hisei paredu
## 5 2 math migrant books hisei paredu
## 5 3 math migrant books hisei paredu
## 5 4 math migrant books hisei paredu
## 5 5 math migrant books hisei paredu
summary(dat2)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## idschool studwgt math migrant books hisei paredu female
## "" "" "pmm" "pmm" "pmm" "pmm" "pmm" ""
## urban
## ""
## PredictorMatrix:
## idschool studwgt math migrant books hisei paredu female urban
## idschool 0 1 1 1 1 1 1 1 1
## studwgt 1 0 1 1 1 1 1 1 1
## math 1 1 0 1 1 1 1 1 1
## migrant 1 1 1 0 1 1 1 1 1
## books 1 1 1 1 0 1 1 1 1
## hisei 1 1 1 1 1 0 1 1 1
class(dat2)
## [1] "mids"
After imputing the data, in order to analyze the data, instead of specifying the data frame in the data option, the data are analyzed using the with
function. For some other functions, an imputationlist has to be created but with a mids
object (multiply imputed dataset), these can be analyzed directly.
model1 <- with(dat2, lm(scale(math) ~ migrant + female, weights = studwgt))
summary(model1)
## term estimate std.error statistic p.value
## 1 (Intercept) 0.10343011 0.02261421 4.573679 4.934867e-06
## 2 migrant -0.49575568 0.03918768 -12.650805 5.230175e-36
## 3 female -0.07522121 0.03055474 -2.461851 1.386339e-02
## 4 (Intercept) 0.10384614 0.02261056 4.592816 4.505222e-06
## 5 migrant -0.50249688 0.03907260 -12.860595 3.928566e-37
## 6 female -0.07600872 0.03051405 -2.490941 1.278007e-02
## 7 (Intercept) 0.10844606 0.02261679 4.794937 1.685425e-06
## 8 migrant -0.49746352 0.03919533 -12.691909 3.159126e-36
## 9 female -0.08340406 0.03052509 -2.732312 6.316313e-03
## 10 (Intercept) 0.10938769 0.02261002 4.838018 1.359944e-06
## 11 migrant -0.50631300 0.03906748 -12.959960 1.137243e-37
## 12 female -0.07846245 0.03050740 -2.571915 1.014894e-02
## 13 (Intercept) 0.10214606 0.02261175 4.517388 6.438179e-06
## 14 migrant -0.48143591 0.03944415 -12.205508 1.119162e-33
## 15 female -0.08100336 0.03058448 -2.648512 8.115914e-03
The summary shows the output for each regression run. However, the cluster robust standard errors need to be computed (clustering at the school level using idschool
) and the results have to be combined appropriately using Rubin’s rules (to account for within and between group variation).
If all we wanted was to quickly get results, we could use:
pool(model1)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## Class: mipo m = 5
## estimate ubar b t dfcom
## (Intercept) 0.10545121 0.0005113326 1.051253e-05 0.0005239477 4070
## migrant -0.49669300 0.0015361450 9.012010e-05 0.0016442891 4070
## female -0.07881996 0.0009325186 1.169079e-05 0.0009465476 4070
## df riv lambda fmi
## (Intercept) 2520.0994 0.02467089 0.02407689 0.02485048
## migrant 743.7519 0.07039968 0.06576953 0.06827164
## female 3284.7635 0.01504415 0.01482117 0.01542047
However, the results don’t account for clustering nor do we get p values. To get these, we use the mitools
package which needs the data to be an imputation list. The datlist
object is analyzed instead of the original mids
.
#create imputation list for analyses
datlist <- miceadds::mids2datlist(dat2)
model2 <- with(datlist, lm(scale(math) ~ migrant +
female, weights = studwgt)) #same model but using datlist
The coefficients (betas) and standard errors can be extracted per regression run. The resulting summary shows the pooled or combined results (with p values). The vcovCL
function in the sandwich
package is used to get the cluster robust standard errors. The cluster variable is specified using the first dataset in the datlist (which is the same per dataset since everything is now a complete dataset).
library(sandwich)
betas <- lapply(model2, coef)
vars <- lapply(model2, FUN = function(x){vcovCL(x, cluster = datlist[[1]]$idschool)}) #vcovCL is in the sandwich package
# conduct statistical inference
summary(pool_mi(betas, vars))
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 0.10545121 0.05824299 1.810539 7.021331e-02 -0.00870345
## migrant -0.49669300 0.05777990 -8.596293 1.183904e-17 -0.60997550
## female -0.07881996 0.04291278 -1.836748 6.625148e-02 -0.16292896
## upper) missInfo
## (Intercept) 0.219605875 0.4 %
## migrant -0.383410488 3.3 %
## female 0.005289041 0.8 %
We can save the output and then be done with that if that is all we were interested in. However, often, we will want to compare multiple models side by side. We could do that by copying and pasting results. But then, a function can help out with that together with texreg
.
Basically, after the initial multiply imputed regression is run, the extract.df
function below is run. Specify the imputed model run and the clustering variable. Afterwards, a texreg object is created which can be used to display the output.
extract.df <- function(tt, cl = NULL) {
require(sandwich)
require(mitools)
require(texreg)
m2 <- length(tt) #number of imputations
betas <- lapply(tt, coef)
vars <- lapply(tt, FUN = function(x){ vcovCL(x, cluster = cl)})
# conduct statistical inference and save results into a data.frame
model1 <- summary(pool_mi(betas, vars))
R2 <- mean(sapply(1:m2, function(x) summary(tt[[x]])$r.squared))
#technically, should use R to Z-- but close for now when R is low
ns <- nobs(tt[[1]])
#creates what is used by texreg
tr <- createTexreg(
coef.names = row.names(model1),
coef = model1$results,
se = model1$se,
pvalues = model1$p,
gof.names = c("R2", "Nobs"),
gof = c(R2, ns),
gof.decimal = c(T,F)
)
}
out1 <- extract.df(model2, cl = datlist[[1]]$idschool)
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) 0.10545121 0.05824299 1.810539 7.021331e-02 -0.00870345
## migrant -0.49669300 0.05777990 -8.596293 1.183904e-17 -0.60997550
## female -0.07881996 0.04291278 -1.836748 6.625148e-02 -0.16292896
## upper) missInfo
## (Intercept) 0.219605875 0.4 %
## migrant -0.383410488 3.3 %
## female 0.005289041 0.8 %
screenreg(out1, digits = 3) #my preferred option with 4 or more models
##
## =========================
## Model 1
## -------------------------
## (Intercept) 0.105
## (0.058)
## migrant -0.497 ***
## (0.058)
## female -0.079
## (0.043)
## -------------------------
## R2 0.039
## Nobs 4073
## =========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
screenreg(out1, digits = 3, single.row = T) #another option with 3 or less models
##
## =================================
## Model 1
## ---------------------------------
## (Intercept) 0.105 (0.058)
## migrant -0.497 (0.058) ***
## female -0.079 (0.043)
## ---------------------------------
## R2 0.039
## Nobs 4073
## =================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
We can use this to compare models side by side.
model3 <- with(datlist, lm(scale(math) ~ migrant +
female + books + hisei + paredu, weights = studwgt))
out2 <- extract.df(model3, cl = datlist[[1]]$idschool)
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = vars)
## results se t p (lower
## (Intercept) -0.954487879 0.085248359 -11.196554 5.579285e-27 -1.121846890
## migrant -0.355519320 0.054265040 -6.551535 5.972009e-11 -0.461889577
## female -0.101647429 0.040435379 -2.513824 1.196103e-02 -0.180910418
## books 0.184774512 0.014535136 12.712266 4.726350e-32 0.156213037
## hisei 0.006049941 0.001310318 4.617156 1.200200e-05 0.003449145
## paredu 0.031885088 0.010072376 3.165597 2.523272e-03 0.011699614
## upper) missInfo
## (Intercept) -0.787128868 7.6 %
## migrant -0.249149063 2 %
## female -0.022384440 2.2 %
## books 0.213335986 9.6 %
## hisei 0.008650738 22 %
## paredu 0.052070563 29.5 %
screenreg(list(out1, out2), digits = 3, custom.model.names = c('base', 'w/SES'))
##
## =======================================
## base w/SES
## ---------------------------------------
## (Intercept) 0.105 -0.954 ***
## (0.058) (0.085)
## migrant -0.497 *** -0.356 ***
## (0.058) (0.054)
## female -0.079 -0.102 *
## (0.043) (0.040)
## books 0.185 ***
## (0.015)
## hisei 0.006 ***
## (0.001)
## paredu 0.032 **
## (0.010)
## ---------------------------------------
## R2 0.039 0.157
## Nobs 4073 4073
## =======================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Specifying the model to include school fixed effects is straightforward (including factor(idschool)
in the formula) and then omitting– using omit.coef = c('idschool')
– the results in the regression table (not unless you want to see the coefficients for 79 - 1 schools). If you want to save the output to Word, use htmlreg
and specify the file
option. The output will be an html file which can be opened in Word.
Using svyglm
Another option is to use svyglm
in the survey
package. The svydesign
must first be created– and then the multiply imputed dataset list is specified. The weights and the cluster variables are also part of the design object.
library(survey)
imp_list <- imputationList(lapply(1:dat2$m, function(n) mice::complete(dat2, action = n)))
summary(imp_list)
## Length Class Mode
## imputations 5 -none- list
## call 2 -none- call
des <- svydesign(ids = ~1, cluster = ~idschool, weight = ~studwgt, data = imp_list)
After the design object is specified, that is what is used in analysis instead of the data.
model4 <- with(des, svyglm(scale(math) ~ migrant + female + books + hisei + paredu))
summary(model4) # a list of results
## Length Class Mode
## [1,] 33 svyglm list
## [2,] 33 svyglm list
## [3,] 33 svyglm list
## [4,] 33 svyglm list
## [5,] 33 svyglm list
The output is a list. The output should then be analyzed and combined.
### NOTE: there is an extract.svyglm function but that will only work with a single dataset, not a multiply imputed dataset
extract.svymi <- function(z){
#require(norm)
m2 <- length(z)
ns <- nobs(z[[1]]) #first dataset
betas <- lapply(z, FUN = function(x){coef(x)} )
sess <- lapply(z, FUN = function(x){vcov(x)}) #the vcov
#ses <- lapply(sess, FUN = function(x){sqrt(diag(x))}) #the se, not the complete vcov
dm <- lapply(z, FUN = function(x){model.matrix(x)})
r.1 <- sapply(1:m2, function(row) cor(dm[[row]] %*% betas[[row]], z[[row]]$y))
fisher.r2z <- function(r) {0.5 * (log(1+r) - log(1-r))}
zz <- mean(fisher.r2z(r.1))
r2 <- psych::fisherz2r(zz)^2 #in psych
R2 <- mean(r2)
test1 <- summary(pool_mi(betas, sess))
test1$term <- row.names(test1)
test2 <- test1[,c(8,1,2,4)] #4 is the pvalue
row.names(test2) <- c()
names(test2) <- c('term','estimate','std.error', 'p.value')
tr <- createTexreg(
coef.names = test2$term,
coef = test2$estimate,
se = test2$std.error,
pvalues = test2$p.value,
gof.names = c("R2", "Nobs"),
gof = c(R2, ns),
gof.decimal = c(T,F)
)
}
ret <- extract.svymi(model4)
## Multiple imputation results:
## Call: pool_mi(qhat = betas, u = sess)
## results se t p (lower
## (Intercept) -0.954487879 0.058789520 -16.235681 3.879433e-36 -1.070557223
## migrant -0.355519320 0.041246355 -8.619412 1.017630e-17 -0.436389673
## female -0.101647429 0.030684143 -3.312702 9.354611e-04 -0.161812642
## books 0.184774512 0.012503280 14.778083 2.997891e-36 0.160153403
## hisei 0.006049941 0.001250969 4.836203 6.292020e-06 0.003560504
## paredu 0.031885088 0.010432911 3.056203 3.280180e-03 0.011038579
## upper) missInfo
## (Intercept) -0.838418535 16.5 %
## migrant -0.274648966 3.5 %
## female -0.041482216 3.8 %
## books 0.209395620 13.1 %
## hisei 0.008539378 24.2 %
## paredu 0.052731598 27.4 %
screenreg(ret, digits = 3, custom.model.names = "using svyglm")
##
## =========================
## using svyglm
## -------------------------
## (Intercept) -0.954 ***
## (0.059)
## migrant -0.356 ***
## (0.041)
## female -0.102 ***
## (0.031)
## books 0.185 ***
## (0.013)
## hisei 0.006 ***
## (0.001)
## paredu 0.032 **
## (0.010)
## -------------------------
## R2 0.158
## Nobs 4073
## =========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
–### END