Selecting Appropriate Covariates in a Causal Relationship with Non-Gaussian Models: An Algorithm and a Simulation Study.

Bixi Zhang; Wolfgang Wiedermann

Background: Studying causal effects is an important aim in education. Causal relationships indicate how well implements (e.g., interventions) work for the target subjects. A good strategy to get the inference in such relationships is to conduct randomized experiments. However, random assignment is limited in education research, even is discouraged since the causal effects estimate can be biased, over-generalized, or needs further tests (Cook, 2002). In observational data, a set of covariates is often considered to get more unbiased results of the causal effect. From statistical modeling, the causal effect is estimated from a joint distribution conditioning on a set of appropriate covariates (Pearl, 2009). Failing to control for covariates can lead to inconsistent estimators. However, not every variable is eligible as a covariate. For example, if a covariate takes the role of a collider (i.e., a common effect), adjustment can bias the estimator of the causal effect (Weinberg, 1993). The back-door criterion shows that for a causal effect of x on y and a set of covariates Z, if Z does not contain any descendants of x and blocks every back-door path from x to y (i.e., every path from x to y contains an arrow into x), Z is admissible (Pearl, 2009, page 79). Following this definition, a confounder (z1 in Figure 1) should be added to the model. However, a collider (z2 in Figure 1) should not be included because it has a descendant of x. Entner and colleagues (2012) developed a statistical test for consistency of the causal estimator and proposed a selection algorithm for linear non-Gaussian models, which allows one to find the appropriate set of variables used as controls. Purpose: The proposal extends the forward selection algorithm by Entner et al. (2012) and provides an implementation in R (R Core Team, 2021) to help researchers select appropriate and significant covariates (e.g., confounder) and drop inadmissible variables (e.g., collider). A simulation study was performed to evaluate the performance of the proposed approach. Algorithm design Statistical test for consistency: Suppose we have x and y (with a causal relationship of the form x ? y), and a set of variables W. For z ? W, we estimate two regression models via OLS: x=z?+r_x (1), where r_x is the residual of regressing x on z. y=x?+z?+r_y (2), where r_y is the residual of regressing y on z and x. If a Gaussianity test of r_x is rejected and an independence test for r_x and r_y is not rejected at a certain threshold, the estimated effect (?) is consistent. Forward selection algorithm 1 Basic checks: 1.1 Dimension check: the algorithm works for a univariate test with one predictor of interests. Missing values are not allowed. 1.2 Fixed effects check: If the model includes fixed effects (e.g., school fixed effects), the algorithm includes the fixed effects in equations (1) and (2). 1.3 Multicollinearity check: If a variable z from W has an extreme high correlation (>0.95) with x, the algorithm drops variable z. 2 Estimate equations (1) and (2) for each variable z from W: 2.1 Drop z if it is non-significant at a certain threshold in equation (2). 2.2 Apply a Gaussianity test (the Lilliefors test) for r_x and drop z if the test is non-significant at a certain threshold. 2.3 Conduct an independence test for r_x and r_y (Hilbert Schmidt Independence Criterion (Gretton et al., 2008); nonlinear correlation test) and retain the p-value of each z. 2.4 Select the largest p-value (also larger than a certain threshold) and the relevant z (this is the best z from set W). 3 Run two models with a set of the best z from step 2.4 and another random z from W: 3.1 Follow the same procedure in step 2 and select the best pair of z from W. 4 Repeat the steps until all admissible z from set W are selected. The algorithm is implemented in R via the function forward_selection <Ð function(x, y, W, fixed = NULL, is.factor = TRUE, alpha_Gauss = 0.05, alpha_indep = 0.05, alpha_p = 0.05, independence = ÔhsicÕ), where fixed and is.factor are for the fixed effects adjustment; alpha_Gauss, alpha_indep, and alpha_p are certain threshold for each specific test; independence specifies the independence test in the algorithm. Figure 2 shows an output example in R. Simulation study Four types of variables and three models are considered in the causal effect model x ? y: A confounder (z ? x and z ? y); A covariate (z ? y but no x relation); A collider (x ? z and y ? z); An independent variable (no x relation and no y relation). True model: a model should only select the confounder and the covariate; False model: a reverse model from y to x and should select none variables; Combined decision: considering true and false model decisions simultaneously. In the simulation experiment, we consider various sample sizes, distributions of r_x and r_y, and types of independence test (see Table 1). All test thresholds are set to 0.05. Intercepts were set to 0, and slopes were fixed at 0.5 for generating x, y, and the collider. For each condition, we compute the proportions of true decision for each type of variables and each model among 1000 replications. Results and Conclusion: Figure 3 and Figure 4 summarize the results for the simulation conditions. The algorithm shows acceptable performance. Results suggest that the algorithm can drop potential collider for large samples. The lower correctness rates of the confounder, covariate, and independent variables are caused by dropping non-significant variables. The study presents an algorithm for covariate selection in non-Gaussian models that helps researchers to select appropriate covariates in the evaluation of a causal effect. The presented algorithm is able to 1) detect admissible covariates to de-confound causal effects and 2) to detect reverse causation biases. The current algorithm is, however, restricted to OLS regression models with linear covariate relations. Extensions to generalized linear models and non-linear covariate effects are material for future work.