Model-Based Clustering for Ensembles of Networks: A Theory-Driven Method with Applications to Student/Teacher Mobility.

Tessa Johnson; Tracy Sweet

Background/Context: Social network methodology is particularly relevant to the types of social structures found in education research. The current study develops a finite mixture approach for clustering ensembles of networks (NetMix). Following a structural equation modeling framework, NetMix simultaneously estimates a measurement model comprised of parameters for ensembles of social networks and a structural model comprised of predictors and distal outcomes of the categorical latent categorical variable. We motivate the model with real data, present the model formulation and estimation, and evaluate the model with a simulation study. Motivating Example: Educational Mobility Networks: One policy-relevant example of network ensembles with underlying population heterogeneity is the process of student or teacher mobility (i.e., non-promotional transfers) across schools. Network models offer a nuanced approach for modeling mobility of persons between and among organizations. While nodes traditionally represent persons, in organizational mobility networks, each node represents an organization, and ties between nodes represent the movement of persons or resources between organizations (Kerbow, 1996; Kerbow, Azcoitia, & Buell, 2003; Burdick-Will, Grigg, Nerenberg, & Connolly, 2020). This type of network is shown in Figure 1, where schools are represented by circles, and curved arrows depict the direction of personnel transfer. Figure 1. Mobility networks among high schools in two school districts. Purpose/Objective/Research Question/Focus of Study: The current study presents the model formulation and estimation of the network mixture model (NetMix). The full presentation will present the evaluation of the model performance and the utility of the new method using real and simulated data. Statistical, Measurement, or Econometric Model: The NetMix method performs model-based clustering using finite mixtures on interpretable network parameters. We choose network insularity (Sweet & Zheng, 2017 & 2018) or latent position variance (Hoff, Raftery, & Handcock, 2002), depending on the structure of the individual networks. Network insularity dictates the extent to which persons belonging to different subgroups within a network will form ties with persons in other subgroups, while latent position variance can be considered as an indicator of how similar persons are to one another. The NetMix approach uses these network parameters as the sole indicator of a latent categorical variable to group networks based on the assumption that there are multiple underlying distributions of the parameter among the multiple networks. This approach (Wolfe, 1963, 1965; Fraley & Raftery, 2002) has a similar goal to statistical clustering but assumes that the data are drawn from a heterogeneous population while also providing a statistical mechanism for testing hypotheses based on this assumption. NetMix Model Formulation: The latent space model (LSM; Hoff, Raftery, & Handcock, 2002) and the mixed membership stochastic blockmodel (MMSBM; Airoldi et al., 2008) are both known as conditionally independent tie models, meaning that conditional on a set of latent and observed variables, the ties between two persons are independent. If A is a sociomatrix with binary ties (binary ties are not necessary, but are given for simplicity), P(A[subscript ij]= 1) is the probability that a tie from person i to person j is equal to 1 and 0 otherwise. For multiple, independent networks (k) and conditional on a set of covariates, X, latent variables, Z, and model parameters, [phi]: We can now break down our model for the sociomatrix, A, into our specific formulations for LSM and MMSBM. Although the latent variables, Z, for the LSM and MMSBM differ (see below), the overall model formulation is unchanged from above. The LSM formulation for A is as follows: where d gives a function for distance, generally given as the Euclidean or inner product, though Liu, Jin, & Zhang (2018) utilize Mahalanobis distances to incorporate latent nodal covariates. Z[subscript ik] then is the latent space coordinate for node i in network k. The MMSBM takes a slightly different formulation. The MMSBM assumes that there is an underlying group or "block" structure within the nodes in the model (think team memberships, curriculum expertise, etc.), yet the MMSBM formulation allows nodes to take on different block memberships when sending versus receiving ties. As such, the hierarchical model specification is given as follows (Sweet, 2019): where [theta] gives the membership probability for belonging to each block, S and R give block membership vectors for nodes when sending versus receiving ties, [xi] adjusts the ? parameter to center the Dirichlet distribution, and ? gives the distribution its shape (that is, the relative insularity of the network). Mixture models in a structural equation modeling framework are specified to have two components: the measurement component and the structural component (Muthén, 2001). We present our current model specification using only a single indicator of the latent categorical variable, that is, our measurement model contains only the single parameter of interest for clustering our ensembles of networks, which is sufficient for model identification (Masyn, 2013). To incorporate our ensemble of social networks within a finite mixture model, we simply formulate our structural model such that the function for parameter [phi] is comprised of a mixture of distributions dictated by class membership, q: where p gives the marginal probability that a network from the population will belong to class Q such that class membership is exhaustive and mutually exclusive, i.e., [sigma][pi][subscript q] = 1 (Muthén, 2001). Incorporating predictors of the latent categorical variable and distal outcomes is then a matter of parameterizing the class proportion, [pi], commonly achieved with a multinomial logistic formulation that can take on a set of covariates, T: Model Estimation: Bayesian estimation is used and parameters are updated using a combination of Gibbs and Metropolis steps. Label switching of the categories of the latent categorical variable is addressed by imposing the constraint that categories should be ordered based on the values of the network parameter, though several approaches to addressing the issue have been proposed (Diebolt & Robert, 1994; Jasra, Holmes, & Stephens, 2005). Finally, while several approaches have been proposed for class enumeration (Masyn, 2013), we apply a semi-confirmatory approach assuming a fixed number of latent classes for our demonstration models.