Multilevel modeling: A primer

A remarkable range of phenomena of concern to researchers in many fields are often hierarchical (or nested) in nature. Patients are nested within doctors. Employees are nested within teams. Students are nested within classrooms (each of which has a different teacher), which are in turn nested within schools, family members within families. In addition repeated measures are often captured such as in the case of daily diaries and ambulatory physiological measures taken a multiple times throughout a day over several days. In such cases, individual measurements (i.e., observations) are clustered by day, which are in turn grouped by study participant. Whether of self-reports of thoughts and emotions or of physiological/behavioral indicators captured using ambulatory devices, multiple observations are recorded within a day, over a period of days across a sample of individuals.

The hierarchical nature of so much data has important implications for the analyses of such datasets. Consider an example where it is desirable to determine how well Grade 10 students are performing in Math at a particular school. There are 3 grade 10 classes at this school, each of which has a different teacher. A standard math test is administered to all students at the school and the scores tabulated.

Of course there will be variance in scores across students — they are different people who may have studied to varying degrees, or who had varying degrees of pre-existing math proficiency and so on. But we would also expect that students within a class would be more alike than students outside that class. In other words, the individual Math scores across students are not independent. All the students in a given class share a teacher who may varies in ability to teach, likability, etc. As a result, students in the same class will share variance.

So if we scan down the Math Score column of Table 1, the variability in scores between students becomes readily apparent. Student 1’s math score was 86, whereas student 2’s score was 76, student 3’s score was 70 and so on. The variability in scores across all students comes not just from the differences between students (study time, math proficiency etc.) but also from differences between teachers, due to differences in teacher characteristics. This latter source of variability is shared between all students in a given class because they all share a given teacher. Because of this shared variability, this means that the observations (the math scores) are not independent and this lack of independence violates a key assumption of ordinary regression, which is that all observations are independent of each other.

Ordinary least squares regression, as well as most other widely used statistical tools, fails to take account of this non-independence of observations. If ordinary regression is nonetheless pursued, then hierarchical data must be “flattened” and the between-classroom variance is dumped together with the within-classroom variance into a single term, the error term. The big problem with this is that it will tend to produce inflated estimates of coefficients.

Multilevel modeling (MLM) is the statistical tool of choice to handle such situations in which the hierarchical nature of data means that there are multiple sources of variance. MLM enables each source of variance to be modeled separately, each in its own error term.

In the parlance of MLM, individual observations are at level 1. For example, every student’s math score would be at level 1. As students are grouped by classroom, the classroom (the grouping variable) would be considered level 2. In the case of diary studies and other real-time in situ designs involving repeated measures, each observation (such as a diary entry or glucose reading) would be at level 1, and these would be grouped by study participant, which would be considered level 2.

Multilevel modeling can be scaled to accommodate more than 2 levels. For example, students (level 1) can be grouped into classrooms (level 2), which are further grouped by school (level 3). In a repeated measures design, observations (level 1) can be clustered into days (level 2), which are in turn grouped by individual (level 3). Although there is no theoretical limit to how many levels can be modeled, model complexity grows substantially making it unwieldy, minimal required sample sizes increase considerably, and with insufficient data, models may fail to converge on parameter estimates. It is advisable to minimize levels where possible to avoid unnecessary complexity. Assuming it is possible to obtain convergence on parameter estimates it is possible to statistically test how much better a model is given the inclusion of a particular level. This will be discussed later.

In regression an equation is constructed such that an outcome variable is a function of a linear combination of predictor variables each weighted by its own coefficient. Each coefficient quantifies how much a unit of change in the predictor is related to change in the outcome variable. In the following simple regression equation, the aim is to predict a student’s final exam score from his or her midterm score.

Final exam score = 1.5Midterm + 65

In this case, the coefficient for the predictor, midterm, is 1.5. This means that for every one unit increase in midterm score, the predicted final exam score will increase by 1.5.

In ordinary (single-level) regression, coefficients are “fixed”. That is, they can have only one value, and this value is applied to all level 1 units regardless of the level 2 variables under which it is nested. So, the estimated weight for the midterm variable, 1.5, would remain the same (i.e., fixed) across all students regardless of classroom to which students belonged.

Now, it might be tempting to say that you could take into account classroom by merely adding classroom into the regression equation like so:

Final exam score=b_1 Midterm+b_2 Classroom+a+e

The problem with this solution is that the variation due to differences across the level 2 variable, classroom, is dumped into the overall error term e. There is no separation of the variance attributed to the different levels. Ordinary least squares (OLS) regression cannot take into account error associated with sampling of observations at multiple levels of analysis. As a result, it is now considered inappropriate to use repeated measures ANOVA or single level multiple regression in such cases.

MLM holds several advantages over alternatives, including:

Assumption of independence of errors not required. For example, students within classrooms are likely to be more alike than students across classrooms. Patients of certain doctors will experience outcomes more alike than patients of different doctors. With repeated measures, measurements made closer in time will likely be more highly correlated than measurements made further apart in time.
MLM offers a great deal of analytical flexibility, permitting the examination of questions not otherwise possible. For one thing, predictors can be included at every level of the analysis. For example you can predict student achievement based on variables from different levels such as student motivation and sex (student level), teacher enthusiasm (class level), and school type (school level). Differences in average classroom achievement (intercept) and the relationship between student achievement and student motivation (slope) could be a function of school type. Also, it is possible to model within and cross-level interactions.

Analyzing data that are naturally hierarchical as if they are all on the same level results in both statistical and interpretive errors.

Ecological fallacy: For example, say that data for student achievement is aggregated to the classroom level to see if classes that differ in teacher enthusiasm (a classroom level variable) have different mean test performance scores. Group data will then be used to draw conclusions about individuals.

Atomistic fallacy: To interpret individual level data at the group level. MLM allows prediction of individual scores adjusted for group differences and to make predictions about group scores adjusted for individual differences within groups.

If individual scores are used without taking into account the nested structure of the data, the Type I error rate will be inflated because analyses are based on too many degrees of freedom that are not truly independent.

MLM deals with these issues by allowing intercepts (means) and slopes (IV-DV relationships) to vary between higher level units. For example, the relationship between student achievement and student motivation can vary across classes.

This variability is modeled by treating group intercepts and slopes as DVs in the next level of analysis. For example, we can attempt to predict differences in means and slopes within classrooms from differences in teacher enthusiasm across classrooms.

In single level regression the individual participants are considered to be a random sample from some population but in multilevel modeling, the groups are also considered to be a random sample from some population.

An example: MLM modeling for repeated measures data

The following example uses a dataset from Hox (2010).

Suppose we have a dataset consisting of student popularity for a total of 2000 students from 100 classes. The table below shows the first 20 cases from this dataset. The outcome variable is popularity, which is a self-report rating on a scale from 0 (very unpopular) to 10 (very popular). Sex (0=boy, 1=girl) is the student level (level 1) predictor and teacher experience is a class level (level 2) predictor.

In MLM analyses, the first step is always to test a baseline or “null” model containing only a random intercept and no predictors. The aim is to test whether on average there is a difference in the outcome variable (popularity) across level 2 units (schools).

Although there exist different frameworks and notations for MLM modeling, it is helpful to use separate equations for each level and to use the notation promoted by Raudenbush and Bryk (2002). Here are the 2 equations for a null model; that is, a model with no predictors. We have i level-1 observations are nested within j level-2 groups.

In the first equation, Y_ij represents the outcome variable (in this case, popularity), β_0j represents the mean popularity across all of school j‘s students, and r_ij represents the difference between school j‘s mean popularity and the popularity of student i. Notice that the inclusion of the j subscript makes it clear that we wish to model the intercept not as a fixed coefficient but as a random one, in which separate estimates will be derived for each group j.

Given that the main advantage of MLM is the ability to model the variance at each level separately, if we want the intercept to be free to change for each group, then each the intercept for each group j requires its own equation in which variance at that level can be incorporated into the model. That’s the Level 2 equation. In this equation, γ00 represents the grand mean (the average popularity computed across all schools), and μ0j represents the difference between school js average and the grand mean. The inclusion of the μ0j term signals that we wish to model the intercept as random. And indeed, that we use separate equations for each level makes it easy to see which coefficients are modeled as fixed and which as random. The equation indicates that there is both a fixed component (γ00) and a random component.

We can write out the full equation by substituting the level 2 equation into the level 1 equation, which gives us:

This equation makes it easy to see that the popularity of student i in school j is a function of 3 components: how average popularity of students across all schools, plus how much school j deviates from this grand mean, plus how much student i‘s popularity deviates from school j‘s average. So the variance terms μ0j and rij correct the prediction according to the two sources of variance: 1) how much a school deviates from the mean of all schools and 2) how much a student deviates from the mean within a school.