1 Comparing Models

For many statistical models, including LMMs, it can be informative to compare different models. Comparing different models can be used in lots of different ways. Here are some examples, although they are not meant to be exhaustive.

Evaluate which of two (or more) models provides the best fit to the data
Evaluate / compare how the results for a particular predictor(s) of interest change across two (or more) models
Calculate the significance of multiple predictors
Calculate effect sizes for one or multiple predictors

We will look at examples of the different uses of model comparisons in this topic. In fact, we already saw one example in the topic on Moderation where we compared the results, by eye, from two models that differed only in whether some extreme values were included or excluded. This is a type of model comparison, and what we learned in that example was that the results of the model were not sensitive to the two extreme values.

Just as there are many different kinds of models we can fit, even with LMMs (e.g., with or without random slopes, etc.), so to there are many different kinds and purposes for different model comparisons.

To begin with, let’s imagine we are just trying to compare two models: Model A and Model B. We can broadly classify the type of comparison based on whether Model A and B are nested or non-nested models. We will talk about what these mean next.

1.1 Nested Models

Models are considered nested when one model is a restricted or constrained version of another model. For example, suppose that Model A predicts mood from stress and age whereas Model B predicts mood from age only. Written as a formula, these could be:

\[ ModelA: mood_{ij} = b_{0j} + b1 * age_j + b2 * stress_{ij} + \varepsilon_{ij} \]

and

\[ ModelB: mood_{ij} = b_{0j} + b1 * age_j + 0 * stress_{ij} + \varepsilon_{ij} \]

In Model B, I purposely used \(0 * stress_{ij}\) to highlight that when a predictor is left out of a model, it is the same as fixing (sometimes also called constraining) the coefficient (\(b2\) in this case) to be exactly 0. In this case, we would say that Model B is nested within Model A. In other words, Model A contains every predictor and parameter in Model B plus more.

Restricted

This idea is similar to the idea of nested data used in LMMs, but the difference is that we are not talking about observations or data, rather we are talking about the parameters of a model.

To summarize, briefly, we say that Model B is nested within Model A if:

setting one or more parameters in Model A to 0 yields the same model as Model B.
both models use the same data, have the same number of observations, etc. Even if the same dataset is used in R, this does not guarantee the same data are used because if additional predictors are included in Model A and these extra predictors have some missing data, by default R would drop the missing data and result in a subset of cases being used for Model A than for Model B

If two models are nested, then we have the most options in terms of comparing the two models. For example, we can evaluate whether Model A is a statistically significantly better fit than is Model B using a Likelihood Ratio Test (LRT)11 https://en.wikipedia.org/wiki/Likelihood-ratio_test.

We can compare the fit of each model and use the difference in fit to derive effect sizes. We also can attribute any differences in the two models to the parameter(s) that have been constrainted to 0 in Model B from Model A.

A simple definition of the LRT test statistic, \(\lambda\) is based on two times the difference in the log likelihoods.

\[ \lambda = -2 * (LL_B - LL_A) \]

You may wonder why the likelihood ratio test can be conducted by taking the difference in the log likelihoods. It is because the log of a ratio is the same as the difference in the logs of the numerator and denominator.

\[ log_{e}\left(\frac{6}{2}\right) = log_{e}(6) - log_{e}(2) \]

which we can confirm is true in R:

## log of ratio
log(6/2)

## [1] 1.0986

## difference in logs
log(6) - log(2)

## [1] 1.0986

If the null hypothesis of no difference is true in the population, then \(\lambda\) will follow a chi-squared distribution with degrees of freedom equal to the number of parameters constrained to 0 in Model B from Model A, the difference in degrees of freedom used for each model, that is:

\[ \lambda \sim \chi^2(DF_A - DF_B) \]

Thus we often use a chi-square distribution in the LRT to look up the p-value. Finally, note that because LRTs are based on the log likelihoods (LL) from a model, we need true log likelihoods for the LRT to be valid. Therefore, we cannot use restricted maximum likelihood, we need to use maximum likelihood estimation.

1.1.1 Nested Models in `R`

To see the idea of nested models and the LRT in action, let’s examine a concrete example in R. Here are two LMMs corresponding to Model A and Model B formula we wrote previously. We can see in the R code that the models are nested, the only difference is age. We set REML = FALSE to get maximum likelihood estimates so that we get true log likelihoods. We also need to confirm that the two models are based on the same number of observations. We can extract just this information in R using the nobs() function. This let’s us confirm that the two models are fitted to the same data. For example, if stress had some missing data that was not missing age or mood, it could be that Model A is based on fewer observations than Model B.

modela <- lmer(mood ~ age + stress + (1 | ID), data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + (1 | ID), data = dm, REML = FALSE)

nobs(modela)

## [1] 183

nobs(modelb)

## [1] 183

In this case, we can see that the number of observations are identical. Now we can find the log likelihoods of both models by using the logLik() function.

logLik(modela)

## 'log Lik.' -281.71 (df=5)

logLik(modelb)

## 'log Lik.' -290.86 (df=4)

Now we have the log likelihoods (LL) of each model and the degrees of freedom from both models. From this, we can calculate \(\lambda\) and then look up the p-value for a chi-square distribution with \(\lambda\) and 1 degree of freedom (1 is the difference in degrees of freedom between Model A and B). To get the p-value from a chi-square, we use the pchisq() function.

## lambda
-2 * (-290.86 - -281.71)

## [1] 18.3

## p-value from a chi-square
pchisq(18.3, df = 1, lower.tail = FALSE)

## [1] 1.8871e-05

In practice, we do not need to do these steps by hand, we can get a test of two nested models in R using the anova() function (which in this case is not actually analyzing the variance really). We use the anova() function with test = "LRT" to get a likelihood ratio test.

anova(modela, modelb, test = "LRT")

## Data: dm
## Models:
## modelb: mood ~ age + (1 | ID)
## modela: mood ~ age + stress + (1 | ID)
##        npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## modelb    4 590 603   -291      582                        
## modela    5 573 589   -282      563  18.3  1    1.9e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you compare the chi-square value, degrees of freedom and p-value, you’ll see that they basically match what we calculated by hand. The small differences are due to rounding (R will use more decimals of precision whereas we only used two decimals).

In this simple case, we are only testing a single parameter, the fixed regression coefficient for stress, because that is the only parameter that differs between Model A and Model B. Thus in this case, it would be easier to rely on the t-test we get from a summary of the model.

summary(modela)

## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: mood ~ age + stress + (1 | ID)
##    Data: dm
## 
##      AIC      BIC   logLik deviance df.resid 
##    573.4    589.5   -281.7    563.4      178 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.404 -0.471  0.187  0.591  2.303 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.424    0.651   
##  Residual             0.996    0.998   
## Number of obs: 183, groups:  ID, 50
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)   6.5087     0.9861  65.3214    6.60  8.6e-09 ***
## age          -0.0214     0.0406  56.3137   -0.53      0.6    
## stress       -0.2842     0.0646 179.6811   -4.40  1.9e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr) age   
## age    -0.961       
## stress -0.333  0.090

In this instance, the LRT and the t-test actually yield equivalent p-values, 1.9e-05, however, this does not have to be. The LRT is based on slightly different theory than the t-test and the t-test uses approximate degrees of freedom for the t-distribution based on the Satterthwaite method, whereas the LRT does not directly incorporate the sample size in the same way. The two methods will be assymptotically equivalent (at very large sample sizes) but can differ, particularly for smaller samples.

In practice, we wouldn’t usually use a LRT to evaluate whether a single model parameter is statistically significant. The benefit of (nested) model comparisons is that it allows us to compare two models. Those models can be quite different.

Here are another two models, this time they differ by two predictors, energy and female. The LRT now tests whether the addition of energy and female results in significantly better fit for Model A than Model B.

modela <- lmer(mood ~ age + stress + energy + female + (1 | ID),
               data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + stress + (1 | ID), data = dm,
               REML = FALSE)

nobs(modela)

## [1] 183

nobs(modelb)

## [1] 183

anova(modela, modelb, test = "LRT")

## Data: dm
## Models:
## modelb: mood ~ age + stress + (1 | ID)
## modela: mood ~ age + stress + energy + female + (1 | ID)
##        npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## modelb    5 573 589   -282      563                        
## modela    7 488 511   -237      474    89  2     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, we can see from the significant p-value that Model A is a significantly better fit to the data than is Model B. Note that LRTs are only appropriate when the models are nested. We cannot use LRTs for non-nested models.

In nested models, the more complex model is often called the “Full” model and the simpler model the “Reduced” or “Restricted” version of the Full model.

1.2 Non Nested Models

Models are considered non nested when one model is not strictly a constrained version of a more complex model. For example, suppose that Model A predicts mood from stress and age whereas Model B predicts mood from age and female. Written as a formula, these could be:

\[ ModelA: mood_{ij} = b_{0j} + b1 * age_j + b2 * stress_{ij} + \varepsilon_{ij} \]

and

\[ ModelB: mood_{ij} = b_{0j} + b1 * age_j + b2 * female_j + 0 * stress_{ij} + \varepsilon_{ij} \]

Although Model B does have age which also is present in Model A and its restricted stress to be 0, it has another addition, female, which is not in Model A. In this case, the two models are not nested. In this case, although we could fit both models and they are both on the same number of observations, so the outcome is the same in both models, the models are not nested. Although we can still ask R to conduct a LRT, this LRT is not valid. It is shown here to highlight that you as the analyst are responsible for determining whether your two models are nested or not and therefore deciding whether a LRT is an appropriate way to evaluate whether the models are significantly different from each other, or not.

modela <- lmer(mood ~ age + stress + (1 | ID),
               data = dm, REML = FALSE)
modelb <- lmer(mood ~ age + female + (1 | ID), data = dm,
               REML = FALSE)

nobs(modela)

## [1] 183

nobs(modelb)

## [1] 183

## this is NOT appropriate LRT
anova(modela, modelb, test = "LRT")

## Data: dm
## Models:
## modela: mood ~ age + stress + (1 | ID)
## modelb: mood ~ age + female + (1 | ID)
##        npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## modela    5 573 589   -282      563                    
## modelb    5 589 605   -289      579     0  0          1

Although we cannot conduct a LRT on non nested models, it is still useful to compare the fit of non-nested models. For example, if one is a much better fit than another model, that may suggest one set of predictors is superior or can be used to evaluate competing hypotheses (e.g., Theory 1 says that stress and family history are the most important predictors of mood and Theory 2 says that age and sleep are the best predictors of mood — these two theories are competing, not nested versions of each other).

We cannot use LRTs to compare non nested models, but we can use other measures, including performance measures such as variance explained or model accuracy and information criterion. We will talk about information criterion next.

1.3 Information Criterion

Two common information criterion are:

AIC: the Akaike Information Criterion (AIC)22 https://en.wikipedia.org/wiki/Akaike_information_criterion
BIC: the Bayesian Information Criterion (BIC)33 https://en.wikipedia.org/wiki/Bayesian_information_criterion also sometimes called the Schwarz Information Criterion

Both the AIC and BIC are calculated based primarily on the log likelihood, LL, of a model. One way of thinking about these information criterion is that you could think about models being a way of approximating reality. Suppose you have two models that both generate approximations (predictions) of reality. The model whose predictions are closer to the observed data will have a higher LL. The LL can be used as a relative measure of model fit. LL is not an absolute measure of fit. That is, we do not interpret a specific LL value as indicating “good” fit. Only which of a set of (all potentially bad) models is the best.

However, there is a limitation with using LL alone. The LL will always stay the same or increase as additional predictors / parameters are added to the model. Thus if we use the LL alone, out of a set of competing models, we would virtually always pick the more complex models. To address this, we need to incorporate some penalty for the complexity of a model, so that to choose a more complex over simpler model as the “best” it has to improve the LL enough.

The AIC and BIC are very similar except that they use different penalties for model complexity (technically they are derived from rather different theoretical foundations, but for practical purposes they are similar other than the complexity penalties).

A common way of defining model complexity is based on the number of estiamted model parameters, \(k\). For example, consider the following simple linear regression for \(n\) different observations:

\[ \begin{align} \eta_{i} &= b_0 + b_1 * x_i\\ y_i &\sim \mathcal{N}(\eta_i, \sigma_{\varepsilon_i})\\ \end{align} \]

The parameters are:

\[ b_0, b_1, \sigma_{\varepsilon_i} \]

so \(k = 3\).

The equations for AIC and BIC are quite easy to follow and looking at them helps understand where they are similar and different.

\[ \begin{align} AIC &= 2 * k - 2 * LL \\ BIC &= log_{e}(n) * k - 2 * LL \\ \end{align} \]

\(n\) is the number of observations included in the model and \(LL\) is the log likelihood, where higher values indicate a better fitting model. These equations highlight that the only difference between the AIC and BIC is whether the number of parameters in the model, \(k\) are multiplied by \(2\) (AIC) or \(log_{e}(n)\). Thus, the AIC and BIC will be identical if:

\[ \begin{align} log_{e}(n) &= 2 \\ n &= e^2 \approx 7.39 \\ \end{align} \]

If \(n < e^2 \approx 7.39\) then the BIC will have a weaker penalty based on the number of parameters, \(k\), than the AIC. If \(n > e^2 \approx 7.39\) then the BIC will have a stronger penalty based on the number of parameters, \(k\), than the AIC. Functionally, a stronger penalty on the number of parameters means that a particular information criterion will tend to favor more parsimonious (less complex) models. Thus, for all but the tiniest of sample sizes (\(\approx 7.39\)) the BIC will have a stronger penalty on model complexity and so will favor relatively more parsimonious models than will AIC.

For both AIC and BIC, the relatively better model of those compared is the model with the lower value (i.e., lower = better for both AIC and BIC).

There is no “right” or “wrong” information criterion to use. People use both the AIC and/or the BIC. If both AIC and BIC suggest the same model is the “best” there is no ambiguity. Sometimes the AIC and BIC disagree regarding which model is the best. In these cases one must pick which information criterion to go with. Both criterion require that the number of observations, \(n\), be larger than the number of parameters, \(k\) for them to operate well.

A benefit of the AIC and BIC is that both can be used to compare non nested models and they also can be used to compare nested models. It is relatively common to use AIC and/or BIC to “choose” amongst a set of possible models.

In R we can usually calculate AIC and BIC using the functions, AIC() and BIC(). Here we will calculate the AIC and BIC for our two models.

AIC(modela, modelb)

##        df    AIC
## modela  5 573.42
## modelb  5 588.56

BIC(modela, modelb)

##        df    BIC
## modela  5 589.46
## modelb  5 604.61

In this case, both the AIC and BIC are lower for Model A than for Model B, indicating that Model A is the optimal of those two models. Again, the relative interpretation is important. We cannot conclude that Model A is a “good” model, only that it is better than Model B.

1.4 Model Selection

Integrating what we have learned about LRT for nested models and AIC/BIC for nested or non nested models, we can compare across several models to select the best model.

First, we fit a series of models. In all cases, mood is the outcome. First we have an intercept only model (m0), an unconditional model, as a baseline reference. This is helpful as all the fit indices are relative. Then we fit polynomials of stress with degree 1 (linear; m1); degree 2 (quadratic; m2); degree 3 (cubic; m3). If you are not familiar with polynomials, please see the extra section at the end of this lecture now. We fit different degree polynomials of stress using the poly() function in R. Finally we fit a competing model, maybe a linear effect of energy is a better predictor of mood than stress.

Next, we check that all the observations are identical across models. If the observations were not the same across models, we would need to create a new dataset that had any missing data on any variable used in any of the models excluded. This is a critical step if needed, because LRTs and AIC and BIC are only valid if based on the same data.

dm[, stress := as.numeric(stress)]
dm[, mood := as.numeric(mood)]

m0 <- lmer(mood ~ 1 + (1 | ID), data = dm, REML = FALSE)
m1 <- lmer(mood ~ poly(stress, 1) + (1 | ID), data = dm, REML = FALSE)
m2 <- lmer(mood ~ poly(stress, 2) + (1 | ID), data = dm, REML = FALSE)
m3 <- lmer(mood ~ poly(stress, 3) + (1 | ID), data = dm, REML = FALSE)
malt <- lmer(mood ~ energy + (1 | ID), data = dm, REML = FALSE)

## check all the observations are the same
nobs(m0)

## [1] 183

nobs(m1)

## [1] 183

nobs(m2)

## [1] 183

nobs(m3)

## [1] 183

nobs(malt)

## [1] 183

Now we can see which model is the best fit. For the nested models (m0 - m3) we can use LRTs. Technically, m0 also is nested in malt, so we can use a LRT for that too. We cannot use a LRT for, say, m1 and malt as those mdoels are not nested. We can use AIC and BIC for all models, though. Note that with multiple models, the anova() function is doing sequential LRT tests (e.g., m0 vs m1; m1 vs m2, etc.) not all compared to one model. If you want two specific models compared, specify just two.

#### LRTs for nested models ####

## two specific comparisons
anova(m0, m3, test = "LRT")

## Data: dm
## Models:
## m0: mood ~ 1 + (1 | ID)
## m3: mood ~ poly(stress, 3) + (1 | ID)
##    npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## m0    3 588 597   -291      582                        
## m3    6 574 593   -281      562  20.1  3    0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## sequential comparisons
anova(m0, m1, m2, m3, test = "LRT")

## Data: dm
## Models:
## m0: mood ~ 1 + (1 | ID)
## m1: mood ~ poly(stress, 1) + (1 | ID)
## m2: mood ~ poly(stress, 2) + (1 | ID)
## m3: mood ~ poly(stress, 3) + (1 | ID)
##    npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## m0    3 588 597   -291      582                        
## m1    4 572 585   -282      564 18.05  1    2.1e-05 ***
## m2    5 572 588   -281      562  2.05  1       0.15    
## m3    6 574 593   -281      562  0.00  1       0.95    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m0, malt, test = "LRT")

## Data: dm
## Models:
## m0: mood ~ 1 + (1 | ID)
## malt: mood ~ energy + (1 | ID)
##      npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## m0      3 588 597   -291      582                        
## malt    4 492 505   -242      484  97.5  1     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## AIC and BIC for nested and non nested models
AIC(m0, m1, m2, m3, malt)

##      df    AIC
## m0    3 587.75
## m1    4 571.69
## m2    5 571.64
## m3    6 573.63
## malt  4 492.21

BIC(m0, m1, m2, m3, malt)

##      df    BIC
## m0    3 597.37
## m1    4 584.53
## m2    5 587.69
## m3    6 592.89
## malt  4 505.05

From the LRT we can see that m3 is significantly better fit than m0. Looking at the sequential tests, however, m2 is no better than m1 and m3 is no better than m2, which might suggest m1 is the best model. If we look at the AIC values, for the models with stress, AIC selects m2 as the best model (by .05 compared to m1, very close). BIC favors more parsimonious models and selects m1 over m2. However, both AIC and BIC indicate that the alternate model (malt) is the best of all the models evaluated, suggesting that linear energy is a better predictor of mood than is linear, quadratic, or cubic polynomials of stress.

If we were picking a stress model only, we would pick between the linear (m1; LRT, BIC) and quadratic (m2; AIC) model, depending which method we used. If we were willing to pick energy over stress, we would clearly go with energy.

In addition to testing fixed effects, these same methods can be used to test random effects. Here we will fit a new model, m4, that include a random slope of stress and the correlation between the random intercept and stress slope. Then we compare the linear stress with random slope (m4), linear stress without random slope (m1) and intercept only model (m0).

m4 <- lmer(mood ~ stress + (1 + stress | ID), data = dm, REML = FALSE)

## check all the observations are the same
nobs(m0)

## [1] 183

nobs(m1)

## [1] 183

nobs(m4)

## [1] 183

anova(m0, m1, test = "LRT")

## Data: dm
## Models:
## m0: mood ~ 1 + (1 | ID)
## m1: mood ~ poly(stress, 1) + (1 | ID)
##    npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## m0    3 588 597   -291      582                        
## m1    4 572 585   -282      564  18.1  1    2.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m1, m4, test = "LRT")

## Data: dm
## Models:
## m1: mood ~ poly(stress, 1) + (1 | ID)
## m4: mood ~ stress + (1 + stress | ID)
##    npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m1    4 572 585   -282      564                    
## m4    6 572 592   -280      560  3.42  2       0.18

anova(m0, m4, test = "LRT")

## Data: dm
## Models:
## m0: mood ~ 1 + (1 | ID)
## m4: mood ~ stress + (1 + stress | ID)
##    npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## m0    3 588 597   -291      582                        
## m4    6 572 592   -280      560  21.5  3    8.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(m0, m1, m4)

##    df    AIC
## m0  3 587.75
## m1  4 571.69
## m4  6 572.27

BIC(m0, m1, m4)

##    df    BIC
## m0  3 597.37
## m1  4 584.53
## m4  6 591.53

From these tests we can conclude several things. Firstly adding a linear trend of stress as a fixed effect is signfiicantly better than the intercept only model. Secondly, adding a random slope of stress and correlation of the random slope and intercept is not significantly better than the fixed linear effect of stress model. Thirdly, the model with stress as both a fixed and random slope is signficantly better than the intercept only model (m4 vs m0 is kind of an omnibus test of the effect of adding stress as both a fixed and random slope into the model). However, this multi degree of freedom test masks the fact that the effect is really just driven by the fixed slope of stress, not the random slope.

Finally, both AIC and BIC favor m1 over the other models, BIC by several points, suggesting that m1 is the best balance of complexity and model fit.

THe main point of this final exercise is to show how you can use LRTs and AIC and BIC to evaluate whether random effects “help” a model fit.

2 Effect Sizes

As with other statistical models, relying solely on statistical significance is limitting as it ignores the magnitude of an effect.

Simple \(R^2\) measures of variance accounted for are problematic, however, recently several approaches have been put forward.

To understand these measures, let’s go back to a relatively simple LMM with a fixed effect predictor and a random intercept only.

\[ y_{ij} = b_{0j} + b_1 * x_j + \varepsilon_{ij} \]

Recall that we separate the random intercept as:

\[ b_{0j} = \]

substituting we get:

\[ y_{ij} = \gamma_{00} + u_{0j} + b_1 * x_j + \varepsilon_{ij} \]

and re-arranging slightly, we can break the model down into essentially three “parts”:

\[ y_{ij} = \overbrace{\gamma_{00} + b_1 * x_j}^\text{fixed} + \underbrace{u_{0j}}_\text{random} + \overbrace{\varepsilon_{ij}}^{residual} \]

For each of these three parts, we can calculate the variance associated with them. First the variance of the fixed effects portion of the model:

\[ \sigma^{2}_{fixed} = Var(\gamma_{00} + b_1 * x_j) \]

Next the variance of the random effects part of the model, here the random intercept:

\[ u_{0j} \sim \mathcal{N}(0, \sigma^{2}_{u_{0}}) \]

Finally, the variance of the residuals (the unexplained part):

\[ \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^{2}_{\varepsilon}) \]

Nakagawa and Schielzeth (2013; https://doi.org/10.1111/j.2041-210x.2012.00261.x) defined two variance explained measures defined based on these components as follows:

\[ R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}} \]

and

\[ R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}} \]

As it is difficult to write \(R^{2}_{LMM(m)}\), I refer to this as the Marginal R2. It captures the proportion of the total variance attributable to the fixed effects portion of the model.

Likewise, \(R^{2}_{LMM(c)}\) is referred to as the Conditional R2. It captures the proportion of the total variance attributable to both the fixed and random effects portion of the model (i.e., it only excludes the residual variance).

The Conditional R2 essentially measures the total proportion of variability explainable by the model. The Marginal R2 measures the total proportion of variability explainable only by the fixed effects, the ‘average’ or marginal effect in the population. Both are useful in different circumstances. Because the Conditional R2 incorporates random effects, it is not a measure of how much variance we could predict in a new data. For example if we had a LMM built and then applied the model to data on a new patient who came to a clinic, without their data we will not know what their random effects should be, so we could only apply the average effects, the fixed effects. Thus, the Marginal R2 can give us a sense of how much variance we might explain using only the averages from our model and how much we might explain if we had data on someone new. The Conditional R2 uses both the fixed and random effects and can tell us how much variance, essentially in our own sample, our model is able to explain.

One caveat: it is technically possible for the Marginal R2 to decrease when more predictors are added to the model. This does not happen in linear regression, but it can happen with the Marginal R2. It generally does not happen, but it can. This is simply a limitation of how LMMs are estimated.

Also note that if you have an intercept only model, that is a model with only a fixed and a random intercept, there will be no variability associated with the fixed effects (since there is only an intercept and it does not vary) and in this special case, the Conditional R2 is identical to the ICC.

Although we looked at the formula for Marginal R2 and Conditional R2 from a LMM with only a random intercept and no other random effects, Johnson (2014; https://dx.doi.org/10.1111/2041-210X.12225) has shown they can be extended to random slope models. Although additional random components are included, the interpretation of the Marginal R2 as being the proportion of total variance attributable to fixed effects only and Conditional R2 being the proportion of total variance attributable to the model, the fixed and random effects together, remains the same.

In R, we can calculate these measures using the R2() function. To start with, let’s look at an intercept only model and compare the R2 results to the ICC.

m0b <- lmer(mood ~  1 + (1 | ID), data = dm)

summary(m0b)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mood ~ 1 + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 584
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.010 -0.537  0.177  0.604  2.256 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.525    0.724   
##  Residual             1.085    1.041   
## Number of obs: 183, groups:  ID, 50
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)    4.943      0.132 50.762    37.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R2(m0b)

##    MarginalR2 ConditionalR2 
##       0.00000       0.32607

iccMixed("mood", "ID", data = dm)

##         Var   Sigma     ICC
## 1:       ID 0.52477 0.32607
## 2: Residual 1.08462 0.67393

Here we can see that the Marginal R2 is 0, which makes sense as we have no fixed effects predictors, only the fixed effect intercept, which does not vary (because it is fixed). The Conditional R2 is equal to the ICC, because it captures the proportion of total variance attributable to the fixed and random effects, but the fixed effects variance is 0, so its just the variance of the random intercept over the total variance, which is the equation for the ICC.

Now let’s look at a more complex example. Here we use the model with a linear stress trend predicting mood as a fixed effect with a random intercept.

summary(m1)

## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: mood ~ poly(stress, 1) + (1 | ID)
##    Data: dm
## 
##      AIC      BIC   logLik deviance df.resid 
##    571.7    584.5   -281.8    563.7      179 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.394 -0.523  0.190  0.597  2.315 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.424    0.651   
##  Residual             0.998    0.999   
## Number of obs: 183, groups:  ID, 50
## 
## Fixed effects:
##                 Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)        4.936      0.122  50.195   40.49  < 2e-16 ***
## poly(stress, 1)   -5.475      1.254 178.649   -4.37  2.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## ply(strs,1) 0.011

R2(m1)

##    MarginalR2 ConditionalR2 
##       0.10330       0.37064

Here we can see that the fixed effect of stress explains about 10% of the variance in mood (Marginal R2). The model overall explains about 37% of the variance in mood (Conditional R2).

We can also calculate effect sizes that are equivalent to Cohen’s \(f^2\) effect size. Recall that:

\[ f^2 = \frac{R^2}{1 - R^2} \]

So we can define

\[ f^2_{LMM(m)} = \frac{R^2_{LMM(m)}}{1 - R^2_{LMM(m)}} \]

and

\[ f^2_{LMM(c)} = \frac{R^2_{LMM(c)}}{1 - R^2_{LMM(c)}} \]

which are a Marginal and Cnditional F2 effect size. We can get all of these and other model performance measures, like AIC and BIC using the modelPerformance() function.

modelPerformance(m1)

## $Performance
##     Model Estimator N_Obs N_Groups    AIC    BIC      LL LLDF   Sigma
## 1: merMod        ML   183  ID (50) 571.69 584.53 -281.85    4 0.99893
##    MarginalR2 ConditionalR2 MarginalF2 ConditionalF2
## 1:     0.1033       0.37064    0.11521       0.58893
## 
## attr(,"class")
## [1] "modelPerformance.merMod" "modelPerformance"

3 Effect Sizes for Differences

Often we are interested in the effect size for a particular model parameter or a particular variable in our model. For example, we may want to know what is the unique effect size for stress predicting mood? The overall model effect sizes are not the right answer to this question. We need to calculate how much the model performance changed when just the parameter(s) of interest changed. That is, we need to compare two, nested models. For these, it is best to use ML, not REML. Here we will fit two models with m.a nested within m.ab. This way, we can ensure that the only difference between the two models is the addition of a fixed effect slope of stress.

m.ab <- lmer(mood ~ stress + (1 | ID), data = dm, REML = FALSE)
m.a <- lmer(mood ~ 1 + (1 | ID), data = dm, REML = FALSE)

nobs(m.ab)

## [1] 183

nobs(m.a)

## [1] 183

Now we can define a specific effect size:

\[ f^2 = \frac{R^2_{AB} - R^2_{A}}{1 - R^2_{AB}} \]

For simplicity, I am just writing the formula once, but we can calculate a Marginal F2 and Conditional F2 by using Marginal R2 or Conditional R2. The key here is in the top of the fraction: we calculate the unique variance explained by calculating the difference in variance explained between our two models. If the two models differ only by one parameter/predictor as in this case, then the effect size is for one specific predictor. However, we can use this to calculate an effect size for multiple parameters. Later, we will see some cases where we would care about this.

Let’s calculate this now for stress.

R2(m.ab)

##    MarginalR2 ConditionalR2 
##       0.10330       0.37064

R2(m.a)

##    MarginalR2 ConditionalR2 
##       0.00000       0.31954

## Marginal F2 for stress
(.10330 - 0) / (1 - .10330)

## [1] 0.1152

## Conditional F2 for stress
(.37064 - .31954) / (1 - .37064)

## [1] 0.081194

Let’s extend this. Suppose we added stress as both a fixed and random slope and wanted to get the overall effect size for stress.

m.abc <- lmer(mood ~ stress + (1 + stress | ID), data = dm, REML = FALSE)

R2(m.abc)

##    MarginalR2 ConditionalR2 
##       0.09810       0.40187

R2(m.a)

##    MarginalR2 ConditionalR2 
##       0.00000       0.31954

## Marginal F2 for stress OVERALL (fixed + random slope vs intercept only)
(0.09810 - 0) / (1 - 0.09810)

## [1] 0.10877

## Conditional F2 for stress OVERALL (fixed + random slope vs intercept only)
(0.40187 - .31954) / (1 - 0.40187)

## [1] 0.13765

We could also ask, how much does the effect size change when we added stress as a random slope on top of the model with stress as a fixed effect slope only?

R2(m.abc)

##    MarginalR2 ConditionalR2 
##       0.09810       0.40187

R2(m.ab)

##    MarginalR2 ConditionalR2 
##       0.10330       0.37064

## Marginal F2 for random effect of stress
(0.09810 - 0.10330) / (1 - 0.09810)

## [1] -0.0057656

## Conditional F2 for random effect of stress
(0.40187 - 0.37064) / (1 - 0.40187)

## [1] 0.052213

In this case, we get the (seemingly odd) negative value for the Marginal F2 and a small effect size for the Conditional F2.

According to Cohen’s (1988) guidelines, which should be treated as rough rules of thumb only:

\(f^2 \geq 0.02\) is small \(f^2 \geq 0.15\) is medium *\(f^2 \geq 0.35\) is large

whether those cut offs well translate into LMMs and Marginal and Conditional F2 (which certainly was not the context Cohen would have imagined them being applied to) is unclear.

We could also use LRTs to test the overall (fixed + random) or random only effect of stress as these are nested models.

## overall effect (fixed + random)
anova(m.a, m.abc, test = "LRT")

## Data: dm
## Models:
## m.a: mood ~ 1 + (1 | ID)
## m.abc: mood ~ stress + (1 + stress | ID)
##       npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)    
## m.a      3 588 597   -291      582                        
## m.abc    6 572 592   -280      560  21.5  3    8.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## random effect only
anova(m.ab, m.abc, test = "LRT")

## Data: dm
## Models:
## m.ab: mood ~ stress + (1 | ID)
## m.abc: mood ~ stress + (1 + stress | ID)
##       npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m.ab     4 572 585   -282      564                    
## m.abc    6 572 592   -280      560  3.42  2       0.18

In the case of a variable like stress, which has not been decomposed into a purely between person and a purely within person variable, it could explain both variability from the random intercept and the residual variance, even as a fixed effect. Further because the Conditional F2 includes both the fixed and random variance from the model, a fixed effect can influence the Conditional F2. The random effect alone, however, would not be expected to be associated with any improvement in the Marginal F2, so we’d expect a Marginal F2 for the random effect alone about 0 with the random effect only showing up on the Conditional F2. We can see that overall adding a fixed and random effect of stress is statistically significant and that overall (fixed + random slope of stress; p < .001), we get small to moderate (by Cohen’s guidelines, dubiously applied to LMMs) effect sizes on both the Marginal F2 (.109) and Conditional F2s (.138). We can also see that the addition of a random stress slope beyond the fixed effect stress slope is not statistically significant (p = .18) and is associated with only a relatively small Conditional F2 (.052). Note that I am not commenting on the Marginal F2 for the random effect only as we would not expect it to be anything other than 0 anyways, so its not interesting to report / interpret.

Another point that is worth noting is that to get all of this information, we actually had to fit three different models and make various comparisons between all three of them. That is quite a bit of effort. If we had a model with multiple predictors, fitting three models to test each predictor would be a lot of effort. The modelTest() function actually does all of this for us basically. Let’s run it and then use APAStyler() to get some nicely formatted output.

APAStyler(modelTest(m.abc))

##                          Term                     Est           Type
##  1:               (Intercept)  6.02*** [ 5.47,  6.56]  Fixed Effects
##  2:                    stress -0.27*** [-0.42, -0.13]  Fixed Effects
##  3: cor_stress.(Intercept)|ID                   -0.77 Random Effects
##  4:         sd_(Intercept)|ID                    0.79 Random Effects
##  5:              sd_stress|ID                    0.24 Random Effects
##  6:                     sigma                    0.97 Random Effects
##  7:                  Model DF                       6  Overall Model
##  8:                N (Groups)                 ID (50)  Overall Model
##  9:          N (Observations)                     183  Overall Model
## 10:                    logLik                 -280.13  Overall Model
## 11:                       AIC                  572.27  Overall Model
## 12:                       BIC                  591.53  Overall Model
## 13:               Marginal R2                    0.10  Overall Model
## 14:               Marginal F2                    0.11  Overall Model
## 15:            Conditional R2                    0.40  Overall Model
## 16:            Conditional F2                    0.67  Overall Model
## 17:   stress (Fixed + Random)     0.11/0.14, p < .001   Effect Sizes
## 18:           stress (Random)    -0.01/0.05, p = .181   Effect Sizes

Previously, we saw this output and only focused through the fixed and random effects. Now let’s look at some of the other output. We have the model AIC and BIC provided, which can be helpful for comparing between different models. We have the overall model Marginal and Conditional R2 and F2. We also have effect sizes and some p-values for labeled “Effect Sizes”. Specifically:

stress (Fixed + Random) - this contrasts the model with stress as a both a fixed and random effect to a model without stress as either a fixed or random effect (m.a vs m.abc in our previous work). The Est column lists the Marginal/Conditional F2, here Marginal = 0.11 and Conditional = 0.14. The p-value comes from a LRT comparing the two models. This p-value, p < .001, is the overall test of whether adding both the fixed and random slope of stress improves the LL of the model.
stress (Random) - this contrasts the model with stress as both a fixed and random effect to a model without stress as a random slope but with stress as a fixed effect (m.ab vs m.abc in our previous work). The Est column lists the Marginal/Conditioanl F2, here for a random effect only the marginal is not interesting but the Conditional F2 is useful. The p-value again comes from a LRT testing whether the addition of stress a a random slope significantly improves the LL. Although caution is always warranted on p-values for variances of random effects, it can be a helpful guide and the LRT is about the best possible p-value that can be computed (short of a bootstrapped one).

Thus, using modelTest() we can actually get by default a lot of the comparisons and effect sizes that we may be interested in and we can get effect size estimates for specific predictors. A test of stress only as a fixed effect is not included because it does not make sense to test the addition of a fixed effect onto a model with a random effect but no fixed effect.

APAStyler() can take a list of models as well, so its possible for even more comparisons. For example, we might wonder the impact of a covariate, say age.

m.abcov <- update(m.abc, . ~ . + age)

APAStyler(list(
  Unadjusted = modelTest(m.abc),
  Adjusted = modelTest(m.abcov)))

##                          Term              Unadjusted                Adjusted
##  1:               (Intercept)  6.02*** [ 5.47,  6.56]  6.29*** [ 4.50,  8.07]
##  2:                    stress -0.27*** [-0.42, -0.13] -0.28*** [-0.42, -0.13]
##  3:                       age                            -0.01 [-0.08,  0.06]
##  4: cor_stress.(Intercept)|ID                   -0.77                   -0.77
##  5:         sd_(Intercept)|ID                    0.79                    0.82
##  6:              sd_stress|ID                    0.24                    0.24
##  7:                     sigma                    0.97                    0.97
##  8:                  Model DF                       6                       7
##  9:                N (Groups)                 ID (50)                 ID (50)
## 10:          N (Observations)                     183                     183
## 11:                    logLik                 -280.13                 -280.09
## 12:                       AIC                  572.27                  574.18
## 13:                       BIC                  591.53                  596.65
## 14:               Marginal R2                    0.10                    0.10
## 15:               Marginal F2                    0.11                    0.11
## 16:            Conditional R2                    0.40                    0.40
## 17:            Conditional F2                    0.67                    0.67
## 18:   stress (Fixed + Random)     0.11/0.14, p < .001     0.11/0.14, p < .001
## 19:           stress (Random)    -0.01/0.05, p = .181    -0.01/0.05, p = .198
## 20:               age (Fixed)                             0.00/0.00, p = .765
##               Type
##  1:  Fixed Effects
##  2:  Fixed Effects
##  3:  Fixed Effects
##  4: Random Effects
##  5: Random Effects
##  6: Random Effects
##  7: Random Effects
##  8:  Overall Model
##  9:  Overall Model
## 10:  Overall Model
## 11:  Overall Model
## 12:  Overall Model
## 13:  Overall Model
## 14:  Overall Model
## 15:  Overall Model
## 16:  Overall Model
## 17:  Overall Model
## 18:   Effect Sizes
## 19:   Effect Sizes
## 20:   Effect Sizes

Now we get all the estimates for stress as before but also those are repeated for the model with age. We can see that age has a basically 0 effect size either marginal or conditional. We can also see that both the AIC and BIC favor the unadjusted model and that the fixed effect coefficient and standard deviation fro the random stress slope are basically unchanged with or without age in the model.

3.1 Between Only Variables

So far for effect sizes, we were focused primarily on variables that differed both between and within person (stress). However, although the same strategy and effect sizes can be used for variables that vary only between people or only within person, what results we expect are a little different.

Also, although modelTest() uses the Wald method by default for confidence intervals, just as for the confint() function the method can be specified to pick what you want, such as profile likelihood.

Also note that in this case I am not comparing models so I will use the default REML estimation rather than pure ML.

dm[, c("Bstress", "Wstress") := meanDeviations(stress), by = ID]

m.between <- lmer(mood ~ Bstress + (1 | ID), data = dm)

APAStyler(modelTest(m.between, method = "profile"),
  pcontrol = list(digits = 3, stars = FALSE, includeP = TRUE,
                  includeSign = TRUE, dropLeadingZero = TRUE))

## Computing profile confidence intervals ...

## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.

##                  Term                          Est           Type
##  1:       (Intercept)  6.02p < .001 [ 5.11,  6.91]  Fixed Effects
##  2:           Bstress -0.28p = .018 [-0.51, -0.05]  Fixed Effects
##  3: sd_(Intercept)|ID            0.66 [0.43, 0.88] Random Effects
##  4:             sigma            1.04 [0.93, 1.18] Random Effects
##  5:          Model DF                            4  Overall Model
##  6:        N (Groups)                      ID (50)  Overall Model
##  7:  N (Observations)                          183  Overall Model
##  8:            logLik                      -288.00  Overall Model
##  9:               AIC                       584.01  Overall Model
## 10:               BIC                       596.85  Overall Model
## 11:       Marginal R2                         0.06  Overall Model
## 12:       Marginal F2                         0.06  Overall Model
## 13:    Conditional R2                         0.31  Overall Model
## 14:    Conditional F2                         0.46  Overall Model
## 15:   Bstress (Fixed)         0.06/-0.01, p = .017   Effect Sizes

In the output we can see that parameters and CIs were based on REML. However, the effect sizes are based on ML, because those are based off of model comparisons. First of all, we can see in the effect sizes for Bstress that the p-value differs very slightly from the p-value from the fixed effects. This may be due either to minor differences between REML and ML estimation or due to one test being a t-test with approximate degrees of freedom.

The Marginal F2 for Bstress indicates that it has a small effect size. The conditional F2 is basically zero and slightly negative. This is not surprising as the addition of Bstress is not likely to improve the fit of the conditional model over the random intercept model. Individual differences in mean level of mood are already captured in the random intercept model, so for the conditional model, it will be very difficult for Bstress to actually improve the overall variance explained. Instead, Bstress is able to shift what was previously variance due to the random intercept and explain it so it because variance attributable to the fixed effects portion of the model. In terms of the equations:

\[ R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}} \]

and

\[ R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}} \]

any gains in \(\sigma^{2}_{fixed}\) will basically be matched by a decrease in \(\sigma^{2}_{u_{0}}\), thus while the marginal variance explained increases, the conditional variance explained is more or less constant, so no change (0 effect size).

In summary, we can say that there is a significant, negative association between average stress and average mood (b [05% CI] = -0.28 [-0.51, -0.05], p = .018), with a small effect size for the fixed effects, \(f^2\) = .06. The pattern of results is that higher average stress was associated with lower average mood.

Because we do not expect a meaningful impact on the Conditional F2, I did not even mention it in the interpretation for a between only fixed effect predictor.

3.2 Within Only Variables

Here we have an example of a variable that only varies within, not between individuals. This is interpreted basically the same was as a variable that varies both between and within (like overall stress).

m.within <- lmer(mood ~ Wstress + (1 | ID), data = dm)
summary(lmer(mood ~ 1 + (1 | ID), data = dm))

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mood ~ 1 + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 584
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.010 -0.537  0.177  0.604  2.256 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.525    0.724   
##  Residual             1.085    1.041   
## Number of obs: 183, groups:  ID, 50
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)    4.943      0.132 50.762    37.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

APAStyler(modelTest(m.within, method = "profile"),
  pcontrol = list(digits = 3, stars = FALSE, includeP = TRUE,
                  includeSign = TRUE, dropLeadingZero = TRUE))

## Computing profile confidence intervals ...

## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.

##                  Term                          Est           Type
##  1:       (Intercept)  4.94p < .001 [ 4.68,  5.20]  Fixed Effects
##  2:           Wstress -0.28p < .001 [-0.44, -0.13]  Fixed Effects
##  3: sd_(Intercept)|ID            0.74 [0.53, 0.97] Random Effects
##  4:             sigma            1.00 [0.89, 1.13] Random Effects
##  5:          Model DF                            4  Overall Model
##  6:        N (Groups)                      ID (50)  Overall Model
##  7:  N (Observations)                          183  Overall Model
##  8:            logLik                      -284.68  Overall Model
##  9:               AIC                       577.36  Overall Model
## 10:               BIC                       590.20  Overall Model
## 11:       Marginal R2                         0.04  Overall Model
## 12:       Marginal F2                         0.05  Overall Model
## 13:    Conditional R2                         0.38  Overall Model
## 14:    Conditional F2                         0.60  Overall Model
## 15:   Wstress (Fixed)          0.05/0.09, p < .001   Effect Sizes

In the output we can see both the Marginal and Conditional F2 are non zero. Wstress varies within individuals so even though it is being compared against a random intercept model, it can explain variance in the residuals that the random intercept alone cannot explain. In terms of the equations:

\[ R^{2}_{LMM(m)} = \frac{\sigma^{2}_{fixed}}{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}} \]

and

\[ R^{2}_{LMM(c)} = \frac{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}}}{\sigma^{2}_{fixed} + \sigma^{2}_{u_{0}} + \sigma^{2}_{\varepsilon}} \]

any gains in \(\sigma^{2}_{fixed}\) will come from a decrease in \(\sigma^{2}_{\varepsilon}\). At the same time, \(\sigma^{2}_{u_{0}}\), the variance of the random intercept will basically not change as Wstress does not vary between people at all. If the residual variance decreases, both the Marginal and Conditional F2 will increase. Thus, we reasonably expect Wstress to increase both the Marginal and Conditional F2. The Conditional F2 is larger in part because of the formula for Cohen’s \(f^2\) which includes division by the unexplained variance, which is much less from the conditional than the marginal model (We can see that the Marginal R2 for the whole model is 4% whereas the Conditional R2 for the whole model is 38%).

Note that for variables that vary both between and within person, they may shift some variance from the random effect variance to the fixed effect variance and from the residual variance to the fixed effect variance. Of course if a variable is included as a random slope as well, then it would also shift variance from the residuals to the random effects, if the random slope does differ between people quite a bit.

In summary, we can say that there is a significant, negative association within stress and same day mood (b [05% CI] = -0.28 [-0.44, -0.13], p < .001), with a small effect size for the fixed effects, \(f^2\) = .05, and a small effect size for the overall conditional model, \(f^2\) = .09. The pattern of results is that higher within stress was associated with lower same day mood.

Note that it is a coincidence that the fixed effect slope for between and within stress happened to be the same at two decimal points.

Function	What it does
`lmer()`	estimate a LMM
`R2()`	calculates Marginal and Conditional R2 for LMMs
`AIC()`	calculates AIC for LMMs
`BIC()`	calculates BIC for LMMs
`anova()`	can be used to compare two nested LMMs and calculate a LRT
`modelTest()`	along with `APAStyler()` get a nicely formatted summary of a model results, including many automatic effect sizes.
`poly()`	Fit a polynomial of specified degree for a predictor in a LMM.

Linear Mixed Models (LMMs) - Model Comparisons

Joshua F. Wiley

2020-05-31

1 Comparing Models

1.1 Nested Models

1.1.1 Nested Models in `R`

1.2 Non Nested Models

1.3 Information Criterion

1.4 Model Selection

2 Effect Sizes

3 Effect Sizes for Differences

3.1 Between Only Variables

3.2 Within Only Variables

4 Summary

4.1 Conceptual

4.2 Code

5 Extra - Polynomial Function Refresher

5.1 Degree 0

5.2 Degree 1

5.3 Degree 2

5.4 Degree 3

Linear Mixed Models (LMMs) - Model Comparisons

Joshua F. Wiley

2020-05-31

1 Comparing Models

1.1 Nested Models

1.1.1 Nested Models in R

1.2 Non Nested Models

1.3 Information Criterion

1.4 Model Selection

2 Effect Sizes

3 Effect Sizes for Differences

3.1 Between Only Variables

3.2 Within Only Variables

4 Summary

4.1 Conceptual

4.2 Code

5 Extra - Polynomial Function Refresher

5.1 Degree 0

5.2 Degree 1

5.3 Degree 2

5.4 Degree 3

1.1.1 Nested Models in `R`