GRADE your meta-analysis

An introduction to meta-analysis, following the GRADE assessment domains for evaluation
Author

Lucija Batinović

Published

September 27, 2025

Why do a Meta-Analysis?

  • Precision: Many studies are too small to be convincing alone. Pooling many studies increases statistical power to detect effects.
  • Answer broader questions: Combining diverse studies tests consistency of the effect across populations and interventions.
  • Resolve conflicting evidence: Meta-analysis allows us to resolve conflicts found in studies with opposite effects.

GRADE Framework

To make conclusions about a phenomenon after conducting a systematic review, we need to provide estimates of certainty in the evidence. To do this, we follow the GRADE (Grading of Recommendations Assessment, Development and Evaluation) guidelines.

GRADE is

“a well-developed formal process to rate the quality of scientific evidence in systematic reviews and to develop recommendations in guidelines that are as evidence-based as possible”

Use GRADE to evaluate evidence based on:

  1. Imprecision
  • Are the confidence intervals wide or overlapping null effects?
  1. Inconsistency
  • Examine heterogeneity (e.g., τ²)
  • Are the results consistent across populations and settings?
  1. Publication Bias
  • Create funnel plots or fit a p/z-curve
  • Is there indication of selective reporting/publishing barely significant results?
  1. Risk of Bias
  • Assess the individual study quality.
  • Do high risk studies bias the effect estimate?
  1. Indirectness
  • Are the populations, interventions, and comparators in these studies directly applicable to your research question?

This document provides information for addressing imprecision, inconsistency, and risk of bias and publication bias. The code provided demonstrates how to conduct the synthesis in an open and reproducible way. Risk of bias assessment and indirectness are not covered in depth.

1. Imprecision

Choosing and Estimating Effect Size

Effect size: “A quantitative reflection of the magnitude of some phenomenon that is used for the purpose of addressing a question of interest” (Kelley & Preacher, 2012)

There are many effect size types, and they usually depend on the type of data collected, along with the chosen analysis. For example, if we’re interested in height difference between men and women, we can calculate a mean difference (effect size) of their height in cm (continuous data).

Types of data

  • Dichotomous: binary responses with only two outcomes (yes/no, e.g., has a person stopped smoking after an intervention)
  • Continuous: data measured on a continuous scale with any possible value within a defined range (e.g., weight after a weight loss intervention)
  • Ordinal: categorical data measured on a scale with predefined sets of possible values, set in a particular order (e.g., answers on a Likert scale)
  • Count/rate: data representing the number of times an event occurs (count), or how many times it occurs within a certain period (rate) (e.g., number of correctly read words per trial)
  • Time-to-event: time it takes until an event occurs

Effect size types

  • Dichotomous: Odds ratio (OR), Risk ratio (RR), Risk difference (RD)
  • Continuous: Mean difference (MD), Standardized mean difference (SMD)
  • Other: Correlation coefficients, non-overlap measures

Example 1: Dichotomous Data

Let’s compare the number of hard-of-hearing participants diagnosed with depression between the group that received hearing aids and the group that has not.

Depression No Depression N
Hearing aids 12 88 100
No hearing aids 34 66 100

To get an effect size that will represent the strength of intervention’s effectiveness, we can calculate an odds ratio, i.e., proportion of depressed compared to not depressed individuals between the hearing aids and no hearing aids groups. The code chunk below shows one of many ways to calculate odds ratio in R. Then we print a table with results.

R code to compute odds ratio

Code
options(scipen = 999)
table <- matrix(c(12, 88, 34, 66), nrow = 2, byrow = TRUE)
colnames(table) <- c("Depressed", "Not_Depressed")
rownames(table) <- c("Hearing_aids", "No_Hearing_aids")
kableExtra::kable(table)
Depressed Not_Depressed
Hearing_aids 12 88
No_Hearing_aids 34 66
Code
# install.packages("epitools")
library(epitools)
result <- oddsratio(table)
kableExtra::kable(result$measure) #prints a nicely formatted table
estimate lower upper
Hearing_aids 1.0000000 NA NA
No_Hearing_aids 0.2683851 0.1240792 0.5468509

We see in the table that the odds of being diagnosed with depression is 73% lower when the individual received hearing aids. The 95% confidence interval ranges from 88% lower to 45% lower, so the risk is significantly lower for the hearing aids group. Note that odds ratio of 1 signifies no difference between the groups.

Example 2: Continuous Data

Here is another example of an effect size, this time for continuous data. We can compare the mean outcome scores (i.e., reading comprehension scores) between the children with intellectual disability that received a working memory training intervention and children with ID that did not receive any intervention.

Mean SD N
WM training 5.2 1.1 50
No intervention 4.5 1.3 45

To get an effect size that represents the strength of the intervention’s effectiveness, we can calculate the mean difference between the groups (MD). However, calculating just a mean difference by subtracting one mean from the other only makes sense when we want to meta-analyse studies that have used the exact same measurement scales/tools. Then their measurement units are completely comparable and mean differences give us an interpretable estimate. In reality, this is unlikely. Studies will use different scales, and adapt their measures to the design. To make the estimates standardized and comparable across studies, the standardized mean difference (SMD) is calculated instead, by standardizing the mean difference by the pooled standard deviation (or other variants). There are different SMDs, most common one being Cohen’s d, but the one that provides a less biased estimate, particularly in studies with small sample sizes, is Hedges’ g.

The code chunk below shows one of many ways to calculate SMD (Hedges’ g) in R.

R code to compute MD and SMD

Code
# Example data
group1 <- c(mean = 5.2, sd = 1.1, n = 50) #WM group
group2 <- c(mean = 4.5, sd = 1.3, n = 45) #control group

# install.packages("esc")
library(esc)
ef <- esc_mean_sd(grp1m = group1["mean"], grp1sd = group1["sd"], grp1n = group1["n"],
            grp2m = group2["mean"], grp2sd = group2["sd"], grp2n = group2["n"], es.type = "g")

We see in the output that the effect of WM training on reading comprehension scores is 0.58. The standardized mean difference (Hedges’ g) adjusts for the different standard deviations and gives an effect size on a common scale, presented as the increase in the standard deviations. In this case, it means that the group that received the intervention scored 0.58 standard deviations higher than the control group.

There are many more ways to calculate a common effect size for all studies, and this book provides ways to use different types of test statistics and data information to convert to common effect sizes, along with detailed explanation for each effect size.

Meta-Analysis Procedure

Now that we know how to calculate effect sizes for all included studies, we can run the meta-analysis. To conduct a meta-analysis, the following steps are:

  1. Compute effect size for each study.
  2. Combine using weights. Weighting usually is needed because not all studies are equal in their statistical power and quality. The most common way to add weight to studies is by multiplying each estimate with the inversed variance (1/v) (for random-effects 1/(v + τ²)) and computing weighted means.

The weighted mean effect size μ^ is calculated as:

μ^=i=1kwiESii=1kwi

where:
- ESi = effect size of study i
- wi=1vi = weight of study i, with vi being its variance (or vi + tau2 for random-effects meta-analysis)
- k = number of studies

Because the inverse variance is proportional to the sample size, this simply gives more weight to studies with larger sample size, making the estimated population effects closer to the effects found in larger studies. There are other ways to assign weight as well, e.g., by giving more relevance to studies with low risk of bias assessment.

Example Dataset

We will use an example dataset to go through all domains and evaluate the evidence strength at the end. This is a mock dataset contains results from 10 studies comparing the effectiveness of neurofeedback pocedure to no treatment for reducing ADHD symptoms in young people. The dataset contains study labels needed to differentiate effects from different studies, the effect size, standard error, p-value (needed to conduct the z-curve analysis in the publication bias section), information about the population, intervention, comparison. We also have information about the risk of bias level of each conducted study and some notes.

To conduct the meta-analysis, we need the columns with the study label, effect size and standard errors.

Code
library(tibble)
dataset <- tribble(
  ~Study, ~Effect_Size, ~SE, ~P_value, ~Population, ~Intervention, ~Control, ~Risk_of_Bias, ~Notes, ~Sample_Size,
  "Smith et al.", 0.25, 0.10, 0.012, "Children", "NFP", "Placebo", "Low", "—", 80,
  "Lee et al.", 0.40, 0.15, 0.008, "Children", "NFP", "Waitlist", "High", "Small sample", 45,
  "Wang et al.", 0.10, 0.20, 0.003, "Adolescents", "NFP", "Placebo", "Moderate", "Different setting", 60,
  "Patel et al.", 0.35, 0.12, 0.005, "Children", "NFP", "Waitlist", "Low", "—", 95,
  "Garcia et al.", 0.05, 0.18, 0.780, "Adolescents", "NFP", "Placebo", "High", "High attrition", 50,
  "Kim et al.", 0.30, 0.11, 0.015, "Adolescents", "NFP", "Placebo", "Moderate", "—", 85,
  "Nguyen et al.", 0.45, 0.14, 0.002, "Children", "NFP", "Waitlist", "Low", "—", 110,
  "Jones et al.", 0.20, 0.17, 0.240, "Children", "NFP", "Placebo", "High", "Different dosage", 70,
  "Olsen et al.", 0.15, 0.19, 0.410, "Children", "NFP", "Placebo", "Moderate", "—", 65,
  "Choi et al.", 0.50, 0.13, 0.001, "Children", "NFP", "Placebo", "Low", "—", 120,
  "Nilsson et al.", 0.52, 0.14, 0.001, "Adolescents", "NFP", "Placebo", "Low", "—", 100,
  "Olson et al.", 0.15, 0.19, 0.410, "Adolescents", "NFP", "Placebo", "Moderate", "—", 75,
  "Cho et al.", 0.50, 0.13, 0.01, "Children", "NFP", "Placebo", "Low", "—", 130,
  "Nilssone et al.", 0.52, 0.14, 0.005, "Adolescents", "NFP", "Placebo", "Moderate", "—", 90,
  "Martinez et al.", 0.33, 0.16, 0.020, "Children", "NFP", "Waitlist", "Moderate", "Urban setting", 82,
  "Brown et al.", 0.27, 0.12, 0.030, "Adolescents", "NFP", "Placebo", "Low", "—", 105,
  "Kumar et al.", 0.18, 0.17, 0.220, "Children", "NFP", "Placebo", "High", "Short follow-up", 48,
  "Andersson et al.", 0.48, 0.11, 0.004, "Adolescents", "NFP", "Waitlist", "Low", "Large sample", 150,
  "Lopez et al.", 0.21, 0.15, 0.080, "Children", "NFP", "Placebo", "Moderate", "—", 88,
  "Stevens et al.", 0.42, 0.13, 0.006, "Children", "NFP", "Waitlist", "Low", "—", 140,
  "Huang et al.", 0.36, 0.14, 0.010, "Adolescents", "NFP", "Placebo", "Moderate", "—", 115,
  "Baker et al.", 0.09, 0.18, 0.520, "Children", "NFP", "Placebo", "High", "Underpowered", 40,
  "Sato et al.", 0.55, 0.12, 0.0008, "Adolescents", "NFP", "Waitlist", "Low", "—", 160,
  "Dubois et al.", 0.29, 0.15, 0.040, "Children", "NFP", "Placebo", "Moderate", "—", 92
)

kableExtra::kable(dataset)
Study Effect_Size SE P_value Population Intervention Control Risk_of_Bias Notes Sample_Size
Smith et al. 0.25 0.10 0.0120 Children NFP Placebo Low 80
Lee et al. 0.40 0.15 0.0080 Children NFP Waitlist High Small sample 45
Wang et al. 0.10 0.20 0.0030 Adolescents NFP Placebo Moderate Different setting 60
Patel et al. 0.35 0.12 0.0050 Children NFP Waitlist Low 95
Garcia et al. 0.05 0.18 0.7800 Adolescents NFP Placebo High High attrition 50
Kim et al. 0.30 0.11 0.0150 Adolescents NFP Placebo Moderate 85
Nguyen et al. 0.45 0.14 0.0020 Children NFP Waitlist Low 110
Jones et al. 0.20 0.17 0.2400 Children NFP Placebo High Different dosage 70
Olsen et al. 0.15 0.19 0.4100 Children NFP Placebo Moderate 65
Choi et al. 0.50 0.13 0.0010 Children NFP Placebo Low 120
Nilsson et al. 0.52 0.14 0.0010 Adolescents NFP Placebo Low 100
Olson et al. 0.15 0.19 0.4100 Adolescents NFP Placebo Moderate 75
Cho et al. 0.50 0.13 0.0100 Children NFP Placebo Low 130
Nilssone et al. 0.52 0.14 0.0050 Adolescents NFP Placebo Moderate 90
Martinez et al. 0.33 0.16 0.0200 Children NFP Waitlist Moderate Urban setting 82
Brown et al. 0.27 0.12 0.0300 Adolescents NFP Placebo Low 105
Kumar et al. 0.18 0.17 0.2200 Children NFP Placebo High Short follow-up 48
Andersson et al. 0.48 0.11 0.0040 Adolescents NFP Waitlist Low Large sample 150
Lopez et al. 0.21 0.15 0.0800 Children NFP Placebo Moderate 88
Stevens et al. 0.42 0.13 0.0060 Children NFP Waitlist Low 140
Huang et al. 0.36 0.14 0.0100 Adolescents NFP Placebo Moderate 115
Baker et al. 0.09 0.18 0.5200 Children NFP Placebo High Underpowered 40
Sato et al. 0.55 0.12 0.0008 Adolescents NFP Waitlist Low 160
Dubois et al. 0.29 0.15 0.0400 Children NFP Placebo Moderate 92

To meta-analyse the effects, we use a form of a regression model. We consider each effect size a data point with known sampling variance, which we then use to fit a regression model (i.e., estimate the population effect), with the additional nesting that happens within studies. This clustering makes meta-analysis a multilevel regression, with the first level being the study effect size, the second level being the (average) population effect estimated from the studies. There can be more levels, e.g., if you have access to individual participant data, or if there are multiple measures per study.

In these models, we also want to estimate the overall heterogeneity (variance) of the population effect (i.e., find out how much the effects vary among studies due to differences in design or populations sampled for the study). There are multiple ways to estimate this variance, and the default provided by the metafor package in R is the restricted maximum likelihood estimator (REML) variance estimator, as it was found to be the least biased Langan et al., 2018.

For the models, we can assume either:

  • Equal-effect: Assumes one true effect. This will in the literature sometimes be referred to as “fixed effects”, which is a misnomer (cf. Viechtbauer). Equal effect models provides with a precise estimate of a true population effect; or
  • Random-effects: Allows the true effect to vary across studies, which are random samples from independent distributions of different true effects. Random-effect meta-analysis models provide us with an average true effect.

Take a look at this figure created by Matthew B. Jane:

Comparison of fixed and random effect models

R code for meta-analysis

Listing 1: meta-analysis
Code
# install.packages("metafor") if needed
library(tidyverse)
library(metafor)

# your data
studies <- dataset %>% transmute(
  study = Study,
  yi = Effect_Size,  # effect sizes
  sei = SE   # standard errors
)

# run a random-effects meta-analysis
m <- rma(yi = yi, # add all efect sizes
         sei = sei, # add their standard errors
         data = studies, method = "REML")  # or method = "DL" for DerSimonian-Laird

pi <- predict(m, digits=3)
# print results
summary(m)

Random-Effects Model (k = 24; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 11.4775  -22.9550  -18.9550  -16.6840  -18.3550   

tau^2 (estimated amount of total heterogeneity): 0.0000 (SE = 0.0054)
tau (square root of estimated tau^2 value):      0.0024
I^2 (total heterogeneity / total variability):   0.03%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity:
Q(df = 23) = 23.8126, p-val = 0.4142

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.3473  0.0284  12.2224  <.0001  0.2916  0.4030  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the summary we see the model estimates, variance (τ²) and other information about the meta-analytic model. The average population effect of NFP compared to control conditions is g = 0.35, 95%CI [0.29, 0.4], with a variance of τ² = 0.00001. The effect is of moderate size. Low variance and narrow confidence interval can be interpreted as a more precise estimate of the population effect, according to the GRADE guidelines for imprecision.

Although less common, along confidence intervals, it is good to report prediction intervals. A 95% prediction interval estimates where the true effect is supposed to be in 95% of similar future studies. The prediction interval provides a wider range of possible values if there is heterogeneity between the studies and is thus more informative than confidence intervals.

In this meta-analysis, the 95% prediction interval is PI [0.29,0.4], which is the same as the confidence interval, because there is no heterogeneity in this dataset. This means that we can expect the true effect will be between g = 0.29 and g = 0.4 with 95% probability.

Always report uncertainty (e.g., confidence intervals or standard errors) along with the estimates.

We can visualize the dataset with a forest plot, which plots effect sizes of the inidividual studies as point estimates, and confidencve intervals as error bars. Together with study labels this allows us to have a better overview of the effect sizes, imprecision (presented as confidence interval bars) and the overall effect estimate from the meta-analysis. The tabs present a default forest plot created by ‘metafor’ and a simple ‘ggplot’ version that lets you modify the appearance in various ways.

Listing 2: forest plot
Code
# recreate the plot in ggplot for nicer appearance
ma_d <- data.frame(
  yi = m[["b"]],       # effect size taken from the ma model
  sei = m[["se"]],      # standard error taken from the ma model
  study = "RE Model",     # naming the model in the study label
  lower_ci = m[["ci.lb"]], #95% lower Confidence Interval
  upper_ci = m[["ci.ub"]] #95% upper CI
)

studies_gg <- bind_rows(studies, ma_d) 

studies_gg <- studies_gg %>% 
  mutate(lower_ci = yi - 1.96 * sei, #calculate the lower CI
  upper_ci = yi + 1.96 * sei #calculate the upper CI
  )


### ggplot forestplot
studies_gg$desired_order <- seq(dim(studies_gg)[1],1)
gg_forest <- ggplot(studies_gg, aes(y = reorder(study, desired_order))) +
    geom_point(aes(x=yi), size = 3, shape=15, color = "darkred") +  
    geom_errorbarh(aes(xmin = lower_ci, xmax = upper_ci), height = 0.2, color = "grey") +  
    labs(
      x = "Effect Size (Hedges' g)",
      y = "Study",
      caption = "Error bars represent 95% confidence intervals"
    ) + theme_minimal() + coord_cartesian(xlim = c(min(studies_gg$lower_ci), max(studies_gg$upper_ci)), ylim = c(0,dim(studies_gg)[1] + 1))

2. Inconsistency

Along with imprecision, according to the GRADE framework, we have to evaluate how inconsistent the effects are between the studies, to evaluate robustness and generalizability of the effect. To do this, we estimate the variance (or better yet the standard deviation of the estimated population effect), along with other information about the heterogeneity.

Heterogeneity

Here are the explanations for the most commonly used estimates of heterogeneity, ordered by their informativeness:

  • τ² (Tau²): Between-study variance. Variance tells us how large the heterogeneity of true effects is.
  • : % of variation due to heterogeneity. This tells us whether the variation comes more from random sampling errors, or systematic differences of the effects (e.g., due to actual differences in populations, measures or exposure characteristics).
  • χ² test: Test of heterogeneity. This test only tells us whether there is significant heterogeneity, but nothing about its size.

Interpretation of I²:

I² (%) Interpretation
0–40 Low
30–60 Moderate
50–90 Substantial
75–100 Considerable

According to this, in the analysis conducted in the section, we found low heterogeneity, i.e., variance (τ²) = 0.0000059. From variance, we can also look at the standard deviation of the distribution of the effects, i.e., τ = 0.0024), which is a more easily interpretable value than variance. I² is also low (0.03%), i.e., the heterogeneity due to differences between populations or study designs is low, which means the effect is robust across populations.

Along heterogeneity, to evaluate the robustness of the effect, we should conduct a sensitivity analysis to find any studies that potentially drive the effect estimate up or down. Generally, a meta-analysis can be accompanied by a sensitivity analysis called “leave-one-out” analysis. This approach involves conducting the meta-analysis k number of times, each time excluding a different study, until each study in the set has been excluded once. Consequently, this allows us to identify which studies (if any) largely influence the average effect size estimate and whether such bias skews the outcomes of the meta-analysis.

R code to assess heterogeneity and sensitivity

Code
summary(m)

Random-Effects Model (k = 24; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 11.4775  -22.9550  -18.9550  -16.6840  -18.3550   

tau^2 (estimated amount of total heterogeneity): 0.0000 (SE = 0.0054)
tau (square root of estimated tau^2 value):      0.0024
I^2 (total heterogeneity / total variability):   0.03%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity:
Q(df = 23) = 23.8126, p-val = 0.4142

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.3473  0.0284  12.2224  <.0001  0.2916  0.4030  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Leave-one-out analysis

Listing 3: leave-one-out code
Code
leom <- leave1out(m)

leom %>% as.data.frame() %>% kableExtra::kable()
estimate se zval pval ci.lb ci.ub Q Qp tau2 I2 H2
-1 0.3558412 0.0296365 12.00687 0 0.2977548 0.4139276 22.78266 0.4140879 0.0000065 0.0321631 1.000322
-2 0.3447477 0.0294817 11.69364 0 0.2869647 0.4025307 23.68465 0.3639740 0.0006461 3.2260830 1.033336
-3 0.3523993 0.0287012 12.27820 0 0.2961460 0.4086527 22.25215 0.4449275 0.0000000 0.0002588 1.000003
-4 0.3461777 0.0301300 11.48947 0 0.2871240 0.4052315 23.81211 0.3571548 0.0010669 5.1134690 1.053890
-5 0.3549020 0.0287708 12.33551 0 0.2985123 0.4112916 21.01483 0.5198219 0.0000000 0.0000000 1.000000
-6 0.3498035 0.0301353 11.60775 0 0.2907394 0.4088677 23.61448 0.3677572 0.0008809 4.2156025 1.044011
-7 0.3424609 0.0294076 11.64533 0 0.2848231 0.4000987 23.25148 0.3876480 0.0004629 2.3200190 1.023751
-8 0.3515375 0.0288166 12.19915 0 0.2950581 0.4080169 23.04024 0.3994605 0.0000014 0.0072662 1.000073
-9 0.3518165 0.0287357 12.24318 0 0.2954955 0.4081374 22.70959 0.4182799 0.0000030 0.0156048 1.000156
-10 0.3396474 0.0291145 11.66592 0 0.2825840 0.3967107 22.36384 0.4383595 0.0000007 0.0037260 1.000037
-11 0.3398844 0.0290182 11.71279 0 0.2830097 0.3967590 22.22571 0.4464876 0.0000051 0.0262572 1.000263
-12 0.3518165 0.0287357 12.24318 0 0.2954955 0.4081374 22.70959 0.4182799 0.0000030 0.0156048 1.000156
-13 0.3396474 0.0291145 11.66592 0 0.2825840 0.3967107 22.36384 0.4383595 0.0000007 0.0037260 1.000037
-14 0.3398844 0.0290182 11.71279 0 0.2830097 0.3967590 22.22571 0.4464876 0.0000051 0.0262572 1.000263
-15 0.3473795 0.0293400 11.83978 0 0.2898741 0.4048849 23.80056 0.3577697 0.0005548 2.7955653 1.028760
-16 0.3515050 0.0295824 11.88225 0 0.2935246 0.4094854 23.37298 0.3809313 0.0004037 1.9985899 1.020393
-17 0.3521114 0.0288174 12.21872 0 0.2956304 0.4085924 22.81626 0.4121665 0.0000023 0.0121999 1.000122
-18 0.3378222 0.0294080 11.48742 0 0.2801836 0.3954609 22.25347 0.4448497 0.0000001 0.0007122 1.000007
-19 0.3524147 0.0289343 12.17983 0 0.2957046 0.4091249 22.94356 0.4049227 0.0000005 0.0026419 1.000026
-20 0.3430154 0.0297151 11.54348 0 0.2847750 0.4012559 23.48428 0.3748305 0.0007124 3.5024078 1.036295
-21 0.3460484 0.0296795 11.65950 0 0.2878775 0.4042192 23.80407 0.3575829 0.0007880 3.8858814 1.040430
-22 0.3538772 0.0287741 12.29845 0 0.2974810 0.4102735 21.71703 0.4768834 0.0000039 0.0205373 1.000205
-23 0.3352705 0.0292415 11.46559 0 0.2779583 0.3925827 20.79012 0.5337393 0.0000000 0.0000000 1.000000
-24 0.3490165 0.0293471 11.89270 0 0.2914972 0.4065358 23.66126 0.3652327 0.0004858 2.4451706 1.025065

The leave-one-out analysis output provides a table with an effect size estimate for a model with each study excluded. Comparing them just visually, we can see that the effect size estimate stays relatively similar each time, with a slightly larger estimate found when the first study is excluded, but not meaningfully different. Therefore, we can conclude that the estimate is robust and that no studies biased the estimate.

3. Risk of bias

When we have results of the risk of bias assessment for each study, we can conduct subgroup or moderator analyses that will allow us to evaluate whether high-risk studies influence the effect estimate.

Code below shows how to conduct a subgroup analysis based on risk of bias assessment in R, based on code provided by Viechtbauer.

Listing 4: RoB analysis code
Code
library(metafor)
res1 <- rma(Effect_Size, SE, data=dataset, subset=Risk_of_Bias=="High")
res2 <- rma(Effect_Size, SE, data=dataset, subset=Risk_of_Bias=="Moderate")
res3 <- rma(Effect_Size, SE, data=dataset, subset=Risk_of_Bias=="Low")

dat.comp <- data.frame(Risk_of_Bias  = c("High", "Moderate", "Low"), 
                       estimate = c(coef(res1), coef(res2), coef(res3)), 
                       stderror = c(se(res1), se(res2), se(res3)),
                       tau2     = c(res1$tau2, res2$tau2, res3$tau2))
dfround(dat.comp, 3)
  Risk_of_Bias estimate stderror tau2
1         High    0.192    0.184    0
2     Moderate    0.282    0.131    0
3          Low    0.423    0.111    0
Listing 5: RoB analysis code
Code
rma(estimate, sei=stderror, mods = ~ Risk_of_Bias, method="FE", data=dat.comp, digits=3)

Fixed-Effects with Moderators Model (k = 3)

I^2 (residual heterogeneity / unaccounted variability): 0.00%
H^2 (unaccounted variability / sampling variability):   1.00
R^2 (amount of heterogeneity accounted for):            NA%

Test for Residual Heterogeneity:
QE(df = 0) = 0.000, p-val = 1.000

Test of Moderators (coefficients 2:3):
QM(df = 2) = 1.399, p-val = 0.497

Model Results:

                      estimate     se   zval   pval   ci.lb  ci.ub    
intrcpt                  0.192  0.184  1.045  0.296  -0.168  0.553    
Risk_of_BiasLow          0.231  0.215  1.075  0.282  -0.190  0.652    
Risk_of_BiasModerate     0.090  0.226  0.397  0.691  -0.353  0.532    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the low risk studies produce the largest effect size estimate, but none of the categories significantly drive the estimate to one direction. In this case, you can decide to keep all studies in your meta-analysis, but you can also decide to exclude high risk studies from your meta-analysis for a more reliable estimate of the effect.

4. Publication (Reporting) Bias

Now that we have information about imprecision, risk of bias, and inconsistency, we can look at the reporting bias to finalize the assessment of certainty in evidence according to GRADE.

Publication bias is bias in effect size estimates which arises due to publication system practices. Conventionally, positive findings (i.e., statistically significant findings) are considered more important and interesting, and therefore are more likely to be published. This in turn leads to (questionable research) practices which inflate the values and artifically lead to low p-values (which are often right at the threshold of significance). These practices skew the perception of the effectiveness of interventions, as they might appear significant according to the published research, but the (average) true population effect in reality is not strong or might not be meaningful at all.

One way to combat this is to include unpublished, gray literature in the systematic review to ensure file-drawer, non-significant findings are also represented in the data. However, there are tools that allow us to statistically evaluate the published findings to identify potential publication bias.

There are multiple popular and less popular ways to analyse it:

  • Funnel plots: commonly presented in meta-analyses. Funnel plots allow us to visualize each study as a point in an inverse funnel, where the x-axis represents the effect size, and the y-axis represents the precision, with the standard error often being used as the measure of precision. The funnel has a vertical line which represents the average effect size of the dataset (although the reference line is often also set to the line of no effect, e.g., 0), and the triangle lines on both sides which represents the 95% confidence interval.

    Study points lying right on the lines of the triangle are exactly at the significance threshold (on the 95% confidence interval line). A funnel plot which has study points scattered symmetrically on both sides of the average effect size vertical, and scattered within the triangle, usually indicates that there is no publication bias and all effect sizes sampled from the distribution are equally likely published. If the studies cluster around the threshold of significance line, that indicates potential publication bias. Note that funnel plots also carry their own biases, and are not reliable with fewer than 10 studies, so they should be interpreted along other indicators if possible, or interpreted with caution.

    Additionally, placement on the axes tells us about the precision of the effect and its size. Study points that are more to the right of the x-axis and at the top of the y-axis have larger effects with smaller standard errors, which makes them more precise. Funnel plot interpretation is contingent on sample size and heterogeneity of effects, thus not all assymmetries will indicate publication bias, but rather other issues like heterogeneity or other artefacts. Sterne et al. (2011) provide recommendations for interpreting funnel plots.

  • Egger’s test: Egger’s test is another commonly conducted test for publication bias, although it should be interpreted together with the funnel plot. Egger’s test assesses the symmetry of the funnel plot to infer whether there’s publication bias based on statistically significant results.

  • Newer, less commonly used methods include the p-curve (Simonhson et al., 2014) which estimates the publication bias by plotting significant values to see whether the distribution is closer to the true effect or accumulates around the significance threshold (i.e., indicates selective reporting), and the z-curve (Schimmack & Bartoš, 2023) which allows us to estimates publication bias by assessing the false positive risk of the studies conducted on the selected topic. A detailed breakdown of the z-curve is available here.

Example: Funnel Plot in R

Listing 6: funnel plot
Code
funnel(m, refline = 0)

Code
# Code adapted from D.Quintana github gist: https://gist.github.com/dsquintana/b4a80de22ce14c4ab73c443f334ee83a
funnel(m, refline=0, level=c(90, 95, 99), 
       shade=c("white", "red", "orange")) 

We see that the study points are only on the right side of the no effect line, which would indicate slight assymmetry, however there is a spread of the studies on both sides of the significance thresholds, with studies with higher imprecision (i.e., larger SE) being in the non-significant area, and larger studies being significant. There doesn’t seem to be clustering around the signficance line, so while there is a slight assymmetrical pattern, this would not necessarily indicate a reporting bias, especially with low reported heterogeneity between studies, as this means that the significant studies do not differ in other characteristics apart from having a smaller SE (due to a larger sample size)

Example: z-curve in R

Listing 7: z-curve
Code
# Install and load zcurve
if (!requireNamespace("zcurve", quietly = TRUE)) {
  install.packages("zcurve")
}
library(zcurve)

# Example: Simulate p-values from 100 studies
set.seed(123)

# Optional: convert p-values to z-scores
zvals_d <- dataset %>% 
  mutate(zvals = qnorm(1 - P_value / 2))  # for two-sided tests
# Fit from z-scores: 
fit <- zcurve(zvals_d$zvals)

# Fit z-curve directly from p-values
#fit <- zcurve(p = dataset$P_value)

# Print summary
summary(fit)
Call:
zcurve(z = zvals_d$zvals)

model: EM via EM

    Estimate  l.CI  u.CI
ERR    0.516 0.486 0.546
EDR    0.516 0.466 0.566

Model converged in 44 + 38 iterations
Fitted using 17 z-values. 24 supplied, 17 significant (ODR = 0.71, 95% CI [0.49, 0.87]).
Q = -10.17, 95% CI[-13.60, -6.89]
Listing 8: z-curve
Code
# Plot the z-curve
plot(fit)

The red line on the plot represents a z-score of 1.96, i.e., values located at then 95% confidence interval threshold, or the p = 0.05 significance threshold. The histograms represent p-values transformed into z-scores. If the values accumulate around the red line, it means that most published results are just at the significance line. in our plot, the values are scattered below and above the threshold, which means that there are publications with high p-values, and really low p-values. This is a good indicator that there is no selective reporting and no publication bias in this data. However, z-curves require a higher sample size to be accurate in estimation, therefore you should interpret results from analyses with fewer studies carefully. Along with the plot, we get a model output with expected replication and expected discovery rate, which provide information about the expected replicability of the field.

Taking the two publication bias analysis methods, we can state that there is no indication of publication bias, with moderate confidence due to a relatively low number of included studies.

Example: Certainty of Evidence statement (GRADE)

When it comes to imprecision, as evident from the forest plot and the model output (see and ), the found meta-analytic effect is imprecise. Both the study effects and the average true effects have wide confidence and prediction intervals. Furthermore, the interpretation of the effects would change a lot depending on whether the true effect is near the upper or lower bound of the confidence interval. Regarding inconsistency, we found low heterogeneity between the studies (see tau2 and tau, ), which upgrades the GRADE evaluation for high consistency. Additionally, the leave-one-out sensitivity analysis () shows that the effect is generally robust and that no one study drives the effect into one direction. According to the risk of bias domain of GRADE, our dataset contains a considerable number of studies with moderate and high risk of bias (see ). This downgrades our certainty of evidence, even though the subgroup analysis did not indicate bias in estimates produced by these studies. Finally, the assessment of reporting bias by plotting a funnel plot (see ) and conducting a z-curve (see ) analysis did not reveal reporting bias issues.

Taking all domains into consideration, although we found consistency among studies and no publication bias, due to low precision and moderate risk of bias, we conclude there is a moderate certainty of presented evidence.

Loaded packages

 [1] ".GlobalEnv"        "package:zcurve"    "package:metafor"  
 [4] "package:numDeriv"  "package:metadat"   "package:Matrix"   
 [7] "package:lubridate" "package:forcats"   "package:stringr"  
[10] "package:dplyr"     "package:purrr"     "package:readr"    
[13] "package:tidyr"     "package:ggplot2"   "package:tidyverse"
[16] "package:tibble"    "package:esc"       "package:epitools" 
[19] "package:stats"     "package:graphics"  "package:grDevices"
[22] "package:utils"     "package:datasets"  "package:methods"  
[25] "Autoloads"         "package:base"