5  Discussion

This meta-analysis aimed to evaluate the performance of machine learning models applied to remote sensing for SDG monitoring. Specifically, the study aimed to estimate the average performance, determine the level of heterogeneity within and across studies, assess whether specific study features influence model performance, and lastly compare the sample-weighted and unweighted estimate summary effect. While previous meta-analyses on machine learning models for remote sensing have predominantly relied on unweighted approaches (Hall et al., 2023; e.g., Khatami et al., 2016), this study found that incorporating a weighted approach did not significantly alter the results. Both the weighted and unweighted estimates showed similar average performance metrics, suggesting that weighting by sample size may not dramatically influence the outcomes in this context.

The results from this meta-analysis show that the overall accuracy of machine learning models applied to remote sensing is consistently high. The estimated average overall accuracy of \(\hat{\mu}_{_\text{unweighted}} =\) 0.90 and \(\hat{\mu}_{_\text{weighted}} =\) 0.89. The results also demonstrate a considerable variability in the predictive performance of machine learning models applied to remote sensing data for SDGs. Some of this variability could be attributed to the proportion of the majority class as well as the inclusion of ancillary data. The type of model, whether neural networks and tree-based models or the SDG studied, showed no differences in overall accuracy. Unsurprisingly, the proportion of the majority class significantly affected the overall accuracy of machine learning models. While the use of ancillary data in primary studies has a small but significant positive effect on overall accuracy performance. No other significant effects were found in the study features examined in this study.

The findings of this study regarding the use of ancillary data aling with Khatami et al. (2016) and Hanadé Houmma et al. (2022) who found the use of ancillary data did improve model performance. Some effect of the choice of machine learning model was found by previous research. For example, Khatami et al. (2016) noted that while support vector machines and neural networks performed well, differences between other model types were not significant. Notably, no study was found that explicitly corrected for class imbalance (proportion of the majority class) when assessing the difference in performance between groups. While Khatami et al. (2016) employed pairwise comparisons, which does ensure that models are compared within the same data context, this study goes further in directly highlighting the influence of class proportion on overall accuracy.

Limitations

  1. Number of reviewers: From the 200 studies randomly sampled, three reviewers assessed whether full-text screening should be conducted. Only 57 papers were agreed upon by all three reviewers, while each reviewer thought between 77 and 81 studies could have been included. This highlights the subjectivity of the selection process and the importance of having multiple reviewers. The full-text screening was only conducted by one person which means that this subjectivity or potential mistakes were missed in the final dataset. This issue is exasperated by the inconsistent reporting on methods in this field. For example, one feature that could not be included in the analysis was whether the results reported were derived from the training or test set because it was very unclear in some of the selected studies.

  2. Sample size: This analysis included a total of 20 studies. While several simulation studies suggest that a three-level meta-analysis can yield accurate results with as few as 20 to 40 studies (Hedges et al., 2010), this analysis is at the lower bound, and the included studies exhibit considerable variability, making the statistical power a concern. Polanin (2014) suggests a minimum of 40 studies is generally recommended to ensure robust results. Furthermore, a relatively high proportion of the studies (6 out of 20) reported only one result (\(k_j = 1\)), limiting the ability to assess within-study variability. The small sample size inherently increases the potential for bias and may affect the reliability of the findings (Polanin, 2014).

  3. Choice of effect size: While overall accuracy is widely used, it does not capture the complexity of model performance, especially in studies with imbalanced classes. To illustrate the problem, if 99% of the data belongs to class A, a model that always predicts class A—without any regard to the predictors—will achieve an overall accuracy of 99%, despite essentially doing nothing and failing to capture meaningful patterns. For more specific details on the issues related to the use of overall accuracy, see Foody (2020) and Stehman & Foody (2019). Alternative metrics include Matthews’ correlation coefficient, F1 score, Somers’ D, and average precision. Unfortunately, these metrics are rarely reported in the studies analyzed here. Moreover, some of these alternatives are also sensitive to class imbalance and must be corrected to ensure comparability across studies (Burger & Meertens, 2020).

  4. Publication bias: This study only examined published results, which introduces publication bias—a well-documented effect where studies with positive results are more likely to be published, while negative or neutral findings remain unpublished (Borenstein et al., 2009; Bozada et al., 2021; Hansen et al., 2022; Harrer et al., 2022). This bias can lead to an overestimation of effects, as demonstrated in this study, where the average overall accuracy is around 90%.

  5. Study features included: The analysis would have benefited from the inclusion of more study features. It is also important to note that most of the study features included in this research were between-study covariates and did not differ within studies, which explains why only the between-study heterogeneity was reduced. Furthermore, due to the small sample size, it was necessary to aggregate the study features into broad categories, which limited the granularity of the analysis.

  6. Apples and oranges problem: The \(I^2\) result of effectively 100% may indicate that the included studies are too different to statistically compare. This is often referred to as the “apples and oranges problem” (Harrer et al., 2022, Chapter 1). The extent to which primary studies can differ while still being meaningfully combined in a meta-analysis is debated. However, when Robert Rosenthal, a pioneer in meta-analysis, was asked whether combining studies with significant differences is valid his response was “combining apples and oranges makes sense if your goal is to produce a fruit salad” (Borenstein et al., 2009, Chapter 40, pp. 357). In this case, despite the diverse research aims of the included studies, the objective is to draw general conclusions about machine learning applications in remote sensing for SDG monitoring. This approach can be viewed as a “fruit salad” with potential for broad applicability across different SDG contexts. However, this again raises the issue of sample size, as a large sample is required to ensure sufficient statistical power to draw confident conclusions.

  7. Cochran’s Q and large sample sizes: Another limitation is the reliance on Cochran’s Q for testing heterogeneity. While widely used, the power of the Q-statistic is dependent on the number of included effect sizes (\(k\)) and the precision of the studies i.e., the sample size of that study (\(m_{ij}\)). In cases with large sample-sizes, the Q-statistic becomes highly sensitive to even minor differences between studies. The Q-statistic is “overpowered”, which results in the detection of statistically significant heterogeneity even when the actual differences between studies are small. Little research has been done on the effect of very large primary-sample-sizes since meta-analyses typically compile studies who’s unit of analysis are human patients. Primary sample sizes in the millions is not a common issue.

  8. Transformation of the effect size: In general model selection at the transformed level presents limitations, as the relevance of features is assessed on the transformed scale, which may not directly translate to the original effect size after back-transformation. This complicates the interpretation of results, since conclusions drawn on the transformed scale may not have the same meaning when applied to the original scale. This effect is seen with the covariate: use of ancillary data. Additionally, the use of FT transformation is contested in the literature because of several important limitations (Doi & Xu, 2021; Lin & Xu, 2020; Röver & Friede, 2022; Schwarzer et al., 2019). First, the FT is notably unintuitive, specifically the calculation of variance which relies on the structure of an arcsine function’s derivative. Second, back-transforming the pooled effect size using certain methods—such as the harmonic mean of primary sample sizes—can lead to misleading results (Doi & Xu, 2021; Lin & Xu, 2020; Röver & Friede, 2022; see Schwarzer et al., 2019; Wang, 2023). In this analysis, the pooled variance, rather than the harmonic mean, was used for back-transformation, which seems to address the main concern debated in the literature. Nevertheless, the choice of back-transformation method significantly influences the outcome, and justifying a specific method is especially challenging in a multilevel data structure (Röver & Friede, 2022). Lastly, in a random-effects model the true (transformed) proportion is assumed to follow a normal distribution between studies, the FT transformation potentially violates this assumption as the arcsine function has a bounded domain (Röver & Friede, 2022).

Implications for Future Research

The limitations identified in this meta-analysis suggest several directions for future research that can enhance the robustness and generalisability of findings related to machine learning applications in remote sensing for SDG monitoring.

  1. Sample size and model complexity: One of the primary limitations of this meta-analysis was the small sample size. Future research should aim to expand the pool of included studies. This would mean that interaction effects between the collected study features could also be included in the analysis. The structure of the random effects can also be explored with the application of more sophisticated variance-covariance structures for random effects. This approach, sometimes referred to as dose-response meta-analysis (Viechtbauer, 2024, p. 269), would provide insights into how specific study characteristics influence effect sizes over time or across varying conditions.

  2. Broader inclusion of performance metrics: This meta-analysis primarily focused on overall accuracy, a commonly used but potentially misleading performance metric, particularly in imbalanced datasets. Future studies should expand the range of performance metrics, incorporating class-specific precision, recall, F1-score, Matthews’ correlation coefficient, F1 score, or Somers’ D to provide a more comprehensive evaluation of model performance (Burger & Meertens, 2020). More than one effect size can be modeled using network meta- analysis models (Harrer et al., 2022, Chapter 12). The inclusion of more performance metrics would offer a more nuanced understanding of how models perform under different conditions.

  3. Exploring additional study features and moderators: The present study focused on a limited set of study features. Future research should investigate a broader range of potential moderators, such as model complexity, data preprocessing techniques, and environmental or socio-economic factors specific to SDG challenges. By including a more extensive set of features, researchers can better understand the drivers of performance variability and refine model selection for specific applications.

  4. Effect of large sample size in primary studies: Simulation studies could provide insights into the sensitivity of Cochran’s Q in the context of large sample sizes. Developing less sensitive methods for assessing heterogeneity would improve the reliability of meta-analytic findings, especially when studies involve substantial sample sizes, which can exaggerate minor differences between studies.

  5. Data extraction: In the time frame of this research, the ChatGPT virtual assistant showed significant improvements in data extraction capabilities. Initially, in January 2024, ChatGPT struggled to extract meaningful features. By May 2024, it was capable of accurately filling in all study features directly from the provided papers (in PDF format). Although the improvement was not formally assessed in this study, the difference was striking. Some research has already examined the potential accuracy of large language models (LLMs) in data extraction for meta-analyses, with promising results (Mahuli et al., 2023). However, for this thesis, ChatGPT was not used for formal data extraction. Instead, traditional manual extraction methods were employed to ensure accuracy. Further investigation into the accuracy of LLMs for meta-analysis is required. LLMs can expedite the data extraction process, potentially addressing challenges related to the limited number of included studies. Another unrelated recommendation to improve data extraction would be for journals to require results and specific features to be submitted separately in addition to the manuscript so that the journals themselves can report trends in outcomes.

6 Conclusion

This meta-analysis provides insights into the variability of machine learning models used for remote sensing in SDG monitoring. First, (1) the average performance of machine learning models was found to be high, but strongly influenced by class imbalance. This finding reinforces the limitations of overall accuracy as a metric for assessing model performance. It highlights the need for a shift towards more balanced and nuanced performance metrics in future SDG monitoring studies. Second, (2) the three-level random-effects model showed a substantial degree of heterogeneity across outcomes. Third (3), the role of specific study features was notable: although no significant differences were observed between model types (e.g., neural networks or tree-based models), the proportion of the majority class and the inclusion of ancillary data were important factors. Finally, the comparison of sample-weighted and unweighted models (4) revealed no substantial difference in summary effect size, though the weighted model uncovered significant heterogeneity. Lastly, more research is needed to assess the robustness and applicability of meta-analyses methods to this field. In particular, the use of Cochran’s Q-statistic is questionable in the context of this analysis, as the very large sample sizes might make the Q-statistic overly sensitive. This can result in the detection of statistically significant heterogeneity, even when the heterogeneity may not be practically meaningful.