Can AI produce reliable and consistent data analysis?

Kira Kappe; Adis Dzebo

doi:10.51414/sei2025.040

SEI brief

Can AI produce reliable and consistent data analysis?

part of AI and SEI

Start reading

SEI brief

Can AI produce reliable and consistent data analysis?

SEI researchers trialled the AI Reader in a pilot study of national policy evaluations, reflecting on its accuracy, consistency and potential for large-scale analysis.

Kira Kappe, Adis Dzebo / Published on 21 August 2025

Download Full publication / PDF / 2 MB

Citation

Kappe, K., & Dzebo, A. (2025). Can AI produce reliable and consistent data analysis? SEI brief. Stockholm Environment Institute. http://doi.org/10.51414/sei2025.040

DOI

http://doi.org/10.51414/sei2025.040

With recent advances in artificial intelligence (AI) tools for research, tasks that would have previously overwhelmed even the most dedicated research teams, such as the systematic analysis of thousands of policy documents, are increasingly within reach. These tools now provide advanced document analysis capabilities even to researchers lacking technical expertise or advanced coding skills.

However, one question still lingers: how can we ensure that AI-generated data meets the quality standards demanded by academic scrutiny?

Here, we explore the potential of AI tools for systematic policy analysis, while also examining the challenges and pitfalls that may prevent AI from fully delivering on its promise.

Evidence-based policy evaluation

In a pilot project conducted by SEI researchers, we aim to assess the advantages, risks and limitations of using AI tools in academic research. Our focus is on policy evaluation analysis. We are currently conducting a large-scale review of outcome and impact evaluations of policy implementation, as well as independent audit reports with the help of SEI’s AI Reader.

Our objective is to uncover the drivers of successful policy implementation in different countries and to extract insights into what enables effective outcomes across diverse national or thematic contexts. Focusing on climate policy, this work aims to address the persistent challenge of linking policies to successful outcomes and to identify patterns of effective implementation within specific national, socio-economic and governance contexts, thereby supporting more informed policymaking.

SEI AI Reader

The SEI AI Reader (beta) is a document analysis tool developed in 2024 by SEI researchers (Babis et al., 2024). It utilizes large language models (LLM), such as ChatGPT, to assist with literature reviews and extract policy-relevant data across various national and thematic contexts. It features a user-friendly interface with customizable input options:

A main query panel for targeted prompts.
A structured table where users can define (i) key independent variables, (ii) instruct what data needs to be extracted for each variable and, (iii) provide examples to illustrate what should (and should not) be extracted.

The tool’s simple interface is designed to ensure accessibility even for non-technical users.

Creating an analytical framework

To guide our analysis, we conducted a human-led literature review of established policy evaluation frameworks such as those by the OECD and the European Environmental Agency, to identify suitable variables for extracting the key elements that determine policy success. Our analytical framework is structured around two broad themes, focusing on their correlation and causality.

Criteria for establishing successful policy implementation, including effectiveness, efficiency, outcomes and impacts, attribution and spillover effects.
Principles of effective policy implementation processes, such as agenda-setting, policy formulation, content, implementation and stakeholder engagement.

Recognizing that policy processes vary across governance systems, our analytical approach is structured around three “universal” characteristics of effective policy processes: coordination, coherence and integration. Our analytical framework ultimately comprises 12 independent variables and a checklist of 46 questions.

To conduct the analysis, we utilized a newly compiled global database of impact and outcome evaluations, along with extensive repositories of independent audits of national policy implementation. Both databases include metadata and direct access to thousands of documents.

Operationalization and pilot review

With our analytical framework and policy document dataset ready, we initiated a pilot study to test the tool’s capabilities. We began by analysing four documents – both manually and using the AI Reader. We conducted approximately 10 iterative runs on each document using prompts of varying specificity and detail.

The purpose of this iterative process was to calibrate the tool by refining the input query, question formulation and context specification to match the accuracy of human analysis.

Prompt design and tool calibration

We quickly discovered that our initial analytical framework, designed with broad exploratory questions, was not effective in extracting the relevant information. The tool often returned generic, repetitive answers with no evaluative insights or omitted responses altogether.

Through iteration where each question was redesigned and reformulated, we managed to arrive at a solution where answers became more sophistically advanced and analytically complex. This process is reflected in Figure 1, showing how we eventually arrived at answers that corresponded to our analytical framework.

A diagram titled “Agenda-setting process” shows the evolution of question formulation for policy evaluation. It features four columns moving left to right, each with a question (Q) and an answer (A) pair, demonstrating increasing analytical depth. Underneath each A, there’s a label describing the level of analytical sophistication – from basic descriptive answers to complex causal analysis. The progression shows how refining questions leads to more meaningful, evaluative insights aligned with an analytical framework. — Figure 1: Formulating the right question.

Graphic: Mia Shu / SEI.

Similarly, iterative refinement of queries and questions revealed how query-question (mis-)matches significantly influence both the quality and quantity of responses from the AI Reader. As shown in Figure 2, different combinations applied in individual runs (A-D) yielded distinct outcomes. The findings suggest that a broad but structured query combined with a specific, analytical question provides the most effective balance. This combination enables the AI reader to cast a wide net for all relevant content while applying a narrow filter that recognizes multiple types of evidence as relevant.

A visual comparison of four AI document analysis runs (Run A to Run D) showing how different query and question designs affect results. Run A uses a broad, unstructured query and a broad question, leading to many irrelevant results (about 36). Run B introduces structure to the query but keeps the broad question, reducing some irrelevant content (about 26 results). Run C uses both a narrower query and a narrower question, which lowers both the quantity and quality of results (about 24). Run D combines the structured query from Run B with the narrow question from Run C, producing high-quality results with fewer irrelevant quotes (about 28). The diagram demonstrates how refining both queries and questions improves the effectiveness of AI-assisted content extraction. — Figure 2: Choosing the right query – question combination.

Graphic: Mia Shu / SEI.

A third parameter supporting the gradual improvements made through iteration was the “context variable”, which allowed us to define terms, provide contextual guidelines to a specific independent variable and incorporate “do’s and don’ts” from previous run results to correct inconsistencies.

Other refinements added at later stages of the prompting included instructing the AI to provide original answers to each question, along with separate justifications. These justifications offered additional insights into why specific information was extracted and contextualized the answer in light of the question posed. This structure is expected to support later analysis, helping to more systematically assess and rank the relevance of extracted answers.

Accuracy and reliability

We assessed the tool’s accuracy by comparing its responses to our manual assessment, using a simple quantitative scoring system. On average, the AI achieved approximately 85% accuracy across all four documents compared to our own analysis.

A perfect score was not possible due to the subjective nature of the topic – even our team did not always agree on the correct answer. Sometimes, the AI Reader provided insights missed by human analysis; in other cases, it failed to extract relevant information.

Importantly, there was significant correlation between human and AI analysis on which questions lacked available data, suggesting the AI could resist the urge to “please” by fabricating answers. Notably, we did not observe hallucination of facts and wrongful answers, likely due to safeguards in our prompt design (e.g. requiring page references, direct quotes and explicit instructions not to hallucinate).

Consistency

We also cross-analysed the answers from multiple runs to assess the consistency of the tool’s outputs and determine how often answers were the same or similar, comparable or significantly different. The findings are cautiously promising: consistency across the four documents ranged from 69% to 90%.

One recurring issue was that while 3 or 4 runs often produced similar results, one run would occasionally differ. The reason for this discrepancy remains unclear.

As a potential solution for the final analysis of the full dataset, we suggest running each document twice to help identify and compensate for potential outliers or inaccuracies. However, some inconsistency is likely unavoidable when working with AI tools, due to the inherent “black box” nature of LLMs such as ChatGPT.

Additional considerations when conducting multiple runs include the environmental impacts of repeated processing and the added workload of comparing and consolidating the results.

Implications and potential for scalability

Returning to our central question – “how can we ensure that AI-generated data meets the quality standards demanded by academic scrutiny?” – our pilot review suggests that, despite some limitations, the consistency, accuracy and reliability of AI-generated data are sufficiently high.

This makes advanced tools like the SEI AI Reader a promising solution for overcoming the methodological challenges and time constraints involved in systematically processing and synthesizing the vast and growing body of climate policy evaluations, particularly grey literature. These tools can help derive actionable insights from past policymaking experiences.

Our next step is to expand the analysis to a larger set of documents to validate the current calibration and assess the scalability of our approach before proceeding to the main analysis.

Even at this early stage, our findings suggest that AI tools hold considerable potential for analysing large volumes of policy-related documents, supporting the identification of common patterns in successful policy implementation across varied national and thematic contexts.

This research contributes to closing persistent empirical gaps by:

enhancing understanding of how coherent policymaking contributes to effective implementation outcomes, and
identifying the specific, localized conditions that explain “what works, where, how and why?” (Browne et al., 2023; Dzebo et al., 2025).

Download

Full publication / PDF / 2 MB

References

Babis, W., Muñoz Cabré, M., Martelo Llerena, C., Salzano, C., Torres-Morales, E., & Arsadita, F. (Forthcoming, 2025). SEI AI Reader [Dataset]. Stockholm Environment Institute.

Browne, K., Dzebo, A., Iacobuta, G., Faus Onbargi, A., Shawoo, Z., Dombrowsky, I., Fridahl, M., Gottenhuber, S., & Persson, Å. (2023). How does policy coherence shape effectiveness and inequality? Implications for sustainable development and the 2030 Agenda. Sustainable Development, 31(5):3161-3174. https://doi.org/10.1002/sd.2598

Dzebo, A., Shawoo, Z., & Browne, K. (2025). Does policy coherence make national implementation of global sustainability goals more successful. Annual Review of Environment and Resources, EG50. https://doi.org/10.1146/annurev-environ-111523-102337.

Nilsson, M., Hackmann, H., Sokona, Y., Guilanpour, K., Oni, T., Dzebo, A., & Onoda, S. (2024). Seeking synergy solutions: policies that support both climate and SDG action. Expert Group on Climate and SDG Synergy. UN Department of Economic and Social Affairs. https://sdgs.un.org/sites/default/files/2024-06/Thematic%20Report%20on%20Climate%20and%20SDGs%20Action-060824.pdf

Kira Kappe

Research Associate

SEI Headquarters

Adis Dzebo

Senior Research Fellow

SEI Headquarters

Topics and subtopics: Governance : Public policy, Innovation / Climate : Climate policy
Tags: artificial intelligence, machine learning, methodology, climate policy, policy engagement
Related centres: SEI Headquarters

Evidence-based policy evaluation

SEI AI Reader

Creating an analytical framework

Operationalization and pilot review

Prompt design and tool calibration

Accuracy and reliability

Consistency

Implications and potential for scalability

References

SEI authors

Related projects

You might also be interested in