SEI researchers trialled the AI Reader in a pilot study of national policy evaluations, reflecting on its accuracy, consistency and potential for large-scale analysis.
With recent advances in artificial intelligence (AI) tools for research, tasks that would have previously overwhelmed even the most dedicated research teams, such as the systematic analysis of thousands of policy documents, are increasingly within reach. These tools now provide advanced document analysis capabilities even to researchers lacking technical expertise or advanced coding skills.
However, one question still lingers: how can we ensure that AI-generated data meets the quality standards demanded by academic scrutiny?
Here, we explore the potential of AI tools for systematic policy analysis, while also examining the challenges and pitfalls that may prevent AI from fully delivering on its promise.
In a pilot project conducted by SEI researchers, we aim to assess the advantages, risks and limitations of using AI tools in academic research. Our focus is on policy evaluation analysis. We are currently conducting a large-scale review of outcome and impact evaluations of policy implementation, as well as independent audit reports with the help of SEI’s AI Reader.
Our objective is to uncover the drivers of successful policy implementation in different countries and to extract insights into what enables effective outcomes across diverse national or thematic contexts. Focusing on climate policy, this work aims to address the persistent challenge of linking policies to successful outcomes and to identify patterns of effective implementation within specific national, socio-economic and governance contexts, thereby supporting more informed policymaking.
The SEI AI Reader (beta) is a document analysis tool developed in 2024 by SEI researchers (Babis et al., 2024). It utilizes large language models (LLM), such as ChatGPT, to assist with literature reviews and extract policy-relevant data across various national and thematic contexts. It features a user-friendly interface with customizable input options:
The tool’s simple interface is designed to ensure accessibility even for non-technical users.
To guide our analysis, we conducted a human-led literature review of established policy evaluation frameworks such as those by the OECD and the European Environmental Agency, to identify suitable variables for extracting the key elements that determine policy success. Our analytical framework is structured around two broad themes, focusing on their correlation and causality.
Recognizing that policy processes vary across governance systems, our analytical approach is structured around three “universal” characteristics of effective policy processes: coordination, coherence and integration. Our analytical framework ultimately comprises 12 independent variables and a checklist of 46 questions.
To conduct the analysis, we utilized a newly compiled global database of impact and outcome evaluations, along with extensive repositories of independent audits of national policy implementation. Both databases include metadata and direct access to thousands of documents.
With our analytical framework and policy document dataset ready, we initiated a pilot study to test the tool’s capabilities. We began by analysing four documents – both manually and using the AI Reader. We conducted approximately 10 iterative runs on each document using prompts of varying specificity and detail.
The purpose of this iterative process was to calibrate the tool by refining the input query, question formulation and context specification to match the accuracy of human analysis.
We quickly discovered that our initial analytical framework, designed with broad exploratory questions, was not effective in extracting the relevant information. The tool often returned generic, repetitive answers with no evaluative insights or omitted responses altogether.
Through iteration where each question was redesigned and reformulated, we managed to arrive at a solution where answers became more sophistically advanced and analytically complex. This process is reflected in Figure 1, showing how we eventually arrived at answers that corresponded to our analytical framework.
Figure 1: Formulating the right question.
Graphic: Mia Shu / SEI.
Similarly, iterative refinement of queries and questions revealed how query-question (mis-)matches significantly influence both the quality and quantity of responses from the AI Reader. As shown in Figure 2, different combinations applied in individual runs (A-D) yielded distinct outcomes. The findings suggest that a broad but structured query combined with a specific, analytical question provides the most effective balance. This combination enables the AI reader to cast a wide net for all relevant content while applying a narrow filter that recognizes multiple types of evidence as relevant.
Figure 2: Choosing the right query – question combination.
Graphic: Mia Shu / SEI.
A third parameter supporting the gradual improvements made through iteration was the “context variable”, which allowed us to define terms, provide contextual guidelines to a specific independent variable and incorporate “do’s and don’ts” from previous run results to correct inconsistencies.
Other refinements added at later stages of the prompting included instructing the AI to provide original answers to each question, along with separate justifications. These justifications offered additional insights into why specific information was extracted and contextualized the answer in light of the question posed. This structure is expected to support later analysis, helping to more systematically assess and rank the relevance of extracted answers.
We assessed the tool’s accuracy by comparing its responses to our manual assessment, using a simple quantitative scoring system. On average, the AI achieved approximately 85% accuracy across all four documents compared to our own analysis.
A perfect score was not possible due to the subjective nature of the topic – even our team did not always agree on the correct answer. Sometimes, the AI Reader provided insights missed by human analysis; in other cases, it failed to extract relevant information.
Importantly, there was significant correlation between human and AI analysis on which questions lacked available data, suggesting the AI could resist the urge to “please” by fabricating answers. Notably, we did not observe hallucination of facts and wrongful answers, likely due to safeguards in our prompt design (e.g. requiring page references, direct quotes and explicit instructions not to hallucinate).
We also cross-analysed the answers from multiple runs to assess the consistency of the tool’s outputs and determine how often answers were the same or similar, comparable or significantly different. The findings are cautiously promising: consistency across the four documents ranged from 69% to 90%.
One recurring issue was that while 3 or 4 runs often produced similar results, one run would occasionally differ. The reason for this discrepancy remains unclear.
As a potential solution for the final analysis of the full dataset, we suggest running each document twice to help identify and compensate for potential outliers or inaccuracies. However, some inconsistency is likely unavoidable when working with AI tools, due to the inherent “black box” nature of LLMs such as ChatGPT.
Additional considerations when conducting multiple runs include the environmental impacts of repeated processing and the added workload of comparing and consolidating the results.
Returning to our central question – “how can we ensure that AI-generated data meets the quality standards demanded by academic scrutiny?” – our pilot review suggests that, despite some limitations, the consistency, accuracy and reliability of AI-generated data are sufficiently high.
This makes advanced tools like the SEI AI Reader a promising solution for overcoming the methodological challenges and time constraints involved in systematically processing and synthesizing the vast and growing body of climate policy evaluations, particularly grey literature. These tools can help derive actionable insights from past policymaking experiences.
Our next step is to expand the analysis to a larger set of documents to validate the current calibration and assess the scalability of our approach before proceeding to the main analysis.
Even at this early stage, our findings suggest that AI tools hold considerable potential for analysing large volumes of policy-related documents, supporting the identification of common patterns in successful policy implementation across varied national and thematic contexts.
This research contributes to closing persistent empirical gaps by:
Babis, W., Muñoz Cabré, M., Martelo Llerena, C., Salzano, C., Torres-Morales, E., & Arsadita, F. (Forthcoming, 2025). SEI AI Reader [Dataset]. Stockholm Environment Institute.
Browne, K., Dzebo, A., Iacobuta, G., Faus Onbargi, A., Shawoo, Z., Dombrowsky, I., Fridahl, M., Gottenhuber, S., & Persson, Å. (2023). How does policy coherence shape effectiveness and inequality? Implications for sustainable development and the 2030 Agenda. Sustainable Development, 31(5):3161-3174. https://doi.org/10.1002/sd.2598
Dzebo, A., Shawoo, Z., & Browne, K. (2025). Does policy coherence make national implementation of global sustainability goals more successful. Annual Review of Environment and Resources, EG50. https://doi.org/10.1146/annurev-environ-111523-102337.
Nilsson, M., Hackmann, H., Sokona, Y., Guilanpour, K., Oni, T., Dzebo, A., & Onoda, S. (2024). Seeking synergy solutions: policies that support both climate and SDG action. Expert Group on Climate and SDG Synergy. UN Department of Economic and Social Affairs. https://sdgs.un.org/sites/default/files/2024-06/Thematic%20Report%20on%20Climate%20and%20SDGs%20Action-060824.pdf


