3. Standards of Evidence: Mapping the experience in OECD countries

Abstract

While many experts in governments recognise the importance of using evidence in the policy-making process, figuring out which evidence is robust and translating it to decision makers is challenging. This chapter reviews the existing standards used to strengthen the quality of evidence supporting the design of public interventions, policies and programmes, and promoting its uptake in policy-making. The chapter covers seven issues: evidence synthesis; theory of change and logic underpinning the programme; design and development of policies and programmes; efficacy; effectiveness; cost (effectiveness); and implementation and scale up of intervention. Each subsection offers details of why each standard is important and a summary of existing approaches used in OECD countries.

Introduction

The approaches to evidence standards found across OECD countries are focused on a variety of forms of evidence and cover seven standards of evidence. This chapter is structured as follows: i) the four main functions of the standards of evidence; ii) distribution of standards of evidence across the OECD countries; iii) forms of evidence that are addressed by the standards; iv) and, an introduction to the seven standards of evidence reviewed in this report.

What are standards of evidence for?

The standards of evidence reviewed in this report vary in terms of the ‘unit of analysis’ they focus on. Some standards focus on the entirety of the existing evidence base (evidence synthesis). This includes standards focused on assessing the quality of existing evidence syntheses and standards focused on the generation of new evidence syntheses (see Figure 3.1).

Figure 3.1. Focus of the standards of evidence

Other standards are focused on generation of new evaluation evidence. This includes standards for assessing the evidence for an individual intervention as well as standards for supporting the development of an intervention. The approaches also vary in whether they address one key function or whether they address multiple functions. More than half of the approaches have one key function, versus the rest with multiple functions.

Assessing the strength of evidence of an intervention

Assessing the strength of evidence of an intervention refers to standards that examine the quality design and the robustness of findings of single studies in order to determine the strength of evidence for individual interventions. This assessment also involves an analysis of the findings and impacts in the study/studies. For instance, What Works for Kids (Australia) rates the evidence according to the evaluation(s) that has been conducted on each programme (Nest What works for kids, 2012[1]). Another approach that focuses on individual intervention is Social Programmes That Work (USA) , that seeks to identify those social programmes shown in rigorous studies to produce sizable, sustained benefits to participants (Coalition for Evidence-Based Policy, 2010[2]).

Assessing bodies of evidence

Assessing bodies of evidence refers to standards for appraising the totality of evidence included in a review of an evidence base. This includes: (a) the nature of the totality of evidence; (b) the extent and distribution of that evidence; and (c) the methods for undertaking a review (Gough and White, 2018[3]). Health Evidence provides a Quality Assessment Tool to evaluate systematic reviews or bodies of evidence (Health Evidence, 2018[4]). Another example is the Practice Scoring Instrument from Crime Solutions (USA). They present guidelines to identify a body of evidence (e.g. what qualifies as an eligible meta-analysis?) and evaluate it: eligibility criteria; comprehensive literature search; methodological quality; publication bias (Crime Solutions, 2013[5]).

Reviewing the evidence base for an intervention

Reviewing the evidence base for an intervention refers to standards for creating evidence synthesis. This can include methods to identify, select, appraise, and synthesize high quality research. For instance, Clearinghouse for Labor Evaluation and Research (CLEAR), reviews the evidence base for interventions, providing information related to the selection process, key features of all the relevant research identified for a given topic area, and reference documents of the review process (Clearinghouse for Labor Evaluation and Research, 2017[6]). An additional example is The Community Guide approach, which has a guide for the execution of systematic reviews (Zaza et al., 2000[7]).

Supporting the development of an intervention

Supporting the development of an intervention refers to standards that focus on creating guidelines for the use of entities to better understand how an intervention fits into an implementing site’s existing work and context (National Implementation Research Network, 2018[8]). The European Monitoring Centre for Drugs and Drug Addiction (EMCDDA) has developed the European drug prevention quality standards, which outline the steps to be taken when planning, conducting, or evaluating programmes. These standards inform the development of interventions and serve as a reference framework for professional development (2011[9]).

What forms of evidence are covered by standards of evidence?

The mapped approaches cover a variety of different types of evidence, including quantitative research, impact evaluation, systematic review, and qualitative research (See Figure 3.2). Approaches can focus specifically on one type of evidence or may focus on more than one type. A small number only cover one kind of evidence, versus the majority with multiple use of these types of evidence. Education counts describes the four types (Alton-Lee, 2004[10]).

Most of the approaches assess evidence impact evaluations. This can include Randomized Control Trials (RCTs) and Quasi-Experimental Designs (QEDs).
Thirty-four approaches concern quantitative methods. This includes approaches that assess correlational analyses, single case studies and pre-post studies without control groups.
Twelve approaches evaluate qualitative methods. This includes approaches that include interviews, focus groups, panels of experts, or ethnographies. Qualitative research is often concerned with the implementation process
Ten approaches assess systematic review or meta-analysis.

What are the seven issues covered in the standards of evidence?

The seven standards of evidence are: Evidence synthesis; Theory of Change and Logic underpinning the Programme; Design and Development of Policies and Programmes; Efficacy; Effectiveness; Cost (effectiveness); and, Implementation and scale up of intervention.

Evidence Synthesis

Evidence synthesis informs policy makers of what is known from research, making it fundamental for informing policy decisions and for promoting the uptake and use of evidence from evaluations and other evidence (Oliver et al., 2018[11]; Shemilt et al., 2010[12]). Evidence syntheses come in a variety of forms and of varying quality (as with primary studies), so standards to enable readers to appraise the quality of evidence synthesis are critical.

Evidence synthesis is an important tool for good knowledge management. Given the breadth of literature, including impact evaluations and RCTs, being published each year, knowledge management is essential as it becomes more difficult for policy makers and practitioners to keep abreast of the literature. Furthermore, policies should ideally be based on assessing the full body of evidence, not single studies, which may not provide a full picture of the effectiveness of a policy or programme.

Why is it important?

Evidence syntheses provide a vital tool for policy makers and practitioners to find what works, how it works – and what might do harm. Evidence syntheses are also critical in informing what is not known from previous research. As with primary studies, readers can (and should) appraise the quality and relevance of evidence synthesis (Gough, Thomas and Oliver, 2019[13]).

Since the early 2000s, across many sectors and countries, there has been an increase in the number of impact evaluations, including Randomization Control Trials (RCTs), being published each year (White, 2019[14]). For example, in education around ten RCTs were published each year in the early 2000s, growing to over 100 a year by 2012. As the number of studies increases, it becomes more difficult for policy-makers and practitioners to keep abreast of the literature. Furthermore, evidence synthesis allows for the amalgamation of findings and easier navigation of bodies of literature (Gough, Thomas and Oliver, 2019[13]), not single studies—which may not provide a full picture of the effectiveness of a policy or programme.

Evidence synthesis can come in a variety of forms, depending on the research questions and resources available, such as:

Map of maps: provide reports from other evidence and gap maps in that policy space and by doing so act as a navigation tool (Gough, Thomas and Oliver, 2019[13]);
Mega-maps: show other maps and reviews, but not primary studies;
Evidence and gap maps: are even broader in scope but report a far more limited range of information about the reviews and primary studies they include (Saran and White, 2018[15]);
Review of reviews: may be broader in scope but may be more restricted in the depth of analysis. This method only includes existing reviews, preferably systematic, rather than primary studies (Saran and White, 2018[15]);
Systematic reviews: are narrow in scope but provide in-depth analysis (Saran and White, 2018[15]). This is the most robust method for reviewing, synthesising and mapping existing evidence on a particular policy topic. It is more resource-intensive, as it can take up to 8 to 12 months minimum and requires a researcher team (The UK Civil Service, 2014[16]). Systematic reviews have a number of stages, including: defining the review question; conceptual framework; inclusion criteria; search strategy; screening; coding of information from each study; quality and relevance appraisal; and, synthesis of study findings to answer the review question (Gough, Thomas and Oliver, 2019[13]).
Meta-analysis: It refers to the use of statistical methods to summarise the results from individual programme evaluations on a given topic. A meta-analysis produces a weight-of-the-evidence summary to achieve a specific outcome or the relationship between one outcome and another; and therefore, to draw an overall conclusion about the average effectiveness of a programme (Washington State Institute for Public Policy Benefit, 2017[17]).
Rapid Evidence Assessment (REA): It is a quick overview of existing research on a (constrained) topic and a synthesis of the evidence provided by these studies to answer a specific policy issue or research question. REAs tend to be rigorous and explicit in method and thus systematic but make concessions on the depth of the process by limiting particular aspects of the systematic review process such as the screening stage (e.g. only electronically available texts) or considering using less developed search strings (The UK Civil Service, 2014[16]).
Quick Scoping Review: It consists of a quick overview of the available research— from accessible, electronic and key resources, going up to two bibliographical references on a specific topic —to determine the range of existing studies on the topic. This non-systematic method can take from 1 week to 2 months strings (The UK Civil Service, 2014[16]).

Evidence synthesis also has a critical role to play in evidence-informed recommendations and guidance. In a number of policy areas, notably in health, formal processes have been developed for interpreting research evidence in order to develop and make recommendations (Ferri and Griffiths, 2015[18]) (see Box 3.1). At the European level, the EMCDDA (2020[19]) with its experience in monitoring and disseminating best practice promotes and supports guideline adaptation. An inventory of national guidelines and standards in treatment, prevention and harm reduction functions as a tool for ensuring that there are processes for translating the evidence base into appropriate recommendations and guidelines. At a global level the WHO produces guidelines that are underpinned by evidence synthesis (Oxman, Lavis and Fretheim, 2007[20]).

Box 3.1. Evidence Syntheses as a tool to support decision making

Systematic reviews that provide accurate estimates of the effects of a policy or intervention can have a significant influence on decisions about whether to implement or disinvest in an intervention.
Several global organisations produce evidence informed guidelines based on systematic reviews. The WHO produces guidelines that are underpinned by systematic reviews of interventions and health technologies that aim to follow a transparent and evidence-informed process.
The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group has taken a leading role in developing guidance and methods for using research evidence to inform decision making. GRADE offers an explicit and transparent system for rating certainty in the evidence that underpins conclusions in a systematic review, a Health Technology Assessment or a guideline.
Many of the approaches mapped by this report either use or are based-on GRADE. GRADE is used by over 100 organisations worldwide, including the WHO, the Cochrane Collaboration and the Campbell Collaboration. The Campbell Collaboration aims to promote positive social and economic change through the production and use of systematic reviews and other evidence synthesis for evidence-informed policy and practice.

Source: Adapted from Montgomery, Movsisyan, Grant, Macdonald and Rehfuess (2019) Saran and White (2018[15]).

Many OECD countries have a strong focus on producing systematic reviews and evidence-informed recommendations and guidance, following the long established practice of the Cochrane centres in the health area. The Danish government funds ‘Cochrane Denmark’, which supports synthesizing and dissemination of the best available evidence for health professionals, researchers and decision-makers (The Cochrane Collaboration, 2021[21]). The Norwegian Institute of Public Health, a government agency under the Ministry of Health and Care Services, has a strong focus on producing evidence synthesis to support decision making. Recent reviews include a live map of COVID-19 evidence (Norwegian Institute of Public Health, 2020[22]) and a review of weight reduction strategies among adults with obesity (Norwegian Institute of Public Health, 2021[23]).The Swedish Agency for Health Technology Assessment and Assessment of Social Services (SBU) is an independent national agency, tasked by the government with assessing health care and social service interventions, covering medical, economic, ethical and social aspects. SBU conducts health technology assessments and systematic reviews of published research to support key decisions in health, medical care and social services. These approaches are now being extended to other areas, beyond health, such as social policy, The SBU is currently pioneering a new international initiative in this area.

Summary of the mapping of existing approaches

Of the approaches included in the mapping, around half concern evidence synthesis. These could be divided into two broad categories, standards for assessing the quantity and quality of existing evidence syntheses and standards for executing evidence synthesis.

Standards for assessing the quantity and quality of existing reviews.

Around half of the approaches concerned with evidence synthesis were primarily concerned with providing standards for the quantity and quality of existing syntheses.

Some approaches were primarily concerned with completing a review of existing reviews and translating this into conclusions about the strength of the evidence base for a policy or programme. These approaches include the Education Endowment Foundation’s Teaching and Learning Toolkit (UK), What Works for Health (USA) and the European Monitoring Centre for Drugs and Drug Addiction. For example, What Works for Health (2010[24]) has a rating of ‘Scientifically Supported’ which is awarded to interventions that have one or more systematic review(s). The Education Endowment Foundation has developed a ‘padlock’ rating system to rank the practices within the Teaching and Learning Toolkit (see Box 3.2).

Box 3.2. The Education Endowment Foundation’s Padlock Rating System

This rating is designed to determine the strength of the causal inference for impact on learning outcomes in schools, the quantity and consistency of the findings (both the overall pooled effect and the pattern of effects relating to moderator variables) and the ecological validity of the studies (where studies took place in schools with interventions managed by teachers rather than researchers). The security ratings are allocated as following:

1. Very limited: One padlock: Single studies with quantitative evidence of impact, including effect size data reported or calculable (e.g. from randomised controlled trials, well-matched experimental designs, regression discontinuity designs, natural experiments with appropriate analysis); and/or observational studies with correlational estimates of effect related to the intervention or approach; but no publically available meta-analyses.
2. Limited: Two padlocks: At least one publically available meta-analysis
3. Moderate: Three padlocks: Two or more publically available meta-analyses, which meet the following criteria: they have explicit inclusion and search criteria, risk of bias is discussed, and tests for heterogeneity are reported. They include some exploration of methodological features such as research design effects or sample size.
4. Extensive: Four padlocks: Three or more meta-analyses, which meet the following criteria: they have explicit inclusion and search criteria, risk of bias is discussed, and tests for heterogeneity are reported. They include some exploration of the influence of methodological features such as research design effects or sample size on effect size. The majority of included studies should be from school or other usual settings.
5. Very Extensive: Five padlocks: Three or more meta-analyses, which meet the following criteria: they have explicit inclusion and search criteria, risk of bias is discussed, and tests for heterogeneity are reported. They include some exploration of the influence of methodological features such as research design effects or sample size on effect size. The majority of included studies should be from school or other usual settings.

Source: Adapted from Education Endowment Foundation (2018[25]).

A small number of approaches go further in providing tools that can be used to rate the quality of existing evidence syntheses in order to reach conclusions about the strength of evidence of the body of evidence underpinning a policy or programme. These approaches include ROBIS, Crime Solutions and the EMMIE framework used by the What Works Centre for Crime Reduction and the What Works Centre for Children’s Social Care. Some of these tools originate in the academic literature but have not yet been used by international clearinghouses. ROBIS is one of the most comprehensive and is described in detail in Box 3.3.

Box 3.3. ROBIS - tool to assess risk of bias in systematic reviews

The ROBIS tool aims to assess the risk of bias in systematic reviews (rather than in primary studies), and the relevance of a review to the research question. The development of this tool was based on a four stage approach: define the scope; review the evidence base; hold a face to face meeting; and refine the tool through piloting. The evaluation (or tool) is completed in three (3) phases:

1. Assessing relevance (optional): 1) Assessors report the question that they are trying to answer (target question) in terms of the PICO (participants, interventions, comparisons, and outcomes) or equivalent. 2) Assessors complete the relevant PICO for the systematic review. 3) Assessors are asked whether the two questions match. If a review is being assessed in isolation and there is not target question, then this phase can be omitted.
2. Identifying concerns with the review process: this phase involves the assessment of four domains covering the key review process:
1. a. Study eligibility criteria: aims to assess whether primary study eligibility criteria were pre-specified (rather than on existing knowledge of the studies themselves), clear, and appropriate for the review question.
2. b. Identification and selection of studies: aims to assess whether any primary studies that would have met the inclusion criteria were not included in the review (i.e. review of databases, range of terms in the searching, screening titles and abstracts, and assessing full text- studies for inclusion).
3. c. Data collection and study appraisal: aims to assess whether bias may have been introduced through the data collection or to establish the risk of the bias assessment process (e.g. transcribing data or failing to collect relevant information). Therefore, the risk of bias assessment should involve a minimum of two reviewers working independently.
4. d. Synthesis and findings: aims to assess whether, given the decision has been made to combine data from the primary studies, the reviewers have used appropriate methods to do so (i.e. whether between-study variation (heterogeneity) variation is taken into account, publication bias, or interpreting standards errors as standards deviations (in meta-analysis)).
3. Judging risk of bias: This phase considers whether the systematic review, as a whole, is at risk of bias. This assessment uses the same process as in phase 2, but the judgment regarding the bias is replaced with an overall judgment of risk of bias.

Source: Adapted from Whiting et al (2016[26]).

Several clearinghouses and What Works centres have also developed their own frameworks to assess existing evidence syntheses. The EMMIE framework score focuses on five dimensions which should be covered in any systematic reviews intended to inform crime prevention (Johnson, Tilley and Bowers, 2015[27]). These are the Effect of intervention, the identification of the causal Mechanism(s) through which interventions are intended to work, the factors that Moderate their impact, the articulation of practical Implementation issues, and the Economic costs of intervention. In the US, the Crime Solutions clearing house has also developed a detailed scoring system that is applied to existing systematic reviews (See more in Box 3.4).

Box 3.4. Crime Solutions – Practice Scoring Instrument

The National Institute of Justice’s CrimeSolutions.gov from USA comprises two components — a web-based clearinghouse of programmes and practices, and a process for identifying and rating those programmes and practices. From its second component, Crime solution provides a scoring instrument to assess the quality, strength, and extent of a meta-analysis. The quality rating section analyses each meta-analysis, defined as eligible for consideration, under the following criteria:

Eligibility Criteria - rates the degree to which the meta-analysis provides a clear, detailed statement of the inclusion and exclusion criteria used to determine whether primary studies were eligible for inclusion in the final meta-analysis. Inclusion and exclusion criteria must be outlined.
Comprehensive Literature Search - rates the degree to which the meta-analysis conducted an exhaustive, comprehensive review of the literature in an attempt to identify all eligible studies.
Grey Literature Coverage - assesses the extent to which a meta-analysis includes results from unpublished or “grey” literature sources.
Coder Reliability - assesses how the authors of the meta-analysis handled reliability of data extraction from the primary research reports (ideally, two or more coders).
Methodological Quality - assesses the extent to which the authors of the meta-analysis were aware of and attentive to the methodological quality of the studies included in the meta-analysis.
Outlier Analysis - assesses whether the meta-analysis checks for effect size outliers in the data.
Handling Dependent Effect Sizes - rates a meta-analysis based on its appropriate analysis of effect sizes (e.g. appropriate statistical procedures were used to handle dependent effect size estimates).
Effect Size Reporting - assesses whether the meta-analysis reported an aggregate mean effect size that synthesized (averaged) effect sizes across one or more sets of multiple studies, and whether the meta-analysis provided estimates of precision around the point estimate(s).
Weighting of Results - assesses whether the meta-analysis uses appropriate weighting schemes when estimating mean effect sizes and in other analyses in order to give greater weight to the effect sizes estimated with more precision (e.g., based on larger samples).
Analysis Model - rates a meta-analysis based on whether the authors recognized and addressed the issue of random effects versus fixed effect analysis models (Preference is given to random effects models because they provide results that are generalizable)
Heterogeneity Attentiveness - rates a meta-analysis on whether the authors were aware of and attentive to heterogeneity (i.e. variability) in the effect size estimates from the studies in the meta-analysis.
Publication Bias - refers to whether the meta-analysis descriptively or statistically assessed the possibility of publication bias in the results (e.g. funnel plot graphs or contour-enhanced funnel plot graphs)

Source: Adapted from Crime solutions: Practices Scoring Instrument (2013[5]).

Standards for executing evidence synthesis

Seventeen of the approaches focus on standards for executing evidence synthesis. Figure 3.3 provides an example of the general stage for conducting reviews. First, standards for setting up and scoping a review (Stage 1), followed by standards for searching research (Stage 2), and standards rating the quality of evidence and strength of recommendations (Stage 4). Details of the standards in stage 1, 2 and 4 will be presented below. For example, The Equator Network contains a comprehensive searchable database of reporting guidelines and also links to other resources relevant to research reporting, this includes guidelines for systematic reviews from deciding the scope and the title of the review, to drawing conclusions (The EQUATOR Network, 2020[28]).

Figure 3.3. Methodology for conducting a review

Standards for setting up and scoping a review

Several approaches recommend the development of a protocol to define the conceptual framework for the review, the main review question, the inclusion and exclusion criteria, the review methods, and its documentation. These approaches include the Campbell Collaboration (2019[30]), What Works for Wellbeing (UK) (2017[31]), and EIF (UK) (2018[32]).

Other approaches also stipulate the use of the PICOS (Participants, interventions, comparisons, outcomes, and study design) framework to determine the inclusion and exclusion criteria of the review. For example, Campbell Reviews stipulate that the inclusion criteria should be stated specifically enough, with key terms clearly defined, to be applied with consistent results by anyone screening studies.

Standards for searching research

Some approaches stipulate that a search process should be transparent, comprehensive, and replicable. For instance, Education counts (NZ) (2004[10]) emphasises the importance of being transparent in approach, including the use of language when making claims as a fundamental tool to support both rigour and effective communication through each synthesis report.

Several approaches go further and provide guidelines to develop the search protocol, the search strategy, and how to document the search process. For instance, What Works for Wellbeing (UK) (2017[31]) specifies what to include in the search protocol (e.g. electronic sources to be searched, and restrictions), and refers to the need to balance sensitivity (ability to identify relevant information) and precision (ability to exclude irrelevant documents).

Other approaches also stipulate considering grey literature in the review to avoid publication bias. For instance, Evidence Based Teen Pregnancy Programmes (USA) (2016[33]) identifies new studies through public calls for new and unpublished research. Early Childhood Foundation (2018[32]), on the other hand, provides methodologies in practice to measure the publication bias in their systematic reviews.

Standards for rating the quality of evidence and strength of recommendations

Most of the approaches agree that quality assessment is a critical stage of the evidence review process. Some approaches provide checklists to evaluate the evidence, including the issues of efficacy and effectiveness addressed in the following sections (Efficacy of an Intervention and Effectiveness of Interventions), including evidence for ESSA (USA) (2019[34]), the Strengthening Families Evidence Review (USA) (2018[35]), and the Community Guide (USA) (2018[36]). Another example is the European Food Safety Agency (2010[37]) which has produced guidance on the application of systematic review methodology to food safety assessments to support decision-making, which includes a number of key conclusions concerning the importance of methodological quality assessment:

In a systematic review, each study should undergo a standardised assessment, checking whether it meets or not a predefined list of methodological characteristics, to assess the degree to which it is susceptible to bias.
There are many stages of the review at which the validity of the individual studies is considered.
Common types of bias that can occur in many different study designs are often classified as selection, performance, detection, attrition and reporting biases.
Assessment of methodological quality involves using tools (e.g. Checklists) to identify those aspects of study design, execution, or analysis, which induce a possible risk of bias.
It is important to distinguish between the quality of a study and the quality of reporting the study, although both may be correlated.

Other approaches not only stipulate a rating for the quality of evidence, but also provide information about the overall impact regarding the multiple outcomes in an intervention or policy. For instance, GRADE has developed a method for creating a clear separation between quality of evidence and strength of recommendations and presents a rating for each of these categories (See below Box 3.5).

Box 3.5. Grading of Recommendations Assessment, Development and Evaluation (GRADE)

The GRADE system provides an explicit, comprehensive criterion for downgrading and upgrading the quality of evidence ratings, and supporting the clarity of the process to translate evidence into recommendations for clinicians, patients, and policy makers.

The GRADE system classifies the quality of evidence at one of the four levels—high, moderate, low, and very low, according to study limitations, directness of evidence, consistency of the results, precision, and reporting bias.

How does the GRADE system consider strength of recommendations?

Strong - refers to when the desirable effects of an intervention clearly outweigh the undesirable effects, or clearly do not

Weak - refers to when the trade-offs are less certain—either because of low quality evidence or because evidence suggests that desirable and undesirable effects are closely balanced—weak recommendations become mandatory.

Source: Adapted from Guyatt et al. (2008[38]).

Key questions

Standards for assessing the quantity and quality of existing reviews.

Does a review include a systematic search for unpublished reports, to avoid publication bias?
Is a protocol developed (project plan) for the review in advance?
Are study inclusion and coding decisions carried out by at least two reviewers who work independently and compare results?
Is study quality assessed?
Does a review undergo peer review and editorial review?
Does a review use rigorous methods to synthesize evidence, including, where appropriate, statistical meta-analysis of quantitative evidence and theory-based analysis of qualitative evidence?

Standards for setting up and scoping a review

Does a review use the PICO (Participants, interventions, comparisons, outcomes, and study design) framework to determine the inclusion and exclusion criteria?
Standards for searching research
Does the review process provide guidelines to develop the search protocol, the search strategy, and how to document the search process?
Does the review process include grey literature?

Standards for rating the quality of evidence and strength of recommendations

Does the review process distinguish between evidence quality and strength of recommendation?
Does the evidence review process provide an independent checklist to evaluate the evidence regarding the issues of efficacy and effectiveness?
Does the evidence review process provide an independent system to evaluate the strength of recommendation according to the effects of an intervention or programme and the quality of evidence?

Theory of Change and Logic Model underpinning the programme

Whereas the previous section of this chapter focused on the synthesis of existing evidence, the remaining sections focus on standards for various aspects of primary evidence generation using a monitoring and evaluation framework. This section focuses on the theory of change and logic model underpinning a programme.

A theory of change can be defined as a set of interrelated assumptions explaining how and why an intervention is likely to produce outcomes in the target population (European Monitoring Centre for Drugs and Drug Addiction, 2011[9]). A logic model sets out the conceptual connections between concepts in the theory of change to show what intervention, at what intensity, delivered to whom and at what intervals would likely produce specified short term, intermediate and long term outcomes (Axford et al., 2005[39]; Epstein and Klerman, 2012[40]). In some cases, a single theory of change might be difficult to identify due to multiple and complex interactions, it might be difficult to identify a unique course of action and the underlying policy goals could be multiple and conflicting; this should not impede the activation of evidence processes.

Why is it important?

. Engaging in the process of developing a theory of change leads to better policy planning and implementation, because the policy or programme activities are linked to a detailed and plausible understanding of how change happens; while a logic model is a critical tool to allow detailed coherent and realistic policy planning.

Although a theory of change and a logic model are often expected to be developed during the planning stage, putting them in practice can be tied to time settings, political context, etc. However, they can also be useful in the monitoring and evaluation stage. For instance, to identify key indicators for monitoring or gaps in available data (Better Evaluation, 2012[41]). A full list of the benefits of developing both a theory of change and logic model is reproduced in Box 3.6.

Box 3.6. The benefits of developing an intervention theory of change and logic model

The evaluability of the programme—both for implementation and for outcomes—is facilitated, by signposting appropriate metrics.
The original intentions of the programme developers are clearly set out, and are explicit and open to critique.
The underlying logic of the assumptions made in the theory, for example, that undertaking a certain activity will lead to a particular outcome, can be scrutinised.
The realism of the assumptions made by the programme developers can be checked against wider evidence of ‘what works’, to assess the likelihood of the programme being successful.
Commissioners can check the programme meets their needs; and providers and practitioners delivering the programme can check their own assumptions and the alignment of their expectations against the original intentions of the programme developers.
The key parameters or boundaries (e.g., who is the programme for, and under what specific circumstances) can be set out, reducing the likelihood that the programme is used inappropriately or ineffectively.
Core components (of content, or of implementation, or both) that are believed to be essential to the programme’s effectiveness can be identified.
Activity traps can be identified and avoided.
The most important features of the implementation model of the programme can be captured, enabling delivery that adheres to the original model and helps to prevent programme drift during maturation and scaling.

Source: Adapted from Ghate (2018[42]).

Although this report focusses on standards of evidence as they apply to discrete interventions, many of the concepts relevant to theory of change are also relevant to the discussions about policy evaluation, and around results-oriented policies, such as the importance of clearly distinguishing between concerns of input, output, outcome/result and impact (European Commission, 2011[43]; Gaffey, 2013[44]) For example, the EU Cohesion Policy (European Commission, 2018[45]) sets out several important changes in the understanding and organisation of monitoring and evaluation, notably the emphasis on a clearer articulation of policy objectives (see Box 3.7).

Box 3.7. Guidance Document on Monitoring and Evaluation – European Commission

The European commission under the Directorate General for Regional and Urban Policy presented the Guidance Document on Monitoring and Evaluation to facilitate the implementation of programmes and to encourage good practices. Particularly, the first section: Intervention logic of a programme as starting point. Results and result indicators highlights key concepts and terms of programming, monitoring and evaluation to form the basis of practical applications in the policy cycle. (European Commission, 2018[45]).

The following graph (see Figure 3.4 illustrates their framework. As a starting point, a problem is addressed (need); following by the intended result that motivates policy action (dimension of well-being and progress for people). Once an objective of a policy has been chosen, they propose appropriate measures (result indicators: responsive to policy, normative, robust, and timely collection of data) to facilitate a later judgement of the policy. The specific activity of programmes leads to outputs (direct products of programmes), which are intended to contribute to results. Finally, the approach suggests identifying other factors that can drive the intended result towards or away from the desired result, and by which policymakers should be aware.

Figure 3.4. Monitoring and Evaluation framework

Finally, the guideline suggests that a representation of a programme should reflect that an intervention can lead to several results and that several outputs can lead to those changes. Similarly, it recommends differentiating the result(s) by affected groups and time horizons. The guideline also offers example for the use of result indicators and a list of common indicators used for Regional and Urban Policies.

Sources: Adapted from the European Commission (2018[45]), Guidance document on Monitoring and Evaluation.

Summary of the mapping of existing approaches

Of the approaches included in the mapping, around half include some coverage of either intervention theory of change or logic model.

Standards concerning theory of change

All of the approaches stipulated that the intervention should be underpinned by a theory of change, but the approaches vary in terms of how rigorous the theory of change must be to meet the required standard. For example, the Level 1 standard from Project Oracle requires a theory of change and an evaluation plan (2018[46]). Similarly, the Level 1 standards from Nesta (2013[47]) stipulates that the intervention should specify what it does and why it matters in a logical, coherent and convincing way. The Nesta standards identify that this standard represents a low threshold, appropriate for early stage innovations, which may still be at the idea stage.

Around half of the approaches go further in stipulating that the theory of change needs to be explicitly based on scientific theory and/or evidence. These approaches include the Canadian Best Practices Portal, the EU-Compass for Action on Mental Health and Well-being, Blueprints and SUPERU. Blueprints, for example stipulates that for an intervention to meet the ‘Promising Programs’ category it must clearly identify the outcome the programme is designed to change and the specific risk and/or protective factors targeted to produce this change in outcome (Blueprints for Health Youth Development, 2015[48]). The EU-Compass for Action on Mental Health and Well-being (2017[49]) stipulates that for an intervention to be considered ‘evidence and theory based’ it must be built on a well-founded programme theory which is evidence based, with the effective elements in the intervention stated and justified.

A number of standards go further in providing detailed criteria against which an intervention’s theory of change could be rated. These criteria also facilitate comparisons between different interventions according to the quality of their theory of change. These approaches include the Early Intervention Foundation (2018[32]), the Green List Prevention (2011[50]), and the EMCDDA’s European drug prevention quality standards (2011[9]). For example the Green List Prevention has a number of criteria for Conceptual Quality described in Box 3.8.

Box 3.8. The Green List Prevention – Conceptual Quality Criteria

There is a theoretically well-defined model of the programme’s effectiveness, the assumed underlying mechanisms have been defined clearly (based on scientifically recognized theoretical models).
The methods and instruments applied are theoretically well grounded.
There is a strong logical relationship between analyses of the problem – malleable factors – goals – target groups – used method.
The programme is targeted at research-based risk- and protective factors
The target group(s) are described comprehensively and precisely.
Instructions for implementation and manuals are clearly deduced from the model.
Goals are defined explicitly and are measurable
Unless the programme was developed in Germany, the original context and the adaptations made are described.

Source: Adapted from Green List Prevention (2011[50])

A small number of approaches turn these criteria into a numerical scale against which a theory of change is assessed. There are differences in the approaches taken according to whether the approach looks at the evidence underpinning a discrete intervention, such as the Office of Juvenile Justice and Delinquency Prevention Model Programs Guide or whether they assess systematic review evidence underpinning a practice such as the What Works Centre for Crime Prevention (2017[51]). Further details of these two contrasting approaches are in Box 3.9.

Box 3.9. Two approaches to grading the quality of a theory of change

Office of Juvenile Justice and Delinquency Prevention Model Programs Guide

The Model Programs Guide includes a category ‘theoretical base’ which measures the degree to which the programme is based on a well-articulated, conceptually sound programme theory–it should explain why the programme should effect change. Acceptable programme theory may be articulated or implicit. The programme should provide an explanation of why and how it is expected to achieve its intended results. It uses the following numerical scale:

3 - A well-articulated programme theory is clearly defined and sound; previous empirical work related to the theory is described; and there is an explanation about how the theory relates to the specific programme components and how this should result in change for the participants.
2 - A programme that defines and describes an empirically supported theory, but does not necessarily connect the theory with the specific components of the programme or provide a theory of change.
1 - A programme that provides very little information on programme theory–perhaps referring to a theory but not describing it, referencing prior empirical support, describing the theory of change, or tying it into programme components.
0 - A programme that does not mention a theory; or a programme based on a theory known to be unsound.

Crime Reduction Toolkit

The Crime Reduction Toolkit is based on assessments of systematic reviews using the EMMIE framework. One of the dimensions of the framework is ‘Mechanism/Theory of change dimension focuses on the following questions:

How is the intervention presumed to work and impact on the outcome?
What is needed to make it work?
What are the anticipated impacts on crime or on behaviour?

To distinguish its quality, the dimension (MechanismQSCORE) is scored using the following scale:

0 (No information) - No reference to theory - simple black box
1 (Limited quality) - General statement of assumed theory
2 (Moderate quality) - Detailed description of theory - drawn from prior work
3 (Strong quality) - Full description of the theory of change and testable predictions generated from it.
4 (Very strong quality) - Full description of the theory of change and robust analysis of whether this is operating as expected.

Source: Adapted from Green list prevention (2011[50]) and Crime Reduction Toolkit (2017[51]).

Standards concerning the logic model underpinning an intervention

Only some of the approaches stipulate that the theory of change should be accompanied by a logic model. The majority stipulate that a logic model is necessary but do not provide detailed guidance on what it should contain. For example, SUPERU (2017[52]) has a category for pilot initiatives which have ‘a plausible and evidence-based logic model or theory of change that describes what the intervention is, what change it hopes to achieve and for whom, and how the intervention is supposed to work (how its activities will cause change)’.

A small number of standards go further in providing detailed criteria against which a logic model could be assessed. These include the Society of Prevention Research Standards (2015[53]) which stipulate that they must be described at a level that would allow others to implement/replicate it, including the content of the intervention, the characteristics and training of the providers, characteristics and methods for engagement of participants, and the organisational system that delivered the intervention. The EMCDDA’s European drug prevention quality standards also provide very detailed criteria concerning logic models and the description of the intervention described in Box 3.10.

Box 3.10. EMCDDA’s European drug prevention quality standards – intervention description

Defining aims, goals and objectives

1. It is specified what exactly is being ‘prevented’.
2. The programme’s aims, goals, and objectives are specified.
3. Aims, goals, and objectives are dependent on each other and form a logical progression.
4. Aims, goals, and objectives are:
1. a. informed by the needs assessment;
2. b. useful for the target population;
3. c. clear, understandable, and easy to identify;
4. d. formulated in terms of expected change in participants (‘outcomes’);
5. e. in accordance with professional and good ethical principles.
5. Goals and objectives are specific and realistic.
6. Specific and operational objectives are distinguished.

Defining the setting

1. The setting(s) for the activities is (are) described.
The setting:
1. a. is most likely to produce the desired change; i.e. it is relevant to the target population.
2. b. matches the programme aims, goals, and objectives;
3. c. matches the available resources.
2. Necessary collaborations in this setting are identified.
The setting:
1. a. matches identified risk and protective factors;
2. b. matches likely activities with the target population;
3. c. makes participants feel comfortable.

Source: An abbreviated list of the European drug prevention quality standards (2011[9])

The Office of Juvenile Justice and Delinquency Prevention Model Programs Guide was unique amongst the approaches in turning the detailed criteria into a numerical scale against which the programme logic model could be assessed as described in Box 3.11.

Box 3.11. Office of Juvenile Justice and Delinquency Prevention Model Programs Guide

Programme Description criteria

The Programme Description rates the degree to which the programme details are described. A full and thorough description should serve as a guide for the implementation of the programme. It should include the following information:

the logic of the programme,
the details of all key components,
the frequency and duration of the programme activities,
the targeted population,
the targeted behaviour(s) (i.e., the intent of the programme),
the setting.

The rating should reflect the degree to which the provided materials afford an adequate programme description and/or direct the reader to references containing such a description and is assessed using the following numerical scale.

a. 3= All programme details are specified (5-6 items are described).
b. 2= Most programme details are specified (3-4 items are described).
c. 1= Some programme details are specified (1-2 items are described).
d. 0= No programme details are specified.

Source: Adapted from Crime Solutions (2013[54]) Program Scoring Instrument Version 2.0

Key questions

Is there a theory of change that explains how and why the intervention should work?
Does the logic model clearly specify the assumptions underlying the intervention?
Does the logic model specify the target population, intervention's resources and inputs?
Does the logic model clearly describe the intervention's activities and outputs, as well as identify the short and long-term outcomes?

Design and Development of Programmes

Standards concerning the design and development of policies and programmes focus on evidence that tests the feasibility of delivering a policy in practice. At the design and development stage, analysts are often doing important work in testing theories of change and logic models, carrying out process evaluations and pre/post studies.

Why is it important?

Most of the approaches at this stage do not attempt to assess the casual impact of an intervention. Instead, standards concerning design and development aim to identify promising interventions that may be suitable or merit further investigation, at a later stage, for efficacy testing. Efficacy studies (discussed in the next chapter) are complex, time-consuming and expensive to carry out, especially where the collection of new data is required. Therefore, feasibility and pilot studies are an important way of providing information with which to make programme refinements and to inform the design of efficacy studies.

Summary of the mapping of existing approaches

Thirty approaches recognise a phase of design and development of policies and programmes. Most of the approaches categorize these interventions using descriptions such as “emerging”, “delivery and monitoring”, “exploration and development”, or “probable effectiveness”. The approaches can be divided into two broad categories, those establishing the feasibility of an intervention and those focused on piloting the outcomes of the intervention.

Standards for establishing the feasibility of an intervention

A feasibility study typically evaluates whether a range of activities in an intervention, or key components of an intervention’s logic model – including its resources, activities and population reach– are practical and achievable (See Box 3.12). This allows researchers to investigate whether an intervention can work by systematically testing the intervention’s progress towards its intended outputs as it is being implemented (Early Intervention Foundation, 2019[55]).

Feasibility studies can use a variety of quantitative methods (to determine whether the intervention is reaching its delivery and recruitment targets), and qualitative research (to understand the views of the intervention’s recipients and whether these views are consistent with the intervention).

Box 3.12. Feasibility: Can an intervention work?

The Early Intervention Foundation – EIF (UK) presents 10 steps for evaluation success. One of the first steps is to conduct a feasibility study in order to verify if an intervention can really work, and to know under what conditions and with what resources. Key elements include:

Specifying the intervention’s core activities and identify the factors that support or interfere with their successful delivery.
The use of qualitative research methods to understand which factors contribute to the success of the intervention from the perspective of those delivering it.
The use of qualitative methods to understand how those receiving the intervention perceive the intervention’s benefits and whether these perceptions are consistent with the intervention’s original theory of change.
Building an understanding of how best to recruit and retain participants.
Developing systems for monitoring participant reach and core delivery targets.
Applying methods for verifying user satisfaction.
Tracking and documenting intervention costs.

Source: Adapted from Early Intervention Foundation (2019[55]).

Some of the approaches that recognize qualitative methods at this stage are SUPERU (NZ), What Works for Health (USA), and the Agency for Healthcare Research and Quality (USA). For instance, SUPERU (2017[52]) includes personal experiences from individuals participating in the intervention, such as: interviews, case studies, and ethnographic research. What Works for Health (USA) recognizes studies that describe the intervention, and studies that ask respondents or experts about the intervention (e.g. descriptive, anecdotal, expert opinion). Finally, the Agency for Healthcare Research and Quality (USA) (2012[56]), includes non-comparative case studies or anecdotal reports in its “suggestive” category.

Standards for piloting the outcomes of an intervention

A pilot study is a preliminary and often small-scale investigation conducted to assess the feasibility of the methods to be used in a larger and more rigorous evaluation study. These studies may also focus on which measures are most appropriate for testing the target outcomes (Early Intervention Foundation, 2019[55])

Research and Evaluation design

A variety of different approaches are used to provide preliminary support for programme outcomes. These include administrative data, pre-post-test design and correlational analysis. Most of the approaches agree on implementing pre-post-test at this stage such as Project Oracle (UK) (2018[46]), the European Platform for Investing in Children (EU), and What Works for Health (USA). For instance, the European Platform for Investing in Children (2017[57]) recognizes evaluations using at the minimum pre/post design with appropriate statistical adjustments, and What Works for Health (2010[58]) considers studies comparing outcomes before and after an intervention, and with a statistical analysis. The Green List Prevention (Germany), includes benchmark or non-references-studies (Groeger-Roth and Hasenpusch, 2011[50]) which is described in detail in Box 3.13.

Box 3.13. Standards concerning Design and Development: Green List Prevention

The Green list prevention was established by the Crime Prevention Council of Lower Saxony (Germany) to provide an inclusive rating criterion for the Communities That Care Programme – Databank.

The Rating of the Prevention Programmes is divided in three levels:

Level 1: Effectiveness theoretically well-grounded
Level 2: Probable Effectiveness (promising)
Level 3: Proven Effectiveness (effective)

Within levels, there is a Rating of the Evaluation Studies from “No statement on effectiveness possible” to “five stars”. Within Threshold Level 2 to Threshold Level 3, the criteria focuses on conceptual, implementation and evaluation quality in the following order:

No Star

Participant-satisfaction assessment
Pre-post assessment without control-group
Goal-attainment study
Quality-assurance-study

One star

Benchmark / Norm-reference-study
Theory of Change – study

Two stars

RCT or QED with or without follow-up (not in routine context)
Pre-post assessment with control-group(s) in routine context

Finally, an Assessment of the Conclusiveness of Evaluation Results is given regarding the previews rating. If the evaluation design obtained:

No star = no conclusiveness
One to two stars = preliminary

Source: Adapted from Groeger-Roth and Hasenpusch (2011[50]).

Sample

Design and development standards consider several recommendations regarding the study sample, including its representativeness, the sample design, the sample size, and processes for dealing with study drop out.

Representativeness of the sample. Most of the approaches demand study samples that accurately represent the target population and will be relevant to the research question.

Sampling approach. Most approaches specify that the sampling approach should be well-defined and mention its restrictions. For instance, the Housing Associations' Charitable Trust (UK) (2016[59]) mentions that the sampling design should include the setting and location where the data are planned to be collected; and a comprehensive description of the eligibility criteria used to select the study participants and the recruitment methods.

Sample size. Some of the approaches specify a minimum sample size threshold required for the research, but the threshold can be set at different sizes. For instance, the European Platform for Investing in Children (EU) (2017[57]) requires a sample size of at least 20 in each study group. Another example is Project Oracle (2018[46]), which considers that a reasonable sample is at least 30 individuals.

Study drop out. Most of the approaches recognise the relevance of study drop out but most of them do not specify any rate beyond which the strength of evidence is compromised. For example, the Clearinghouse for Labor Evaluation and Research (USA) asks if the researchers took steps to reduce study drop out to resolve these issues (2014[60]). Only a few approaches highlight acceptable rates of drop out. In the EIF (UK) (2018[32]) recommends that overall study attrition should not be higher than 40% (i.e. with at least 60% of the sample retained).

Measurement

Most design and development standards stipulate that an evaluation must use valid and reliable measurements. There are some differences between the specifications across standards about how to specify the validity, reliability, and the independence of measurement. See further information in Box 3.14.

Box 3.14. Definitions: Reliability and validity

Reliability refers to the stability of measurement over a variety of conditions—such as across two measure time point—in which the same results should be obtained. Measurement tools must have established test-retest reliability to provide an accurate estimate of pre-and post-intervention change. Issues with reliability occur when measurement items are ambiguous or are subject to fleeting changes in study participants’ moods.

Validity is concerned with the extent to which measurement accurately corresponds to the real world. This means that a measurement tool measures what it claims to measure. Establishing measurement validity often involves psychometrically testing to verify the internal, structural validity of measures as well the consistency with other measures.

Source: Adapted from Scholtes, Terwee and Poolman (2010[61]), Drost (2011[62]), and Early Intervention Foundation (2019[55]).

Reliability and validity. Most of the approaches stipulate that measurements should be valid and reliable measures of an outcome. For instance, SUPERU (2017[52]) specifies that the evaluation should use valid and reliable methods and measurement tools that are appropriate for participants and relevant to what the intervention is trying to achieve (See Box 3.15). Project Oracle (2018[46]) also stipulates that valid and reliable measurement tools have been used that are appropriate for the participants in the research.

Other standards provide further specification on the technical requirements that the measurements should meet, such as the European Drug Prevention Standards (2011[9]), which requires that measures should demonstrate internal consistency, test-retest, inter-rater reliability; and construct validity.

Box 3.15. Standards concerning Design and Development: Social Policy Evaluation and Research Unit- Superu

Superu was a government agency from New Zealand focused on what works to improve the lives of families, children and whanau 1 Part of its goals was to provide independent standards of evidence for evaluation guidance, funding support, and to promote the development and uptake of evidence-based interventions.

In that direction, Superu presented a Strength of evidence scale that consists in an ascending ranking for quality of evidence, and brings outlooks about the type of evidence that can and should be generated about an intervention as it matures and grows.

The scale comprises five levels:

Pilot initiative
Early stage, good in theory
Progressing, some evidence
Good evidence, sufficient for most interventions
Extra evidence for large or high risk interventions

By Level 2, an intervention should be in operation for around 1 to 3 years. It has met all level 1 criteria, and been evaluated at least once, but it may not yet be possible to directly attribute outcomes to it.

The intervention is evaluated by the following criteria:

Effectiveness:

How efficient was the intervention (the delivery of outputs in relation to inputs)?
The evaluation used a convincing method to measure change: pre- and post-analysis, or a recognised qualitative method. Not necessary with control group.
The evaluation used valid and reliable methods to measure what the intervention is trying to achieve.
Was data was used properly?

Intervention consistency and documentation

There is clarity and documentation about what the intervention comprises.
Manuals and staff training processes.
Resources (money and people) required to deliver the intervention.

← 1. An extended family or community of related families who live together in the same area. (Taken from Oxford Dictionary)

Sources: (SUPERU, 2017[52]).

Independence from the intervention. Some approaches request independency of a measurement from participants and data collecters. For instance, Clearinghouse for Labor Evaluation and Research (USA) (2014[63]) indicates that data collection must reflect methods that produce unbiased results such as independency and objectivity of the outcomes from the research team. The European Drug Prevention Standards (EU) (2011[9]) also agrees that measures must produce results independently of who uses the instrument.

Approach to analysis

Design and development standards highlight the importance of well executed and described analysis, which covers: the data collection, hypothesis testing, and methods of address missing data or other sources of bias.

Data collected. Most approaches stipulate that a complete report should be able to explain and justify why and how the analysis was conducted. For instance, Housing Associations' Charitable Trust (2016[59]) requires the study protocol, recording of any deviations, and a structured report of findings.

Hypothesis testing. Most of the approaches stipulate a clear description of the analysis methods selected to test the research question. For example, Clearinghouse for Labor Evaluation and Research (USA) (2014[63]) demands analysis methods that are very well-described, relevant to the research question, sufficiently rigorous, and correctly executed.

Missing data. Many of the approaches consider issues of missing data, with most requesting that the analysis specifies how these issues were managed, and how this could affect the interpretation of the findings. For instance, Project Oracle (2018[46]) asks if the research provides all the details concerning the data analysis, or any weaknesses of the design, and their impact on the results. Another example is European Drug Prevention Standards (EU) (2011[9]), which requests reporting and appropriate handling of missing data.

Findings and conclusions

Design and development standards request coherence between the programme’s theory of change, the data analysis, and findings. Some of the approaches go further and specify that findings should be statistically significant on at least one of the outcomes; and not have harmful effect. Other approaches define the findings in this stage as unclear/undetermined effects.

Statistical significance. Among the approaches dealing with quantitative research, there are variations between the information required about statistical significance. Some approaches only recommend that the results were tested for statistical significance, such as Project Oracle (UK) (2018[46]). Other approaches stipulate that the findings must be significant. For instance, What Works for Health (USA) (2010[58]) scores pre/post studies with statistically significant favourable findings higher, and Evidence for ESSA (USA) (Every Student Succeeds Act - ESSA, 2019[34]), which requires findings of a statistically significant effect for correlational studies. Some other approaches require a specific level of significance, such as European Platform for Investing in Children (EU) (2017[57]), which asks for positive results at 10 % of significance.

No Harmful effects. Many of the approaches expect that the intervention does not constitute a risk of harm. For example, The Centers for Disease Control and Prevention suggest that studies should indicate any negative effect. (Puddy and Wilkins, 2011[64]).

Unclear/undetermined effects. Most of the approaches accept unclear or undetermined effects given the type of evidence, and rigorous on the study design. For example, SUPERU (NZ) (2017[52]) mentions that at this stage an evaluation (pre/post study) indicates some effect, but it may not yet be possible to directly attribute outcomes to it. Another example is the Housing Associations' Charitable Trust (UK) (2016[59]), where the lack of a good design limits any conclusion of causality.

Subgroup analysis1. Only a few approaches discuss subgroup analysis to verify for whom the effects are claimed. For example EIF (UK) (2018[32]) stipulates that subgroup analysis is used to verify for whom the intervention is effective and under what conditions. The Clearinghouse for Labor Evaluation and Research (USA) (2014[60]) discusses if the sample analysis allows generalizing the results to a wider population, or if it is presented the limitations of this inference.

Key questions

Standards concerning the feasibility of the intervention

Has a feasibility study been conducted to test whether the intervention can achieve its intended outputs?
Has the feasibility study led to an understanding of which factors contribute to the success of the intervention from the perspective of those who are implementing it?
Has the feasibility study led to an understanding of how those receiving the intervention perceive the intervention’s benefits and whether they are satisfied with the intervention?

Standards concerning pilot studies

Has a pilot study been carried out to investigate an intervention’s potential for improving its intended outcomes?
Does the study have a minimum of pre and post measurement to establish a correlation between programme participation and outcomes?

Sample

Is the study sample representative of the intervention’s target population?
Is the sample design clearly described?
Does the study give information about dropouts and dose?
Does the study present information regarding attrition?

Measurements

Are the measurements used reliable and valid? Does the study report a specific method to do so?
Are the measurement independent on any measures or information given as part of the intervention?
Has the evaluation used valid and reliable methods and measurement tools that are appropriate for participants and relevant to what the intervention is trying to achieve?

Approach to analysis

Does the study stipulate the analysis methods used and are these appropriate for the research question?
Does the study have procedures for managing issues of missing data?
Finding and conclusions
Does the study find statistically significant positive results and are there no harmful effects on at least one of the target outcomes of the study?

Efficacy of an Intervention

Once an intervention has been identified as ‘promising’ in preliminary research, many standards of evidence emphasise the need for rigorous efficacy testing. Efficacy studies typically privilege internal validity, which pertains to inferences about whether the observed correlation between the intervention and outcomes reflect and underlie causal relationship (Society for Prevention Research Standards of Evidence, 2015[53]). In order to maintain high internal validity, efficacy trials often test an intervention under ‘ideal’ circumstances. This can include a high degree of support from the intervention developer and strict eligibility criteria thus limiting the study to a single population of interest.

Why is it important?

A critical goal of standards of evidence is to facilitate the communication of which policies and programmes are efficacious. A statement of efficacy should be of the form that Intervention X is efficacious for producing Y outcomes for Z population at time T in setting S (Society for Prevention Research Standards of Evidence, 2015[53]). In order to maintain high internal validity, efficacy trials often test an intervention under ‘ideal’ circumstances, and tell us little about the impact of an intervention in ‘real world conditions”, because the evaluation is often overseen by the developer of the policy or programme, with a carefully selected example. This can include a high degree of support from the intervention developer and strict eligibility criteria thus limiting the study to a single population of interest.

Therefore, standards of evidence often stipulate that a policy or programme demonstrates effectiveness, in studies where no more support is given than would be typical in ‘real world’ situations. This requires flexibility in evaluation design to address cultural, ethical and practice challenges. Systematic reviews, observational studies and participatory evaluations which gather attitudinal and experiential considerations from the main beneficiaries can still be considered useful evidence and guide improvements in the design or implementation of the intervention.

Summary of the mapping of existing approaches

Determining the efficacy of an intervention is a complex process, involving considerations on the evaluation design, sample, measurements, methods of analysis, and findings. There is wide variety of specification standards that an evaluation must meet for an intervention to be deemed efficacious.

Evaluation design

All the standards consider Randomized Control Trials (RCTs) as an appropriate study design to generate a counterfactual as the basis for making efficacy claims. However, there is wide variation across standards regarding Quasi-Experimental Design (QEDs). Some approaches consider that QEDs can be used to generate comparable samples as RCTs, whereas other standards only recognise that QEDs are better than pre/post studies.

Sixteen of the approaches privilege the use of RCTs over QEDs. For example, Nest What Works for Kids (Australia) (2012[1]) ranks programmes or policies with well-implemented RCTs in the highest levels. Evidence for Every Student Succeeds Act (USA) (Every Student Succeeds Act - ESSA, 2019[34]) defines a programme or policy as Strong evidence when it has at least one well-designed and well-implemented RCT.

Thirty of the approaches consider both RCTs and QEDs as robust evaluation designs to support causal inference. For instance, the European Platform for Investing in Children (EU) (2017[57]) defines both evaluation designs as methodologies that can be used to construct convincing comparison groups to identify policy impacts. Other approaches that treat suitably designed RCTs and QEDs as equivalent are SUPERU (NZ) (2017[52]) and the Green List Prevention (Germany) (Groeger-Roth and Hasenpusch, 2011[50]).

Among the approaches that accept QEDs, some of them distinguish between how rigorous different type of designs are, such as: difference in difference (DD); propensity score matching (PSM); and Regression Discontinuity Designs (RDD). A few approaches also provide a score according to the rigour and limitations of QEDs. For example, What Works Centre for Local Economic Growth (UK) (2016[65]) presents a guide scoring evidence using the Maryland Scientific Methods Scale to evaluate the different type of designs, from PSM, Panel Methods, DD, RDD to Instrumental variables (IV), see Box 3.16.

Many approaches recognise that, whilst RCTs might in theory present the ‘gold standard’ in reducing threats to internally validity, in practice randomisation might not be practicable for a range of policy challenges, including ethical concerns. In the health policy area, the famous “Rand experiment”, which allowed for computing the price elasticity of the demand for health, could probably not be replicated today. (Newhouse J.P., 1993[66]) . For example, in OECD’s work on Regional Development Policy (OECD, 2017[67]), it is recognised that randomisation is not always possible, and quasi-experimental designs can be used as an alternative method to identifying causal effects. In addition, the development of econometric methods with the use of Difference in Differences with instrumental variables in econometrics, has helped to diffuse the use of alternative quasi experimental approaches to producing reliable estimates.

Box 3.16. Standards concerning Efficacy: What Works Centre for Local Economic Growth

The What Works Centre for Local Economic Growth (WWG) is an independent organisation from the UK, mainly focused on producing systematic reviews of the evidence on a broad range of policies in the area of local economic growth.

WWG assessment is based on the Maryland Scientific Methods Scale (SMS), which ranks policy evaluations from 1 (least robust: studies based on simple cross sectional correlations) to 5 (most robust: Randomised Control Trials.). The ranking aims to present the extent to which the methods deal with the selection biases inherent to policy evaluations (robustness), and the quality of its implementation to achieve efficacy, as following:

Level 2: Use of adequate control variables and either (a) a cross-sectional comparison of treated groups with untreated groups, or (b) a before-and-after comparison of treated group, without an untreated comparison group.
Level 3: Comparison of outcomes in treated group after an intervention, with outcomes in the treated group before the intervention, and a comparison group used to provide a counterfactual (e.g. difference in difference). Techniques such as regression and propensity score matching may be used to adjust for difference between treated and untreated groups, but important unobserved differences are likely to remain.
Level 4: Quasi-randomness in treatment is exploited, so that it can be credibly held that treatment and control groups differ only in their exposure to the random allocation of treatment. This often entails the use of an instrument or discontinuity in treatment, the suitability of which should be adequately demonstrated and defended.
Level 5: Reserved for research designs with Randomised Control Trials (RCTs) providing the definitive example. Extensive evidence provided on comparability of treatment and control groups, showing no significant differences in terms of levels or trends. Additionally. Attention paid to problems of selective attrition, and there should be limited or, ideally, no occurrence of ‘contamination’ of the control group with the treatment.

Source: (What Works Centre for Local Economic Growth, 2016[65]).

Intention to treat (ITT). Although the importance of ITT in the academic field is well-established (Hollis and Campbell, 1999[68]), there is variation within the approaches concerning their treatment of ITT. Some of them clearly request in their criteria that analysis must be based on ITT. For instance, What Works Centre for Children’s Social Care (UK) (2018[69]) establishes that acceptable quality study must have an intent-to-treat design. Social Programmes that Work (USA) (2019[70]), which stipulate an ITT approach for the intervention group, and Child Trends (USA) (2018[71]) affirms that only results based on an intent-to-treat analysis can be reported.

Sample

Most standards present clear conditions regarding the nature of the sample required in providing an appropriate basis for the analysis. The standards specify issues concerning a baseline equivalence, attrition, and risks of contamination.

Baseline equivalence. Some of the standards focus on baseline characteristics of the treatment and comparison-groups before running a programme or policy. For instance, Nest What Works for Kids (AU) (2012[1]) requests clear analysis of baseline characteristics. And Social programmes that Work (USA) stipulates that the intervention and control groups must be highly similar in key characteristics prior to the intervention (Coalition for Evidence-Based Policy, 2010[2]).

Other standards treat baseline equivalence differently according to whether the study design is an RCT or QED. Evidence Based Teen Pregnancy Programs stipulates that an RCT must control for statistically significant baseline differences and QEDs must establish baseline equivalence of research groups and control for baseline outcome measures (Mathematica Policy Research, 2016[33]).

Attrition. Some of the standards recognise an attrition threshold. For instance, European Platform for Investing in Children (EU) (2017[57]), which states that attrition must be less than 25% or that it has been accounted for using an acceptable procedure. Another example is the Clearinghouse for Military Family Readiness (USA) (2012[72]), which stipulates an attrition at immediate post-test, of less than 10%.

Other standards also stipulate conditions for overall and differential attrition. For example, Darlington Service Design Lab requests no evidence of significant differential attrition (Graham Allen, 2011[73]). Other approaches go further in stipulating specific attrition thresholds. For instance, What Works for Clearinghouse (USA) (2020[74]) defines that for studies with a relatively low overall attrition rate of 10%, a rate of differential attrition up to approximately 6% is acceptable. For studies with a higher overall attrition rate of 30%, a lower rate of differential attrition, at approximately 4% is acceptable.

Risk of contamination. Only few standards highlighted the issues around risk of contamination. For example, What Works Centre for Local Economic Growth (UK) (2016[65]) stipulates no occurrence of contamination of the control group for the treatment. Crime Solutions (2013[54]) assesses the degree to which internal validity is threated, within other aspects, by contamination.

Measurements

Some efficacy standards stipulate that evaluations must use valid and reliable measurements. In general, these standards tend to be broadly equivalent to those already discussed at the design and development phase. For example, Blueprints for Healthy Youth Development (2015[48]) demands use of valid and reliable measures, and California Evidence-Based Clearinghouse for Child Welfare (2019[75]) provides a measurement tools rating scale based on the level of psychometrics (e.g., sensitivity and specificity, reliability and validity) in peer review studies using QEDs or RCTs.

A few approaches go further and recommend the independency of the measurement with the participants of an intervention. For instance, EIF (UK) (2018[32]) requests that measurements are blind to group assignment if possible. The European Monitoring Centre for Drugs and Drug Addiction (EU) (2011[9]) specifies that an instrument is objective if it produces results independently of who uses the instrument to take measurements. The Dartington Service Design Lab (UK), stipulates that outcome measures must not depend on the unique content of the intervention, and they are not rated solely by the person or people delivering the intervention (Graham Allen, 2011[73]).

Approach to analysis

A few of the approaches provide details on the appropriate analysis required in order to establish the efficacy of a policy. These standards focus on establishing baseline conditions, and the analysis of the effects at the correct level of assignment.

Baseline conditions. Most of the approaches require that evaluations use statistical models to control for baseline differences between treatment and control group. For instance, Strengthening Families Evidence Review (USA) (2019[76]) requests statistical adjustment when treatment and comparison groups are not equivalent. Another example is HomeVEE (USA) (2018[77]), which requests that the analysis should control for differences in baseline characteristics and baseline measures.

Level of analysis. Only a few of the standards demand that the analysis needs to be appropriate according to whether the assignment is at the individual or group (or cluster) level. For instance, Evidence Every Student Succeeds Act (USA) (Every Student Succeeds Act - ESSA, 2019[34]) stipulates that clustered designs must use Hierarchical Linear Modelling (HLM), or other methods accounting for clustering. A second example is the Society of Prevention Research (Society for Prevention Research Standards of Evidence, 2015[53]), which specifies that the analysis must assess the treatment effect at the level at which randomization took place.

Findings and impacts

Most of the standards focus on impact effects and their statistical significance whereas other standards that also request effect size measures.

Statistical significance. Across the standards, the majority demand information regarding whether effects are statistically significant. For instance, Clearinghouse for Military Family Readiness (2012[72]) in the Promising Programme Category requests specific conditions for significant Effects—Two-tailed tests of significance are preferable to one-tailed tests.

Impact effects. Most of the standards claim that an intervention is efficacious when the findings of an intervention are positive and significant, and there is no evidence of harmful effects. Other standards request reporting of mixed effects or null effects. The standards may present these criteria as part of a one ranking; or independently with a ranking solely focused on impact.

Forty-one of the standards consider positive impact effects to claim efficacy. For instance, Be you (AU) (2020[78])— the new integrated national initiative of the Australian government to promote mental health from early years through evidence-based, flexible online professional learning, complemented by a range of tools and resources to turn learning into action (Early Childhood Australia, 2020[79])—requests that a programme have at least one research or evaluation study which demonstrates a positive impact on mental health outcomes for children or young people.
Nineteen of the standards also request reporting on whether there are harmful effects. For example, EU-Compass for Action on Mental Health and Well-being (EU) (European Commission - Directorate-General for Health and Food Safety, 2017[49]) requires that the evaluation outcomes demonstrate beneficial impact, and that possible negative effects be identified and stated.
Eighteen of the standards consider mixed effects or null effects. For example, EIF (UK) (2018[32]) has a Not effect level (NE). This level is reserved for programmes where there is evidence from a high-quality evaluation of the programme that did not provide significant benefits for children.

Some of the standards present an independent ranking or score to assess impact. For instance, Darlington Service Design Lab (UK) evaluates impact according to interventions with positive effect size, and no harmful effects or negative side–effects of intervention (Graham Allen, 2011[73]). Another example is Evidence Based Teen Pregnancy Programs (USA), which classifies the programme evidence as positive, mixed, indeterminate or negative (Mathematica Policy Research, 2016[33]).

Magnitude of the findings. Some standards recognise the importance of reporting effect size. For example, Blueprints (2015[48]) stipulates that effect sizes should be reported, along with the significance levels of those differences, or that it should be possible to calculate the effect size from the data reported (means and standard deviations).

Other standards establish an effect size threshold. For instance, the European Platform for Investing in Children (EU) (2017[57]) stipulates an effect size of at least 0.1 of a standard deviation. A second example is Best Evidence Encyclopaedia (USA), which assesses an intervention by the sample size and effect size (Johns Hopkins University School of Education’s Center for Data-Driven Reform in Education - CDDRE[80]), in the following order:

Moderate evidence level requests specifically studies with weighted mean effect size of at least +0.20;
Limited evidence level a study can meet the criteria except that the weighted mean effect size is +0.10 to +0.19; or the weighted mean effect size is at least +0.20, but the study is insufficient in number or sample size.

Key questions

Sample

Are the treatment and comparison group assignment at the appropriate level (e.g. individual, family, school, community)?
Is there baseline equivalence between the treatment and comparison-group participants on demographic variables before running the intervention?
Does the study provide information about risks of contamination between intervention and control groups?
Does the study provide information about overall and differential attrition?

Approach to analysis

Does the evaluation use statistical models to control for baseline differences between treatment and control groups?
Does the study use an appropriate analysis according to level of assignment (e.g. individual or group)?
Finding and Impacts
Does the study present statistically significant positive effects on at least one or more of the primary outcomes of the study and no harmful effects?
Does the study report the magnitude of the effects (e.g. effect size)?

Effectiveness of Interventions

Efficacy trials often tell us little about the impact of an intervention in ‘real world’ conditions, because the evaluation is often overseen by the developer of the policy or programme, with a carefully selected sample. What are the benefits or damages, independently from the policy goals? Therefore, standards of evidence often stipulate that a policy or programme demonstrates effectiveness, in studies where no more support is given than would be typical in ‘real world’ situations.

Why is it important?

Demonstrating effectiveness of a policy or programme in ‘real world’ situations requires flexibility in evaluation design to address cultural, ethical and practice challenges. During policy implementation, evidence is useful to understand for whom it works and for whom it does not work. Therefore it is important to learn how to maximize benefits and minimize damages, also within a no policy change scenario.

Summary of the mapping of existing approaches

For the majority of standards, in order for an intervention to claim effectiveness, the evaluations should meet all of the conditions of efficacy studies discussed in the previous section as well as the following criteria: generalizability of the findings, long term impacts, positive average effect across studies and no reliable iatrogenic effect observed on important outcomes.

Evaluation design

Generalizability. In order to translate the findings of efficacy evaluations into a wider range of population and settings, the standards concerning effectiveness stipulate that the generalizability of intervention effects should be tested across the following dimensions: a replication; and population subgroup analysis.

Replication. Most of the standards typically request two or more RCTs or QEDs conducted in different locations. Some of the standards consider that before an intervention is judged as effective and ready for scaling up data collection and analysis should be carried out by an independent evaluator who does not have any involvement with the developer of the intervention. For instance, the Clearinghouse for Military Family Readiness (USA) (2012[72]) requests at least one replication involving an external implementation team at a different site. The National Dropout Prevention Center (USA) (2019[81]) gives the highest score to programmes that were evaluated using an experimental or strong quasi-experimental design conducted by an external evaluation team. Further details on the approach adopted by Blueprints are described in Box 3.17.

Other standards do not necessarily request or establish any “independency” condition for a replication. Most of them specify the number of evaluations of the intervention and attention to the transferability of the findings to different context. For instance, CEBC (USA) (The California Evidence-Based Clearinghouse for Child Welfare, 2019[75]) requests at least two RCTs in different settings. Another example is Education counts (NZ), which considers the degree of applicability to New Zealand contexts; and specificity or generalisability of findings (Alton-Lee, 2004[10]).

Population subgroup analysis. Other standards explore generalizability through an analysis of the population subgroups (e.g. race, gender, social class). For instance, Society for Prevention Research Standards of Evidence (2015[53]) requests a statistical analysis of subgroup effects for each important subgroup to which intervention effects are generalized. SUPERU (NZ) (2017[52]) asks for evidence of the impact of the intervention on different subgroups in the target population.

Sample

The standards concerning effectiveness assume that the studies have the same robust approach about the sampling, as already was specified in the efficacy section.

Measurements

Most standards stipulate that effectiveness evaluations must meet the same requirements for measurement as in efficacy standards. Some of them request additionally the independency of the measurements from the participants and from the person delivering the intervention. For instance, EIF (UK) (2018[32]) request that at least one evaluation use a form of measurement that is independent of the study participants and independent of those who deliver the programme.

Box 3.17. Standards concerning Effectiveness: Blueprints

Blueprints is a project from USA funded by the Annie E. Casey Foundation, and hosted by the Center for the Study and Prevention of Violence (CSPV) to make available a registry of evidence-based positive youth development programmes on the health and well-being of children and teens.

The programmes are ranked under the following standards of effectiveness:

Promising programs:

Intervention specificity: Clarity on the objective outcome, risk and/or factors expected from the study, target population, and components role to produce a ‘change’.
Evaluation quality: The programme has been evaluated by (a) at least one randomized controlled trial (RCT) or (b) two quasi‐experimental (QED) evaluations.
Intervention impact: There is a significant positive change in planned outcomes that can be attributed to the programme and there is no evidence of harmful effects.
Dissemination readiness: organizational capability, manuals, training, technical assistance and other support required for implementation with fidelity in communities and public service systems.

Programmes meet the minimum standard of effectiveness and are recommended for local community and systems adoption.

Model programmes - meet these additional standards:

Evaluation Quality: A minimum of (a) two high quality randomized control trials or (b) one high quality randomized control trial plus one high quality quasi-experimental evaluation.
Positive intervention impact is sustained for a minimum of 12 months after the programme intervention ends.

Model Plus programmes – provide one additional standard:

Independent Replication: In at least one high quality study, demonstrating desired outcomes, authorship, data collection, and proper analysis.

Model and Model Plus programmes meet a higher standard, provide greater confidence in the programme’s capacity to change behaviour and developmental outcomes, and are recommended for large-scale implementation.

Source: Adapted from Blueprints for health youth development (2018[82])

Approach of analysis

Most of the standards stipulate that at effectiveness evaluations must meet all the methodological requirements previously discussed in the efficacy standards including appropriate statistical analysis (e.g. Intent-to-treat) and baseline equivalence adjustments.

Finding and impacts

Most of the standards concerning effectiveness agree on requiring positive average effects and no evidence of negative effects or risk of harm. Other standards also consider the sustainability of the effect at the long term.

Positive average effect and no reliable iatrogenic effect observed. Most of the standards agree on requesting positive average effect across studies and reporting no reliable iatrogenic effects observed on important outcomes. For instance, Society for Prevention Research Standards of Evidence (2015[53]) specifies that effectiveness can be claimed only for intervention populations, times, settings, and outcome constructs for which the average effect across all effectiveness studies is positive and for which no reliable iatrogenic effect on an important outcome has been observed. Another example is Dartington Service Design Lab (UK), which requests evidence of a positive effect and an absence of iatrogenic effects from the majority of the studies (Graham Allen, 2011[73]).
Other standards adjust the programme’s rating according to the average effects of the studies. These standards recognize a category for each of the possible results found in multiple studies. For example, CEBC (USA) (The California Evidence-Based Clearinghouse for Child Welfare, 2019[75]) presents three categories for the overall weight of evidence from several studies:
- Well supported category: At least two RCTs have found the benefit of the practice;
- Evidence Fails to Demonstrate Effect: Two or more RCTs have found the practice has not resulted in improved outcomes;
- Concerning Practice: The overall weight of evidence suggests the intervention has negative effect.
Long term effects. Most of these approaches agree on requesting sustained effects for at least 12 months. A few of them also accept effects for at six least months. For example, the Nest What Works for Kids (AU) (2012[1]) consider different periods of time: Supported level, the effect should be maintained at a 6-month follow-up; and for Well supported an effect must be maintained for at least one study at one-year follow-up. Another example is HomVEE (USA) (2018[83]), which evaluates the evidence across diverse outcome domains, such as duration of Impacts (information on the length of follow-up) and Sustained Impacts (impacts were measured at least one year after programme enrolment).

A postscript on systems-based approaches to evaluation

One important caveat to the standards reviewed in the efficacy and effectiveness chapters is that they primarily originate from traditional approaches to impact evaluation. These can be contrasted with systems-based approaches to evaluation (see Table 3.1). System based approaches to evaluation start from challenges faced when dealing with the open‐ended nature of problems and issues including innovation and the goal complexity of the connected processes (Askim, Hjelmar and Pedersen, 2018[84]; Tõnurist, 2019[85]).

OECD has been also moving towards a systems approach to public sector innovation and has developed a model to look at innovation activities from an individual, organisational and systemic lens, which can then also feed into approaches to evaluation (Tõnurist, 2019[85]). The tensions between the traditional approaches impact evaluation and the system based approaches has been further discussed by Tõnurist (2019[85]) and it is acknowledged that integrating these insights would be a useful next step for approaches to standards of evidence.

Table 3.1. Two Basic Approaches to Evaluate Public Sector Innovation
	Impact-focused approach	Systems approach
Purpose	Solid knowledge about effects and causality	Generate new practices; nurture innovation and learning
Participants	Controlled selection, few in number, randomisation, strategic sampling, control group	Self-selection, many participants, no control group
Contents	Narrow scope and accommodation	Wide scope; emergent, liberal accommodation
Evaluation	Comprehensive, external party, effect-oriented	Inspiring depictions, context-sensitive narratives, process-oriented
Source: Adapted from Tõnurist (2019[85]) and Askim, Hjelmar and Pedersen, (2018[84])

Key questions

Evaluation design

Has the programme been evaluated in more than one location?
Has at least one evaluation been carried out independently from the developer of the intervention?

Measurements

Are the measurements tools used independently from the participants and the people involved in delivering the intervention? (e.g. administrative data, independent observer)

Findings and impacts

Has the programme’s impacts been analysed by population sub-groups?
Are the effects of the intervention sustained over the medium or long term?

Cost (effectiveness) of Interventions

Measuring effective interventions requires not only evidence of their impacts, but evidence of their cost and value for money. Cost data provide information relevant to the financial planning and sustainable scale-up (Levin and Chisholm, 2016[86]); while a variety of methodological tools look to assess the benefits and costs associated with an intervention.

Why is it important?

Positive impacts at a very high price may not be in the interests of governments and citizens. Using economic evidence is important to demonstrate value for money for public programmes in a context of continued fiscal constraints. Increased understanding of interventions that achieve impact at a too high price would enable decision-makers to make more efficient decisions.

Summary of the mapping of existing approaches

A variety of different methodologies is taken by the existing standards. Some of them focus on reporting the existence of cost or economic evaluations of an intervention. Other standards request in their criteria the presence of cost information and related analysis for a policy or programme; whilst a final set of standards provide detailed guidance on carrying out and interpreting economic evaluations.

The Box 3.18 provides a first general description of the different types of economic evaluations used to understand their complexity and usefulness when a particular organization or government is taking an investment decision.

Box 3.18. Type of economic evaluations

Cost analysis (CA): This evaluation can provide a complete account of the economic costs of all the resources used to carry out an intervention.
Cost-effectiveness analysis (CEA): CEA requires that an indicator of effectiveness be compared with a cost from a single project. CEA can only offer guidance on which of several alternative policies (or projects) should be selected, given that one has to be selected. (OECD, 2018[87]).
Cost-utility analysis (CUA): when alternative interventions produce different levels of effect in terms of both quantity and quality of life (or different effects), the effects may be expressed in utilities. The best known utility measure is the quality-adjusted life year, or QALY.
Cost-benefit analysis (CBA): provides a model of rationality (or human preference). It forces the decision-maker to look at beneficiaries and losers–in both, spatial and temporal dimensions– in monetary values (OECD, 2018[87]).
Return on investment (ROI): captures the percentage of return for every dollar invested. It is calculated by dividing intervention benefits net of intervention costs by intervention costs.
Multi-criteria analysis (MCA): looks at manifold and diverse dimensions of policies and investment projects through the same analytical framework. MCA offers a broader interpretation of CEA since it openly features the existence of multiple objectives (OECD, 2018[87]).

Source: (The Cochrane Collaboration, 2011[88]; National Academies of Sciences, Engineering, and Medicine, 2016[89]).

In the same line, the National Academies of Sciences, Engineering, and Medicine provides a decision tree to determinate if an intervention is ready for an economic evaluation (See Figure 3.5). According to this tool, the first step corresponds to reviewing the available information about an intervention to determine if it is enough to answer the research question (e.g. are the counterfactuals well defined? are resources required to implement the intervention known?). At this stage, a researcher or policy maker should be able to conduct a CA. If there is evidence of intervention impacts, the researcher should consider conducting a CEA or CBA, which also relies on whether the interventions’ impacts can be monetarized. If they can be, the researcher should conduct a CBA; otherwise, a CEA would be the best option. Other economic evaluations can be in consideration such as QALY, or DALY. This will depend on the perspective of the evaluation.

Figure 3.5. Decision tree for the different types of economic evaluation

Cost information

A first step towards being able to carry out economic evaluation is to collect information on the costs of an intervention; most of the approaches (31 of 50) report information in term of the economic resources in materials and training in an intervention. For instance, Clearinghouse for Military Family Readiness (USA) (2012[72]) presents in the evidence summary information related to the cost of training per participants, and further available information related to the programme implementation.

Other approaches report in the evidence reviews if an intervention or programme has developed an economic evaluation. For example, What Works for Kids (Australia) (2012[1]) asks specifically if a cost benefit study has been undertaken and published. This information is also identified in the evidence portal of What Work Centre for Children’s Social Care (UK) (2018[69]). A further example is Social Program that works (USA), which provides summaries of the available programme’s benefits and costs in their evidence reviews.

Cost rating or assessment

Some of the approaches establish in their criteria a cost rating or assessment condition, separately from the quality design, to assess if an intervention provides cost information, or the settings to present this type of information. For instance, EMCDDA (EU) (2011[9]) suggests planning financial requirements in terms of cost estimations for the programme, and a detailed and comprehensive breakdown of costs. Other approaches provide tools to determine the resources required in future interventions or additional activities, such as EEF (See Box 3.19) and the Toolkit produced by the What Works Centre for Crime Reduction (UK) (2017[90]), which distinguish direct and indirect costs, and allow users to make a comparison of costs prior to the intervention being implemented and after the intervention, or in a different context.

Box 3.19. Cost rating bands for EEF Toolkit

The Education Endowment Foundation (UK) implements a cost rating reflecting the approximate additional cost per pupil of implementing such interventions in schools. This might include the cost of new resources required, additional training courses, and activities. The cost rating bands are organized as following:

£ - Very low: up to about £2,000 per year per class of 25 pupils, or less than £80 per pupil per year.
££ - Low: £2,001 £5,000 per year per class of 25 pupils, or up to about £200 per pupil per year.
£££ - Moderate: £5,001 to £18,000 per year per class of 25 pupils, or up to about £720 per pupil per year.
££££ - High: £18,001 to £30,000 per year per class of 25 pupils, or up to £1,200 per pupil.
£££££ - Very High: over £30,000 per year per class of 25 pupils, or over £1,200 per pupil.

Source: Education Endowment Foundation (2018[25]).

Whereas few of the approaches requesting a Cost Benefit Analysis or Cost Effectiveness Analysis in their criteria. For instance, EPIC (EU) (2017[57]) assesses if the programme has been found to be cost-effective/cost-beneficial (i.e. the practice can deliver positive impact at a reasonable cost).

Standards concerning economic evaluations

To determine how best to invest public or private resources in social policies, decision makers require the use of economic evaluations to answer relevant questions, such as What does it cost to implement this intervention in a particular context and what are its expected returns? To what extent can these returns be measured in monetary or nonmonetary terms? Who will receive the returns and when? Is this investment a justifiable use of scarce resources relative to other investments? (National Academies of Sciences, Engineering, and Medicine, 2016[89]).

In this section additional standards that are mainly focused on establishing guidelines to evaluate economic evaluations will be introduced, particularly, with regard to criteria relating to cost analysis (CA), cost effectiveness analysis (CEA), and cost benefit analysis (CBA).

In the following sub-sections the standards for framing the evaluation will be outlined: identifying the impacts, determining the cost, valuing benefits and cost, and presenting the results.

Standards for framing an economic evaluation

Most of the approaches stipulate the importance of clearly stating the objectives of an economic evaluation regarding the information and resources available for a given intervention in order to establish which evaluation method to use. For instance, The National Academies of Sciences, Engineering, and Medicine (2016[89]) suggest that in order to determinate whether an intervention is ready for economic evaluation, this will depend on the question(s) of interest, the intervention specificity, and a well specified counterfactual condition.

Other approaches specify that the eligibility criteria, delivery setting time, and location must be well described, which relates to some of the issues of theory of change and logic model addressed. For instance, New South Wales Government (2017[91]) refers to the need for a programme logic to identify the issues that a programme is seeking to address; its intended activities and processes; their outputs; and the intended programme outcomes. Another example is the work on Cost benefit analysis and the environment by the OECD (2018[87]) which specifies that it is necessary to mention all the direct and indirect participants involved in the policy, geographical boundary, and extension to wider limits.

Standards for identifying the programme/policy impacts

Most of the approaches stipulate that the outcomes used in CEA and CBA should come from robust designs to determinate unbiased impacts from an intervention (e.g. RCTs or QEDs). This builds on the issues of efficacy and effectiveness addressed in Section 0 and 0 (Research design, measurements, sample, potential impacts, and external validity). For instance, The National Academies of Sciences, Engineering, and Medicine (2016[89]) stipulate that for CEA or CBA not only is information on the resources used to implement the intervention required, there is also a need for credible evidence of impact. Another example is the Vera institute (USA) (2014[92]), which refers to quantifying the investment’s impacts using evaluations that establish the causal link between an investment and its impacts.

Other approaches also highlight the use of meta-analysis or systematic reviews when multiple impact studies exist regarding a programme or similar intervention. For example, the Washington State Institute for Public Policy Benefit (USA) develops a meta-analytic approach to identify, screen, and code research studies in its cost-benefit analysis. The WSIPP also adjusts effect size regarding the methodical quality of the study and the longitudinal linkage (2017[17]).

Standards for estimating programme costs

Developing accurate estimates of the cost of an intervention is one of the main concerns in economic evaluations; and represents an opportunity to improve subsequent programme planning and implementation (Crowley et al., 2018[93]). According to this, some of the approaches agree on planning cost data collection, ideally, in the early stages of the intervention through standardized methodologies such as: a macro top-down approach; or a bottom-up approach (See Box 3.20). Some of them also provide information regarding tools to facilitate the process of data collection: for instance, CostOut, produced by the Center for Benefit-Cost Studies of Education (Vera Institute, 2014[92]; Crowley et al., 2018[93]), which was designed to simplify the estimation of costs and cost-effectiveness of educational or other social programmes. Other approaches provide information on current practices, such as the analytical report presented by EMCDDA (2017[94]), which compiled initiatives for estimating drug treatment costs across eleven countries—including US, Australia, Portugal, Italy, and Czech Republic and the European Union.

Box 3.20. Methods for costing an intervention

Macro top-down approach (costs to society): uses total public spending (or individual site budget or expenditure) data to provide gross average estimates of intervention costs.
Bottom-up approach (or Micro costing or Ingredients Methods): relies on identifying all resources required to implement an intervention and then valuing those resources in monetary units to estimate intervention costs. This method requires the evaluator to:
- Describe the theory of change or logic model guiding the programme;
- Identify specific resources used to implement the programme;
- Value the cost of those resources;
- Estimate the programme’s total, average, and marginal costs.

Source: (Crowley et al., 2018[93]; National Academies of Sciences, Engineering, and Medicine, 2016[89]; Vera Institute, 2014[92]).

Whatever the methodology chosen to measure the cost of an intervention, the majority of approaches agree on covering as much information cost categories as possible (See Box 3.21) to ensure unbiased analysis (Vera Institute, 2014[92]; NSW Goverment, 2017[91]). For instance, the National Academy of Sciences (2016[89]) not only considers personnel, space, materials, and supplies (in the micro costing method); but also found useful to register direct cost, indirect cost (e.g. volunteer time), fixed cost (do not vary with the number of participants served), and variable costs, particularly when the evaluator is interested in an intervention’s marginal and steady-state (average) costs. Additional to these costs, Crowley et al (2018[93]) also suggests that resources needed to support programme adoption, implementation, sustainability, and monitoring should be included in cost estimates.

Box 3.21. Type of costs

Direct Costs: value of all the goods, services, and other resources that are consumed in the provision of an intervention. (e.g. Total programme expenditures for the most recently completed fiscal year)
Fixed costs
- Capital cost: for setting up the service
- Labour costs: salaries and fringe benefit amounts for all programme staff
- Overhead cost: for running the services
Variable costs: costs that change in relation to variations in an activity.
Indirect Costs: travel, training, trainer fees, supplies, materials, utilities, office rent, office services or volunteering.
Intangible costs: Monetary value of pain, suffering, distress etc. associated with treatment (measured through Willingness To Pay).

Sources: (Lomas et al., 2018[95]) (NPC Research & Portland State University’s Center for the Improvement of Child and Family Services, 2019[96]).

Other approaches, such as the WSIPP (USA) (2017[17]), use several strategies in meta-analysis to construct programme cost estimates. Some of their principles are the following:

If the programme evaluations they have meta-analysed contain information on the number of “physical resource units” used by the programme, then they summarize those units, and produce an estimate of the average cost.
The per-participant programme costs represent the cost of the average person who enters the programme, rather than the cost of a participant who completes the programme.
In addition to a per participant cost estimate, they also note the year in which the dollars are denominated.
Programmes that involve multiple years of per-participant spending can be present valued with NPV equation, where the discount factor depends on the years.

Standards for valuing the costs and benefits of programmes

After having identified the resources used in an intervention and their outcomes, the approaches typically refer to how cost and benefits should be valued. This will depend on the type of economic evaluation, its purpose, and time horizon.

For CA, some of the approaches consider the market price of a resource as a good approximation for its opportunity cost. For example, CADTH (Canada) recommends that the fees and prices, listed in schedules and formularies of Canadian ministries of health, be considered as unit-cost measures when calculating the perspective of the public payer (2017[97]). Other approaches suggest that shadow prices can be used, as another method for valuing the resource, when a market price does not exist (National Academies of Sciences, Engineering, and Medicine, 2016[89]; Crowley et al., 2018[93]). Shadow prices are used to capture the appropriate economic value in terms of willingness to pay: what consumers are willing to forget to obtain a given benefit or avoid a given cost (Karoly, 2012[98]).

For CBA, most of the approaches agree on the three summary statistics or decision rules in CBA model (See Box 3.22) Net Present Value (NPV); the Benefit-Cost Ratio (BCR); and the Internal Rate of Return (IRR). Particularly, NPV requests that factors such as the discount rate, inflation and time horizon be taken into account (National Academies of Sciences, Engineering, and Medicine, 2016[89]; OECD, 2018[87]; OMB, 1992[99]; CADTH, 2017[97]; Crowley et al., 2018[93]; NSW Goverment, 2017[91]).

Box 3.22. Summary statistics on CBA

The OECD has worked extensively on the use of Cost Benefit Analysis (CBA) as a core tool of public policy. Their last publications on the topic reflect a timely update on recent developments in the theory and practice of CBA, and provide detailed information on the following definitions:

Net Present Value (NPV): refers to the present value of net benefits so that: NPV = PV(B) – PV(C) with present values calculated at the social discount rate. Where PV(B) denotes the (gross) present value of benefits, and PV(C) the gross present value of costs.
- Discount rate: Under the assumption that the social or shadow price of a unit of consumption in the future is lower than the price of a unit of consumption today. The discount rate simply measures the rate of change of the shadow price.
Benefit-Cost Ratio (BCR): The general rules become: i) accept a project if: PV(B)/PV(C) > 1; ii) in the face of rationing: rank by the ratio PV(B)/PV(C); or, iii) in choosing between mutually exclusive projects: select the project with the highest benefit cost ratio.
Internal Rate of Return (IRR): The net present value rule requires the use of some predetermined social discount rate rule to discount future benefits and costs. An alternative rule is to calculate the discount rate which would give the project a NPV of zero and then to compare this “solution rate” with the pre-determined social discount rate.

Source: OECD (2018[100]; 2006[101]).

Other approaches such as the Office for Management and Budget (USA) (2018[102]) provides further Guidelines and Discount Rates for Benefit-Cost Analysis of Federal Programmes. They stipulate and update information on treatment of inflation, Real Discount Rates (a forecast of real interest rates from which the inflation premium has been removed), and Nominal Discount Rates (a forecast of nominal or market interest rates for calendar year). Additionally, the OECD (2018[100]) provides information about health valuations for valuing risks to life (VSL), and the value of a (statistical) life-year (VSLY) (See Box 3.23).

Box 3.23. Health Valuation

Many public policies are mainly designed to protect life and health. However, these policies come with substantial costs that are not immediately detected. Understanding how both government and individuals must make trade-offs between income and risk, has led evaluators to measure the value that individuals place on changes in risk levels for threats to life and health (Bosworth, Professor and Kibria, 2017[103]), through some of the following measurements:

Valuing risks to life

The procedure for measuring the value of a statistical life captures how much an individual is willing to pay (WTP) to secure a risk reduction arising from a policy or project, or the willingness-to-accept (WTA) a compensation for tolerating higher than “normal” risks; which is equal to the marginal rate of substitution between income (or wealth) and mortality risk.

Value of a (statistical) life-year

The VSLY approach replaces the assumption (implicit in the way that VSL is typically applied) that age does not matter with an alternative assumption that age not only matters but it matters in a particular way.

Source: (OECD, 2018[100]; Bosworth, Professor and Kibria, 2017[103]).

For CEA, some of approaches stipulate the need for a comprehensive measure of the intervention’s economic costs from a societal perspective, and the examination of one or more no-monetized outcomes(s). However, the use of different units to measure outcomes limits the aggregation of them. This problem has been mitigated by the development of measures, such as quality-adjusted life years (QALYs) (also known as cost-utility analysis) or disability-adjusted life years (DALYs) (National Academies of Sciences, Engineering, and Medicine, 2016[89]) (Goverment of Netherlands, 2016[104]). Other approaches, such as NICE (UK) (2013[105]), provide parameters to consider an intervention cost effective (less than £20,000 per QALY gained or between £20,000 and £30,000 per QALY, if certain conditions are satisfied).

One of the problems that arise from valuing cost and benefits is double counting. This issue refers to outcomes (benefits or cost) that are inputs for other outcomes or can be linked within each other. A number of approaches mention the need or precautions to avoid double counting (OECD, 2018[100]; Goverment of Netherlands, 2016[104]; National Academies of Sciences, Engineering, and Medicine, 2016[89]). For instance, Crowley et al (2018[93]) suggests that one approach to manage double counting is to employ a series of “trumping rules” that isolate developmental pathways to ensure no double counting occurs. Another example is the Treasury from New Zealand, which provides practical examples concerning double counting (See Box 3.24).

Another problem that can arise when measuring cost and benefits are externalities. This issue refers to goods that, once produced, can be consumed simultaneously by any number of people and from which people can’t be excluded. These can have negative or positive effects (2015[106]). OMB (USA) presents several examples where externalities are addressed, such as in the principle of Willingness-To-Pay, where market prices provide an invaluable starting point for measuring costs, but prices sometimes do not adequately reflect the true value of a good to society; hence, the use of shadow prices2 can avoid market distortions, such as externalities or taxes (1992[99]). Externalities also lead to measuring indirect cost and intangible costs as another way to avoid a measurement bias in the evaluation. The Treasury from New Zealand offers another example regarding externalities (See Box 3.24).

Box 3.24. Examples for avoiding double counting and externalities: The Treasury of New Zealand

The treasury created the Guide to Social Cost Benefit Analysis with the aim of encouraging that all public decisions be accompanied by at least a robust and comprehensive CBA, through main steps into the generic policy development process. For instance, they present some of the principles and examples to identify the costs and benefits (Step 3) of a programme or policy, such as avoidance of double counting and externalities.

Avoidance of double counting

A depreciation charge is intended to reflect the ‘consumption’ of capital, or the reduction in the value of the capital investment over a specified period, but it may double count the cost of an investment if the construction costs were already included in the CBA. Accounting practice treats construction cost as capital expenditure and recognises depreciation as an operating cost.

Externalities

A ‘carbon’ tax would be an example, though whether it could be set at the right level would depend on whether it was possible to determine the resource cost that carbon emissions give rise to (in terms of adverse climate consequences). The effect of the tax is for people to take the cost of the externality into account in their actions, i.e. to ‘internalise’ the externality. Externalities are quantified when doing CBAs in order to ensure that they are internalised in the spending decision or regulatory decision.

Source: Adapted from The treasury of New Zealand (2015[106]), Guide to Social Cost Benefit Analysis - July 2015.

Standards for handling estimate uncertainty

Because cost and benefits estimations are made prior to the implementation of the programme, the outcomes from an economic evaluation could take many different possible values; and create a level of uncertainty. Standards are needed for estimating, resolving, and reporting that uncertainty. A risk, on the other hand, refers to whether the information allows for the estimation of the full range of possibilities of an event in terms of their probabilities. (OECD, 2018[100]).

The majority of approaches focus on uncertainty, and request testing the economic projections using a variety of approaches to sensitivity analysis (See Box 3.25). For instance, The Pew Charitable Trusts (USA) (2013[107]) stipulates the need to conduct and report sensitivity analysis, and provide a range of possible outcomes to ensure methodological rigor and transparency from an economic evaluation. Another example is the Regulatory Impact Analysis (RIA) Guidelines by the Government of Ireland (2009[108]), which suggest that any assumptions made in RIAs (and the MCAs and CBAs performed in this context) should be calculated for a variety or range of future values through a sensitivity analysis.

Most of the approaches suggest using the Monte Carlos Analysis to handle uncertainty. Other approaches propose different practices such as Partial sensitivity analysis and Break-even analysis (See Box 3.25). Additionally to this, some approaches stipulate the need for these methods not only to test the robustness of the findings, but also to present estimates within a confidence interval and their standards errors (Crowley et al., 2018[93]) (National Academies of Sciences, Engineering, and Medicine, 2016[89]).

Box 3.25. Mechanism to test uncertainty

Sensitivity analysis shows how responsive—or sensitive—a study’s results are to changes in assumptions. These are some of the types of sensitivity analysis (Vera Institute, 2014[92]):

Partial sensitivity analysis: In this analysis, you select one variable and change its value while holding the values of other variables constant, to determine how much the CBA results change in response.
Break-even analysis: If you are unable to estimate a policy’s most likely effects or cannot find comparable studies to help determine its best-case and worst-case scenarios, you can use a break-even analysis. This helps identify how large a policy’s impact must be for its benefits to equal its costs, that is, to break even. By definition, breaking even results in a net present value of $0.
Monte Carlo analysis: uses estimates of the probability distributions of costs and benefits, and other parameters used in CBA, to undertake a probabilistic analysis of the likely NPV to emerge from a particular project. The steps to Monte Carlo analysis are as follows (OECD, 2018[100]):
1. 1. Estimate the probability distributions for the parameters of interest. Where parameters are likely to be correlated, the joint probability distributions are estimated;
2. 2. Take a random draw of the parameters of interest of sample size;
3. 3. Estimate the NPV n times using the parameters drawn;
4. 4. Calculate the mean NPV across the n estimates and store the value;
5. 5. Repeat m times until one can plot the probability distribution of mean NPV conditional on the uncertain parameters with sample size n, with m repetitions;
6. 6. Evaluate the likelihood of a positive or negative NPV.

Source: (Vera Institute, 2014[92]; OECD, 2018[100])

Standards for reporting the results of economic evaluations

The procedure of reporting economic evaluation findings depends not only on the type of evaluation conducted, but also on the needs to promote transparency and comparability across studies. The standards are requested to provide best practices for reporting a clear record of how the evaluation was conducted and to support the verification of the findings by an independent researcher.

The majority of approaches provide guidelines on how and what to report in an economic evaluation. For instance, the National Academies of Sciences (2016[89]) offer a Checklist of Best Practices for Reporting Economic Evidence according to the methodology implemented (CA, CEA or CBA). Another example is the Government of Netherlands (2016[104]), which provides guidance related to reporting input values, costs, and uncertainty analysis, within others.

Some of the approaches suggest more specific reporting requirements such as maintaining a common table of inputs and assumptions. For instance, Crowley et al (2018[93]) recommends, in their Standards for Reporting Findings from Economic Evaluations, implementing a two-tiered reporting system that includes a consumer-focused summary accompanied by a technical description (e.g. included as an appendix) that details the modelling and assumptions made to estimate costs and benefits. Another example is the Vera institute (USA) (2014[92]), which provides guidance for CBA on how to tabulate results, document the analysis, and interpret the findings (See Figure 3.6).

Only a small number of approaches are concerned with the delivery (time) and accessibility of the evaluation findings. For instance, the Government of Ireland (2009[108]) discusses factors such as: Where should Regulatory Impact Analysis (RIAs) be published? And the Evidence-Based Policymaking Collaborative (USA) (2016[109]), which highlights the importance that CBA results are delivered in accessible, concise, and compelling ways, and completed in time to inform decision-makers’ choices. They consider that adopting rigorous, replicable CBA methodologies and making data readily available to conduct analyses can help improve timeliness.

Key questions

Framing an economic evaluation

Does the economic evaluation specify an empirical question, with an underlying model or theory?
Does the evaluation adopt a societal perspective? If not, what are the limitations of the evidence used?
Does the evaluation take into account options for agile design and delivery methods that could have implications for the results of the evaluation?

Identifying the programme/policy impacts

Does the economic evaluation describe the methods used to identify the programme/policy impacts (Quality design, measurements, sample, approach of analysis, findings, and external validity)?
Does the economic evaluation well-describe the impacts, including magnitude, significant and non-significant findings, significance levels, standard errors or standard deviations, methods of estimation?
Does the economic evaluation describe any limitations to generalizability, internal and external validity of impacts, as well as confidence intervals?
If the economic evaluation uses a systematic review to identify the programme impacts, does it stipulate the methodological criteria to select the studies?

Estimating programme costs

Does the economic evaluation cover different cost categories (e.g., labour, equipment, training, materials and supplies, office space, indirect cost) according to the stages of an intervention or programme (e.g. adoption, implementation, or monitoring)
Does the economic evaluation use micro costing procedures or ingredients methods to improve the quality of intervention cost estimates?
Does the costing concern both direct and indirect costs?

Implementation and Scale up of Interventions

Knowledge of ‘what works’ – of which policies and programmes are effective, is necessary but not sufficient for obtaining outcomes for citizens. Increasingly, there is recognition that ‘implementation matters’- that the quality and level of implementation of an intervention of a policy is associated with outcomes for citizens (Durlak, 1998[110]; Durlak and DuPre, 2008[111]).

Why is it important?

It is important to understand the features of policies and programmes, of the organisation or entity implementing them, along with the myriad of other factors that are related to adoption, implementation and sustainability of a policy or programme. This enables practical guidance to enable successful implementation and scale-up efforts. Increased attention to implementation has also been drawn by the work of economists such as Pr. Duflo, and the JPAL networks working on development issues. It has been estimated that interventions implemented correctly can achieve effects two or three times greater than interventions where problems with implementation have been experienced (Durlak and DuPre, 2008[111]).

Summary of the mapping of existing approaches

Of the approaches included in the mapping, the majority include some coverage issues concerning implementation and scale up of interventions.

Standards concerning the delivery and implementation of an intervention

Most of the approaches that cover implementation and scale up are focused on simply providing factual details about the delivery and implementation requirements of an intervention. These approaches include the Australian What Works for Kids, the Canadian Best Practices Portal, Spain’s ‘Prevención basada en la evidencia’ and the Evidence Based Teen Pregnancy Programmes in the USA. Spain’s approach provides information related to the delivery of an intervention, its materials and setting. The Canadian Best Practices Portal is another approach that provides key information about what is required to implement an intervention (Box 3.26).

Box 3.26. Canadian Best Practices Portal – Adaptability of an intervention

The Canadian Best Practices Portal is a compilation of multiple sources of information on interventions, best practices and resources for chronic disease prevention and health promotion recommended by experts. The Portal links to resources and solutions to plan programmes for promoting health and preventing diseases for populations and communities. The portal includes three types of interventions: Best Practices, Promising Practices and Aboriginal ‘Ways Tried and True’.

From the overview of the interventions evaluated, the portal provides information related to the adaptability of the programme, in terms of:

Implementation history;
Expertise required for implementation within the context of the intervention;
Available support for implementation;
And, resources and/or products associated with the interventions.

Source: National Collaborating Centre for Methods and Tools (2010[112])

Other approaches provide more granularity about the implementation requirements of an intervention. The Evidence Based Teen Pregnancy Programs in the USA has a standalone section on implementation, which comprises of eight fields including implementation requirements and guidance and allowable adaptations. What Works for Kids also has a standalone section on implementation which includes the following fields:

Training
Can training be accessed in Australia?
Who delivers the programme?
Minimum practitioner qualifications
Are there any licensing or accreditation requirements?
Is there a manual that describes how to implement the programme?
What are the required materials for the trainer?
Are specific assessments required prior to implementation?
Are particular tools required for implementation?
Overall implementation / resourcing issues
Is the programme scalable?
Comments on the scalability of the intervention
Setup costs
Ongoing costs

Experiences of the implementation of an intervention

Some of the approaches that cover issues around implementation and scale up are focused on providing and categorising experiences of implementing an intervention. These experiences are typically the findings from process evaluations and qualitative studies. These approaches include The Community guide, the EMCDDA Best Practice Portal, the EU-Compass for Action on Mental Health and Well-Being and HomeVee.

The Community Guide is a resource that helps practitioners and policy makers to improve health and safety in their communities. As part of a ten-step process it includes details about the applicability and barriers to implementation for the recommended interventions. The EU-Compass for Action on Mental Health and Well-being also includes information on experiences of implementation and is described in Box 3.27. The EMCDDA Best Practice Portal has recently published a new database of programmes for implementation. This includes details of programmes that have been implemented in more than one European Country, along with details of experiences of implementation (EMCDDA, 2020[113]).

HomeVee is another approach that provides a summary of ‘Implementation Experiences’ based on the studies included in a review, focusing on:

Characteristics of Model Participants,
Location and Setting,
Staffing and Supervision,
Model Components,
Model Adaptations or Enhancements,
Dosage (Home visits), and
Lessons Learned.

Box 3.27. EU-Compass for Action on Mental Health and Well-being

The EU-Compass for Action on Mental Health and Wellbeing is a web-based mechanism used to collect exchange and analyse information on policy and stakeholder activities in mental health. The EU-Compass collects examples of good practices from EU countries and stakeholders on an ongoing basis. These examples are evaluated using quality criteria agreed with EU countries and the Commission. Each year, a brochure with examples of good practices is published regarding one of the EU-Compass areas. The brochure contains a brief summary of each practice which addressed priority areas, the lessons learned, and level of implementation. Particularly, EU-Compass provides information related to the lessons learned from an implemented programme in three main sub-sections:

What worked well & facilitators to implementation;
What did not work & barriers to implementation;
Recommendations for future adopters of this practice.

Source: EU-Compass for Action on Mental Health and Well-being (EU) (European Commission - Directorate-General for Health and Food Safety, 2017[49]).

Standards concerning the dissemination readiness and system readiness

A small number of approaches go further in providing detailed criteria against which dissemination readiness and/or system readiness could be assessed. These are features of the intervention or of the organisation or community adopting the intervention that have been shown to be related to adoption, implementation, or sustainability of the intervention (Society for Prevention Research Standards of Evidence, 2015[53]). The purpose of such approaches is to support the implementation and scale-up efforts of evidence-based interventions.

These approaches include the EU-Compass for Action on Mental Health and Well-being, the Green List Prevention, NESTA, Housing Associations' Charitable Trust, Blueprints, and SUPERU. The Green List Prevention includes six criteria to rate the ‘implementation quality’ of an intervention including whether ‘support / technical assistance during implementation is available’ and whether ‘instruments for quality control during the implementation are available’. Blueprints includes five criteria on ‘dissemination readiness’ including that ‘there are explicit processes for ensuring the intervention gets to the right persons’. SUPERU also developed comparable criteria, described in Box 3.28.

Box 3.28. SUPERU’s intervention consistency and documentation criteria

The strength of evidence scale of Superu (NZ) consisted of five levels. The levels corresponded not just for ascending rankings for strength of evidence, but also to expectations about the type of evidence that can and should be generated about an intervention as it matures and grows; specifically, evidence about which elements of the intervention are necessary to implement with fidelity, and which can be adapted. Level 4 is defined as appropriate for mature, largescale interventions with a strong evidence base. Interventions that reach this level must contain the following information:

There is evidence that the intervention is consistently delivered as planned and reaches its target groups.
Intervention consistency and documentation
There is clarity and documentation about what the intervention comprises.
There is regular review of procedures, manuals and staff training processes.
Information is available on the resources (money and people) required to deliver the intervention.
Technical support is available to help implement the intervention in new settings.

Source: (SUPERU, 2016[114])

A limited number of approaches go further in either explicitly scoring the implementation readiness of an intervention. The Evidence Based Teen Pregnancy Programs conducts a detailed assessment of an intervention’s ‘Implementation Readiness’ conducted based on materials and documents about the intervention and its implementation. Based on this assessment, an implementation readiness score is awarded by three component scores: (1) curriculum and materials, (2) training and staff support, and (3) fidelity monitoring tools and resources. The component scores are added together to give a total score, which ranges from 0 to 8, with higher scores indicating the interventions most ready to implement.

Key questions

Requirements

Are materials available that specify the activities to be carried out and optimal methods of implementing the intervention?
Is there guidance about who can implement the intervention and what education and training they must have to be able to implement the intervention successfully?
Are training and technical assistance available for implementing the core components of the intervention?

Intervention Readiness

Are tools available to ensure that the intervention is being implemented as intended?
Is there a system in place for documenting adaptations to core components that occur during implementation?
Is there a system to support regular monitoring of implementation?
Is there a system in place to support planning and monitoring of service user recruitment?

System readiness

Are there tools to enable the organization or community adopting the intervention to assess factors that are likely to impede or facilitate adoption and successful implementation of the intervention?
Are there tools to enable the organization or community adopting the intervention measure their own capacity to implement in a high-quality fashion?

Real world experiences of implementation

Are there existing resources available, which describe the findings of studies, examine the barriers, and facilitators of implementing the intervention?
Are there any recommendations for adopters of the intervention that could be incorporated into the implementation of the intervention?
Are the findings and conclusions of the assessment shared with and validated by a range of key stakeholders?

References

[56] Agency for Healthcare Research and Quality (2012), What Is the Evidence Rating?, https://innovations.ahrq.gov/help/evidence-rating (accessed on 19 February 2019).

[10] Alton-Lee, A. (2004), Guidelines for Generating a Best Evidence Synthesis Iteration, Ministry of Education New Zealand, http://www.minedu.govt.nz (accessed on 14 February 2019).

[84] Askim, J., U. Hjelmar and L. Pedersen (2018), “Turning Innovation into Evidence-based Policies: Lessons Learned from Free Commune Experiments”, Scandinavian Political Studies, Vol. 41/4, pp. 288-308, https://doi.org/10.1111/1467-9477.12130.

[39] Axford, N. et al. (2005), “Evaluating Children’s Services: Recent Conceptual and Methodological Developments”, British Journal of Social Work, Vol. 35/1, pp. 73-88, https://doi.org/10.1093/bjsw/bch163.

[78] Be You (2020), The Be You Programs Directory, https://beyou.edu.au/resources/tools-and-guides/about-programs-directory (accessed on 25 March 2020).

[41] Better Evaluation (2012), Describe the theory of change, https://www.betterevaluation.org/en/node/5280 (accessed on 20 October 2019).

[48] Blueprints for Health Youth Development (2015), Evidence-Based Programs - Standards of Evidence, https://www.blueprintsprograms.org/resources/Blueprints_Standards_full.pdf (accessed on 15 February 2019).

[82] Blueprints for health youth development (2018), Blueprints Database Standards, https://www.blueprintsprograms.org/resources/Blueprints_Standards_full.pdf (accessed on 15 February 2019).

[103] Bosworth, R., A. Professor and A. Kibria (2017), THE VALUE OF A STATISTICAL LIFE: ECONOMICS AND POLITICS Primary Investigators, https://strata.org/pdf/2017/vsl-full-report.pdf (accessed on 7 June 2019).

[97] CADTH (2017), Guidelines for the Economic Evaluation of Health Technologies: Canada 4th Edition, https://www.cadth.ca/sites/default/files/pdf/guidelines_for_the_economic_evaluation_of_health_technologies_canada_4th_ed.pdf (accessed on 17 April 2019).

[71] Child Trends (2018), , https://www.childtrends.org/what-works/eligibility-criteria.

[6] Clearinghouse for Labor Evaluation and Research (2017), About CLEAR, https://clear.dol.gov/about (accessed on 8 March 2019).

[63] Clearinghouse for Labor Evaluation and Research (2014), Guidelines for reviewing implementation studies, https://clear.dol.gov/sites/default/files/CLEAR_Operational%20Implementation%20Study%20Guidelines.pdf (accessed on 19 February 2019).

[60] Clearinghouse for Labor Evaluation and Research (2014), Guidelines for reviewing quantitative descriptive studies, https://clear.dol.gov/sites/default/files/CLEAROperationalDescriptiveStudyGuidelines.pdf (accessed on 19 February 2019).

[72] Clearinghouse for Military Family Readiness (2012), Continuum of Evidence, https://militaryfamilies.psu.edu/wp-content/uploads/2017/08/continuum.pdf (accessed on 30 January 2019).

[2] Coalition for Evidence-Based Policy (2010), Checklist For Reviewing a Randomized Controlled Trial of a Social Program or Project, To Assess Whether It Produced Valid Evidence, http://coalition4evidence.org/wp-content/uploads/2010/02/Checklist-For-Reviewing-a-RCT-Jan10.pdf (accessed on 19 February 2019).

[51] College of Policing: What Work Network (2017), Crime Reduction Toolkit, https://whatworks.college.police.uk/toolkit/Pages/Toolkit.aspx (accessed on 30 April 2019).

[5] Crime Solutions (2013), Practices Scoring Instrument, https://www.crimesolutions.gov/pdfs/PracticeScoringInstrument.pdf (accessed on 18 February 2019).

[54] Crime Solutions (2013), Program Scoring Instrument Version 2.0, https://www.crimesolutions.gov/pdfs/program-rating-instrument-v2.0.pdf (accessed on 18 February 2019).

[93] Crowley, D. et al. (2018), “Standards of Evidence for Conducting and Reporting Economic Evaluations in Prevention Science”, Prevention Science, Vol. 19/3, pp. 366-390, https://doi.org/10.1007/s11121-017-0858-1.

[62] Drost, E. (2011), Validity and Reliability in Social Science Research, https://www3.nd.edu/~ggoertz/sgameth/Drost2011.pdf (accessed on 13 February 2019).

[110] Durlak, J. (1998), “Why program implementation is important”, Journal of Prevention & Intervention in the community, Vol. 17/2, pp. 5-18.

[111] Durlak, J. and E. DuPre (2008), “Implementation matters: A review of research on the influence of implementation on program outcomes and the factors affecting implementation”, American journal of community psychology, Vol. 41/3-4, pp. 327-350.

[79] Early Childhood Australia (2020), KidsMatter has become Be You, http://www.earlychildhoodaustralia.org.au/our-work/beyou/ (accessed on 22 April 2020).

[55] Early Intervention Foundation (2019), 10 steps for evaluation success, Early Intervention Foundation, https://doi.org/12345.

[32] Early Intervention Foundation (2018), EIF Guidebook, https://guidebook.eif.org.uk/eif-evidence-standards (accessed on 14 February 2019).

[25] Education Endowment Foundation (2018), Technical appendix and process manual, https://educationendowmentfoundation.org.uk/public/files/Toolkit/Toolkit_Manual_2018.pdf (accessed on 1 February 2019).

[37] EFSA Guidance for those carrying out systematic reviews European Food Safety Authority (2010), “Application of systematic review methodology to food and feed safety assessments to support decision making”, EFSA Journal, Vol. 8/6, p. 1637, https://doi.org/10.2903/j.efsa.2010.1637.

[113] EMCDDA (2020), Xchange prevention registry, http://www.emcdda.europa.eu/best-practice/xchange (accessed on 21 April 2020).

[94] EMCDDA (2017), “Drug treatment expenditure: a methodological overview”, http://www.emcdda.europa.eu/publications/insights/drug-treatment-expenditure-measurement_en.

[40] Epstein, D. and J. Klerman (2012), “When is a Program Ready for Rigorous Impact Evaluation? The Role of a Falsifiable Logic Model”, Evaluation Review, Vol. 36/5, pp. 375-401, https://doi.org/10.1177/0193841X12474275.

[45] European Commission (2018), Guidance Document on Monitoring and Evaluation.

[43] European Commission (2011), “Towards a new system of monitoring and evaluation in EU cohesion policy”, https://ec.europa.eu/regional_policy/sources/docgener/evaluation/doc/performance/outcome_indicators_en.pdf.

[49] European Commission - Directorate-General for Health and Food Safety (2017), Criteria to select best practices in health promotion and chronic disease prevention and management in Europe, https://ec.europa.eu/health/sites/health/files/mental_health/docs/compass_bestpracticescriteria_en.pdf (accessed on 25 January 2019).

[19] European Monitoring Centre for Drugs and Drug Addiction (2020), Best practice portal, http://www.emcdda.europa.eu/best-practice_en (accessed on 27 April 2020).

[9] European Monitoring Centre for Drugs and Drug Addiction (2011), “European drug prevention quality standards”, https://doi.org/10.2810/48879.

[57] European Platform for Investing in Children (2017), Review Criteria and Process, https://ec.europa.eu/social/main.jsp?catId=1246&intPageId=4286&langId=en (accessed on 14 February 2019).

[34] Every Student Succeeds Act - ESSA (2019), Evidence for ESSA: Standards and Procedures, https://content.evidenceforessa.org/sites/default/files/On%20clean%20Word%20doc.pdf (accessed on 18 February 2019).

[109] Evidence-Based Policymaking Collaborative (2016), Evidence Toolkit: Cost Benefit Analysis.

[18] Ferri, M. and P. Griffiths (2015), “Good Practice and Quality Standards”, in Textbook of Addiction Treatment: International Perspectives, Springer Milan, https://doi.org/10.1007/978-88-470-5322-9_64.

[44] Gaffey, V. (2013), “A fresh look at the intervention logic of Structural Funds”, European Commision, https://doi.org/10.1177/1356389013485196.

[42] Ghate, D. (2018), “Developing theories of change for social programmes: co-producing evidence-supported quality improvement”, Palgrave Communications, Vol. 4/1, p. 90, https://doi.org/10.1057/s41599-018-0139-z.

[13] Gough, D., J. Thomas and S. Oliver (2019), Clarifying differences between reviews within evidence ecosystems, BioMed Central Ltd., https://doi.org/10.1186/s13643-019-1089-2.

[3] Gough, D. and H. White (2018), Evidence standards and evidence claims in web based research portals, Centre for Homelessness Impact, https://uploads-ssl.webflow.com/59f07e67422cdf0001904c14/5bfffe39daf9c956d0815519_CFHI_EVIDENCE_STANDARDS_REPORT_V14_WEB.pdf (accessed on 8 March 2019).

[104] Goverment of Netherlands (2016), Guideline for economic evaluations in healthcare.

[108] Government of Ireland (2009), How to conduct a Regulatory Impact Analysis.

[73] Graham Allen (2011), Early Intervention: The Next Steps An Independent Report to Her Majesty’s Government, http://www.childtrauma.org (accessed on 14 February 2019).

[50] Groeger-Roth, F. and B. Hasenpusch (2011), Green List Prevention: Inclusion-and Rating-Criteria for the CTC Programme-Databank Crime Prevention Council of Lower Saxony, Crime Prevention Council of Lower Saxony, https://www.gruene-liste-praevention.de/communities-that-care/Media/GreenListPrevention_Rating-Criteria.pdf (accessed on 14 February 2019).

[38] Guyatt, G. et al. (2008), “GRADE: an emerging consensus on rating quality of evidence and strength of recommendations”, BMJ, Vol. 336/7650, pp. 924-926, https://doi.org/10.1136/bmj.39489.470347.ad.

[4] Health Evidence (2018), Quality Assessment Tool, https://www.healthevidence.org/documents/our-appraisal-tools/quality-assessment-tool-dictionary-en.pdf (accessed on 8 March 2019).

[68] Hollis, S. and F. Campbell (1999), “What is meant by intention to treat analysis? Survey of published randomised controlled trials”, British Medical Journal, Vol. 42, p. 4, https://doi.org/10.1136/bmj.319.7211.670.

[83] Home Visiting Evidence of Effectiveness (2018), Assessing Evidence of Effectiveness, https://homvee.acf.hhs.gov/Review-Process/4/Assessing-Evidence-of-Effectiveness/19/7 (accessed on 19 February 2019).

[77] Home Visiting Evidence of Effectiveness (2018), Producing Study Ratings, https://homvee.acf.hhs.gov/Review-Process/4/Producing-Study-Ratings/19/5 (accessed on 19 February 2019).

[59] Housing Associations’ Charitable Trust (2016), Standard for Producing Evidence - Effectiveness of Interventions – Part 1: Specification, https://www.hact.org.uk/sites/default/files/StEv2-1-2016%20Effectiveness-Specification.pdf (accessed on 25 January 2019).

[80] Johns Hopkins University School of Education’s Center for Data-Driven Reform in Education - CDDRE (n.d.), Best Evidence Encylopedia, http://www.bestevidence.org/aboutbee.htm (accessed on 18 February 2019).

[27] Johnson, S., N. Tilley and K. Bowers (2015), “Introducing EMMIE: an evidence rating scale to encourage mixed-method crime prevention synthesis reviews”, Journal of Experimental Criminology, Vol. 11/3, pp. 459-473, https://doi.org/10.1007/s11292-015-9238-7.

[98] Karoly, L. (2012), “Toward Standardization of Benefit-Cost Analysis of Early Childhood Interventions”, Journal of Benefit-Cost Analysis, Vol. 3/1, https://doi.org/10.1515/2152-2812.1085.

[115] Karoly, L. (2010), “Toward Standardization of Benefit-Cost Analyses of Early Childhood Interventions”, SSRN Electronic Journal, https://doi.org/10.2139/ssrn.1753326.

[86] Levin, C. and D. Chisholm (2016), “Cost-Effectiveness and Affordability of Interventions, Policies, and Platforms for the Prevention and Treatment of Mental, Neurological, and Substance Use Disorders”, Mental, Neurological, and Substance Use Disorders: Disease Control Priorities, Vol. 4, https://doi.org/10.1596/978-1-4648-0426-7_ch12.

[95] Lomas, J. et al. (2018), “Which Costs Matter? Costs Included in Economic Evaluation and their Impact on Decision Uncertainty for Stable Coronary Artery Disease”, PharmacoEconomics - Open, Vol. 2/4, pp. 403-413, https://doi.org/10.1007/s41669-018-0068-1.

[33] Mathematica Policy Research (2016), Identifying Programs That Impact Teen Pregnancy, Sexually Transmitted Infections, and Associated Sexual Risk Behaviors, https://tppevidencereview.aspe.hhs.gov/pdfs/TPPER_Review%20Protocol_v5.pdf (accessed on 26 January 2019).

[89] National Academies of Sciences, Engineering, and Medicine (2016), Advancing the Power of Economic Evidence to Inform Investments in, The National Academies Press, https://doi.org/10.17226/23481.

[112] National Collaborating Centre for Methods and Tools (2010), Effective interventions: The Canadian Best Practices Portal, McMaster University, Hamilton, http://www.nccmt.ca/resources/search/69 (accessed on 15 February 2019).

[81] National Dropout Prevention Center - NDPC (2019), Rating system, http://dropoutprevention.org/mpdb/web/rating-system (accessed on 19 February 2019).

[8] National Implementation Research Network (2018), The Hexagon: An Exploration Tool. Hexagon Discussion & Analysis Tool Instructions, https://implementation.fpg.unc.edu/sites/implementation.fpg.unc.edu/files/resources/NIRN_HexagonTool_11.2.18.pdf (accessed on 19 February 2019).

[105] National Institute for Health and Care Excellence (2013), How NICE measures value for money in relation to public health interventions, https://www.nice.org.uk/Media/Default/guidance/LGB10-Briefing-20150126.pdf (accessed on 1 May 2019).

[1] Nest What works for kids (2012), Rapid Evidence Assessment, http://whatworksforkids.org.au/rapid-evidence-assessment (accessed on 19 February 2019).

[47] NESTA (2013), Standards of evidence: an approach that balances the need for evidence with innovation, https://media.nesta.org.uk/documents/standards_of_evidence.pdf (accessed on 26 February 2019).

[106] New Zealand Treasury (2015), Guide to Social Cost Benefit Analysis - July 2015, New Zealand Treasury, Wellington, http://www.treasury.govt.nz/publications/guidance/planning/costbenefitanalysis/guide/ (accessed on 2 May 2018).

[66] Newhouse J.P., H. (1993), Free for All, lessons from the Rand Health Insurance Experiment,, https://doi.org/10.7249/CB199.

[23] Norwegian Institute of Public Health (2021), Elementer i livsstilstiltak for vektreduksjon blant voksne personer med overvekt eller fedme, https://www.fhi.no/globalassets/dokumenterfiler/rapporter/2021/elementer-i-livsstilstiltak-for-vektreduksjon-blant-voksne-personer-med-overvekt-eller-fedme-rapport-2021-v2.pdf.

[22] Norwegian Institute of Public Health (2020), A systematic and living evidence map on COVID-19, https://www.fhi.no/contentassets/e64790be5d3b4c4abe1f1be25fc862ce/covid-19-evidence-map-protocol-20200403.pdf.

[96] NPC Research & Portland State University’s Center for the Improvement of Child and Family Services (2019), Conduct a Cost Analysis of Your Home Visiting Program, http://www.homevisitcosts.com/organizing-your-data.php (accessed on 22 May 2019).

[91] NSW Goverment (2017), Guide to Cost-Benefit Analysis, https://www.treasury.nsw.gov.au/sites/default/files/2017-03/TPP17-03%20NSW%20Government%20Guide%20to%20Cost-Benefit%20Analysis%20-%20pdf_0.pdf.

[100] OECD (2018), Cost-Benefit Analysis and the Environment: Further Developments and Policy Use, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264085169-en.

[87] OECD (2018), “Preface”, in Cost-Benefit Analysis and the Environment: Further Developments and Policy Use, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264085169-1-en.

[67] OECD (2017), “Making policy evaluation work: The case of regional development policy”, OECD Science, Technology and Industry Policy Papers, https://dx.doi.org/10.1787/c9bb055f-en.

[101] OECD (2006), Cost-Benefit Analysis and the Environment: Recent Developments.

[102] Office of Manangment and Budget (2018), 2018 Discount Rates for OMB Circular No. A-94.

[11] Oliver, S. et al. (2018), “Approaches to evidence synthesis in international development: a research agenda”, Journal of Development Effectiveness, Vol. 10/3, pp. 305-326, https://doi.org/10.1080/19439342.2018.1478875.

[99] OMB (1992), Guidelines and Discount Rates for Benefit-Cost Analysis of Federal Programs, https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/circulars/A94/a094.pdf (accessed on 17 September 2018).

[20] Oxman, A., J. Lavis and A. Fretheim (2007), “Use of evidence in WHO recommendations”, Lancet, Vol. 369/9576, pp. 1883-1889, https://doi.org/10.1016/S0140-6736(07)60675-8.

[46] Project Oracle Children and Youth Evidence Hub (2018), Validation Guidebook: An overview of Project Oracle’s validation process, https://project-oracle.com/uploads/files/Validation_Guidebook.pdf (accessed on 14 February 2019).

[64] Puddy, R. and N. Wilkins (2011), Understanding Evidence Part 1: Best Available Research Evidence. A Guide to the Continuum of Evidence of Effectiveness, Centers for Disease Control and Prevention, Atlanta, https://www.cdc.gov/violenceprevention/pdf/understanding_evidence-a.pdf (accessed on 18 February 2019).

[90] Reduction, W. (2017), Crime Reduction Toolkit, The College of policing, https://whatworks.college.police.uk/toolkit/Pages/Toolkit.aspx (accessed on 30 April 2019).

[76] Review Srengthening Families Evidence (2019), Review Process, https://familyreview.acf.hhs.gov/ReviewProcess.aspx?id=3 (accessed on 19 February 2019).

[15] Saran, A. and H. White (2018), “Evidence and gap maps: a comparison of different approaches”, https://doi.org/10.4073/cmdp.2018.2.

[61] Scholtes, V., C. Terwee and R. Poolman (2010), “What makes a measurement instrument valid and reliable?”, Injury, Vol. 42, pp. 236-240, https://doi.org/10.1016/j.injury.2010.11.042.

[12] Shemilt, I. et al. (2010), “Evidence synthesis, economics and public policy”, Research Synthesis Methods, Vol. 1/2, pp. 126-135, https://doi.org/10.1002/jrsm.14.

[70] Social Programs That Work (2019), Evidence Based Programs, https://evidencebasedprograms.org/ (accessed on 19 February 2019).

[53] Society for Prevention Research Standards of Evidence (2015), “Standards of evidence for efficacy, effectiveness, and scale-up research in prevention science: Next generation”, Society for Prevention Research Standards of Evidence, Vol. 16/7, pp. 893-926.

[35] Strengthening Families Evidence Review - SFER (2018), Review Process, https://familyreview.acf.hhs.gov/ReviewProcess.aspx?id=3 (accessed on 19 February 2019).

[52] SUPERU (2017), An evidence rating scale for New Zealand. Understanding the effectiveness of interventions in the social sector.

[114] SUPERU (2016), Standards of evidence for understanding what works: International experiences and prospects for Aotearoa New Zealand, SUPERU, Wellington, http://www.superu.govt.nz/sites/default/files/Standards%20of%20evidence.pdf (accessed on 18 April 2018).

[75] The California Evidence-Based Clearinghouse for Child Welfare (2019), Scientific Rating Scale, http://www.cebc4cw.org/ratings/scientific-rating-scale/ (accessed on 25 January 2019).

[30] The Campbell Collaboration (2019), Campbell systematic reviews: Policies and Guidelines, https://doi.org/10.4073/cpg.2016.1.

[21] The Cochrane Collaboration (2021), Cochrane Denmark, https://www.cochrane.dk/nordic-cochrane-centre-copenhagen.

[88] The Cochrane Collaboration (2011), Cochrane Handbook for Systematic Reviews of Interventions, https://handbook-5-1.cochrane.org/ (accessed on 18 April 2019).

[36] The Community Guide (2018), The Community Guide Methodology, https://www.thecommunityguide.org/about/our-methodology (accessed on 19 February 2019).

[28] The EQUATOR Network (2020), Enhancing the quality and transparency of health research, https://www.equator-network.org/ (accessed on 21 April 2020).

[107] The Pew Charitable Trusts (2013), States’ Use of Cost-Benefit Analysis.

[16] The UK Civil Service (2014), What is a Rapid Evidence Assessment?, https://webarchive.nationalarchives.gov.uk/20140402163359/http://www.civilservice.gov.uk/networks/gsr/resources-and-guidance/rapid-evidence-assessment/what-is (accessed on 22 April 2020).

[85] Tõnurist, P. (2019), Evaluating Public Sector Innovation Support or hindrance to innovation?, OECD, Paris.

[92] Vera Institute (2014), Cost-Benefit Analysis and Justice Policy Toolkit.

[17] Washington State Institute for Public Policy Benefit (2017), Benefit-Cost Technical Documentation.

[69] What Works Centre for Children’s Social Care (2018), Evidence standards, https://wwc-evidence.herokuapp.com/pages/our-ratings-explained (accessed on 19 February 2019).

[65] What Works Centre for Local Economic Growth (2016), Guide to scoring evidence using the Maryland Scientific Methods Scale, https://whatworksgrowth.org/public/files/Methodology/16-06-28_Scoring_Guide.pdf (accessed on 24 January 2019).

[29] What Works Centre for Local Economic Growth (2015), Evidence Review Apprenticeships, https://whatworksgrowth.org/public/files/Policy_Reviews/15-09-04_Apprenticeships_Review.pdf (accessed on 13 May 2019).

[31] What Works Centre for Wellbeing (2017), A guide to our evidence review methods, https://whatworkswellbeing.org/product/a-guide-to-our-evidence-review-methods/ (accessed on 8 March 2019).

[74] What Works Clearinghouse (2020), Standards Handbook (Version 4.1), https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf (accessed on 5 February 2019).

[24] What Works For Health (2010), Evidence Rating: Guidelines, http://whatworksforhealth.wisc.edu/evidence.php (accessed on 6 June 2019).

[58] What Works For Health (2010), Methods, http://whatworksforhealth.wisc.edu/evidence.php (accessed on 19 February 2019).

[14] White, H. (2019), “The twenty-first century experimenting society: the four waves of the evidence revolution”, Palgrave Communications, Vol. 5/1, p. 47, https://doi.org/10.1057/s41599-019-0253-6.

[26] Whiting, P. et al. (2016), “ROBIS: A new tool to assess risk of bias in systematic reviews was developed”, Journal of Clinical Epidemiology, Vol. 69, pp. 225-234, https://doi.org/10.1016/j.jclinepi.2015.06.005.

[7] Zaza, S. et al. (2000), Data Collection Instrument and Procedure for Systematic Reviews in the Guide to Community Preventive Services, http://www.thecommunityguide.org (accessed on 25 January 2019).

Notes

← 1. This will be broadly discuss in the effectiveness section.

← 2. A shadow price is an estimate of an economic value when market-based values are unavailable (e.g., no market for buying and selling emotional regulation) (Karoly, 2010[115]). The quality and consensus on shadow prices can vary by substantive area. Sometimes an estimate is only appropriate for projections in certain circumstances and should not be generalized (Crowley et al., 2018[93]).

╳

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

https://doi.org/10.1787/3f6f736b-en

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at http://www.oecd.org/termsandconditions.