This post discusses some of my recent reading on evaluation, and outlines evaluation issues and options that need to be considered as part of my doctoral project.
In an ideal world we could learn which types of interventions produce the best outcomes — be they organisational interventions (e.g. to identify the most effective decision support and decision-making techniques), interventions for addressing a particular policy issue, or interventions in larger systems that are aimed at particular desired social outcomes. Progressively over time we learn “what works” and, thus, our ever-more successful interventions would lead to ever-more assured success (e.g. for companies and corporate strategy) and social betterment, more broadly, over time.
This is the general rosy picture that is painted by some evaluation experts.
However, there are reasons to question whether this is feasible. For example, a paper examining issues when evaluating the effectiveness of group decision-support systems discussed our limited ability to conduct the required carefully controlled social experiments in order to make such claims. The authors assert that someone making a claim that a particular group decision-making process/technique can be evaluated readily on the basis of observed outcomes needs to rule out:
- A) The possibility that alternative group interventions at work in the same environment could produce equally satisfactory outcomes; and
- B) The possibility that alternative decisions could do as well or better than the actual choice made by the group.
A key issue is whether “real world” use of such techniques allows for robust comparisons or baselines, e.g. tests of alternative interventions (or intervention Vs no intervention in “control” or comparison groups) and alternative decisions under parallel conditions in order to clarify what works best and delivers the best outcomes. Real world contexts limit our capacity to conduct such analysis.
In other words, it would be great to be able to live the same situation over and over again (like the film Groundhog Day), and thereby suss out what works best by through trial and error. Given we can’t do this, our alternatives include:
- Making or setting up comparisons of interventions/actions in broadly comparable situations, thereby trying to learn “what works”. (Although there are many validity threats to consider)
- Being reflective practitioners that try to learn from experience, and to generalise our lessons and learnings to the future situation we find ourselves in
To complicated matters further, the authors raise additional issues: 1) generalisability, stating: “Even if such a program of research were to provide empirical support (on the basis of outcomes) for the relative superiority of a particular decision process in a specific organizational setting, the preferred use of the same decision-making process could not be generalized reasonably to other circumstances” (p.246); 2) context-sensitivity and chance, stating: good decisions can lead to poor outcomes, and poor decisions can lead to good outcomes, depending on the confluence of subsequent events. So “ineffective decision processes sometimes [can] result in good outcomes” (p.245). Arrrgggghhh!!
Considerations for the reflective practitioner
I’ve read most of Adam Kahane’s work and he often discusses his many, many years of experience. He argues that he’s therefore had plenty of opportunity for trialing various approaches, made some errors along the way, and thus also had many opportunities for learning.
On the one hand, this is a very common-sense and reasonable thing to say and do. We all try to learn from our experiences and develop over-time greater competency.
On the other hand, it’s possible to draw the wrong lessons from situations (e.g. a good outcome may have occurred for other reasons than we thought it did) and it is possible to generalise lessons unreasonably to other circumstances. We may fail to ask counter-factual questions, like what would have occurred if we didn’t intervene, or conducted a different intervention? Given we can’t go back in time such counter-factuals will always be imperfect approximations but they are important.
This is a very challenging problem.
If we cannot conduct social experiments in a truly scientific manner (which compares outcome data in order to work out what interventions are the most effective) what can we do? What are our options?
OPTION 1: evaluate effectiveness by focusing on the process (don’t examine outcomes)
Some scholars argue that it is too difficult to robustly link the “goodness” of outcomes to particular processes (e.g. decision support methods), especially if the intention is to identify generalisable insights. Thus, they argue assessments of effectiveness should focus on the process itself, not the subsequent outcomes. For example, for scenario planning evaluative research could examine whether or not a scenario intervention effectively countered ‘groupthink’ if this was an issue ex-ante — and not try to link the intervention with organisational performance (i.e. the subsequent outcomes).
A process evaluation might focus on how a process influenced decision-making confidence levels; or from a social justice perspective whether or not marginalised voices/people were empowered.
Process evaluation is sometimes termed implementation evaluation, focusing on the appropriateness and quality of the project implementation – such as evaluating how well the project components connect with the goals and intended outcomes, and/or evaluating what aspects of the implementation process and project are facilitating success or acting as a stumbling blocks for the project.
This is probably not a popular option. Most practitioners wish to make (and do make) broader claims about the efficacy of their methods and processes, especially in terms of the outcomes.
OPTION 2: evaluate immediate learning effects only
Some papers in the literature focus on immediate learning effects, such as seeking to measure cognitive or relational learning. For short-term processes (e.g. a two-day workshop) this is more straightforward. For long-term processes it is more complicated, as there is a need to work out if the intervention caused the learning outcome or some other event(s) or factor(s) that occurred during the same period.
Some scenario scholarly have sought via pre-post designs to measure the impact of scenario interventions on such as things as participant mental models (i.e., examine them before and after), and strategic conventional quality in an organisation (e.g. see Thomas Chermack’s papers).
This option appears to be well-aligned with the strong learning orientation of lot of scenario interventions and related practices (e.g. those seeking to enable organisational learning).
OPTION 3: evaluate the ‘content’ (the outputs) only, e.g. the scenarios
A “left-field” option is to primarily evaluate the quality of the scenarios themselves (although not in terms of forecast accuracy). Given related debates this is not an unproblematic option either.
OPTION 4: adopting a “pragmatic” approach that is use focussed
Pragmatists see the value of the evaluation as based on how it is used and the consequences of that use. For example, evaluation might be shaped around being a relevant input into decision-making.
Pragmatism can be understood as the view that what matters is not the truth content of a theory, but its utility. What matters is that science (or evaluative research in this case) is useful.
A pragmatic paradigm of evaluation is discussed in Program Evaluation Theory and Practice (Mertens & Wilson, 2012). The aim is to produce evaluation results that “work” (e.g. demonstrating that results “work” with respect to the problem being studied) rather than making claims to have discovered truths. Evaluators “test the workability (effectiveness) of a line of action (intervention) by collecting results (data collection) that provide a warrant for assertions (conclusions) about the line of action” (p.90).
Pragmatists are also less concerned with neutrality and objectivity – for example, self-evaluation methods may be considered if it suits the specific evaluation situation and evaluation questions (in contrast with others who see issues regarding the potential for bias and lack of controls to minimise such biases).
Mertens & Wilson (2012) identify John Owen as an Australian evaluator who has contributed to the pragmatic evaluation paradigm. Owen points to four purposes:
- “Clarification” evaluation answering questions about desired outcomes and the match between the program design and the desired outcomes;
- “Monitoring” evaluation comparing implementation of programs at different sites and at different times as a way of providing evidence for improving program effectiveness;
- “Impact” evaluation the looks at the achievement of outcomes for the purposes of program funders and stakeholders; and
- “Proactive” evaluation seeking to inform planning decisions needed for new programs or for substantial revisions of existing programs.
It might still be useful to collect data on the impacts of an intervention, and perhaps to also compare these with the impacts that were desired, even though we cannot conduct a true experiment. The core issue seems to be the generalisability of the findings. For example, the impact assessment might suggest the program is achieving the desired outcomes and should be continued, but does this tell you whether or not it would work elsewhere? Such evaluation could actually contribute to negative consequences and wasted resources if it leads to poor decisions about the use of the intervention elsewhere.
What Owen terms “monitoring” evaluation sounds similar to ‘realist’ evaluation (of Pawson & Tilley and others). The key issue is what we can we learn by comparing different sites and times – will this lead to appropriate generalisations to inform future practice, or lead to the wrong conclusions?
OPTION 5: adopt a ‘realist’ evaluation approach
Another idea is to ground evaluation in the realist philosophy. Realists tend to be skeptical of traditional experimental methods when evaluating ‘social’ interventions (as opposed to testing a new drug through standard randomised controlled trials) and argue that it is unrealistic to expect the predictable results of such controlled experiments in the “real world” settings of social or organisational interventions.
Realism is a post-positivist perspective, which Kazi (2001) argues is inclusive of others (empirical practice that focuses on measurable outcomes, interpretivist approaches, and pragmatic approaches) and distinct in seeking to develop theoretical understandings so that the inquirer “can explain the causal mechanisms, and the conditions under which certain outcomes will or will not be realised”.
According to the realist worldview, program/intervention outcomes cannot be explained in isolation: “they can only be explained in the sense of a mechanism that is introduced to effect change in a constellation of other mechanisms and structures, embedded in the context of pre-existing historical, economic, cultural, social and other conditions” (Kazi, 2001). The effectiveness of a program/intervention “is apprehended with an explanation of why the outcomes developed as they did, and how the programme was able to react to other underlying mechanisms, and in what contexts“.
The ‘realist effectiveness cycle’ (Kazi, 2001) involves a dialectic relationship between articulation of interventions models/theory and the realities of practice, in a never-ending cycle that tests and refines intervention models and recognises the complexities of practice.
Without going too deeply into theory, a realist approach appears to present the following opportunities: 1) considering measurable outcomes whilst also using additional methods to consider wider questions noted above (an inclusive approach); 2) prompting the development of more sophisticated propositions about how and why interventions generate particular outcomes; and 3) a conceptual framework to design research that can help to develop an understanding of what conditions are required for certain outcomes to be realised – thus addressing generalisation questions. It also presents challenges in applying to the emerging realist paradigm for evaluation research in new areas – such as to scenario interventions and related discursive/participatory interventions.
Some exponents of ‘realist’ evaluation further argue evaluation should be viewed as a process of theory-building and theory testing. Such theory is based, in part, on a ‘generative’ theory of causality (see Pawson & Tilley’s work). A ‘program’ is viewed as a theory and model of intervention and a core aim of evaluation is to test the extent to which these models are/aren’t analogous with reality.
Currently most published evaluations of scenario interventions and related practices adopt either a process focus, and linked with this, investigate short-term learning effects. This appears to partly be a reflection of the issues faced if seeking to evaluate on the basis of outcomes. Adoption of new evaluation research paradigms (e.g. realist, pragmatic) may enable new approaches that both recognise the complexities of practice and enable consideration of program/intervention outcomes.