Designing and implementing evaluations of large-scale, multi-country interventions

In this blog, we explore issues affecting how evaluations of large, multi-country interventions are designed and implemented, reflecting on an important meeting discussing evaluations of novel and ambitious initiatives to achieve broad aims in Global Health which some LSHTM Centre for Evaluation staff attended.

Few people question that evaluation is important, but evaluations may have several goals and audiences.  This diversity of perspectives on evaluations has increased as novel, large budget initiatives to improve health in low and middle-income settings, driven by the intense focus on achieving the MDGs, create new stakeholders in global health. Evaluators need to be clear about what their audience wants, and if they can provide the answers in the context of burgeoning spending on wide reaching programmes.

Recently, some LSHTM Centre for Evaluation staff attended a meeting discussing evaluations of novel and ambitious initiatives to achieve broad aims in Global Health held by the Institute of Medicine. Examples discussed were big-budget initiatives that aimed to alter the landscape of healthcare and prevention in target countries.  These included the Presidents Emergency Plan for AIDS Relief (PEPFAR), the Global Fund to fight AIDS, Tuberculosis and Malaria (GFATM), the U.S. President’s Malaria Initiative, and the Affordable Medicines Facility-Malaria (AMFM) programme which is led by LSHTM’s Kara Hanson with support from LSHTM researchers.

We welcomed the innovation and ambition in these initiatives, but had some misgivings. There are important questions to be answered about the impact and efficiency of different approaches, but we were concerned that sufficient emphasis was not always currently placed on maximising learning in the context of these projects. The importance of evaluating these initiatives is clear but at times political considerations did not seem optimally balanced against scientific aims.  For example, the evaluation of PEPFAR will not publish any country specific findings, perhaps not on the basis of scientific reasoning and maybe because of political motives or sensitivities instead.  Sometimes the balance between the urgency of the intervention and the amenability to rigorous evaluation did not feel quite right.  For example, some initiatives, conspicuous in their spending of many millions of dollars on health, did not appear to have aims or objectives for evaluation clearly stated in advance, posing enormous challenges to evaluation.

There was an interesting debate about what it means to be ‘independent’ as evaluators.  Deborah Rugg, the United Nations Evaluation Group Chair, talked about “institutional, functional, and operational independence(s)”. Participants from LSHTM looked forward to having guidance on operationalising these ideas fleshed out in more detail. Evaluation such as this pose complex questions:  How independent can evaluators be if the funding sources for evaluation are linked to the implementing agency?  What degree of independence is optimal since better evaluation will be more likely where evaluators fully understand the intervention? Can evaluations be made more independent through methodological conventions such as protocol registration, review boards? A discussion panel, which included Robert Black (Johns Hopkins) suggested a yardstick by which evaluation quality and independence might be judged – evaluations should be of sufficient quality to be able to contribute to changing people’s minds from what they believed in advance.  We were left wondering: where can independence be found in a research market which includes such huge initiatives as PEPFAR?  And are people and institutions independent, or do we need systems and methods, or both?

LSHTM evaluators also reflected that we still have to learn to talk across disciplines.We were struck by how jargon from different disciplines and backgrounds sometimes got in the way of key conversations. One example related to the word “theory”. There was wide agreement that “theory” is useful but often participants from different disciplines appeared to be talking at cross purposes.  On one side, theory is used to structure data analysis plans to avoid data dredging, and on the other evaluations are themselves used for ‘theory building’ where evaluations give risk to data to parameterize models. “Triangulation” was another contested topic. Again there was agreement that different kinds of data could be used to widen and deepen understanding of results.   Yet there was also discussion about “triangulation between disciplines” and the successes of pairing disciplines to look at data to find concordance, without any real solutions as to the benefits this would bring. Finally, presentations about the case-studies often focused on “context” and how it should be taken into account.  In the PEPFAR evaluation, which sought to answer questions including “did PEPFAR work”, context was discussed during the design stage; however the final analysis was done across countries so it was never clear how context was addressed.  At least one attendee from LSHTM thought that the discussion of context was interesting but wondered where it left us.

Overall the meeting represented an important step-forward in an area of growing importance. Presenters on each of the initiatives were honest about the challenges of bringing rigour into a political world, in particular in constructing counterfactuals.  We think further discussion – more focused on the specific methods and their combinations that should be adopted, and on the governance of such evaluations to ensure they are rigorous, independent and accountable is warranted. Evaluation on this scale is going to continue, with a growing emphasis on maximising efficiencies and value for money being a likely trend. LSHTM researchers have a lot to offer.