Monday, December 8, 2008

Focusing Evaluations on Achievement Gaps

The standard design for experimental program evaluations in educational settings may not be doing justice to the questions that matter most to district decision makers. In many sites where we have worked, the most important question had to do with a gap between two populations within the district. For example, one district’s improvement plan specifically targeted the gap in science achievement between black students and white students. In another, there was a specific concern with the performance of new, and often uncertified, teachers compared to experienced teachers. NCLB, with its requirement for disaggregating the performance of specific subgroups, has reinforced this perspective. A new science curriculum that has a modest positive impact on performance across the district could be rejected if it had the effect of increasing the gap between the two populations of concern.

When a new program favors one kind of student or teacher over another, we call it an interaction, that is, an interaction between the experimental “treatment” and some pre-existing “trait” of the population involved. In experimental design, we call these characteristics of the people or the setting moderators because they are seen as moderating the impact of the new program. Moderators are often considered secondary or even exploratory outcomes in experimental program evaluations, which are designed primarily to find out whether the new program makes an overall difference for the study population as a whole. Who gets and doesn’t get the program can be manipulated experimentally. By contrast, the moderator is a pre-existing characteristic that (usually) can’t be manipulated. While the experiment focuses on a specific program (treatment), any number of moderators can be examined after the fact.

Many of our experiments in school systems are aimed at answering a question of local interest. In this case, we often find that the most important question concerns an interaction rather than the average impact of the experimental intervention itself. The potential moderator of interest, such as minority status, under-achievement, or certification can be specified in advance, based on the identified gap in performance the new program was intended to address in the first place. When the interaction is the primary outcome of interest, its status goes beyond even the emphasis that many experts put on interactions as a means for getting a fuller picture of the effectiveness of an intervention (Cook, 2002; Shadish, Cook, & Campbell, 2002). But because investigations of interactions are usually exploratory and not the primary question (except perhaps for the specific setting in which the experiment took place), it is difficult to look across studies of the same intervention to come to any generalization about the moderating effects of certain variables. Research reviews that synthesize multiple studies of the same intervention such as found on the What Works Clearinghouse and Best Evidence Encyclopedia are not concerned with interactions, even if an individual study finds one to be quite substantial. This is unfortunate because, in many studies that find no overall impact for a program, we may discover that it is differentially effective for an important subgroup. It would therefore be useful, for example, to examine whether the moderating effect of a certain variable varies more than is expected by chance across experimental settings. This would indicate whether the moderating effect is robust or whether it depends on local circumstances.

This situation points to the importance of conducting local program evaluations that can focus on the achievement gap of greatest concern. Fortunately, recent theoretical work by Howard Bloom (Bloom, 2005) of MDRC provides an indication that statistical power for detecting differences among subgroups of students in the impact of an intervention (that is, the interaction) can be larger than for detecting a net impact of the same size for that program. This means that a local experiment primarily interested in an interaction can be smaller, and less expensive, than a traditional experiment looking for an overall average effect. The need for information about gaps, as well as the possible greater efficiency of studying gaps, provides support for a strategy of conducting relatively small experiments to answer questions of local interest to a school district (Newman, 2008). Small, and less expensive, experimental program evaluations focused on moderating effects can provide more valuable information to decision makers than large-scale experiments intended for broad generalization, which cannot provide useful evidence for all interactions of interest to schools.

Empirical Education is now engaged in research to empirically verify Bloom’s observation about statistical power; we expect to be reporting the results next spring. —DN

Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed)., Learning More From Social Experiments. New York, NY: Sage.

Cook, T. D. (2002). Randomized experiments in educational policy research: A critical examination of the reasons the education evaluation community has offered for not doing them, Educational Evaluation and Policy Analysis, 24, 175-199.

Newman, D. (2008) Toward School Districts Conducting Their Own Rigorous Program Evaluations: Final Report on the “Low Cost Experiments to Support Local School District Decisions” Project. Empirical Education Research Reports, Palo Alto, CA: Empirical Education Inc.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi- experimental designs for generalized causal inference. Boston: Houghton Mifflin.