It is commonly recognized that formative assessment can have a powerful effect on student learning. Formative assessment transforms the learning space by highlighting teachers’ responsibility to effectively teach skills and understandings. Whether the student learns when taught becomes the most critical feedback to the teacher. When successful learning can be measured — and successful learning is the measure of effective teaching —, then teaching can become more focused. In a certain sense, and as noted by Keller in 1968, we might say that in formative assessment the learner is never wrong, since the learner’s response is feedback to the teacher.
After decades of research in education and educational measurement, we now understand that learning is, in fact, the most predictable outcome of highly effective instruction Yet, when learning is not occurring, education has been too quick to ask “What is wrong with the learner?”, when what educators should first ask “What is wrong with the instruction?”. Formative assessment can help us ask the right question and make adjustments in order to improve learning. Using learning as the barometer of instruction dictated curriculum-based measurement with fields like precision teaching and, later, applied behavior analysis.
Applied behavior analysis experiments in education introduced the notion of collecting baseline data using brief timed measures of student performance, implementing instructional change and examining change in student performance in order to identify and implement instructional strategies that improved learning for students. Curriculum-based measurement (CBM) grew out of applied behavior analysis and largely represented an effort to standardize the collection of the data points that could be used to sensitively and accurately reflect learning gains. Where behavior analysis in education was concerned with the independent variables, that is, the conditions that could be experimentally demonstrated to cause improvements in learning, CBM was concerned with the dependent variables, the standardization of the outcome measures. With CBM, measures of student learning became less of a behavioral observation and more of a score with all the attendant psychometric benefits.
Beginning in the 1990s, metrics like oral reading fluency or words read correctly per minute became standard material for studying and evaluating components and even full methods of instruction for students in reading. The idea of CBM, as set forth in an especially influential article written by Lynn Fuchs and Stan Deno in 1994, was that CBM functioned as a general outcome measurement system whereby several individual skills could be taught during intervention sessions. Consequently, using CBM as a general outcome measure could reflect linear gains over some period of time to determine whether students were learning as expected, as noted by Fuchs and Deno in 1991.
In 1998, Noell and team typifies this work to use words read correctly per minute to find the right instructional tactic for a student. Following a period of baseline data collection, they examined the effect of reward followed by skill-building instruction to find the most effective instructional tactic for a student. The first phase is baseline, the second phase is contingent reward, and the final phase is reward with modeling and practice. These data indicate that this student benefits the most from rewards plus modeling plus practice.
The capacity to model learning and to compare that process to some expected rate of learning made measurement of learning generalizable while also enabling data-based decision-making in schools. CBM and data-based decision-making subsequently became the basis for response to intervention and multi-tiered systems of support. The evolution of reading assessment and intervention in schools is an exciting story that involves real improvements in literacy for children.
However, in mathematics, general outcome measures have been elusive. There is not an analogous metric like “words read correctly per minute” that transcends a sufficiently broad period of academic development in mathematics. Instead, researchers have attempted to construct measures in math using a curricular sampling approach. What “curricular sampling” means is that developers and researchers have attempted to identify 3-5 skills that are reflective of the most important mathematical learning that is expected to occur during a school year and to provide problem types reflecting those skills on a mixed-skill probe. The rationale for curricular sampling is that as instruction progresses (and learning occurs), gains will be observed as children master skills included in the measure.
Research has informed the construction of multiple-skill measures to maximize their technical characteristics, for example, limiting the total number of skill types measured and arranging the problem types in a stratified rather than random way, as proposed by Methe in 2015. Nevertheless, and according to Foegen and others in 2007, the technical characteristics of these measures have always been weaker than those reported for reading CBM.
Two problems have become apparent with regard to multiple-skill measures intended to function as general outcome meters in math, and both are problems of sensitivity. CBMs are used during screening to identify students who may be at risk for poor learning outcomes and in need of intervention. With multiple-skill math measures, only a small number of problem types (no more than 4-5) can be used on a single scale, so the measures are constructed choosing 4-5 operational skills that are important for children to master during the school year. Therefore, children are assessed on problem types that they have not yet been taught how to solve, especially at the beginning of the monitoring period. Thus, at the beginning of the monitoring period, score ranges will be severely constrained. These constrained scores destroy the capacity to use those scores to make screening decisions, which is especially important during the first half of the monitoring period when supplemental intervention could most usefully be added to prevent or repair detected gaps.
In the following example, all but two children score within the frustrational range of performance. The score distribution is constrained with many low scores. In classrooms like this, it is technically impossible to determine who is truly at risk because so many children appear to be at risk (or stated in another way, the measure is not sensitive to detect risk here).
Similarly, in terms of progress monitoring, because the measure is intended to model growth during the year, the gains between each assessment occasion will be so minimal that characterizing the success of instructional changes relative to “typical” growth becomes technically very difficult. In concrete terms, a student in intervention might successfully improve their ability to multiply single digit whole numbers and identify common factors, but a mixed-skill measure that includes mostly fraction operations is unlikely to detect these gains and therefore will not be very useful as a basis for adjusting the intervention or interpreting its effects. This problem has been noted in research studies with subsequent recommendations to collect several weeks of data before a decision can be made about whether the specific intervention is working for the student. Given that the school year generally lasts about 36 weeks, it is impractical to require 10 weeks or more of data collection, as that would only allow the intervention to be adjusted 1-3 times during an entire school year and would have caused students to experience perhaps 10 weeks of an ineffective intervention before the teacher could safely conclude the intervention was, in fact, ineffective for the student. In the example below, the blue line shows the amount of growth that would be required for the student to reach mastery; the amount of growth required for the student to perform at least above the at-risk range is shown in orange. The first intervention did not produce sufficient growth for the student to avoid academic risk. In other words, the first intervention was not effective for the student. The second intervention was effective; yet, by the time an effective intervention was installed, 10 weeks of instruction had already passed, and 6 more weeks were required to determine that the second intervention was working for the student.
Given the limitations of sensitivity with math CBM, researchers began to wonder if, in fact, direct measurement of more specifically defined skills that were being taught in a known sequence could be measured with greater precision and permit more fine-tuned decision-making about progress even if it changed the way progress were measured. In other words, if the right skills could be measured at the right moments of instruction, these data might be meaningful to formative assessment decisions. Such measures would have to change more frequently, resulting in a series of shorter-term linear trends. Summative metrics like rate of skill mastery could function as key decision metrics. We think of mastery measurement in math as a series of “Goldilocks” measures (i.e., the right measure at the right time) closely connected to grade-level learning in mathematics. In other words, the key is that a well-constructed technically equivalent measure of a single skill is used when that skill is being taught to model learning from acquisition to mastery.
In the example below, a typical flow is shown for skills and instructional tactic. In this sample case, a student begins intervention for fluency-building in multiplication facts 0-12. After reaching mastery, the intervention shifts to establish a new skill, which is multiplying with fractions. After reaching the instructional range, the intervention shifts to building fluency in multiplying with fractions. After reaching mastery in multiplying with fractions, the intervention shifts yet again to establish division of fractions. These instructional adjustments can be made much more quickly than it would be possible with the use of less sensitive measures (e.g., multiple skill measures as shown in the last figure). Rapid intervention adjustment optimizes intervention intensity for the student. Multiple-skill measures should also reflect gains, but with less steep improvement.
A series of recent studies published by VanDerHeyden and others, as well as Solomon and team — the latter in submission — have demonstrated that such an approach can meet conventional standards of technical adequacy, including reliability and classification accuracy, while also providing a closer connection to the process of instruction and permitting more sensitive feedback to the teacher about the instructional effects, thus allowing the teacher to make more rapid and fine-tuned adjustments to the teaching technique to improve learning. For example, and as stated by Burns, used mastery measurement in mathematics to predict skill retention and more rapid learning of more complex associated skills, providing the first systematic replication of the decision criteria set forth in Deno and Mirkin in 1977. This study suggested a method and a set of decision rules to determine skill mastery during instruction. In other words, this study provided a framework teachers could use in order to assess students in the class using a 2-minute classwide measure and know whether students required additional acquisition instruction, fluency-building opportunities, or advancement to more challenging content. Subsequently, such measures have been used to specify optimal dosages of instruction — as noted by Codding in 2016, and Duhon in 2020 — and rates of improvement have been empirically examined in large-scale reviews to characterize the dimensions of intervention conditions that can affect rates of improvement on mastery measures (e.g., dosage of intervention; skill targeting), according to a recent study published by Solomon and others. Mastery measures also yield datasets that can be used to determine classwide and individual student risk (i.e., screening) and to evaluate programs of instruction more generally.
The most exciting news from emerging measurement research in mathematics is the possibility of using technically strong (reliable, generalizable) measurements in highly efficient ways to drive instructional changes in the classroom the next day. In contrast with norm-referenced rules that simply tell decision-makers how students perform relative to other students, mastery measurement data can be used to tell us how likely a student is to thrive given specific instructional tactics. Knowing that a student performs below the 20th percentile tells us nothing about which instructional tactics are likely to benefit that student and whether that student is likely to respond to intervention when given optimal instruction. But knowing that a student performs at a level that predicts they will remember what has been taught, will experience more robust and more efficient learning of more complex associated content, and can adapt the skill or use the skill under different task demands is highly useful to instruction. From the teacher’s perspective, knowing that the student performs worse than other students means nothing if the assessment does not help the teacher solve the problem. Mastery measurement in math is necessary for teachers to know whether instruction is working for students and what instruction their students need the most.
Burns, M. K., VanDerHeyden, A. M., & Jiban, C. (2006). Assessing the instructional level for mathematics: A comparison of methods. School Psychology Review, 35, 401-418.
Codding, R., VanDerHeyden, Martin, R. J., & Perrault, L. (2016). Manipulating treatment dose: Evaluating the frequency of a small group intervention targeting whole number operations. Learning Disabilities Research & Practice, 31, 208-220.
Deno, S. L., & Mirkin, P. K. (1977). Data-based program modification: A manual. Reston, VA: Council for Exceptional Children.
Duhon, G. J., Poncy, B. C., Krawiec, C. F., Davis, R. E., Ellis-Hervey, N., & Skinner, C. H. (2020) Toward a more comprehensive evaluation of interventions: A dose-response curve analysis of an explicit timing intervention, School Psychology Review, https://doi.org/10.1080/2372966X.2020.1789435
Foegen, A., Jiban, C., & Deno, S. (2007). Progress monitoring in mathematics: A review of the literature. The Journal of Special Education, 41, 121-139.
Fuchs, L. S., & Deno, S. L. (1991). Paradigmatic distinctions between instructionally relevant measurement models. Exceptional Children, 57, 488-500.
Keller F. S. (1968). «Good-bye, teacher…». Journal of applied behavior analysis, 1(1), 79–89. https://doi.org/10.1901/jaba.1968.1-79
Methe, S. A., Briesch, A. M., & Hulac, D. (2015). Evaluating procedures for reducing measurement error in math curriculum-based measurement probes. Assessment for Effective Intervention, 40, 1-15. http://dx.doi.org/10.1177/1534508414553295
Noell, G. H., Gansle, K. A., Witt, J. C., Whitmarsh, E. L., Freeland, J. T., LaFleur, L. H., et al. (1998). Effects of contingent reward and instruction on oral reading performance at differing levels of passage difficulty. Journal of Applied Behavior Analysis, 31, 659-663.
Solomon, B. G., Payne, L. L., Campana, K. V., Marr, E. A., Battista, C., Silva, A., & Dawes, J. M. (2020). Precision of single-skill math CBM time-series data: The effect of probe stratification and set size. Journal of Psyhoeducational Assessment, 38, 724-739.
VanDerHeyden, A. M., & Broussard, C. (2020). Construction and examination of math subskill mastery measures. Assessment for Effective Intervention. Advance online publication. https://doi.org/10.1177/1534508419883947
VanDerHeyden, A. M., Broussard, C., & Burns, M. K. (2020). Classification agreement for gated screening in mathematics: Subskill mastery measurement and classwide intervention. Assessment for Effective Intervention. Advance online publication. https://doi.org/10.1177/1534508419882484
VanDerHeyden, A. M., Codding, R., Martin, R. (2017). Relative value of common screening measures in mathematics. School Psychology Review, 46, 65-87. https://doi.org/10.17105/SPR46-1.65-87