An equal test for all allows students, parents and schools to have a better perception of the level of knowledge of each student and the state of education of the country. Whilst there is evidence that the tests are an incentive to achieve learning goals, criticism remains about their effect on non-examined subjects. What do the scientific studies on examinations actually conclude?
In Portugal, at the end of each school year, students are called to take national exams. The advantages and disadvantages of this external evaluation method have been one of the most debated themes in the Portuguese education system, which is also present in many other countries.
The benefits of having external and equal tests for all pupils include access to information about the evolution of schools and the education system; a greater incentive to achieve learning goals, especially in schools with more difficult pupils; or help in consolidating content .
On the downside, these tests have been seen as reductive, since they address only a part of pupils' knowledge, leading to teaching focusing on techniques for the execution of the test, damaging subjects which are not externally assessed, and causing greater anxiety in pupils and teachers.
Many of these arguments were reinforced after the adoption in the United States of the famous No Child Left Behind policy in 2002, one of the main pillars of which was the generalisation of centralised testing with implications for the careers of students, schools, and teachers .
Regarding this discussion, it is worth mentioning two studies in economics of education which, by means of different types of data, tried to assess the impact of different evaluation methods, in particular the variation in student achievement after the introduction of external and standardised tests.
One of the most relevant studies in this area focused on an experience that took place in Chicago from the 1996/1997 school year, when external tests were adopted for students at the end of the 3rd, 6th and 8th years. Students who did not obtain a certain minimum level in these tests would have to attend a summer school for six weeks, at the end of which they would repeat the tests, and these results would define whether or not they would move on to the next school year. These tests also affected schools, as those with at least 15% of pupils scoring below a certain level would enter a process of analysis which could lead to their reformulation.
Using data from almost 800,000 students between 1993 and 2000, the study analyses the impact of the introduction of these tests, measuring how they have changed the trend in student achievement in Chicago over time and how this trend has diverged from other parts of the country where tests of the same type had not been introduced. The impact of this policy on student achievement has been positive and significant, and is comparable in magnitude to the famous STAR programme, which reduced classes from 22 to 15 students. It should be noted that these impacts were lower for younger pupils in grades 3 and 6, and more positive for pupils in grade 8.
This study also tried to trace the origins of the improvement in students' results by analysing in detail the different response items in the tests. In mathematics, it was found that this improvement stemmed above all from the best results in calculation; in reading, this evolution was guided by the greater effort of the students in completing the test, reflected in a greater number of items answered and the greater number of questions correctly answered in the last sections of the test. The study also found that the introduction of these tests led to strategic reactions from teachers, such as an increase in students on special education courses and an increase in preventive student retention - inducing reduced participation in the exam for students with greater difficulties - or even less time allocated to other subjects not covered by the exams.
The strongest impacts arise from systems in which there was some sort of assessment mechanism that allowed for comparison between schools and students, leading to an increase in PISA test scores of between 23 and 28 points.
What if we don't consider only one education system, but rather compare different realities? That is possible by measuring how different types of assessment impact on international and comparable tests like PISA. To this end, a very recent study used PISA data between 2000 and 2015, in a total of more than two million observations, in 35 OECD countries, and 24 other non-members of the institution. In their analysis, the authors consider four types of evaluation at 15 years of age, the age at which the PISA test is performed:
- standardised and centralised tests, the results of which are public and can be used as a comparison between schools and pupils - similar, in the Portuguese case, to national exams;
- centralised tests which are only used to monitor the results of students, teachers and schools, their results not being public - in the Portuguese case they would be close to the format of the assessment tests;
- internal school tests, corresponding to the tests normally taken by teachers so as to measure the knowledge of their students;
- tests used for internal monitoring of teachers' work - a type of assessment which is generally absent from the Portuguese reality.
Since this information covers a relatively long period of time (15 years), we have examined how variations in evaluation policy within each country and over time have affected the results of the international Reading, Mathematics and Science tests organised by the OECD. The estimated results show that the strongest impacts derive from systems in which, in the year under review, there was some sort of assessment mechanism that allowed for comparison between schools and pupils, leading to an increase in PISA test scores between 23 and 28 points. These impacts proved to be even more pronounced in countries with lower PISA results in the first year of testing, and therefore started from a lower starting point compared to the other countries participating in the test.
There is thus evidence that conducting and maintaining external testing has beneficial impacts on student outcomes, and is an important source of information for families, schools and teachers.
The debate that this topic generates opens the door to several areas of research, namely on the complementarity between external and internal evaluation, or more and better empirical studies on the reactions that these tests generate in the teachers' teaching method.
Despite the relevance of the tests, they are an instrument of educational policy that should be framed and complemented with others that allow for an improvement of results transversal to all areas and to all types of students.
 https://science.sciencemag.org/content/319/5865/966.full; https://journals.sagepub.com/doi/10.1111/j.1745-6916.2006.00012.x
 https://www.gse.harvard.edu/news/ed/18/01/testing-charade; https://www.epi.org/publication/books_grading_education/
 These impacts correspond to a positive and statistically significant impact of between 20 and 30 percentage points of a standard deviation.
 PISA scores have an average of 500 points and a standard deviation of 100.
Bergbauer, A. B. & Hanushek, E. A., & Woessmann, L., «Testing», NBER Working Papers 24836, National Bureau of Economic Research, 2018.
Jacob, B. A., «Accountability, incentives and behavior: The impact of high-stakes testing in the Chicago Public Schools», Journal of Public Economics, 89 (5–6), 2005, pp. 761–796.
Karpicke, J. D., & Roediger, H. L., «The critical importance of retrieval for learning», Science, 319 (5865), 2008, pp. 966–968.
Koretz, D., The Testing Charade. Pretending to Make Schools Better, University of Chicago Press, 2017.
Roediger, H. L., & Karpicke, J. D., «The Power of Testing Memory: Basic Research and Implications for Educational Practice», Perspectives on Psychological Science, 1 (3), 2006, pp. 181–210.
Ruffin, V. D., «Grading Education: Getting Accountability Right», Journal of Educational Administration, Vol. 47, n.º 5, 2009, pp. 678–680.