On Measuring Global Reading Progress

Reflections and comments on ACER’s Next steps: measuring reading progress

Strong governments and strong global institutions are important for defining, monitoring and addressing inequality in education.  These three policy activities are linked by policy narratives that need to be strong, coherent and consistent to garner global legitimacy. The efforts of ACER’s Centre for Global Education and Monitoring is to be commended on this front.

Work on the United Nation’s Sustainable Global Development (SGD) goals goes back to 1972, and its 2030 agenda focusing on Poverty, Food, Health, Education, Gender and Water is laudable and an agenda to which we could all agree.  However agreement on implementation requires the legitimation of a stronger narrative, and there are elements of the ACER approach that I would like to explore in this blog.

There are many things to like about ACER’s approach; the use of Item Response Theory (IRT) to develop a commonly agreed scale is one of them. IRT is a proven methodology for system and national evaluation, even though the methodology becomes suspect at the school, class and student levels.  The other welcome element is the use of pairwise comparisons in the development of content. The recent increase in the use of teacher-based pairwise comparison is welcome because it reengages the teaching profession with scale formation, an engagement that has atrophied over recent decades due to the use of IRT scaling methodologies. However, in the ACER proposal teacher engagement seems limited to pairwise comparisons in preliminary item selection, and does not seem to extend to international agreement on content.

Where the proposal is likely to encounter legitimation issues relate to the hypothesis that educational skills are universal across the target countries and able to be described on a common scale.  Sure, technically this can be done, I’ve rarely seen any test data that doesn’t scale, and where some items don’t scale properly these can be removed for ‘mysterious item reasons’. However there is bound to be concern around the legitimacy of claims about the universality of scales developed in this manner.

As I have argued elsewhere [on a unifying principle], the notion of being able to universally ‘identify where a student is’ is problematic. There are many ways of describing this issue. One way is to say that it’s too Kantian and ignores the work of Hegel in showing that knowledge is historically and socially located, and the work of Marxists that shows that formulations of knowledge can reinforce disadvantage.  Another way is to describe the approach as too metaphysical by presupposing a universal Cartesian space in which students can be located. Realism is yet another word that comes to mind, an approach that assumes that what IRT measures actually exists in reality.  Again, as I have argued elsewhere [constellation and continuum], the continuum metaphor is only one way to describe learning progress. So the observation that ‘progression occurs in a somewhat lumpy way’, is more than likely a reflection of the IRT model or metaphor, and not a phenomenon from the underlying reality of learning.  This is not to discredit the validity of the IRT model or results derived from it for the purpose of international evaluation; it simply questions the universality of any claims made.

An alternative to presupposing universal realism across nations and cultures on matters such as reading and mathematics, is to develop a procedure for SGD countries to agree on what is common to all with respect to these content areas and to create a common scale around that agreed content and then report explicitly to that effect. That is, report that the scales represent what has agreed to be common, and not was is considered universal and enduring.  The claims to universality, along with the described content methodology, could be characterised as cultural appropriation followed by cultural imperialism. Such an approach is likely to meet with resistance from teachers and the like at some point. People are social and cultural beings who use language to express themselves socially and culturally.  Of course reading progress is important to these expressions and for prosperity, but these expressions are also specific to each cultural context and not a universal function of language.

French President Charles de Gaulle’s famous 1962 observation on “How can you govern a country which has two hundred and forty-six varieties of cheese?” provides a good example of where language equivalence does not mean cultural equivalence. Cheese (Australian), kaas (Dutch) and fromage (French) are language equivalents, but Australians have Cheddar and Tasty, the Dutch have Edam and Gouda, and the French have a much broader variety.  Claims to social and cultural equivalence based on the simple language equivalence of ‘cheese’ is therefore likely to meet resistance. Reporting with claims to universality based on assessments that are only linguistically equivalent could therefore be perceived through a hegemonic narrative instead of the emancipatory one that is being sought by the UN.

It is difficult to know the status of the paper on which I’m commenting (research or marketing). It describes a comprehensive and worthwhile exercise, but it will require comprehensive consultation and discourse among target countries to develop legitimate measures that are acceptable to all.

The Demise of Teacher Professional Judgement

Follow up to Constellation or Continuum – metaphors for assessment

There are many ways in which teacher professional judgement can shape schooling.  Teachers can participate in the development of study designs, curriculum and syllabus, and they can also participate in exam setting, exam marking and standard setting.  In this way teachers perform sophisticated social roles in mediating between systems and the lifeworld of students as well as in setting and maintaining educational norms and expectations on behalf of the community. This kind of participation, where teachers both contribute to the creation of norms and learn how to teach them, is present in all systems to some extent, and highlights the important roles as moral agents and moral leaders that teachers can have.   However there are currently two developments working against teachers taking on system roles as moral agents:  1) instrumental reasoning of mathematical models and 2) the post-conventional/post-traditional nature of technology based education making teacher participation problematic.

Instrumental Reasoning

Where once curriculum and assessment were reflections of social expectation (including expectation of industry), this normative function has to some extent been superseded by uni-dimensional models of curriculum and assessment, mainly the Item Response Theory models (e.g. see Ayala, 2009; Embretson & Reise, 2000; Masters, 1982; Rasch, 1980) and its associated continuum metaphor.  In education systems where Item Response Theory models becomes prevalent learning progressions are less determined by social expectation and more determined by instrumentally defined scale progression, so that curriculum begins to comprise of ‘content that scales’ instead of content that meets social expectations.  Once curriculum content is comprised of ‘content that scales’, teachers’ participation in standard setting is no longer a requirement as instead of socially defined educational standards these standards can be set by way of cut-points, cut-scores and bands instrumentally and arbitrarily defined by application of Item Response Theory  based algorithms.

My thesis will argue that this phenomenon can lead to various outcomes including 1) alienation of teachers’ work, 2) curriculum and assessment not addressing social expectations, 3) students alienated from society and not fully socialised, and 4) a general loss of social capital across the system. It can also be seen as very efficient and cost saving as it doesn’t require expensive teacher engagement.

Post-conventional or post-traditional nature of education

The need to develop new educational norms and expectations during a time of developments in digital technology presents another issue for teacher engagement. Beavis (2010, p. 26) articulates this well when she states that factors such as cultural heritage and identity are at play for not only the student and teacher but also the subject itself.  The required moral reasoning of teachers is therefore far greater at a time where the system capacity of teachers has been greatly diminished through cutbacks etc. This leaves a vacated landscape that private sector can seek to fill (e.g Ultranet see Bajkowski, 2013), or other consortia (e.g. 21st Century Skills see Griffin, McGaw, & Care, 2012).


Not all contemporary assessments are grounded on mathematical models. For example the Victorian Certificate of Education (VCE) is one example of curriculum and assessment that is firmly socially grounded.  The study designs for the VCE (VCE study Designs)  reflect the social, cultural and economic activity of Victoria, and Victorian teachers are actively involved in its design and implementation, including exam setting and marking. The VCE also uses routine statistical techniques (standardization and normalization) to create a single score and then ATAR for students that can be used as currency in the future job and education market in Victoria and beyond. These features make VCE a highly regarded qualification but that it has such significant social buy-in will make it difficult to adapt to technology-based. Although this can be overcome with good management, good planning and sufficient resources for stakeholder engagement.

There is also some hope produced by the constellation metaphor and in the use of Bayesian techniques in the development of curriculum and assessment that is more comprehensive (e.g. Almond, Mislevy, Steinberg, Yan, & Williamson, 2015). However the establishment of good Bayesian belief networks also requires extensive experienced teacher participation, so the danger of the constellation metaphor is that instead of relying on teachers’ input for belief networks, these networks will instead by based on trawling through learning analytic data. Should this occur, my thesis is that this would also lead to alienating circumstances for teachers and students.

My thesis will develop with the view that sophisticated and social cohesive education systems have a sufficient base of morally competent teachers that are involved in the setting of curriculum and assessment, where the judgement of these teachers are informed and supported by sophisticated data systems (constellation and continuum). Of course this could potentiality bifurcate the other way, where teachers and students become increasingly alienated by technocratic systems.

Almond, R. G., Mislevy, R. J., Steinberg, L., Yan, D., & Williamson, D. (2015). Bayesian Networks in Educational Assessment. Tallahassee: Springer.

Ayala, R. J. De. (2009). The Theory and Practice of Item Response Theory. Guilford Press.

Bajkowski, B. J. (2013). News Review . Vic Auditor fails Ultranet, (March).

Beavis, C. A. (2010). English in the Digital Age: Making English Digital. English in Australia, 45(2), 21–30. Retrieved from http://www98.griffith.edu.au/dspace/handle/10072/37149

Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. L. Erlbaum Associates.

Griffin, P., McGaw, B., & Care, E. (Eds.). (2012). Assessment and Teaching of 21st Century Skills. Dordrecht: Springer. doi:10.1007/978-94-007-2324-5

Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. doi:10.1007/BF02296272

Rasch, G. (1980). Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: MESA PRESS.

Constellation or Continuum – metaphors for assessment

In this post I want to lay ground work for a major shift in assessment methodology that education will experience in the coming decades. It will do so by discussing educational objectives, heuristic metaphors, and mathematical models.

To be clear, what we are talking about here are mathematical models and how they implement metaphors or ways of verbal reasoning about educational objectives.  These models inform how we think about and organise content, including assessment, at the system level. While this post will remain agnostic on the science of how the brain works, these models nevertheless inform how we approach students and organise schooling.

After discussing two metaphors, this post will discuss potential issues in their use with a view to informing teacher participation in a broader debate. While it may be unreasonable to expect teachers to understand the mathematics, it is reasonable to expect teachers to engage at the metaphorical and verbal levels.

Constellation or Continuum

The constellation and continuum metaphors have long and evolved histories in the academic and published literature. I will discuss these metaphors in terms of their main exponents and uses.

The continuum metaphor, or the ruler metaphor, is the one most Australians would be familiar with or have experienced.  It is the metaphor used by both NAPLAN and PISA as part of system evaluation. It is therefore also used in many derivative studies or by those who wish to align themselves with these methodologies.  Australia has many world leading exponents for the continuum metaphor with Geoff Masters the most well know due to his development of the Partial Credit Model (Masters, 1982), which was a development of the earlier Rasch Model (Rasch, 1980).  The mathematical models associated with this metaphor are generally called Rasch Models or Item Response Theory (e.g. see Ayala, 2009; Embretson & Reise, 2000) which are often described in terms of improvements to Classical Test Theory.

The constellation metaphor is not so well known in large scale assessment.  A well know exponent is Robert Mislevy who, while remaining pluralistic, opened up the field through his work with others in Evidence Centred Design (ECD) (Almond, Mislevy, Steinberg, Yan, & Williamson, 2015; Mislevy, Steinberg, Almond, Haertel, & Penuel, 2003). This metaphor can also be associated with diagnostic assessment or cognitive assessment (e.g. Leighton & Gierl, 2007, 2011; Rupp & Templin, 2008). The mathematical models associated with this metaphor include Bayesian Networks, Neural Networks and elaborations of Item Response Theory. The constellation metaphor is not as widely used as they are more difficult to implement, although they are often used in post-hoc analysis of learning data.

A simple example

The profound differences between the two metaphors can be illustrated through a simple example. Below is a diagram showing a simple test of 8 questions which tests four operations using smaller numbers then larger numbers.  Student A can do all operations but not with larger numbers. Student B can just do addition and subtraction.


The key issue here is that each student has quite a different state of proficiency yet the raw score for these two patterns cannot distinguish between them, so raw scores mathematical models as used by the continuum metaphor cannot readily detect this type of difference.  A deviant response pattern may be picked up in a misfit or bias analysis, but unless there is some additional treatment these two students will be reported the same.

The two ways of reporting these two response patterns under each metaphor is illustrated below.


It is clear that differences between the two students are lost under the continuum metaphor, but are captured under the constellation metaphor.

My hypothesis is that Australia is captured by the continuum metaphor due to the good fortune of it having the leading Item Response Theorists in the world (Masters, Adams, Andrich, Wu, Wilson etc), it is this circumstance that has also led to a neglect of the constellation metaphor and a concern about what individual Australian students are able to do; a neglect that has led to a decline in overall student performance and to a paradoxical situation where Australia is well placed to measure its decline.  This is a hypothesis only that cannot be empirically proved but which can be reasoned about.

Furthermore, I also contend that the continuum metaphor, with its focus on measurement, comparability and comparisons, is sometimes mistaken for neoliberal forces. It’s not really a conspiracy, but just a by-product of some smart people working very effectively in the endeavor of their interest.


The constellation and continuum metaphors have corresponding metaphors for how we talk about teaching.  Related to constellation metaphor is ‘who a student is’, ‘collection of knowledge’, ‘learning as growth’ and ‘depth and relation’. Related to the continuum metaphor is ‘where a student is’, ‘uni-dimensionality’, ‘teacher as conduit’, ‘learning as filling an empty vessel’.

A particularly effective use of the continuum metaphor is as a system evaluation tool, that’s why it’s used in PISA, NAPLAN and TIMSS.  As a system evaluation metaphor it is also very effective at detecting system biases and therefore it served both accountability and civil rights movements in the United States during last century (see Gordon, 2013), which in part has led to the dominance of the metaphor today.

What is clear from the example above is that the continuum metaphor, and by extension NAPLAN, is a poor diagnostic device and is able to provide little information about the student and on what to teach next, other than a vague location where a student may be in relation to other students.

While the constellation metaphor is better at providing diagnostic information to teachers, these sorts of assessments are also a lot more difficult to manage and implement and have therefore not been implemented at scale. Instead, the constellation metaphor is increasingly being used for post-hoc analysis and fishing exercises on causal relations in education; for example learning analytics (e.g. Behrens & DiCerbo, 2014).  For those who consider education as a purposeful activity, this type of post-hoc meaning making may be of concern.

I trust this may help some, writing it has helped clarify some of my thoughts.


Where both the constellation and continuum metaphors are driven by mathematical models, the determination of matters such as bands and cut-scores are largely arbitrary and determined by a choice of parameter. This contrasts to traditional standard setting procedures that are based on the professional judgements of groups of teachers (e.g. see Cizek, 2012) or holistic judgements in higher education (e.g. see Sadler, 2009).  The metaphors can of course be used to support teacher judgement, and some methods in Cizek’s book recommend this.

Almond, R. G., Mislevy, R. J., Steinberg, L., Yan, D., & Williamson, D. (2015). Bayesian Networks in Educational Assessment. Tallahassee: Springer.

Ayala, R. J. De. (2009). The Theory and Practice of Item Response Theory. Guilford Press.

Behrens, J. T., & DiCerbo, K. E. (2014). Harnessing the Currents of the Digital Ocean. In J. A. Larusson & B. White (Eds.), Learning Analytics:From Research to Practice (pp. 39–60). New York: Springer.

Cizek, G. J. (Ed.). (2012). Setting Performance Standards : Foundations, Methods, and Innovations. New York: Routledge.

Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. L. Erlbaum Associates.

Gordon, E. W. (Ed.). (2013). To Assess, to Teach, to Learn: A Vision for the Future of Assessment : Technical Report. Retrieved from http://www.gordoncommission.org/rsc/pdfs/gordon_commission_technical_report.pdf

Leighton, J. P., & Gierl, M. J. (2007). Cognitive Diagnostic Assessment for Education: Theory and Applications. New York: Cambridge University Press.

Leighton, J. P., & Gierl, M. J. (2011). The Learning Sciences in Educational Assessment: The Role of Cognitive Models. Cambridge University Press.

Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. doi:10.1007/BF02296272

Mislevy, R. J., Steinberg, L. S., Almond, R. G., Haertel, G. D., & Penuel, W. R. (2003). Leverage points for improving educational assessment (PADI technical report 2). Menlo Park: SRI International.

Rasch, G. (1980). Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: MESA PRESS.

Rupp, A. A., & Templin, J. L. (2008). Unique Characteristics of Diagnostic Classification Models: A Comprehensive Review of the Current State-of-the-Art. Measurement: Interdisciplinary Research & Perspective, 6(4), 219–262. doi:10.1080/15366360802490866

Sadler, D. R. (2009). Indeterminacy in the use of preset criteria for assessment and grading. Assessment & Evaluation in Higher Education.