Carnegie Commons Blog

Examining Multi-Rater Teacher Observation Systems

Since 2009, more than 30 states and the District of Columbia have significantly altered policies governing teacher evaluation systems, with most introducing new, more complex procedures that seek to improve upon the cursory “drive by” evaluations of the past. Increasingly, states and districts evaluate teachers based on their students’ achievement, survey data from students and parents, and through more intense classroom observation systems and protocols.

But as these more complex evaluation systems roll out, there is mounting evidence that principals are overburdened by the new responsibilities they introduce. Many report being swamped by observation responsibilities and reporting duties. In Chicago, for example, two thirds of principals reported that the observation requirements of the REACH evaluation system took too big a chunk of their already overscheduled time during the first year of implementation.[1] Elsewhere, principals have confessed they feel overwhelmed by expectations that they be able to give meaningful targeted feedback to teachers with vastly different strengths, weaknesses, and teaching assignments.

In response to these challenges, a growing number of districts have adopted multi-rater evaluation systems, in which multiple observers watch, assess, and respond to teachers’ practice. Raters in such systems might co-observe together or rate different lessons independently. And, in contrast with models from the past, raters might not all be principals, but also master teachers, peer evaluators, independently contracted observers, or central office staff.

Multi-rater systems are relatively rare and there is considerable variation in design.

Taken at face value, multi-rater systems seem like a straightforward way to distribute responsibility and inject some much-needed content or grade-level expertise into the scoring and feedback processes, while also potentially improving the technical quality of observation data, including validity and reliability. According to reports from The MET project, “if a school district is going to pay the cost (both in money and time) to observe two lessons for each teacher it gets greater reliability when each lesson is observed by a different person.”[2]

Despite these potential benefits, however, multi-rater systems are relatively rare. And among those that do exist, there is considerable variation in design, each with its own set of challenges. Some systems, for example, require that multiple raters observe all teachers, while others require multiple raters only for a subset of teachers most in need of support (e.g., novice teachers). Another key design decision concerns the raters themselves: should additional raters come from within—existing administrators, central office staff, etc.—or should they be hired from outside the district to ensure independence? And how should they share responsibilities?

But while the specific features of multi-rater systems vary considerably from one district to another, the challenges districts face in implementing these more complex systems seem to be more consistent. Indeed, in interviews with more than a dozen districts that require or encourage multiple raters, we identified the same challenges over and over again. Chief among these concerns were issues of rater reliability and the difficulty of ensuring consistent feedback across multiple observers. Ensuring that raters are properly trained and calibrated—both in terms of their scoring and the feedback they provide—is not only crucial to the success of the systems, but can be time consuming and costly, too.

Every district reported that the benefits of their multi-rater systems have been well-worth the additional challenges.

Likewise, recruiting, training, and compensating additional raters can be expensive for those districts that employ cadres of non-administrative raters (e.g., Master Educators at DC Public Schools). And, even if a district is able to avoid significant extra costs (and some have), they face logistical challenges in scheduling observations, collecting reports from multiple raters, and coordinating the feedback they provide.

Still, every district reported that the benefits of their multi-rater systems have been well-worth the additional challenges they’ve faced during implementation. And many said they would not—and in some cases could not—go back to their previous, single-rater models.

For more information about the districts using multiple rates, the systems they’ve adopted, and the challenges they’ve faced along the way, please read our latest Issue Brief, Adding Eyes: The Rise, Rewards, and Risks of Multi-Rater Teacher Observation Systems.


[1] Susan E. Sporte, W. David Stearns, Kaleen Healey, Jennie Jiang and Holly Hart, “Teacher Evaluation in Practice: Implementing Chicago’s REACH Students,” September 2013, The University of Chicago Consortium on Chicago School Research, p. 25.

[2] Andrew D. Ho and Thomas J. Kane, “The Reliability of Classroom Observations by School Personnel,” 2013. Seattle, WA: The Bill & Melinda Gates Foundation.