Original Article
Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed

https://doi.org/10.1016/j.jclinepi.2010.03.002Get rights and content

Abstract

Objective

Results of reliability and agreement studies are intended to provide information about the amount of error inherent in any diagnosis, score, or measurement. The level of reliability and agreement among users of scales, instruments, or classifications is widely unknown. Therefore, there is a need for rigorously conducted interrater and intrarater reliability and agreement studies. Information about sample selection, study design, and statistical analysis is often incomplete. Because of inadequate reporting, interpretation and synthesis of study results are often difficult. Widely accepted criteria, standards, or guidelines for reporting reliability and agreement in the health care and medical field are lacking. The objective was to develop guidelines for reporting reliability and agreement studies.

Study Design and Setting

Eight experts in reliability and agreement investigation developed guidelines for reporting.

Results

Fifteen issues that should be addressed when reliability and agreement are reported are proposed. The issues correspond to the headings usually used in publications.

Conclusion

The proposed guidelines intend to improve the quality of reporting.

Section snippets

Background

What is new?

Key finding

  1. Reporting of interater/intrarater reliability and agreement is often incomplete and inadequate.

  2. Widely accepted criteria, standards, or guidelines for reliability and agreement reporting in the health care and medical fields are lacking.

  3. Fifteen issues that should be addressed when reliability and agreement are reported are proposed.

What this adds to what is known
  1. There is a need for rigorous interrater and intrarater reliability and agreement studies to be performed in the future.

  2. Systematic reviews and

Project

In the absence of standards for reporting reliability and agreement studies in the medical field, we evolved the idea that formal guidelines might be useful for researchers, authors, reviewers, and journal editors. The lead author initially contacted 13 experts in reliability and agreement investigation and asked whether they saw a need for such guidelines and whether they wished to take part in this project. The experts were informally identified based on substantial contributions and

Guidelines

The Guidelines for Reporting Reliability and Agreement Studies (GRRAS) are shown in Table 1. They contain issues that should be addressed when reliability and agreement are investigated. The underlying rationale, arguments, or empirical data to support each item are given later. The proposed issues correspond to the headings and order usually used in publications. The items aim to cover a broad range of clinical test scores, classifications or diagnosis. However, some items are only partly

Discussion

The level of reliability and agreement among users of scales, instruments, or classifications in many different areas is largely unknown [15], [16], [18], [100], [110], [111], [112]. Therefore, there is a clear need for rigorous interrater and intrarater reliability and agreement studies to be performed in the future. Studies are also needed for investigating reliability in clinical practice [16], [25], [36], [43]. We hope that the guidelines will help to improve the quality of reporting.

To our

Limitations

We chose a pragmatic approach in developing the guidelines. Eight experts participated, and they were blinded to each other in the first round only. It is commonly assumed that Delphi methods are more reliable, because the group interaction is indirect and more people can be involved [115]. Furthermore, no single expert with a strong opinion and ego can override the opinion of the other experts. However, consensus achieved by Delphi methods also heavily depends on the participating experts,

Conclusions

Interrater and intrarater reliability and agreement examinations are needed to estimate the amount of error in the rating or scoring of tests and classification procedures. We have proposed a set of general guidelines for reporting reliability and agreement studies. The guidelines are broadly useful and applicable to the vast majority of diagnostic issues. We believe that this first draft may be improved upon and updated in the future. We appreciate any comments or suggestions by readers and

References (118)

  • J.C. Hwang et al.

    Representation of ophthalmology concepts by electronic systems: intercoder agreement among physicians using controlled terminologies

    Ophthalmology

    (2006)
  • T.M. Hall et al.

    Intertester reliability and diagnostic validity of the cervical flexion-rotation test

    J Manipulative Physiol Ther

    (2008)
  • M. Zegers et al.

    The inter-rater agreement of retrospective assessments of adverse events does not improve with two reviewers per patient record

    J Clin Epidemiol

    (2010)
  • D. Gould et al.

    Examining the validity of pressure ulcer risk assessment scales: a replication study

    Int J Nurs Stud

    (2004)
  • J.D. Scinto et al.

    The case for comprehensive quality indicator reliability assessment

    J Clin Epidemiol

    (2001)
  • F.A. McAlister et al.

    Why we need large, simple studies of clinical examination: the problem and a proposed solution

    Lancet

    (1999)
  • D.O. Perkins et al.

    Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trails

    Biol Psychiatry

    (2000)
  • K.A. Kobak et al.

    A comparison of face-to-face and remote assessment of inter-rater reliability on the Hamilton Depression Rating Scale via videoconferencing

    Psychiatry Res

    (2008)
  • W. Vach

    The dependence of Cohen's kappa on the prevalence does not matter

    J Clin Epidemiol

    (2005)
  • A. Donner et al.

    The statistical analysis of kappa statistics in multiple samples

    J Clin Epidemiol

    (1996)
  • G. Dunn

    Statistical evaluation of measurement errors: design and analysis of reliability studies

    (2004)
  • B.H. Mulsant et al.

    Interrater reliability in clinical trials of depressive disorders

    Am J Psychiatry

    (2002)
  • M. Szklo et al.

    Epidemiology beyond the basics

    (2007)
  • D.F. Polit et al.

    Nursing research: generating and assessing evidence for nursing practice

    (2008)
  • M.M. Shoukri

    Measures of interobserver agreement

    (2004)
  • D.L. Streiner et al.

    Health measurement scales: a practical guide to their development and use

    (2008)
  • K.L. Gwet

    Computing inter-rater reliability and its variance in the presence of high agreement

    Br J Math Stat Psychol

    (2008)
  • S.D.M. Bot et al.

    Clinimetric evaluation of shoulder disability questionnaires: a systematic review

    Ann Rheum Dis

    (2004)
  • J. Kottner et al.

    Interpreting interrater reliability coefficients of the Braden scale: A discussion paper

    Int J Nurs Stud

    (2008)
  • J. Kottner et al.

    A systematic review of interrater reliability of pressure ulcer classification systems

    J Clin Nurs

    (2009)
  • C.R. Nwosu et al.

    Is real-time ultrasonic bladder volume estimation reliable and valid? A systematic overview

    Scand J Urol Nephrol

    (1998)
  • N. Ratanawongsa et al.

    The reported validity and reliability of methods for evaluating continuing medical education: a systematic review

    Acad Med

    (2008)
  • M.J. Stochkendahl et al.

    Manual examination of the spine: a systematic critical literature review of reproducibility

    J Manipulative Physiol Ther

    (2006)
  • Y. Sun et al.

    Reliability and validity of clinical outcome measurements of osteoarthritis of the hip and knee— a review of the literature

    Clin Rheumatol

    (1997)
  • G.H. Swingler

    Observer variation in chest radiography of acute lower respiratory infections in children: a systematic review

    BMC Med Imaging

    (2001)
  • L. Audigé et al.

    How reliable are reliability studies of fracture classifications?

    Acta Orthop Scand

    (2004)
  • L. Hestbaek et al.

    Are chiropractic tests for the lumbo-pelvic spine reliable and valid? A systematic review

    J Manipulative Physiol Ther

    (2000)
  • E. Innes et al.

    Reliability of work-related assessments

    Work

    (1999)
  • Standards for educational and psychological testing

    (1999)
  • E.M. Glaser

    Using behavioral science strategies for defining the state-of-the-art

    J Appl Behav Sci

    (1980)
  • A. Fink et al.

    Consensus methods: characteristics and guidelines for use

    Am J Public Health

    (1984)
  • F. Buntinx et al.

    Inter-observer variation in the assessment of skin ulceration

    J Wound Care

    (1996)
  • A. Vikström et al.

    Mapping the categories of the Swedish primary health care version of ICD-10 to SNOMED CT concepts: rule development and intercoder reliability in a mapping trial

    BMC Med Inform Decis Making

    (2007)
  • R. Nanda et al.

    An assessment of the inter examiner reliability of clinical tests for subacromial impingement and rotator cuff integrity

    Eur J Orthop Surg Traumatol

    (2008)
  • J. Peat et al.

    Medical statistics: a guide to data analysis and critical appraisal

    (2005)
  • D. Cicchetti et al.

    Rating scales, scales of measurement, issues of reliability

    J Nerv Ment Dis

    (2006)
  • H.C. Kraemer

    Ramifications of a population model for κ as a coefficient of reliability

    Psychometrika

    (1979)
  • H.K. Suen

    Agreement, reliability, accuracy, and validity: toward a clarification

    Behav Assess

    (1988)
  • T. Slongo et al.

    International Association for Pediatric Traumatology Development and validation of the AO Pediatric Comprehensive Classification of Long Bone Fractures by the pediatric expert group of the AO Foundation in collaboration with AO Clinical Investigation and Documentation and the International Association for Pediatric Traumatology

    J Pediatr Orthop

    (2006)
  • J.E. Barone et al.

    Should an Allen test be performed before radial artery cannulation?

    J Trauma

    (2006)
  • Cited by (1301)

    • Predictive value and interrater reliability of mental status and mobility assessment in the emergency department

      2024, Clinical Medicine, Journal of the Royal College of Physicians of London
    View all citing articles on Scopus
    View full text