commentary
Evaluation of a clinical test. I: Assessment of reliability

https://doi.org/10.1016/S0306-5456(00)00150-9Get rights and content

Introduction

Testing and screening are critical parts of the clinical process, since inappropriate testing strategies put patients at risk and entail a serious waste of resources1., 2.. Based on our recent experiences of evaluating diagnostic literature3., 4., 5., 6., 7., we have come to believe that there is much misunderstanding about the evaluation of clinical tests. Some tests, introduced into practice without proper evaluation, are so inefficient as to be almost useless. In our view, the absence of clear methodological guidelines about the evaluation of clinical tests is a major impediment. Just as robust research methods in assessing the effectiveness of treatments have been actively pursued over the last decade, so attention needs to be focused on how research on diagnostic tests and their impact on clinical practice might be improved. Our commentary is prompted by the concern that there is a huge disparity between the number of clinical tests and the availability of robust research evidence to help make decisions about their most appropriate clinical application.

We must first ask why inefficiency in clinical testing leads to mismanagement of patients. The answer is quite simple. By missing a diagnosis, early therapy cannot be undertaken, thereby prolonging morbidity. On the other hand, by making a diagnosis in the absence of disease unnecessary therapy may be undertaken with the risk of adverse effects. But how does inefficiency in clinical testing arise in the first place? We need to understand that the results of our tests are the outcomes of the clinical measurements. It is the errors in these clinical measurements that lead to inefficiency in clinical testing.

Errors in clinical measurements8., 9., 10. are of two sorts. Firstly, measurement may be inconsistent if the same attribute recorded by another observer (or recorded a second time by the same observer) leads to a different reading. The term reliability refers to this type of measurement error. Secondly, the measurement obtained may not be accurate when compared with the ‘true’ state of the attribute estimated by a suitable reference standard. This type of measurement error is referred to as validity. The goal of research is to determine whether a clinical test measures what is intended (validity), but first it should be established that it measures something in a consistent fashion (reliability).

Based on these two types of errors in clinical measurement, our commentary is divided into two parts. In the first part, the focus is on appropriate strategies for conducting and analysing studies of the reliability of a clinical test. In the second part, strategies for conducting and analysing studies of the validity of a clinical test will be described.

Section snippets

Design of a study of reliability

Reliability studies are generally reported in the literature as observer variability studies. The study is designed to compare measurements obtained by two or more observers (inter-rater reliability) or by one observer on two or more different occasions (intra-rater reliability). Intra-rater reliability is a prerequisite for inter-rater reliability8. We will restrict our description to inter-rater reliability. The objective of the study is to measure independently, the same clinical attribute

Data analysis of a study of reliability

Table 1 shows the different types of measurement encountered in clinical practice, with some examples: nominal (dichotomous); ordinal (ranked); and dimensional (continuous). The important point is that in studies of the reliability of a clinical test, the measurements recorded by the two observers should be expressed on the same type of scale, and with the same number of categories if the data are ordinal. It is important to remember that the purpose of a study of reliability is to determine

Nominal scale

When dealing with dichotomous data (for example, the presence or absence of hypertension), many researchers will report the percentage agreement as the index of reliability. From the hypothetical example in Table 2, the percentage agreement between the two midwives recording whether pregnant women are hypertensive or normotensive is 91.3%, a statistic that looks impressive because of its closeness to 100% (the value depicting perfect agreement). However, this statistic does not take into

Ordinal scale

Here again percentage agreement is commonly reported in the literature, but simple percentage agreement is best avoided since it does not take into account any chance-expected agreement. If the two midwives in Table 1 were asked to classify pregnant women into four ordered categories of blood pressure (i.e. normal blood pressure, mild hypertension, moderate hypertension, severe hypertension), then it is obvious that there are various levels of disagreement. The discrepancy between normotensive

Dimensional scale

Pearson's correlation coefficient of the measurements obtained by the two observers has been popular for the assessment of the reliability of clinical tests on a continuous scale5., 14.. However, Pearson's correlation coefficient measures the association between two sets of measurements, but not their agreement8., 19.. Fig. 1 represents two sets of measurements obtained by two observers, A and B. Line 1 shows perfect association, the correlation coefficient being 1.0, and also perfect

First page preview

First page preview
Click to open first page preview

References (22)

  • K.S. Khan et al.

    Misleading authors’ inferences in obstetric diagnostic test literature

    Am J Obstet Gynecol

    (1999)
  • W.G. Thompson et al.

    A reappraisal of the kappa coefficient

    J Clin Epidemiol

    (1988)
  • A.R. Feinstein et al.

    High agreement but low kappa: I. The problem of two paradoxes

    J Clin Epidemiol

    (1990)
  • L.M. Koran

    The reliability of clinical methods, data and judgements [two parts]

    N Engl J Med

    (1975)
  • Clinical disagreement: I. How often it occurs and why

    Can Med Assoc J

    (1980)
  • C.R. Nwosu et al.

    Is real-time ultrasonic bladder volume estimation reliable and valid? A systematic overview

    Scand J Urol Nephrol

    (1998)
  • K.S. Khan et al.

    Evaluating the measurement variability of clinical investigations: The case of ultrasonic estimation of urinary bladder volume

    Br J Obstet Gynaecol

    (1997)
  • P.F.W. Chien et al.

    The diagnostic accuracy of cervico-vaginal fetal fibronectin in predicting preterm delivery: an overview

    Br J Obstet Gynaecol

    (1997)
  • P.F.W. Chien et al.

    How useful is uterine artery Doppler flow velocimetry in the prediction of pre-eclampsia, intrauterine growth retardation and perinatal death? An overview

    Br J Obstet Gynaecol

    (2000)
  • D.L. Streiner et al.

    Health Measurement Scales: A Practical Guide to Their Development and Use

    (1995)
  • G. Dunn et al.

    Clinical Biostatistics. An Introduction to Evidence-Based Medicine

    (1995)
  • Cited by (0)

    View full text