Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests

doi:10.1016/S0167-5877(00)00115-X

Preventive Veterinary Medicine

Volume 45, Issues 1–2, 30 May 2000, Pages 23-41

https://doi.org/10.1016/S0167-5877(00)00115-X Get rights and content

Abstract

We review the principles and practical application of receiver-operating characteristic (ROC) analysis for diagnostic tests. ROC analysis can be used for diagnostic tests with outcomes measured on ordinal, interval or ratio scales. The dependence of the diagnostic sensitivity and specificity on the selected cut-off value must be considered for a full test evaluation and for test comparison. All possible combinations of sensitivity and specificity that can be achieved by changing the test’s cut-off value can be summarised using a single parameter; the area under the ROC curve. The ROC technique can also be used to optimise cut-off values with regard to a given prevalence in the target population and cost ratio of false-positive and false-negative results. However, plots of optimisation parameters against the selected cut-off value provide a more-direct method for cut-off selection. Candidates for such optimisation parameters are linear combinations of sensitivity and specificity (with weights selected to reflect the decision-making situation), odds ratio, chance-corrected measures of association (e.g. kappa) and likelihood ratios. We discuss some recent developments in ROC analysis, including meta-analysis of diagnostic tests, correlated ROC curves (paired-sample design) and chance- and prevalence-corrected ROC curves.

Introduction

The crude results of most serodiagnostic tests are measured on ordinal (e.g. grading scheme or sample titration) or continuous (e.g. quantitative readings of single-dilution tests) scales. For all diagnostic tests (except those producing dichotomous outcomes) a value on the original scale is selected as a decision threshold (cut-off value) to define positive and negative test outcomes. Comparison of the dichotomised test results against the true status of individuals (as determined by a reference or “gold standard” test) allows estimation of the diagnostic sensitivity (Se, probability of a positive test outcome in a diseased individual) and specificity (Sp, probability of a negative test outcome in a non-diseased individual) (see Greiner and Gardner, 2000). It is well recognised that Se and Sp are inversely related depending on the choice of cut-off value. When increasing values of a measurement are associated with disease, higher (lower) cut-off values are generally associated with lower (higher) Se and a higher (lower) Sp. This relationship has two important implications. First, we would like to select a cut-off value such that the desired operating characteristics (Se, Sp) are achieved. Second, we realise that Se and Sp at a single cut-off value do not describe the test’s performance at other potential cut-off values. The latter also implies that the effect of the selected cut-off value should be taken into account when comparing diagnostic tests. These problems are addressed by the receiver-operating characteristic (ROC) analysis and its derivatives.

The ROC methodology was developed in the early 1950s for the analysis of signal detection in technical sciences and was first used in medicine in the late 1960s for the assessment of imaging devices (reviewed by Zweig and Campbell, 1993). ROC analysis has been increasingly used for the evaluation of clinical laboratory tests (Metz, 1978; Henderson, 1993; Schulzer, 1994; Smith, 1995). However, Henderson and Bhayana (1995) reported a lack of consistency with respect to the presentation of ROC analyses. The use of ROC analysis is still limited in the medical and veterinary literature. A systematic review of evaluation (validation) studies of serodiagnostic tests published in 12 biomedical journals in 1995 revealed that ROC analysis has been used in only 3 of 65 medical studies and 1 of 33 veterinary studies (Greiner and Wind, unpublished).

We review practically relevant features of ROC curves and related approaches with emphasis on cut-off selection and test comparison. Data obtained by enzyme-linked immunosorbent assays (ELISAs) for the detection of Trypanosoma antibodies will be used as an example. The presentation will refer to continuous ELISA data because this test format is often used for seroepidemiologic applications. The principles, however, apply also to continuous and ordinal diagnostic tests in general. Finally, we describe some extensions of classical ROC-analysis methodology. In the following examples, increasing values of a test result are associated with increasing likelihood of disease.

Section snippets

Example data

We use a random subset of data from a validation study of antibody ELISAs for the detection of Trypanosoma antibodies in bovine serum. In this study, a negative control group was sampled from non-exposed (Germany) and from exposed (parasitologically non-infected cattle from a tsetse-infested area in Uganda) cattle populations. The positive control group was sampled from the exposed (parasitologically confirmed) population (Greiner et al., 1997). Test antigen derived from blood-stream form

Basic principles of ROC curves

The underlying assumption of ROC analysis is that a diagnostic variable (e.g. ELISA values) is used to discriminate between two mutually exclusive states of tested animals. During the following discussion, we consider the true disease status (denoted D+ and D− for diseased and non-diseased animals, respectively) but note that various other conditions such as infected/non-infected and protected/non-protected established using an appropriate reference method could also be the aim of diagnostic

Recent developments

Confidence bands for ROC curves are needed for inferences from a visual comparison of curves for two or more tests. Methods based on the Greenhouse–Mantel test (Schäfer, 1994), Kolmogorov–Smirnov test and bootstrapping (Campbell, 1994) have been suggested for construction of confidence bands. Confidence intervals for the AUC for diagnostic systems that involve multiple tests were developed by Reiser and Faraggi (1997).

Another topic of current methodological research is the analysis of

Software for ROC analysis

Software for ROC analysis is available in various formats including commercial, shareware or stand-alone products, statistical-program packages with built-in or user-defined ROC modules, and spreadsheet calculation macros. Some available programmes are listed in Table 3. However, the list is not comprehensive and we have not compared the relative advantages of the listed programmes. Some features (based on our experience and information provided by the producers) are listed as a guide. A

Conclusions

ROC analysis visualises the cut-off-dependency of ordinal or continuous diagnostic tests and provides an estimate of the accuracy that is independent of specific cut-off values and prevalence. ROC curves allow a comparison between different diagnostic tests. In addition, the curve provides information which will enable the diagnostician to optimise use of a test through targeted selection of cut-off values for particular diagnostic strategies.

References (54)

D Bamber
The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
J. Math. Psychol.
(1975)
J.A Barajas-Rojas et al.
Notes about determining the cut-off value in enzyme-linked immunosorbent assay (ELISA)
Prev. Vet. Med.
(1993)
J Detilleux et al.
Methods for estimating areas under receiver-operating characteristic curves: illustration with somatic-cell scores in subclinical intramammary infections
Prev. Vet. Med.
(1999)
I.A Gardner et al.
Conditional dependence between tests affects the diagnosis and surveillance of animal diseases
Prev. Vet. Med.
(2000)
M Greiner
Two-graph receiver operating characteristic (TG-ROC): update version supports optimisation of cut-off values that minimise overall misclassification costs
J. Immunol. Methods
(1996)
M Greiner et al.
Epidemiologic issues in the validation of veterinary diagnostic tests
Prev. Vet. Med.
(2000)
M Greiner et al.
A modified ROC analysis for the selection of cut-off values and the definition of intermediate results of serodiagnostic tests
J. Immunol. Methods
(1995)
M Greiner et al.
Evaluation and comparison of antibody ELISAs for serodiagnosis of bovine trypanosomosis
Vet. Parasitol.
(1997)
T.E Hanson et al.
Log-linear and logistic modeling of dependence among diagnostic tests
Prev. Vet. Med.
(2000)
R Holle et al.
Is there a gain from chance-corrected measures of diagnostic validity?
J. Clin. Epidemiol.
(1997)

L Irwig et al.

Meta-analytic methods for diagnostic test accuracy

J. Clin. Epidemiol.

(1995)

C.E Metz

Basic principles of ROC analysis

Semin. Nucl. Med.

(1978)

D.L Simel et al.

Likelihood ratios for continuous test results — making the clinician’s job easier or harder?

J. Clin. Epidemiol.

(1993)

E.J Sondik

Clinical evaluation of test strategies. A decision analysis of parameter estimation

Clin. Lab. Med.

(1982)

S Vida

A computer program for non-parametric receiver operating characteristic analysis

Comput. Methods Programs Biomed.

(1993)

A Albert

On the use of likelihood ratios in clinical chemistry

Clin. Chem.

(1982)

Anderson, J.A., 1982. Logistic regression. In: Krishnaiah, P.R., Kanal, L.N. (Eds.), Handbook of Statistics....

C.A Beam

Analysis of clustered data in receiver operating characteristic studies

Stat. Meth. Med. Res.

(1998)

C.A Beam et al.

A statistical method for the comparison of a discrete diagnostic test with several continuous diagnostic tests

Biometrics

(1991)

B.M Bennett

On comparisons of sensitivity, specificity, and predictive value of a number of diagnostic procedures

Biometrics

(1972)

G Campbell

Advances in statistical methodology for the evaluation of diagnostic and laboratory tests

Stat. Med.

(1994)

B.C.K Choi

Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test

Am. J. Epidemiol.

(1998)

D.D Dorfman et al.

Maximum likelihood estimation of parameters of signal detection theory — a direct solution

Psychometrika

(1968)

O Gefeller et al.

How to correct for chance agreement in the estimation of sensitivity and specificity of diagnostic tests

Methods Inf. Med.

(1994)

J.A Hanley

The robustness of the “binormal” assumptions used in fitting ROC curves

Med. Decis. Mak.

(1988)

J.A Hanley et al.

The meaning and use of the area under a receiver operating characteristic curve

Radiology

(1982)

J.A Hanley et al.

A method of comparing the areas under receiver operating characteristic curves derived from the same cases

Radiology

(1983)

Cited by (1547)

Predicting the Effect of Proton Beam Therapy Technology on Pulmonary Toxicities for Patients With Locally Advanced Lung Cancer Enrolled in the Proton Collaborative Group Prospective Clinical Trial
2024, International Journal of Radiation Oncology Biology Physics
This study aimed to predict the probability of grade ≥2 pneumonitis or dyspnea within 12 months of receiving conventionally fractionated or mildly hypofractionated proton beam therapy for locally advanced lung cancer using machine learning.
Demographic and treatment characteristics were analyzed for 965 consecutive patients treated for lung cancer with conventionally fractionated or mildly hypofractionated (2.2-3 Gy/fraction) proton beam therapy across 12 institutions. Three machine learning models (gradient boosting, additive tree, and logistic regression with lasso regularization) were implemented to predict Common Terminology Criteria for Adverse Events version 4 grade ≥2 pulmonary toxicities using double 10-fold cross-validation for parameter hyper-tuning without leak of information. Balanced accuracy and area under the curve were calculated, and 95% confidence intervals were obtained using bootstrap sampling.
The median age of the patients was 70 years (range, 20-97), and they had predominantly stage IIIA or IIIB disease. They received a median dose of 60 Gy in 2 Gy/fraction, and 46.4% received concurrent chemotherapy. In total, 250 (25.9%) had grade ≥2 pulmonary toxicity. The probability of pulmonary toxicity was 0.08 for patients treated with pencil beam scanning and 0.34 for those treated with other techniques (P = 8.97e-13). Use of abdominal compression and breath hold were highly significant predictors of less toxicity (P = 2.88e-08). Higher total radiation delivered dose (P = .0182) and higher average dose to the ipsilateral lung (P = .0035) increased the likelihood of pulmonary toxicities. The gradient boosting model performed the best of the models tested, and when demographic and dosimetric features were combined, the area under the curve and balanced accuracy were 0.75 ± 0.02 and 0.67 ± 0.02, respectively. After analyzing performance versus the number of data points used for training, we observed that accuracy was limited by the number of observations.
In the largest analysis of prospectively enrolled patients with lung cancer assessing pulmonary toxicities from proton therapy to date, advanced machine learning methods revealed that pencil beam scanning, abdominal compression, and lower normal lung doses can lead to significantly lower probability of developing grade ≥2 pneumonitis or dyspnea.
Refinement and revalidation of the Equine Ophthalmic Pain Scale: R-EOPS a new scale for ocular pain assessment in horses
2024, Veterinary Journal
This study addresses the refinement and revalidation of a composite pain scale that focuses on equine facial expressions and behavioural indicators as exhibitions of ophthalmic pain. This scale included only Behavioural and Facial and Ocular expression indicators and, compared to the first version of Equine Ophthalmic Pain Scale (EOPS), item descriptors and related ratings were changed. Thirteen horses with ocular diseases that required medical or surgical treatment were enroled (group P). In each animal, the refined EOPS (R-EOPS) was applied prior to any treatment (T0) and one week later (T7). The R-EOPS was applied twice, 7 days apart, to 16 healthy control horses (group C). Two 30-second videos were recorded each time to allow the retrospective analysis by eight observers. Inter-observer reliability of items was moderate or substantial (Krippendorff's alpha, Kα>0.40) while their intra-observer reliability was substantial or almost perfect for most items (Kα ≥0.61). Both inter- and intra-observer reliability of Total Score (TS) were however excellent (Intraclass Correlation Coefficients, ICC>0.75). The TS also showed good reproducibility (Kendall coefficient=0.786, ICC=0.684) and high consistency of its items (Cronbach’s α=0.847). The comparison between groups as well as the sensitivity and specificity values supported the validity of the R-EOPS. In particular, for each extra point added to the TS, the risk of the horse having pain increased by more than two times (Odds Ratio=2.079, 95%CI=1.542–2.804; P<0.001). The Receiver Operating Characteristic analysis identified 6 as the threshold value of R-EOPS for discriminating horses with ocular pathology (sensitivity=83%, specificity=100%). This scale may be an effective tool for reliably assessing the pain level in horses with ophthalmic diseases and potentially guiding pain management although it still requires large-scale application and external validation.
Development of an optimal short form of the GAD-7 scale with cross-cultural generalizability based on Riskslim
2024, General Hospital Psychiatry
Despite the relatively small number of items in the GAD-7, fewer items are increasingly sought to shorten testing time in large-scale mental health screenings. As a result, short forms based on the GAD-7, the GAD-2, and GAD-mini, have become popular. However, the GAD-2 and GAD-mini have reported lower diagnostic accuracy in some cultural contexts, implying that a validated short-form version of the GAD-7 may be lacking in large-scale cross-cultural anxiety screening. Based on this, to develop an optimal short form of the GAD-7 with cross-cultural stability, we utilized seven GAD-7 datasets from six different countries, totaling 47,484 participants. Five 2 to 6 item short forms of the GAD were constructed using the Riskslim machine learning algorithm. We evaluated the diagnostic accuracy of the GAD-7 short forms in the training and test sets based on the coefficient of determination(R²) and area under the curve(AUC) metrics, and the results showed that GAD-R2 performed poorly in some cultures, and all of the 3 to 6 item short forms of the GAD performed good in cross-cultural diagnostic rates, with the GAD-R6 showing the highest diagnostic accuracy in all cultures; GAD-R3 outperformed GAD-R2, GAD-2, and GAD-mini in all cultures; GAD-R3 had higher generalizability across cultures and special populations; Given that the GAD-R3 was shorter and nearly as accurate as the GAD-R6, we recommend the use of the GAD-R3 in clinical studies and epidemiologic investigations. And we recommend the optimal actual cutoff value of 15 for GAD-R3. Overall, we recommend GAD-R3 as the short-form version of GAD-7 in cross-cultural studies. However, the 2-item GAD scale is also optimal for the short-form version in clinical practice.
Integrated rules classifier for predicting pathogenic non-synonymous single nucleotide variants in human
2024, Gene Reports
The most prevalent kind of genetic variants in humans are non-synonymous single nucleotide variants (nsSNVs). Several prediction tools have been launched to forecast the effect of amino acid substitutes on human protein function. These tools sort variants as pathogenic or neutral. We developed an Integrated Rules Classifier (Integration Score through JRip “ISTJRip”), which integrates the four individual tools that are publicly available; iFish, Mutation Assessor, FATHMM, and SIFT-based on the JRip machine learning technique. Additionally, we compared the ISTJRip approach with the other three created integration classifiers; Integration Score through J48 “ISTJ48”, Integration Score through RF “ISTRF”, and Integration Score through SVM “ISTSVM” using a VaribenchSelectedPure dataset character from the standard dataset “Varibench”. The proposed integrated rules classifier “ISTJRip” and the other three integration classifiers, ISTJ48, ISTRF, and ISTSVM register 92.41 %, 92.26 %, 91.70 %, and 90.62 % ACC on VaribenchSelectedPure, respectively. Finally, we demonstrated that the integrated rules classifier outperforms other integration classifiers and highlights the benefits of JRip machine learning technique in the integration process for multiple tools.
A novel Bayesian Latent Class Model (BLCM) evaluates multiple continuous and binary tests: A case study for Brucella abortus in dairy cattle
2024, Preventive Veterinary Medicine
Bovine brucellosis, primarily caused by Brucella abortus, severely affects both animal health and human well-being. Accurate diagnosis is crucial for designing informed control and prevention measures. Lacking a gold standard test makes it challenging to determine optimal cut-off values and evaluate the diagnostic performance of tests. In this study, we developed a novel Bayesian Latent Class Model that integrates both binary and continuous testing outcomes, incorporating additional fixed (parity) and random (farm) effects, to calibrate optimal cut-off values by maximizing Youden Index. We tested 651 serum samples collected from six dairy farms in two regions of Henan Province, China with four serological tests: Rose Bengal Test, Serum Agglutination Test, Fluorescence Polarization Assay, and Competitive Enzyme-Linked Immunosorbent Assay. Our analysis revealed that the optimal cut-off values for FPA and C-ELISA were 94.2 mP and 0.403 PI, respectively. Sensitivity estimates for the four tests ranged from 69.7% to 89.9%, while specificity estimates varied between 97.1% and 99.6%. The true prevalences in the two study regions in Henan province were 4.7% and 30.3%. Parity-specific odds ratios for positive serological status ranged from 1.2 to 2.2 for different parity groups compared to primiparous cows. This approach provides a robust framework for validating diagnostic tests for both continuous and discrete tests in the absence of a gold standard test. Our findings can enhance our ability to design targeted disease detection strategies and implement effective control measures for brucellosis in Chinese dairy farms.
Big is not better: Comparing two alpha-Gal-bearing glycotopes in neoglycoproteins as biomarkers for Leishmania (Viannia) braziliensis infection
2024, Carbohydrate Research
The protozoan parasite Leishmania (Viannia) braziliensis is among Latin America's most widespread Leishmania species and is responsible for tegumentary leishmaniasis (TL). This disease has multiple clinical presentations, with cutaneous leishmaniasis (CL) being the most frequent. It manifests as one or a few localized skin ulcers, which can spread to other body areas. Hence, early diagnosis and treatment, typically with pentavalent antimonials, is critical. Traditional diagnostic methods, like parasite culture, microscopy, or the polymerase chain reaction (PCR) for detection of the parasite DNA, have limitations due to the uneven distribution of parasites in biopsy samples. Nonetheless, studies have revealed high levels of parasite-specific anti-α-Gal antibodies in L. (V.) braziliensis-infected patients. Previously, we demonstrated that the neoglycoprotein NGP28b, consisting of the L. (Leishmania) major type-2 glycoinositolphospholipid (GIPL)-3-derived trisaccharide Galpα1,6Galpα1,3Galfβ conjugated to bovine serum albumin (BSA) via a linker, acts as a reliable serological biomarker (BMK) for L. (V.) braziliensis infection in Brazil. This indicates the presence of GIPL-3 or a similar structure in this parasite, and its terminal trisaccharide either functions as or is part of an immunodominant glycotope. Here, we explored whether extending the trisaccharide with a mannose unit would enhance its efficacy as a biomarker for the serological detection of L. (V.) braziliensis. We synthesized the tetrasaccharide Galpα1,6Galpα1,3Galfβ1,3Manpα(CH₂)₃SH (G31_SH) and conjugated it to maleimide-functionalized BSA to afford NGP31b. When we assessed the efficacy of NGP28b and NGP31b by chemiluminescent enzyme-linked immunosorbent assay on a cohort of CL patients with L. (V.) braziliensis infection from Bolivia and Argentina against a healthy control group, both NGPs exhibited similar or identical sensitivity, specificity, and accuracy. This finding implies that the mannose moiety at the reducing end is not part of the glycotope recognized by the parasite-specific anti-α-Gal antibodies in patients' sera, nor does it exert a relevant influence on the terminal trisaccharide's conformation. Moreover, the mannose does not seem to inhibit glycan-antibody interactions. Therefore, NGP31b is a viable and dependable BMK for the serodiagnosis of CL caused by L. (V.) braziliensis.

View all citing articles on Scopus

View full text

Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests

Abstract

Introduction

Section snippets

Example data

Basic principles of ROC curves

Recent developments

Software for ROC analysis

Conclusions

J. Math. Psychol.

Prev. Vet. Med.

Prev. Vet. Med.

Prev. Vet. Med.

J. Immunol. Methods

Prev. Vet. Med.

J. Immunol. Methods

Vet. Parasitol.

Prev. Vet. Med.

J. Clin. Epidemiol.

J. Clin. Epidemiol.

Semin. Nucl. Med.

J. Clin. Epidemiol.

Clin. Lab. Med.

Comput. Methods Programs Biomed.

On the use of likelihood ratios in clinical chemistry

Clin. Chem.

Analysis of clustered data in receiver operating characteristic studies

Stat. Meth. Med. Res.

A statistical method for the comparison of a discrete diagnostic test with several continuous diagnostic tests

Biometrics

On comparisons of sensitivity, specificity, and predictive value of a number of diagnostic procedures

Biometrics

Advances in statistical methodology for the evaluation of diagnostic and laboratory tests

Stat. Med.

Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test

Am. J. Epidemiol.

Maximum likelihood estimation of parameters of signal detection theory — a direct solution

Psychometrika

How to correct for chance agreement in the estimation of sensitivity and specificity of diagnostic tests

Methods Inf. Med.

The robustness of the “binormal” assumptions used in fitting ROC curves

Med. Decis. Mak.

The meaning and use of the area under a receiver operating characteristic curve

Radiology

A method of comparing the areas under receiver operating characteristic curves derived from the same cases

Radiology