- Split View
-
Views
-
Cite
Cite
Neil J. Perkins, Enrique F. Schisterman, The Inconsistency of “Optimal” Cutpoints Obtained using Two Criteria based on the Receiver Operating Characteristic Curve, American Journal of Epidemiology, Volume 163, Issue 7, 1 April 2006, Pages 670–675, https://doi.org/10.1093/aje/kwj063
- Share Icon Share
Abstract
The use of biomarkers is of ever-increasing importance in clinical diagnosis of disease. In practice, a cutpoint is required for dichotomizing naturally continuous biomarker levels to distinguish persons at risk of disease from those who are not. Two methods commonly used for establishing the “optimal” cutpoint are the point on the receiver operating characteristic curve closest to (0,1) and the Youden index, J. Both have sound intuitive interpretations—the point closest to perfect differentiation and the point farthest from none, respectively—and are generalizable to weighted sensitivity and specificity. Under the same weighting of sensitivity and specificity, these two methods identify the same cutpoint as “optimal” in certain situations but different cutpoints in others. In this paper, the authors examine situations in which the two criteria agree or disagree and show that J is the only “optimal” cutpoint for given weighting with respect to overall misclassification rates. A data-driven example is used to clarify and demonstrate the magnitude of the differences. The authors also demonstrate a slight alteration in the (0,1) criterion that retains its intuitive meaning while resulting in consistent agreement with J. In conclusion, the authors urge that great care be taken when establishing a biomarker cutpoint for clinical use.
The proper diagnosis of disease and treatment administration is a task that requires a variety of tools. Through advancements in biology and laboratory methods, a multitude of biomarkers are available as clinical tools for such diagnosis. These biomarkers are usually measured on a continuous scale with overlapping levels for diseased and nondiseased persons. Cutpoints dichotomize biomarker levels, providing benchmarks that label people as diseased or not diseased on the basis of “positive” or “negative” test results. Biomarker levels of persons with known disease status are used to evaluate potential cutpoint choices and, hopefully, identify a cutpoint that is “optimal” under some criterion.
A receiver operating characteristic (ROC) curve is a mapping of this sensitivity by 1 minus specificity. The ROC curve has become a useful tool in comparing the effectiveness of different biomarkers (1–3). This comparison takes place through summary measures such as the area under the curve (AUC) and the partial AUC, with higher area values indicating higher levels of diagnostic ability (1, 2, 4). A biomarker with an AUC of 1 differentiates perfectly between diseased persons (sensitivity = 1) and healthy persons (specificity = 1). An AUC of 0.5 means that, overall, there is a 50-50 chance that the biomarker will correctly identify diseased or healthy persons as such.
Though useful for biomarker evaluation, these measures do not inherently lead to benchmark “optimal” cutpoints with which clinicians and other health-care professionals can differentiate between diseased and nondiseased persons. Several methods for identifying “optimal” cutpoints using sensitivity, specificity, and the ROC curve have been proposed and applied (4–8). Confidence intervals and corrections for measurement error are some of the supporting statistical developments accompanying cutpoint estimation (9). Applications of these techniques have been demonstrated in several fields, including nuclear cardiology, epidemiology, and genetics (7, 10, 11).
In the “Criteria” section of this article, we describe two criteria for locating this cutpoint that have similar intuitive justifications. In describing the mathematical mechanisms behind these criteria, we demonstrate that one of the criteria retains the intended meaning, while the other inherently depends on quantities that may differ from an investigator's intentions. In the “Example” section, we use data from a nested case-control study carried out in the Calcium for Pre-Eclampsia Prevention cohort (12) to demonstrate how these two criteria identify different cutpoints for the classification of 120 preeclampsia cases and 120 controls based on levels of placenta growth factor, a biomarker of angiogenesis. Next, we discuss the appropriateness of the term “optimal” as it applies to each criterion. This is handled first with equally weighted sensitivity and specificity. Consideration of differing disease prevalences and costs due to misclassification is also presented as a practical generalization (5, 13). We end with a brief discussion.
CRITERIA
The closest-to-(0,1) criterion
This criterion can be viewed as searching for the shortest radius originating at the (0,1) point and terminating on the ROC curve. Reference arcs can be used to visually compare radial distances, with the arc corresponding to c* being tangent to the ROC curve and thus the minimum and interior of any of the concentric arcs possible. Figure 1 demonstrates this point at which the dotted arc is completely interior to, and thus closer to (0,1) than, the arc formed by the distance to an alternate point on the curve.
The Youden index
Agreement/disagreement
The above criteria agree with respect to intuition; they maximize and minimize the rates of people's being classified correctly and incorrectly, respectively. The question “Do they agree on the same ‘optimal’ cutpoint?” now begs to be answered.
Suppose the biomarker of interest follows continuous distributions for both diseased and nondiseased populations that are known completely, leading to a true ROC curve. Our only distributional restriction is that a ROC curve is generated that is differentiable everywhere. This is intrinsic to the case where diseased and nondiseased persons are assumed to follow any number of common continuous densities (i.e., normal, lognormal, gamma, etc.). Through differentiation, Appendix 1 shows that the two criteria only agree, c* = cJ = c, when q(c*) = p(c*) and q(cJ) = p(cJ). When either criterion identifies a point on the curve such that q(c*) ≠ p(c*) or q(cJ) ≠ p(cJ), the criteria disagree on what cutpoint is “optimal,” that is, c* ≠ cJ.
An investigator with complete knowledge of a biomarker's data distribution could be faced with two different cutpoints labeled “optimal” under two criteria that are intuitively the same. Our motivation here is simply to show that they are different and address the appropriateness of the label “optimal.”
EXAMPLE
Preeclampsia affects approximately 5 percent of pregnancies, resulting in substantial maternal and neonatal morbidity and mortality (16). Although the cause remains unclear, the syndrome may be initiated by placental factors that enter the maternal circulation and cause endothelial dysfunction, resulting in hypertension and proteinuria (12). Identifying women suffering from preeclampsia is a very important step in the management of the disease. Placenta growth factor is a promising biomarker for such classification, with an AUC of 0.60 (95 percent confidence interval: 0.53, 0.67); however, at what level would a woman be classified as at risk for the disease? Levine et al. (12) conducted a nested case-control study of 120 women with preeclampsia and 120 normal women randomly chosen from the Calcium for Pre-Eclampsia Prevention cohort study. Placenta growth factor levels were measured from serum specimens obtained before labor. Figure 2 shows the ROC curve generated from the log-transformed placenta growth factor levels. After calculation of the distance to (0,1) and the distance to the diagonal for each point, the cutpoints c* = 4.64 and cJ = 4.12, respectively, are identified. Thus, criteria with seemingly identical intuitive intents produce close results but disagree on the “optimal” cutpoint. Again, here it is sufficient to demonstrate that disagreement exists. We will revisit this example after the question of “optimality” has been addressed.
“Optimality”
When attempting to classify people on the basis of biomarker levels, it is always one's intent to do so “optimally.” However, the event of interest may intrinsically involve constraints which must, for ethical or fiscal reasons, be considered. These constraints commonly account for the prevalence of the event in both populations and the costs of misclassification, both monetary and physiologic. Thus, mathematical techniques of optimality must now operate within these constraints, but the idea of an “optimal” cutpoint should remain; one still wishes to choose a point that classifies the most people correctly and the fewest incorrectly.
First let us assume the simplest scenario, absent of constraints or weighting. By definition, the cJ found by equation 2 succeeds ideologically by maximizing the overall rate of correct classification, q(cJ) + p(cJ). As a result, the overall rate of misclassification, (1 − q(cJ)) + (1 − p(cJ)), is minimized. Thus, we can say that cJ is “optimal” with respect to the total correct and incorrect classification rates and any cutpoint that deviates from it is not.
Example revisited
To demonstrate this unnecessary misclassification and its possible magnitude, we revisit the example of placenta growth factor levels' being used to differentiate preeclamptic women from those without the disease. Sensitivity and specificity at the cutpoints previously identified are q(c*) = 0.592, p(c*) = 0.558 and q(cJ) = 0.817, p(cJ) = 0.358, respectively. The overall correct classification rate (q + p) is 1.150 for c* and 1.175 for cJ out of a possible 2, with a difference of 0.025. Without the justification for the third term in equation 3 and without weighting, this difference can be thought of as one person out of 100 being unnecessarily misclassified. Relative cost and disease prevalence are often difficult to assess, as discussed by Greiner et al. (18) and the references cited therein. Thus, we will not attempt adjustment in this example.
DISCUSSION
In this paper, we demonstrated the intuitive similarity of two criteria used to choose an “optimal” cutpoint. We then showed that the criteria agree in some instances and disagree in others. Placenta growth factor levels used to classify women as preeclamptic or not preeclamptic were used to demonstrate this point and quantify the extent of disagreement.
We addressed both criteria in the context of what an investigator might view as “optimal,” with and without attention to misclassification cost and prevalence. Mathematically, J reflects the intention of maximizing overall correct classification rates and thus minimizing misclassification rates, while choosing the point closest to (0,1) involves a quadratic term for which the clinical meaning is unknown. It is for this reason that we advocate for the use of J to find the “optimal” cutpoint.
Since the (0,1) criterion is visually intuitive, we have provided an amended (0,1) criterion in Appendix 2 that is likewise geometrically satisfying while consistently identifying the same “optimal” cutpoint as J. This criterion relies on a ratio of radii originating at (0,1).
Additional motivation for using J is an ever-increasing body of supporting literature (9, 15, 19). Topics such as confidence intervals and correcting the estimate for measurement error have been considered, whereas the (0,1) criterion lacks such support.
Most importantly, cutpoints chosen through less than “optimal” criteria or criteria that are “optimal” in some arbitrary sense can lead to unnecessary misclassifications, resulting in needlessly missed opportunities for disease diagnosis and intervention. We showed above that J is “optimal” when equal weight is given to sensitivity and specificity, r = 1, and a generalized J is “optimal” when cost and prevalence lead to weighted sensitivity and specificity, r ≠1. Thus, when the point closest to (0,1) differs from the point resulting in J, using this criterion to establish an “optimal” cutpoint unnecessarily introduces an increased rate of misclassification.
APPENDIX 1
For continuous receiver operating characteristic (ROC) curves, we make no distributional assumptions beyond the assumption that the probability density functions fD and fD̄ for biomarker levels of diseased and nondiseased persons, respectively, form a ROC curve that is differentiable everywhere. This is the case when fD and fD̄ are assumed to be any common continuous parametric distributions (i.e., normal, gamma, lognormal).
Equations A1.2 and A1.4 show us that the (0,1) and J methods agree, c* = cJ = c, only when q(c*) = p(c*) and thus (1 − p(c*))/(1 − q(c*)) = 1. When q(c*) ≠ p(c*), the criteria disagree on what point is optimal (c* ≠ cJ).
APPENDIX 2
This research was supported by the National Institutes of Health Intramural Research Program, National Institute of Child Health and Human Development.
The authors thank Dr. Richard Levine for allowing them to use the data from the Calcium for Pre-Eclampsia Prevention Study.
Conflict of interest: none declared.
References
Zhou XH, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. New York, NY: John Wiley and Sons, Inc,
Faraggi D. Adjusting ROC curves and related indices for covariates.
Schisterman EF, Faraggi D, Reiser B. Adjusting the generalized ROC curve for covariates.
Pepe M. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press,
Zwieg MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.
Coffin M, Sukhatme S. Receiver operating characteristic studies and measurement errors.
Sharir T, Berman DS, Waechter PB, et al. Quantitative analysis of regional motion and thickening by gated myocardial perfusion SPECT: normal heterogeneity and criteria for abnormality.
Perkins NJ, Schisterman EF. The Youden index and the optimal cut-point corrected for measurement error.
Schisterman EF, Faraggi D, Brown R, et al. TBARS and cardiovascular disease in a population-based sample.
Chen R, Rabinovitch PS, Crispin DA, et al. DNA fingerprinting abnormalities can distinguish ulcerative colitis patients with dysplasia and cancer from those who are dysplasia/cancer-free.
Levine RJ, Maynard SE, Qian C, et al. Circulating angiogenic factors and the risk of preeclampsia.
Barkan N. Statistical inference on r * specificity + sensitivity. (Doctoral dissertation). Haifa, Israel: University of Haifa,
Schisterman EF, Perkins NJ, Aiyi L, et al. Optimal cutpoint and its corresponding Youden index to discriminate individuals using pooled blood samples.
Chmura Kraemer H. Evaluating medical tests: objective and quantitative guidelines. Newbury Park, CA: Sage Publications,
Geisser S. Comparing two tests used for diagnostic or screening processes.
Greiner M, Pfeiffer D, Smith RM. Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests.