A Common Mistake in Assessing the Diagnostic Value of a Test: Failure to Account for Statistical and Methodologic Issues ======================================================================================================================== * Siamak Sabour **TO THE EDITOR:** I was interested to read the paper by Anand et al. in the December 2016 edition of *The Journal of Nuclear Medicine* (1). The purpose of the authors was to assess the impact of variability in scanning speed and in vendor-specific γ-camera settings on the reproducibility and accuracy of the automated bone scan index (BSI) (1). They measured reproducibility as the absolute difference between repeated BSI values, and they measured accuracy as the absolute difference between observed BSI values and phantom BSI values. Descriptive statistics were used to compare the generated data. Reproducibility (reliability) and accuracy (validity), as two completely different methodologic issues, should be assessed using appropriate tests. It is crucial to be aware that, regarding reliability, one should use the intraclass correlation coefficient for quantitative variables and the weighted κ-test for qualitative variables. However, regarding validity, one should use the interclass correlation coefficient (Pearson *r*) for quantitative variables whereas the most appropriate tests for qualitative variables may include sensitivity, specificity, positive and negative predictive value, positive and negative likelihood ratio, diagnostic accuracy, and odds ratio. Moreover, in analyzing reliability, one should apply an individual-based approach using single-measure intraclass correlation coefficient agreement because applying a global-average approach (absolute difference) can be misleading. A test may indicate high validity, yet there may be no reliability at all (2–8). Anand et al. enrolled 25 patients in each of 3 groups and observed a significantly lower reproducibility for group 2 (mean ± SD, 0.35 ± 0.59) than for group 1 (0.10 ± 0.13; *P* < 0.0001) or group 3 (0.09 ± 0.10; *P* < 0.0001). However, no significant difference in reproducibility was observed between group 3 and group 1 (*P* = 0.388) (1). Statistical significance and clinical importance are two completely different issues, and in clinical research—especially reliability analysis—we should not put the emphasis on significance level (*P* value) (2–8). They concluded that the accuracy and reproducibility of automated BSI were dependent on scanning speed but not on vendor-specific γ-camera settings. Such a conclusion should be supported by the above-mentioned statistical and methodologic issues. Otherwise, in clinical practice, misdiagnosis and patient mismanagement may occur. ## Footnotes * Published online Jan. 12, 2017. * © 2017 by the Society of Nuclear Medicine and Molecular Imaging. ## REFERENCES 1. 1.Anand A, Morris MJ, Kaboteh R, et al. A preanalytic validation study of automated bone scan index: effect on accuracy and reproducibility due to the procedural variabilities in bone scan image acquisition. J Nucl Med. 2016;57:1865–1871. [Abstract/FREE Full Text](http://jnm.snmjournals.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Njoiam51bWVkIjtzOjU6InJlc2lkIjtzOjEwOiI1Ny8xMi8xODY1IjtzOjQ6ImF0b20iO3M6MjI6Ii9qbnVtZWQvNTgvNy8xMTgyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 2. 2.Szklo M, Nieto FJ. Epidemiology Beyond the Basics. 2nd ed. Manhattan, NY: Jones and Bartlett; 2007. 3. 3.Sabour S. Myocardial blood flow quantification by Rb-82 cardiac PET/CT: methodological issues on reproducibility study. J Nucl Cardiol. September 6, 2016 [Epub ahead of print]. 4. 4.Sabour S. Reproducibility of semi-automatic coronary plaque quantification in coronary CT angiography with sub-mSv radiation dose: common mistakes. J Cardiovasc Comput Tomogr. 2016;10:e21–e22. 5. 5.Sabour S. Reliability of a new modified tear breakup time method: methodological and statistical issues. Graefes Arch Clin Exp Ophthalmol. 2016;254:595–596. 6. 6.Sabour S, Farzaneh F, Peymani P. Evaluation of the sensitivity and reliability of primary rainbow trout hepatocyte vitellogenin expression as a screening assay for estrogen mimics: methodological issues. Aquat Toxicol. 2015;164:175–176. 7. 7.Sabour S. Re: does the experience level of the radiologist, assessment in consensus, or the addition of the abduction and external rotation view improve the diagnostic reproducibility and accuracy of MRA of the shoulder? [comment]. Clin Radiol. 2015;70:333–334. 8. 8.Sabour S. The reliability of routine clinical post-processing software in assessing potential diffusion-weighted MRI “biomarkers” in brain metastases, common mistake [comment]. Magn Reson Imaging. 2014;32:1162.