REPLY: We appreciate the comments (1) by Zhang and Zhang in response to our manuscript (2). We thank them for their compliments about our work overall. However, we would also like to clarify several points of confusion for their sake and for readers generally.
First, our feature selection involved more than simply “manually excluding several highly correlated features.” We also performed feature reduction by individually assessing features for significant association with recurrence, using bootstrap resampling. We then performed forward stepwise feature selection for model building, carefully checking for nonredundancy of added features at each step, using model stability testing based on bootstrap analysis of the out-of-bag C-indices. Consequently, our stepwise selections terminated after no more than 5 rounds, instead of spuriously adding correlated features. Additionally, to further test that these selected features were not overfitting to noise, we evaluated our model on the test set and calculated C-index CIs to ensure that training and test C-indices overlapped. This combination of procedures more than satisfies the authors’ demand for “more sophisticated and rigorous dimensionality reduction methods…to ensure the reproducibility and independence of the identified radiomic features” (1). We also note that the rigorous tests for nonredundancy that we implemented rebut the authors’ comments regarding subgroup analysis by stage, because the nonredundancy between radiomics and stage is already built-in to the model development.
Second, regarding the statement that “Figure 6 showed the same hazard ratios in the models…suggesting that whole-body biomarkers failed to provide additional information for risk stratification.” We presume the authors are referring to the comparison between Figures 6B and 6C. However, as we mention, the risk score cutoff in Figure 6C was explicitly chosen solely to ensure an equal number of high-risk and low-risk cases as in Figures 6A and 6B for comparison. It was not chosen to optimize the hazard ratio; however, the cutoff in Figure 6D was chosen in this way, where one can see that the optimal stratification using whole-body radiomics-based risk score outperforms the other stratifications.
We agree with some of the authors’ points. Although we discussed potential physiologic correlates of our features, the biologic meaning of the features would undoubtedly be more precise by “correlation with computational pathology features, radiology–pathology coregistration, or analysis of biologic pathways or genomic correlations.” This was beyond our scope of work but is a worthy line of inquiry for future validation studies. We also agree that other measures of model performance could have been reported, including decision curve analysis, calibration plots, or net reclassification improvement. Indeed, there are a wide variety of measures that are reasonable, and the choice of which to report always involves an element of arbitrariness. The metrics we reported, including the C-indices, 2-y receiver-operating-characteristic curve and risk stratifications with hazard ratio, were chosen on the basis of their widespread usage in the biostatistics literature. Nevertheless, we concede that future validation studies would benefit from a more comprehensive set of assessments.
Footnotes
Published online Jan. 13, 2022.
- © 2022 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- 1.
- 2.
- Revision received December 20, 2021.
- Accepted for publication January 6, 2022.