Abstract
Knowledge of the intrinsic variability of radiomic features is essential to the proper interpretation of changes in these features over time. The primary aim of this study was to assess the test–retest repeatability of radiomic features extracted from 18F-FDG PET images of cervical tumors. The impact of different image preprocessing methods was also explored. Methods: Patients with cervical cancer underwent baseline and repeat 18F-FDG PET/CT imaging within 7 d. PET images were reconstructed using 2 methods: ordered-subset expectation maximization (PETOSEM) or ordered-subset expectation maximization with point-spread function (PETPSF). Tumors were segmented to produce whole-tumor volumes of interest (VOIWT) and 40% isocontours (VOI40). Voxels were either left at the default size or resampled to 3-mm isotropic voxels. SUV was discretized to a fixed number of bins (32, 64, or 128). Radiomic features were extracted from both VOIs, and repeatability was then assessed using the Lin concordance correlation coefficient (CCC). Results: Eleven patients were enrolled and completed the test–retest PET/CT imaging protocol. Shape, neighborhood gray-level difference matrix, and gray-level cooccurrence matrix features were repeatable, with a mean CCC value of 0.81. Radiomic features extracted from PETOSEM images showed significantly better repeatability than features extracted from PETPSF images (P < 0.001). Radiomic features extracted from VOI40 were more repeatable than features extracted from VOIWT (P < 0.001). For most features (78.4%), a change in bin number or voxel size resulted in less than a 10% change in feature value. All gray-level emphasis and gray-level run emphasis features showed poor repeatability (CCC values < 0.52) when extracted from VOIWT but were highly repeatable (mean CCC values > 0.96) when extracted from VOI40. Conclusion: Shape, gray-level cooccurrence matrix, and neighborhood gray-level difference matrix radiomic features were consistently repeatable, whereas gray-level run length matrix and gray-level zone length matrix features were highly variable. Radiomic features extracted from VOI40 were more repeatable than features extracted from VOIWT. Changes in voxel size or SUV discretization parameters typically resulted in relatively small differences in feature value, though several features were highly sensitive to these changes.
The role of PET/CT with 18F-FDG in oncologic diagnosis, staging, and treatment monitoring is well established (1,2). With the increasing use of functional imaging modalities such as PET/CT, interest in the quantification of image data has grown. Despite the existence of established quantitative frameworks for response assessment such as RECIST and PERCIST, most interpretations in clinical practice are still based primarily on subjective visual interpretations (3,4).
Tumor metabolism is commonly quantified with SUV metrics, including SUVmax, SUVmean, and SUVpeak (5). Radiomic features are complex quantitative imaging biomarkers purported to provide additional information beyond intensity-based SUV metrics (6). Extraction of radiomic features typically incorporates image preprocessing, segmentation, and feature calculation. Each of these steps has been shown to affect the outcome of radiomic analyses (7).
18F-FDG PET is now part of the standard of care for the evaluation of cervical cancer in the United States (8). In cervical cancer patients, radiomic features extracted from 18F-FDG PET images have been shown to predict both survival and disease recurrence (9,10). As PET is also useful for monitoring response to cervical cancer treatment, there is interest in using radiomic methods to further this purpose (11–13). However, to use these novel features effectively, their intrinsic variability must be formally quantified. Unfortunately, studies focusing on the repeatability of radiomic features are limited and have shown conflicting results (14,15).
The primary aim of this study was to assess the repeatability of radiomic features extracted from PET images of cervical cancer patients. The effects of reconstruction, segmentation, voxel resampling, and SUV discretization methods on repeatability were also explored. These images were collected as part of a repeatability study of several commonly used PET/MRI and PET/CT quantitative imaging metrics (16).
MATERIALS AND METHODS
Subjects
This prospective study was approved by the Washington University Institutional Review Board, and all volunteers provided written informed consent (ClinicalTrials.gov identifier NCT02717572). From June 2016 through May 2017, 17 patients with histologically proven malignancy were enrolled. Two patients who failed to complete the second imaging session were excluded, and another was excluded because of a lack of tumor 18F-FDG uptake. For the current study focusing on cervical cancer, 3 patients without cervical cancer were also excluded.
Each patient underwent double baseline 18F-FDG PET/CT imaging separated by at least 24 h and at most 7 d. Unless otherwise noted, the imaging procedures used in this study conformed with the European Association of Nuclear Medicine tumor imaging guidelines (2). The same PET/CT scanner and approximate 18F-FDG dose were used for both imaging sessions. Patients were instructed to fast and to avoid liquids other than water for at least 6 h before the planned 18F-FDG administration time. Blood glucose levels were measured immediately before the 18F-FDG injection.
Imaging
All images were acquired using a Biograph 40 scanner (Siemens AG). Static PET data were collected for 15 min over a single station (tumor centered within a 21.6-cm field of view), starting 60–70 min after intravenous injection of 370 MBq of 18F-FDG. A CT was performed immediately before PET imaging, using a tube potential of 120 kV, maximum tube current of 80 mAs (CARE Dose 4D [Siemens] tube current modulation), pitch of 0.8, and rotation time of 0.5 s. PET images were reconstructed using ordered-subset expectation maximization (PETOSEM) or ordered-subset expectation maximization with point-spread function (PETPSF). Image reconstruction parameters, as well as additional acquisition and postacquisition details, can be found in Supplemental Table 1 (supplemental materials are available at http://jnm.snmjournals.org).
Image Analysis
MIM, version 6.9.3 (MIM Software), was used for image segmentation (Fig. 1). For each PET/CT session, the lesion volume of interest (VOI) was manually delineated by 1 expert nuclear medicine reader to generate a whole-tumor contour (VOIWT). On the basis of previous work involving segmentation of cervical tumors, a 40% isocontour (VOI40) was also generated, containing all voxels with an SUV of at least 40% of the SUVmax of the VOIWT (17). Before radiomic feature calculation, PET image intensities were normalized to decay-corrected injected activity per kilogram of body weight (SUV [g/mL]). The effects of SUV discretization and spatial resampling were explored by discretizing to a fixed number of bins (32, 64, or 128), over the full SUV range in each image, and either no spatial resampling (4.07 × 4.07 × 5.00 mm) or resampling to 3-mm isotropic voxels. Radiomic features were then extracted using LIFEx, version 5.28, which complies with the Imaging Biomarker Standardization Initiative recommendations (18). Intensity, shape, and textural features (Table 1) were calculated from each VOI. Detailed descriptions, formulas, and computation parameters for each feature can be found on the LIFEx website (https://www.lifexsoft.org).
Representative images. Tumors were manually delineated to generate VOIWT (baseline image [A] and repeat image [B]). VOI40 was created by removing all voxels with SUVs ≤ 40% of SUVmax of VOIWT (baseline image [C] and repeat image [D]).
Radiomic Feature Groups
Statistical Methods
Repeatability was assessed using the Lin concordance correlation coefficient (CCC), which provides a quantification of agreement between 2 repeated measures (19). In this study, a CCC value of less than 0.80 was considered unrepeatable, a value of at least 0.80 was considered repeatable, and a value of at least 0.95 was considered highly repeatable. Including the repeatability calculations for all reconstruction, segmentation, voxel resampling, and SUV discretization methods, we generated 24 CCC values for each radiomic feature.
Paired t tests or Wilcoxon signed-rank tests were used to assess groupwise repeatability differences. Group normality was evaluated using D’Agostino–Pearson tests. Data were analyzed using R, version 3.6.2 (http://cran.r-project.org/), and Excel, version 2016 (Microsoft Corp.). A P value of less than 0.05 was considered significant, unless otherwise indicated. A Bonferroni adjustment for multiple comparisons was applied when necessary to control for type I errors.
RESULTS
Seventeen patients were enrolled, and test–retest images from 11 female patients (Table 2) were eligible for radiomic analysis. The median time between imaging sessions was 2 d, with a range of 1–7 d. Blood glucose levels (mean ± SD) before 18F-FDG administration were 92.2 ± 8.1 and 93.2 ± 20.9 mg/dL during visits 1 and 2, respectively. The mean administered 18F-FDG dose was 367.7 ± 20.2 MBq during visit 1 and 371.0 ± 16.8 MBq during visit 2. The mean visit 1 18F-FDG uptake time was 60.4 ± 1.4 min, and the mean visit 2 18F-FDG uptake time was 61.4 ± 3.1 min.
Patient Characteristics
Radiomic Feature Repeatability
When assessed as groups, standard-intensity, shape, neighborhood gray-level difference matrix (NGLDM), and gray-level cooccurrence matrix (GLCM) features showed consistent repeatability, with mean CCC values greater than 0.80 (Fig. 2). Nonstandard-intensity, gray-level zone length matrix (GLZLM), and gray-level run length matrix (GLRLM) features were mostly unrepeatable (56.7% of CCC values were less than 0.80) when extracted from PETPSF images (Fig. 3). Nonstandard-intensity, GLZLM, and GLRLM features were also mostly unrepeatable (55.8% of CCC values were less than 0.80) when extracted from VOIWT segmentations.
Repeatable radiomic feature groups. Box plots show repeatability ranges for all features within specified feature group. ISR = isotropic voxels; NSR = no spatial resampling.
Unrepeatable radiomic feature groups. ISR = isotropic voxels; NSR = no spatial resampling.
Highly repeatable radiomic features (mean CCC values above 0.95) included metabolic tumor volume, compacity, entropy (GLCM-calculated), gray-level nonuniformity (GLRLM-calculated), run length nonuniformity, coarseness, long-zone high gray-level emphasis, gray-level nonuniformity (GLZLM-calculated), and zone length nonuniformity (Supplemental Tables 2 and 3). Nearly all (98.6%) GLCM CCC values were greater than 0.80, and many (61.8%) were greater than 0.95. Overall, the combination of PETOSEM with isotropic resampling, SUV discretized to 64 bins, and tumors segmented using VOI40 resulted in the most stable features (76.7% were repeatable and 39.5% highly repeatable). The fewest repeatable radiomic features resulted from PETPSF images with no spatial resampling, SUV discretized to 128 bins, and tumors segmented manually (39.5% were repeatable and 20.9% highly repeatable).
PET Reconstruction Method
The mean CCC values for radiomic features extracted from PETOSEM and PETPSF images are provided in Table 3. Shape and textural (i.e., GLCM, GLRLM, NGLDM, and GLZLM) features had significantly higher CCC values when extracted from PETOSEM images than when extracted from PETPSF images (0.86 and 0.79, respectively; P < 0.001; Fig. 4A). GLRLM and GLZLM features were impacted by reconstruction method more than were other features (Fig. 5A). Standard-intensity–based features extracted from both PET reconstructions had approximately the same mean CCC value (0.87; P = 0.494; Fig. 2).
Comparison of Radiomic Feature Repeatability Between PET Image Reconstruction Methods
Bland–Altman plots showing effect on radiomic feature repeatability of changes in reconstruction (PETPSF CCC subtracted from PETOSEM CCC [A]), segmentation (VOIWT CCC subtracted from VOI40 CCC [B]), spatial resampling (no-spatial-resampling CCC subtracted from resampling to 3-mm isotropic voxel CCC [C]), and SUV discretization (32-bin CCC subtracted from 64-bin CCC, 32-bin CCC subtracted from 128-bin CCC, and 64-bin CCC subtracted from 128-bin CCC [D]). Plots were generated by calculating mean of, and difference between, corresponding radiomic feature CCC values of each method.
Bland–Altman plots showing how changes in reconstruction (A), segmentation (B), spatial resampling (C), and SUV discretization (D) affected repeatability of radiomic feature groups.
Segmentation Method
Radiomic features were less repeatable when extracted from VOIWT segmentations than when extracted from VOI40 segmentations (Tables 4 and 5). Using a CCC cutoff of 0.80, 56.2% of PETPSF VOIWT features and 65.5% of VOI40 features were repeatable. When extracted from PETOSEM images, 68.6% of VOIWT features and 89.9% of VOI40 features were repeatable. Shape, GLCM, GLRLM, and NGLDM feature groups extracted from PETPSF images were found to have significantly (all P < 0.015) lower CCC values when extracted from VOIWT than when extracted from VOI40. Nonstandard-intensity, GLCM, GLRLM, NGLDM, and GLZLM features extracted from PETOSEM images were found to have significantly lower (all P < 0.023) CCC values when extracted from VOIWT images than when extracted from VOI40 images.
Groupwise Comparison of Repeatability of Radiomic Features Extracted from PETOSEM Images Using 2 Different Segmentation Methods
Groupwise Comparison of Repeatability of Radiomic Features Extracted from PETPSF Images Using 2 Different Segmentation Methods
Spatial Resampling
The repeatability of most radiomic features was robust against spatial resampling changes, with 78.5% of features showing less than a 5% relative difference in CCC value between resampling methods (Figs. 4C and 5C). Features extracted from PETPSF images were more sensitive to spatial resampling changes than were features extracted from PETOSEM images (Tables 6 and 7). When extracted from VOIWT, GLRLM features showed greater repeatability after isotropic resampling (P < 0.045). The opposite was true when extracted from VOI40, with GLRLM features showing greater repeatability before spatial resampling (P < 0.005). Few radiomic feature groups (21%) were found to have mean CCC values that differed by more than 0.03 before and after spatial resampling.
Groupwise Comparison of Repeatability of Radiomic Features Extracted from PETOSEM Images Using 2 Different Spatial Resampling Methods
Groupwise Comparison of Repeatability of Radiomic Features Extracted from PETPSF Images Using 2 Different Spatial Resampling Methods
SUV Discretization
The repeatability of most radiomic features showed little sensitivity to changes in SUV discretization (Figs. 4D and 5D). Most (60.3%) textural feature CCC values varied by less than 5% among bin number groups. GLCM feature repeatability was largely insensitive to SUV discretization changes, with 80.2% of CCC values varying by less than 5% within SUV bin groups. GLZLM features were considerably impacted by changes in the number of SUV bins, with 69.3% of CCC values varying by 5% or more (and by as much as 228.6%) across SUV bin groups.
Lesion Volume Analysis
The absolute relative difference between test–retest radiomic values was correlated with mean test–retest metabolic tumor volume to assess the influence of lesion volume on feature repeatability. After controlling for multiple comparisons, we found no significant correlations between metabolic tumor volume and absolute relative difference in feature value (Supplemental Tables 4 and 5).
DISCUSSION
The increasing examination of alternative imaging biomarkers such as radiomic features requires an understanding of their test–retest repeatability. The calculation of these features involves several steps, including image reconstruction, segmentation, and preprocessing, all of which have been shown to impact feature stability (15). In the current study, patients with cervical cancer underwent double baseline 18F-FDG PET/CT studies. PET images were reconstructed using 2 different methods, and tumors were delineated manually and with a semiautomated technique. Radiomic features were then extracted following various image preprocessing methods.
Features calculated using run length and zone length matrices were found to have low repeatability, whereas shape, GLCM, and NGLDM features were consistently repeatable. The repeatability of most GLRLM and GLZLM features was quite sensitive to changes in any step of the radiomic extraction process. Tixier et al., using images of esophageal tumors, similarly concluded that GLCM features were repeatable and that GLZLM features were unrepeatable (20). In another study, GLZLM features also showed poor repeatability when extracted from PET images of lung cancer using a PETOSEM, 50% isocontour segmentation, and isotropic spatial resampling (21). In our study, GLZLM features were mostly unrepeatable, but using a PETOSEM and VOI40, 89.4% of GLZLM CCC values were designated as repeatable.
Two feature groups, GLRLM and GLZLM, were not homogeneous in terms of repeatability within each group. Gray-level emphasis and gray-level run emphasis features within the GLRLM group showed poor repeatability. However, the remaining GLRLM features (nonuniformity, run percentage, and run-emphasis calculations) were highly repeatable. Likewise, most GLZLM features were unrepeatable and highly sensitive to changes in image reconstruction, segmentation, and preprocessing methods. However, GLZLM nonuniformity and run percentage features were consistently repeatable.
GLCM features were consistently repeatable in this study—a finding that has also been described in other work (20–22). Entropy, in particular, has consistently been found to be reproducible and repeatable, as well as a significant predictor of patient response (7,23,24). In our study, entropy was consistently repeatable, though somewhat sensitive to SUV discretization. When SUV was discretized to 32 bins, mean entropy CCC was 0.90, but when SUV was discretized to 64 or 128 bins, mean entropy CCC increased to 0.97.
The PET reconstruction methods used here had a substantial impact on repeatability. This impact was intensified by segmentation method, as shown in Figure 4. Features that were otherwise mostly unrepeatable showed mostly high repeatability when PETOSEM and VOI40 were used. Yan et al. found that 5%–56% of textural features showed a large variation between values (920%) when reconstruction settings were varied (25). Reconstruction method was also found by Gallivanone et al. to have a strong impact on the stability of radiomic features (26). Using an anthropomorphic phantom, test–retest PET images were acquired and reconstructed using PETOSEM, PETPSF, and PETPSF with time of flight. They concluded that fewer than 20% of radiomic features were robust against changes in reconstruction method.
Numerous segmentation methods have been combined with radiomic analyses. One study segmented phantom ROIs using an adaptive Bayesian method or a 60% isocontour and found that less than 20% of radiomic features were stable between these segmentations (26). Using VOI40 significantly improved the repeatability of most radiomic features in our study, especially when used on images reconstructed without point-spread function. This improvement was particularly dramatic in some features otherwise found to be unrepeatable. The median CCC values of GLRLM and GLZLM features rose above the repeatability cutoff only when using VOI40, and all features within both groups were repeatable when also extracted from PETOSEM images. Automated or semiautomated segmentation approaches help to prevent the selection of normal tissue near tumor edges but often underestimate volume and may not capture regions of necrosis. Although this study showed a significant improvement in repeatability with the implementation of an SUV threshold, the utility of textural analyses may suffer when using such methods. Though the threshold applied in this study was based on its previous use in this patient population, other segmentation methods (including fixed or adaptive approaches) may further improve radiomic repeatability and should be evaluated.
Interpolation to isotropic voxel sizes has routinely been used in radiomic analyses (27,28). Since many features are sensitive to changes in voxel size, isotropic resampling affects the radiomic feature value and, therefore, reproducibility (29,30). However, the effect of voxel size resampling on radiomic feature repeatability has not been previously studied, and a standard approach to spatial resampling in the context of radiomic analysis has not been described. In the current study, voxels either were not resampled or were downsampled to 3-mm isotropic voxels. This method requires information inference, whereas upsampling involves information loss. Here, most features showed similar repeatability between resampling methods, though GLRLM and GLZLM features showed low mean CCC values and high differences in CCC value between resampling methods.
Resampling of image intensity values is a crucial step in radiomic analysis. Downsampling from a nearly unlimited set of intensity values to a discrete number of bins effectively reduces image noise and allows for comparison of radiomic values between image datasets. SUV discretization has consistently been shown to have a significant impact on radiomic feature values (7,23). Though the feature value may fluctuate with varying bin numbers or sizes, consistent repeatability of a discretization method may still allow for effective comparison between repeated evaluations. We found most feature groups to be consistently repeatable among SUV bin sizes, with GLCM features performing particularly well and GLZLM features performing quite poorly. On the basis of the current study and other recent radiomic repeatability studies, GLZLM features may not be useful for test–retest assessments (7,31). In this study, we explored varying a fixed number of SUV bins. Another approach is to vary the size of the bins, and future studies should explore the repeatability of this alternative method.
This study was limited by a small sample size, though the number of patients included was similar to other published radiomic repeatability studies (20,21). Future studies are needed to validate if these results represent radiomic features extracted from cervical tumors, generally. When performing this study, we attempted to adhere to the Imaging Biomarker Standardization Initiative guidelines, which only recently became available. Performing additional repeatability studies of other cancer types using the methods reported here, and guided by the Imaging Biomarker Standardization Initiative recommendations, may be informative. Repeatability thresholds used in this study, as in test–retest studies generally, are somewhat arbitrary and intended to illustrate only which features may be sensitive to changes in radiomic and image analysis methods. Additional sources of radiomic feature variability should also be explored, since any of the various factors that impact SUV repeatability could, in principle, affect radiomic feature stability.
CONCLUSION
Certain radiomic feature groups (shape, GLCM, and NGLDM) were repeatable, whereas others (GLRLM and GLZLM) were highly variable. Using a fixed-threshold segmentation method increased the repeatability of most features. Most often, changes in voxel size or SUV discretization parameters resulted in relatively small differences in feature value, though several features were highly sensitive to these changes.
DISCLOSURE
This project was supported by NIH grants U01CA140204, 5P30CA006973, and 1R01CA190299. No other potential conflict of interest relevant to this article was reported.
KEY POINTS
QUESTION: Are radiomic features extracted from PET images of cervical cancer repeatable?
PERTINENT FINDINGS: Using certain image preprocessing techniques, many radiomic features (especially shape, GLCM, and NGLDM) are stable and repeatable. The impact of PET reconstruction and segmentation methods on radiomic feature repeatability can be substantial.
IMPLICATIONS FOR PATIENT CARE: Certain radiomic features are repeatable and may be useful in the management of cervical cancer patients.
Footnotes
Published online Oct. 2, 2020.
- © 2021 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- Received for publication April 23, 2020.
- Accepted for publication September 2, 2020.