Abstract
The sensitivity of radiomic features to several confounding factors, such as reconstruction settings, makes clinical use challenging. To investigate the impact of harmonized image reconstructions on feature consistency, a multicenter phantom study was performed using 3-dimensionally printed phantom inserts reflecting realistic tumor shapes and heterogeneity uptakes. Methods: Tumors extracted from real PET/CT scans of patients with non–small cell lung cancer served as model for three 3-dimensionally printed inserts. Different heterogeneity pattern were realized by printing separate compartments that could be filled with different activity solutions. The inserts were placed in the National Electrical Manufacturers Association image-quality phantom and scanned various times. First, a list-mode scan was acquired and 5 statistically equal replicates were reconstructed. Second, the phantom was scanned 4 times on the same scanner. Third, the phantom was scanned on 6 PET/CT systems. All images were reconstructed using EANM Research Ltd. (EARL)–compliant and locally clinically preferred reconstructions. EARL-compliant reconstructions were performed without (EARL1) or with (EARL2) point-spread function. Images were analyzed with and without resampling to 2-mm cubic voxels. Images were discretized with a fixed bin width (FBW) of 0.25 and a fixed bin number (FBN) of 64. The intraclass correlation coefficient (ICC) of each scan setup was calculated and compared across reconstruction settings. An ICC above 0.75 was regarded as high. Results: The percentage of features yielding a high ICC was largest for the statistically equal replicates (70%–91% for FBN; 90%–96% for FBW discretization). For scans acquired on the same system, the percentage decreased, but most features still resulted in a high ICC (FBN, 52%–63%; FBW, 75%–85%). The percentage of features yielding a high ICC decreased more in the multicenter setting. In this case, the percentage of features yielding a high ICC was larger for images reconstructed with EARL-compliant reconstructions: for example, 40% for EARL1 and 60% for EARL2 versus 21% for the clinically preferred setting for FBW discretization. When discretized with FBW and resampled to isotropic voxels, this benefit was more pronounced. Conclusion: EARL-compliant reconstructions harmonize a wide range of radiomic features. FBW discretization and a sampling to isotropic voxels enhances the benefits of EARL-compliant reconstructions.
Personalized cancer treatment is one of the main promises of modern medicine. Analyzing the combinations of patient genetics and tumor phenotype in medical images can provide additional information on treatment response and diagnosis and therefore has the potential to help in clinical decision making (1). One part of this approach is the rapidly growing field of radiomics, which aims to extract a large number of feature values from medical images describing tumor phenotype and tumor inter- and intraheterogeneity (2–4). In PET/CT images, radiomics has shown promising results in the assessment of treatment response and patient survival for several cancer types, such as head-and-neck or lung cancer (5,6).
Besides these positive results, many studies reported on the limitations and challenges of radiomics, including the sensitivity of feature values to differences in reconstruction algorithm, voxel size, smoothing, and discretization method (7–9). To make radiomic studies comparable over patients, institutions, and scanners, it is essential that radiomic features be harmonized across centers. The European Association of Nuclear Medicine (EANM) attempts to reduce this variability of measurements in multicenter clinical trials in its EANM Research Ltd. (EARL) accreditation program (10). For this purpose, it harmonizes basic SUV features based on the SUVmax, SUVmean, and SUVpeak by comparing phantom scans of the National Electrical Manufacturers Association (NEMA) NU2-2012 image-quality phantom. For this purpose, centers choose 1 reconstruction setting that is in line with the standards provided by EARL and uses an iterative reconstruction algorithm (EARL1). It has been shown that reconstructions including resolution modeling (based on the point-spread function [PSF]) can be used to harmonize PET/CT systems (EARL2) (11). Additional to the EARL-compliant reconstructions, every center usually also applies 1 reconstruction with settings leading to optimal lesion detection, which is used for clinical reads. As illustrated in Figure 1, the quality of a PET/CT image differs across these 3 reconstruction settings, which therefore have a high impact on the extracted radiomic features (Table 1).
In patient with non–small cell lung cancer, Biograph Vision PET scan reconstructed with EARL2, EARL1, and clinically preferred reconstruction (from left to right).
The EARL harmonization is based on basic SUV features. To the best of our knowledge, no multicenter experimental study has yet investigated the effect of EARL harmonization on the variability of complex radiomic features. For this purpose, 1 object that reflects realistic heterogeneity uptake has to be scanned at multiple centers, and the feature values across centers have to be compared. Commercially available phantoms such as the NEMA image-quality phantom are not optimal, as they contain only spheric and homogeneous-uptake objects. Therefore, in this study, 3-dimensionally printed phantom inserts were designed and built according to tumors extracted from typical PET scans and reflecting more realistic uptake distributions than seen with spheres. These inserts were scanned at 3 institutions on 6 different PET/CT systems. Feature values were extracted from EARL-compliant (EARL1 and EARL2) and local clinically preferred reconstructions. The reliability, repeatability, and reproducibility of radiomic features were reported.
MATERIALS AND METHODS
Phantom Design and 3-Dimensional Printing
Three 3-dimensionally printed phantom inserts were used in this study. PET scans of patients with non–small cell lung cancer served as models for the inserts. For this purpose, several non–small cell lung cancer tumors showing various heterogeneity uptake pattern were visually checked. Three tumors with different shapes and uptake characteristics were selected as models for the 3-dimensional printing. These tumors were segmented, slightly smoothed, scaled, and converted to a stereolithography file to make the printing possible. Differences in heterogeneity uptake were realized by printing 2 separate compartments that could be filled with different activity solutions. The heterogeneity uptake patterns include a homogeneous tumor (tumor 1), a tumor with heterogeneity uptake in the sagittal view (tumor 2), and a tumor with a necrotic core (tumor 3). The sizes of the inserts are displayed in Table 2. The printing was performed by a Form 2 printer (Formlabs Inc.), which relies on a stereolithography technique to cure its photopolymeric clear resin (FLGPCL02; Formlabs Inc.). A picture of the 3-dimensional inserts and the corresponding tumors is displayed in Figure 2. The inserts were placed at equal distances in the NEMA NU-2 image-quality phantom. The feature values of the phantom inserts were verified to be within the range of radiomic feature values extracted from 10 18F-FDG PET/CT studies of non–small cell lung cancer patients (12). More than 82% of the features are well within the clinically expected range, and only 1.6% show a large variation from the clinical data. Therefore, the inserts generate feature values that are representative of clinical data.
Size of 3-Dimensionally Printed Inserts
(Top) PET/CT images of original tumor (left) and phantom insert (right) for tumors 1, 2, and 3 (from left to right). (Bottom) Corresponding stereolithographed models with tumor-to-background ratio (TBR).
Phantom Scans
To obtain features comparable across institutions and PET/CT systems, only features that are reliable, repeatable, and reproducible should be used. Reliable features are defined as those yielding only marginal differences when extracted from images obtained under exactly the same conditions, and repeatable features are features that result in small differences when extracted from various scans of the same subject. Reproducibility refers to features that remain almost the same when acquired using different PET/CT systems, image acquisition settings, and reconstruction settings.
To measure reliability, the NEMA image-quality phantom containing the inserts was scanned once on a Biograph mCT64 (Siemens Healthcare). The scan was acquired in list mode, and 5 statistical replicates of 60 s were reconstructed. Three different reconstruction settings were applied: An EARL-compliant reconstruction (EARL1, time of flight [TOF] with gaussian smoothing of 5 mm in full width at half maximum), an EARL-compliant reconstruction including PSF (EARL2, PSF + TOF with gaussian smoothing of 5 mm in full width at half maximum), and the clinically preferred setting of this institution (PSF + TOF with gaussian smoothing of 7 mm in full width at half maximum). The homogeneous insert, the outer part of the necrotic core, and the lower part of the third insert were filled with an activity solution that achieved a tumor-to-background ratio of around 10:1. The upper part of the third tumor was filled with an activity solution leading to a tumor-to-background ratio of 5:1, and the necrotic core of the tumor and spheres were filled with water (Fig. 2). The 5 statistically equal replicates represent an ideal situation because the 5 images differ only in noise pattern.
To measure repeatability, the phantom was scanned 4 times on the same system (Biograph mCT64) independently. That is, for every scan, the phantom was filled with an activity solution and placed at a slightly different position in the scanner. For differences in phantom filling, the scan duration was adjusted so that statistically equal replicates were obtained. The exact amount of activity in tumors, spheres, and background is listed in Table 3 for each scan. Images were reconstructed using the same reconstruction settings as described above. For every scan, the inserts were delineated separately, which could lead to slightly different delineations. Therefore, this scenario reflects a more realistic clinical setup.
Activity in Phantom Background and Tumor Inserts for 4 Scans Acquired on Same Scanner and Multicenter Setting
Furthermore, a multicenter study was performed to measure reproducibility. The inserts were scanned at 3 institutions on 6 PET/CT systems including 4 manufactured by Siemens Healthcare (Biograph mCT40, Biograph mCT64, Horizon with an extra ring of detectors [TrueV option], and Biograph Vision), 1 by Philips Healthcare (Vereos), and 1 by GE Healthcare (Discovery MI 4 ring). The data were reconstructed with a clinically relevant scan duration of 60 s. The scan duration was adjusted for differences in phantom filling across centers. Table 3 lists the phantom fillings for each scan. Also, images were reconstructed using the scanner-defined reconstruction settings complying with the EANM standards (EARL1 and EARL2), as well as using the locally clinically preferred settings of each institution. The applied reconstruction algorithm, matrix size, and smoothing kernel for the reconstructed images are listed in Table 4. The inserts were segmented separately for each scan.
Applied Reconstruction Algorithm, Matrix Size, and Smoothing Factor for Each Scanner
PET Analysis
Segmentations were performed with in-house–developed software for the analysis and segmentation of PET images. Segmentations were done manually on the low-dose CT portion of each scan.
In-house–developed software for the calculation of radiomic features programmed in C++ was used for feature calculation (13). All calculated feature values follow the definitions of the Image Biomarker Standardization Initiative and have been tested to be in compliance with the available benchmarks (14). In total, 436 radiomic features were extracted. Before feature calculation, the images were converted to SUVs so that the phantom background had an SUVmean of 1. Features were calculated for images consisting of the original voxel size, as well as for images resampled to 2-mm cubic voxels as recommended (15). Image and binary segmentation masks were resampled using trilinear interpolation. Before the extraction of textural features, images were discretized using a fixed bin number (FBN) of 64 and a fixed bin width (FBW) of 0.25.
Statistical Analysis
Data analysis was performed with Python, version 3.6.3, using the packages numPy, sciPy, and matplotlib (16) for figure plotting. Statistical analysis was performed using R within the Python environment with the Python-R interface rPy2.
Feature Reliability, Repeatability, and Reproducibility
To measure feature consistency (i.e., reliability, repeatability, and reproducibility) for the 3 different scan setups, the intraclass correlation coefficient (ICC) was calculated using the irr package (version 0.84), available from the Comprehensive R Archive Network (http://www.r-project.org). A 2-way single-measure model was used to evaluate the consistency of features for all scans. Every 3-dimensionally printed insert was regarded as a tumor in a patient, and each scan was regarded as 1 observer. The ICC is defined as the ratio of intercluster variability and the sum of intercluster and intracluster variability. Therefore, ICCs vary from 0 to 1, with 1 representing perfect agreement. Furthermore, a high ICC implies that the intracluster variability is low when compared with the intercluster variability, indicating that a feature with a high ICC can distinguish well between inserts. An ICC higher than 0.9 is regarded as excellent, values between 0.75 and 0.9, between 0.6 and 0.75, and below 0.6 are regarded as good, moderate, and poor, respectively (17).
ICCs were compared between reconstruction settings, discretization methods, and original versus resampled data using a nonparametric permutation test. A permutation test compares 2 groups by checking differences in test statistics for the groups. The test randomly swaps the elements of both groups for all possible combinations. If the statistics do not change after swapping, the null hypothesis cannot be rejected. All P values below 0.01 were considered statistically significant. A Benjamini–Hochberg procedure with a false discovery rate of 0.25 was performed to diminish the chance of a type I error for multiple comparisons. The permutation test was performed using the R package perm (version 1.0-0.0) for each feature group separately.
RESULTS
All calculated radiomic features are listed in Supplemental Files 1, 2, and 3 (for EARL1, EARL2, and clinical reconstructions, respectively; supplemental materials are available at http://jnm.snmjournals.org), including their ICCs for each reconstruction setting and discretization method.
Figure 3 displays the percentage of features resulting in an excellent, good, moderate, or bad ICC sorted by feature groups for the statistically equal replicates and both discretization methods. The total percentage of excellent, good, and moderate ICCs was comparable across all reconstruction settings, with the highest values being for FBW discretization (96.7% for EARL1, 97.4% for EARL2, and 97.9% for the clinically preferred setting vs. 83.2%, 94.2%, and 94.7%, respectively, for FBN discretization) (Supplemental Table 1). The EARL1 setting yielded the lowest percentage of features with an excellent ICC. When the feature groups were compared, the differences in ICCs were significant only for gray-level run-length matrix features (P < 0.01). A discretization with FBW resulted in more reliable features than FBN discretization, but the ICCs resulted in significant differences only for gray-level cooccurrence matrix features. Resampling to cubic voxels had almost no effect on reliability, although it led to a slight increase in the number of reliable features (Supplemental Fig. 1) with no significant differences in ICCs.
Percentage of features extracted from 5 statistically equal replicates yielding excellent, good, moderate, or bad ICC for FBN and FBW discretization for different feature groups. GLCM = gray-level cooccurrence matrix; GLRLM = gray-level run-length matrix; NGLDM = neighboring gray-level dependence matrix; GLSZM = gray-level size-zone matrix; GLDZM = gray-level distance-zone matrix; NGTDM = Neighboring gray-tone difference matrix; Stat = intensity-based statistics; Morph = morphology; LocInt = local intensity; IntHist = intensity histogram; IntVol = intensity volume.
By comparison, the percentages of features yielding excellent, good, moderate, or bad ICCs for the 4 scans acquired on the same system are displayed in Figure 4. The number of features yielding an excellent ICC decreased when compared with the 5 statistically equal replicates. However, most features still resulted in a good or moderate ICC. Also, discretization with FBW led to the highest percentage of features with a moderate or better ICC (87.8% for EARL1, 90.3% for EARL2, and 91.8% for the clinically preferred reconstruction vs. 78.2%, 82.1%, and 77.1%, respectively, for FBN discretization), a slight increase after resampling (Supplemental Table 2), and significant differences for gray-level cooccurrence matrix features (P < 0.01). The differences between clinically preferred and EARL-compliant reconstructions also were not significant, but the clinically preferred reconstruction yielded the highest percentage, and the EARL1 setting the lowest percentage, of repeatable features. The only feature group whose features were less repeatable after resampling were the morphologic features (Supplemental Fig. 2).
Percentage of features extracted from 4 scans acquired on same PET/CT system yielding excellent, good, moderate, or bad ICC for FBN and FBW discretization. GLCM = gray-level cooccurrence matrix; GLRLM = gray-level run-length matrix; NGLDM = neighboring gray-level dependence matrix; GLSZM = gray-level size-zone matrix; GLDZM = gray-level distance-zone matrix; NGTDM = Neighboring gray-tone difference matrix; Stat = intensity-based statistics; Morph = morphology; LocInt = local intensity; IntHist = intensity histogram; IntVol = intensity volume.
In the multicenter setting, the percentage of features yielding a moderate or better ICC was low when compared with the other scan settings (Fig. 5). Also, discretization with FBW led to the largest percentage of features with an ICC higher than 0.6 (71.7% for EARL1, 84.9% for EARL2, and 32.3% for the clinically preferred setting vs. 49.3%, 49.5%, and 38%, respectively, for FBN discretization). Significant differences in ICCs between the 2 discretization methods were found only for the EARL-compliant reconstructions and some textural feature groups (gray-level cooccurrence matrix and gray-level run-length matrix features for both EARL-compliant reconstructions, neighboring gray-level dependence matrix and gray-level size-zone matrix for EARL2). For discretization with FBN, only small and nonsignificant discrepancies could be observed between the reconstruction settings. However, for FBW discretization, the difference between EARL-compliant reconstructions and clinically preferred reconstructions led to significant differences for most textural feature groups. In the multicenter setting, the local clinically preferred reconstructions differed substantially between sites and scanners, whereas this was not the case in the single-scanner experiments described. Significant differences in ICCs between EARL1 and EARL2 were observed only for gray-level cooccurrence matrix features and gray-level run-length matrix features when discretized with FBW. A resampling to cubic voxels was beneficial, especially for textural feature groups, although the differences were not significant (Supplemental Fig. 3). In addition, the only feature group resulting in less reproducible features after resampling was the group of morphologic features, for which a significant difference was observed (Supplemental Table 3).
Percentage of features extracted from multicenter setting yielding excellent, good, moderate, or bad ICC for FBN and FBW discretization. GLCM = gray-level cooccurrence matrix; GLRLM = gray-level run-length matrix; NGLDM = neighboring gray-level dependence matrix; GLSZM = gray-level size-zone matrix; GLDZM = gray-level distance-zone matrix; NGTDM = Neighboring gray-tone difference matrix; Stat = intensity-based statistics; Morph = morphology; LocInt = local intensity; IntHist = intensity histogram; IntVol = intensity volume.
DISCUSSION
To the best of our knowledge, this was the first multicenter and multivendor experimental study to investigate the impact of EARL-compliant reconstructions on the repeatability and reproducibility of radiomic features. Our results suggest that in a multicenter setting, the use of EARL-compliant reconstructions leads to a larger number of reproducible features. A reason might be that the clinically preferred reconstructions varied widely in spatial resolution and contrast recovery across PET/CT systems. Because radiomic features are sensitive to resolution and image noise, these variations could be the reason for a higher variation in radiomic features (18). This possibility is in line with the fact that differences in feature consistency between reconstruction settings were not visible in the 5 statistically equal replicates and the 4 scans acquired on the same scanner, for which the same local clinically preferred reconstruction was applied.
In the multicenter setting, EARL-compliant images yield comparable image quality. This might be the reason for the low differences in reliability, repeatability, and reproducibility for these 2 reconstruction settings. This result is in line with the findings of Kaalep et al., who reported that a harmonization of PET/CT systems using PSF reconstructions is feasible (11). Furthermore, our results support the findings of Lasnon et al., who showed that images reconstructed with PSF and in line with the EARL standard can be used for the harmonization of radiomic features (19).
Although EARL-compliant reconstructions yield similar contrast recoveries, the amount of smoothing for clinically preferred settings differed across PET/CT systems. The lower spatial resolution with EARL-compliant reconstructions seems to be beneficial in terms of repeatability and reproducibility but might also eliminate important heterogeneity information that is visible in some of the clinically preferred reconstructions. This effect is lower in the updated EARL standards (EARL2), which yield higher contrast recoveries and spatial resolution and are therefore preferred for future multicenter studies. One limitation of this study is that we do not report the accuracy of feature values. Because it was demonstrated before that radiomic features are biased as a function of acquisition parameters, image reconstruction settings, and noise (18,20,21), there is an urgent need for standardization of feature values to reduce the variability (in bias) of radiomic features across centers. Therefore, we focused on feature consistency and the feasibility of using existing harmonization procedures to improve the reproducibility of radiomic features. Nonetheless, because a high ICC also indicates that features can differentiate well between inserts, our results suggest that EARL-compliant reconstructions also result in more meaningful features, especially when using the EARL2 settings. This is in line with the findings of Aide et al., who showed that images reconstructed with higher-resolution reconstructions improved the characterization of breast tumors when compared with EARL1 (22).
Use of physical phantoms also has limitations, as the 3-dimensionally printed inserts reflect only 3 coarse heterogeneity patterns. However, they provide a more realistic scenario than publicly available phantoms containing only spheres. Furthermore, phantoms have the advantage of providing a more reproducible setting than patient scans, because the activity solution within the spheres and background can be matched closely across experiments performed in different institutions.
Moreover, our study confirms previous findings (on clinical datasets) such as the impact of image discretization on the reliability and repeatability of radiomic features. Previous studies reported better repeatability and less sensitivity to differences in delineations for FBW discretization (7,10,23). Furthermore, Orlhac et al. demonstrated that discretization with FBW led to more meaningful features—that is, features that can distinguish well between tumor types (23). Our results also confirm the benefit of discretization with FBW, as it resulted in more consistent features, especially for EARL-compliant reconstructions.
The impact of voxel size on radiomic feature values has also been studied before (24,25). Hatt et al. recommended the use of isotropic voxels with voxel size of 2 mm (15). Our study supports this recommendation. Especially in the multicenter setting, a resampling to cubic voxels led to better reproducibility of radiomic features. A possible explanation might be that a common voxel size might lead to more comparable features because a large number of features are sensitive to differences in slice thickness and voxel size (26,27). The only feature group not benefiting from resampling were the morphologic features. This effect was observed only in the scan setups in which each scan was segmented separately. A possible reason might be that the resampling of the tumor segmentation might lead to different results depending on the initial position of the delineation in the image.
The impact of tumor delineation on the sensitivity of radiomic features was also reported previously (7,28,29). Our results confirm this finding, as the number of features yielding an excellent ICC decreased from the 5 statistically equal replicates to the 4 scans acquired on the same system (with repositioning and thus redefinition of tumor delineation). However, differences in number of features resulting in a moderate or better ICC might also be caused by differences in phantom filling and phantom positioning. Mansor et al. demonstrated that basic SUV features (SUVmax, SUVpeak, and SUVmean) are affected by phantom repositioning (30), so it is likely that repositioning also affects more complex textural features. However, as patient repositioning and differences in tumor delineation across institutions are part of the general clinical workflow, it is questionable if features highly sensitive to these changes are feasible for use in radiomic analysis in the clinic.
CONCLUSION
This study reports on the impact of EARL-compliant reconstructions on the reliability, repeatability, and reproducibility of radiomic features in comparison with clinically preferred reconstructions. Our results show that the use of EARL-compliant reconstructions is beneficial and leads to a larger number of reliable, repeatable, and reproducible features. Discretization with FBW and resampling to cubic 2-mm voxels increases the percentage of consistent features. The study suggests that EARL-compliant reconstructions should be used for radiomic analysis, especially in a multicenter setting. Use of the updated EARL2 standards is preferred because they have higher contrast recovery and spatial resolution while providing radiomic performance similar to the EARL1 standards (11).
DISCLOSURE
This work is part of the STRaTeGy research program (project 14929), which is (partly) financed by The Netherlands Organisation for Scientific Research (NWO). This study was financed by the POINTING project of the Dutch Cancer Society (grant 10034). No other potential conflict of interest relevant to this article was reported.
KEY POINTS
QUESTION: Which reconstruction algorithm leads to the most stable radiomic features in a multicenter and multivendor setting?
PERTINENT FINDINGS: Harmonized image reconstructions (EARL-compliant) led to a larger number of reliable, repeatable, and reproducible radiomic features. This effect increased when images were discretized with a FBW and resampled to isotropic voxels before feature extraction.
IMPLICATIONS FOR PATIENT CARE: To make radiomic features comparable across multiple centers, multicenter radiomic studies should be performed using harmonized (EARL-compliant) reconstructions, and images should be discretized using a FBW and resampled to isotropic voxels.
Acknowledgments
We thank Hinke Schokker and Johan R. de Jong for help with the phantom scans.
Footnotes
Published online Aug. 16, 2019.
- © 2020 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- Received for publication April 11, 2019.
- Accepted for publication July 24, 2019.