Visual Abstract
Abstract
This study aimed to develop an analytic approach based on [18F]FDG PET radiomics using stacking ensemble learning to improve the outcome prediction in diffuse large B-cell lymphoma (DLBCL). Methods: In total, 240 DLBCL patients from 2 medical centers were divided into the training set (n = 141), internal testing set (n = 61), and external testing set (n = 38). Radiomics features were extracted from pretreatment [18F]FDG PET scans at the patient level using 4 semiautomatic segmentation methods (SUV threshold of 2.5, SUV threshold of 4.0 [SUV4.0], 41% of SUVmax, and SUV threshold of mean liver uptake [PERCIST]). All extracted features were harmonized with the ComBat method. The intraclass correlation coefficient was used to evaluate the reliability of radiomics features extracted by different segmentation methods. Features from the most reliable segmentation method were selected by Pearson correlation coefficient analysis and the LASSO (least absolute shrinkage and selection operator) algorithm. A stacking ensemble learning approach was applied to build radiomics-only and combined clinical–radiomics models for prediction of 2-y progression-free survival and overall survival based on 4 machine learning classifiers (support vector machine, random forests, gradient boosting decision tree, and adaptive boosting). Confusion matrix, receiver-operating-characteristic curve analysis, and survival analysis were used to evaluate the model performance. Results: Among 4 semiautomatic segmentation methods, SUV4.0 segmentation yielded the highest interobserver reliability, with 830 (66.7%) selected radiomics features. The combined model constructed by the stacking method achieved the best discrimination performance. For progression-free survival prediction in the external testing set, the areas under the receiver-operating-characteristic curve and accuracy of the stacking-based combined model were 0.771 and 0.789, respectively. For overall survival prediction, the stacking-based combined model achieved an area under the curve of 0.725 and an accuracy of 0.763 in the external testing set. The combined model also demonstrated a more distinct risk stratification than the International Prognostic Index in all sets (log-rank test, all P < 0.05). Conclusion: The combined model that incorporates [18F]FDG PET radiomics and clinical characteristics based on stacking ensemble learning could enable improved risk stratification in DLBCL.
Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of aggressive non-Hodgkin lymphoma. Rituximab plus cyclophosphamide, doxorubicin, vincristine, and prednisone represents the current first-line treatment, which is effective in approximately 60%–70% of patients (1). Patients with refractory disease or relapse after initial treatment have a low probability of cure and dismal outcomes due to the modest response rates for salvage regimens (2). Therefore, early identification of those high-risk patients is essential for designing individualized therapeutic intervention. Current prognostic scoring systems, such as the International Prognostic Index (IPI) and the National Comprehensive Cancer Network–IPI, have been the basis for determining prognosis in DLBCL (3,4). However, those models are inaccurate in predicting refractory disease, possibly because of their lack of intratumoral metabolic and functional information.
[18F]FDG PET/CT, a type of molecular imaging and a means to “transpathology” (5), has been recommended for staging and response assessment in DLBCL (6,7). Quantitative parameters on PET/CT, particularly total metabolic tumor volume (TMTV) and total lesion glycolysis, are considered to have prognostic significance in DLBCL (8,9). These parameters may allow for the assessment of whole-body tumor burden but remain limited in their ability to characterize phenotypical profiles such as shape, morphology, spatial distribution, and heterogeneity across individual lesions. For PET/CT image analysis, radiomics has recently been proposed as a novel high-throughput, noninvasive approach that could quantify tumor phenotype at a microscale level via extracting thousands of imaging-derived features (10). With the assistance of artificial intelligence, such as machine learning, radiomics offers a promising tool for diagnosis, therapeutic response assessment, and outcome prediction in various tumor types (11), including DLBCL (12–16). Preliminary studies have suggested that the application of machine learning algorithms, such as LASSO (least absolute shrinkage and selection operator) regression (16), ridge regression (13), and random forest (17), may contribute to the improved radiomics feature selection and prognostic modeling in DLBCL. However, most of those studies focused on evaluating a single machine learning approach, whereas only a minority used cross combination of different machine learning algorithms (14) or adopted ensemble machine learning (15). Stacking, an ensemble approach that combines different base classifiers into 1 metaclassifier, has been suggested to provide optimized performance and simplicity (18). In the present study, we aimed to develop an analytic approach based on [18F]FDG PET radiomics using stacking ensemble learning to improve the outcome prediction in DLBCL.
MATERIALS AND METHODS
Study Population
We retrospectively enrolled 240 consecutive patients with newly diagnosed DLBCL at 2 medical centers, including 202 patients at center 1 (the Second Affiliated Hospital of Zhejiang University School of Medicine) and 38 patients at center 2 (the First Affiliated Hospital of Zhejiang Chinese Medical University). Detailed information about the study population is shown in the supplemental materials (available at http://jnm.snmjournals.org) (19,20). The flowchart of patient enrollment is shown in Supplemental Figure 1. This study was approved by the Institutional Review Board at each institution, and the requirement to obtain written informed consent was waived.
PET/CT Imaging Protocol
Image acquisition and reconstruction were in accordance with the guidelines of European Association of Nuclear Medicine, version 2.0 (21). Patients fasted for at least 6 h and had a blood glucose level below 200 mg/dL before PET/CT examination. They were scanned at about 60 min after intravenous injection of [18F]FDG (3.70 MBq/kg). All PET images were corrected for attenuation using acquired low-dose CT data. Acquisitions differed between the 2 institutions in terms of PET/CT scanners, acquisition protocols, and reconstruction settings (Supplemental Table 1).
PET Image Segmentation and Feature Extraction
PET/CT images were reviewed by 2 independent nuclear medicine physicians, who were masked to patients’ clinical outcome. The volumes of interest were semiautomatically delineated using LIFEx software (version 6.30, https://www.lifexsoft.org/index.php) (22). Four different segmentation methods were applied to delineate lesions, including an SUV threshold of 2.5, an SUV threshold of 4.0 (SUV4.0), 41% of SUVmax, and SUVPERCIST (1.5 × liver SUVmean + 2 SDs) (21,23). SUV was calculated as (tissue radioactivity concentration [Bq/mL]) × (body weight [g])/(injected radioactivity [Bq]). According to the European Association of Nuclear Medicine guidelines, the liver SUVmean should be between 1.3 and 3.0 (21). Conventional PET parameters including SUVmax, SUVpeak, TMTV, and total lesion glycolysis of each patient were recorded. The distance between the largest lesion and the lesion farthest from that bulk was also recorded (16).
Before feature extraction, all PET images were resampled to a voxel size of 3 × 3 × 3 mm using bilinear interpolation (24) and were discretized with a fixed bin size of 0.25 SUV (25). In total, 1,245 radiomics features were extracted from the entire segmented disease (patient level) via the open-source toolbox PyRadiomics (version 3.0.1) (16,26), consistent with the Image Biomarker Standardization Initiative (27). Detailed descriptions of the extracted features are presented in Supplemental Table 2. The radiomics workflow is shown in Figure 1.
Feature Selection
The interobserver repeatability of radiomics features was evaluated using the intraclass correlation coefficient (ICC) in 100 randomly selected patients from center 1. Features with an ICC above 0.80 were considered robust and retained for subsequent analysis. The segmentation method with the maximum number of selected features was considered to be the most reliable method.
The ComBat harmonization method was applied to pool all conventional PET parameters and radiomics features derived from images acquired on the 2 different PET/CT scanners (28). Pearson correlation coefficient analysis followed by the LASSO algorithm were applied to select features. Details on feature selection are presented in the supplemental materials.
Stacking Ensemble Learning–Based Model Construction
Stacking ensemble learning is a complex machine learning algorithm that combines the result of several base learners to generate predictions into the metalearner to improve predictive accuracy (18). In this study, random forest, support vector machine, gradient boosting decision tree, and adaptive boosting were set as the base learners (first level), whereas random forest served as the metalearner (second level). The methodologic details are presented in the supplemental materials. Logistic regression was also applied to generate predictions. Confusion matrix analytics (including accuracy, F1 score, recall, and precision) were used to compare the performance of different machine learning algorithms. The detailed parameters of these algorithms are presented in Supplemental Table 3.
We evaluated the predictive value of 5 different models, including the radiomics model, the combined clinical–radiomics model, IPI, the model based on TMTV, the distance between the largest lesion and the lesion farthest from that bulk, and SUVpeak (17), as well as the International Metabolic Prognostic Index (29). Receiver-operating-characteristic (ROC) curve analysis was used to compare the predictive performance of different models.
Statistical Analysis
All statistical analysis was performed using SPSS (version 26.0), R (version 4.0.5, http://www.R-project.org), and Python (version 3.10). Progression-free survival (PFS) was defined as the time from diagnosis until lymphoma progression or death from any cause. Overall survival (OS) was defined as the time from diagnosis to death from any cause or to the last follow-up. Patients still alive were censored at the date of last contact. The differences in clinical characteristics were assessed using the χ2 test and 1-way ANOVA, when appropriate. Patients were stratified into high- and low-risk groups using ROC curve analysis and maximizing the Youden index (30). Survival curves were estimated by the Kaplan–Meier analysis, and survival distributions were compared using the log-rank test. A P value of less than 0.05 was considered statistically significant.
RESULTS
Patient Characteristics and Outcome
Patients’ clinical characteristics are summarized in Table 1. No clinical characteristic had statistically significant differences among different datasets (all P > 0.05). The median follow-up intervals for the training, internal testing, and external testing sets were 41 mo (range, 4–105 mo), 44 mo (range, 6–104 mo), and 39 mo (range, 4–69 mo), respectively. By the end of follow-up, relapse and progression occurred in 56, 21, and 14 patients in the training, internal testing and external testing sets, respectively, whereas 45, 16, and 10 patients, respectively, had died.
Feature Selection
Among 4 segmentations, SUV4.0 segmentation showed the highest reliability, with 830 features (66.7%) retained in the context of an ICC of more than 0.8 (Supplemental Table 4). After the Pearson correlation coefficient test, 88 radiomics features were selected for SUV4.0 segmentation. The optimal features were obtained by the LASSO algorithm for construction of different stacking models (Supplemental Table 5).
Model Performance Evaluation
The model performance for 2-y PFS prediction based on different machine learning algorithms is shown in Supplemental Table 6. For the radiomics model, the stacking classifier showed better performance than the other 4 base classifiers and logistic regression, except for recall in the training set. For the combined model, the stacking classifier also demonstrated better performance than the other classifiers in the training set, internal testing set, and external testing set. Furthermore, the stacking-based combined model had higher predictive power than the radiomics model and IPI across nearly all evaluation metrics.
The model performance for 2-y OS prediction is shown in Supplemental Table 7. For the radiomics model, the stacking classifier demonstrated superior performance to the other base classifiers and logistic regression, except for precision in the internal testing set and accuracy and recall in the external testing set. For the combined model, the stacking classifier had relatively balanced performance in the training set but outperformed the other base classifiers in the internal testing set and the external testing set. Moreover, the stacking-based combined model performed better than the radiomics model and IPI.
We compared the performance of the stacking-based combined models by various combinations of base classifiers. As shown in Supplemental Tables 8 and 9, the combination of 4 base classifiers had a more balanced performance for PFS and OS prediction than did the other combinations. We also evaluated the performance of the radiomics and combined models trained on PFS prediction for predicting OS and vice versa; the results are shown in Supplemental Tables 10 and 11.
The results of ROC analysis are shown in Table 2. The combined model outperformed the other models for PFS prediction, with the area under the ROC curve (AUC) being 0.791, 0.762, and 0.771 in the training set, internal testing set, and external testing set, respectively. A similar trend was observed for OS prediction (the AUCs of the combined model were 0.843, 0.741, and 0.725 for the training set, internal testing set, and external testing set, respectively).
Survival Prediction
Kaplan–Meier survival estimates of the combined model and IPI in the training set, internal testing set, and external testing set are shown in Figures 2, 3, and 4, respectively. The Kaplan–Meier survival estimates of the radiomics model are shown in Supplemental Figure 2. The differences in survival rates between low- and high-risk groups were significant except for OS in the radiomics model in the external testing set (P = 0.053). Moreover, the combined model demonstrated a more distinct risk stratification than the radiomics model and IPI, with larger differences between subgroups for both PFS and OS prediction (all P < 0.05).
DISCUSSION
In this study, we developed an analytic approach based on [18F]FDG PET radiomics using stacking ensemble learning for outcome prediction in DLBCL. Radiomics and combined clinical–radiomics models constructed by the stacking method outperformed those built on other single machine learning classifiers. Furthermore, the combined models integrating radiomics features and clinical information exhibited predictive performance superior to that of radiomics-only models and IPI.
To the best of our knowledge, this was the first study to evaluate the prognostic effect of [18F]FDG PET radiomics through a stacking ensemble learning approach in patients with DLBCL. Several previous studies have found that machine learning–based PET radiomics could be of prognostic importance in DLBCL (12–14). A multicenter study with 317 DLBCL patients suggested that the radiomics model based on LASSO logistic regression was predictive of 2-y time to progression, with an AUC of 0.76 (16). Another study using a LASSO-Cox algorithm reported an AUC of 0.748 for the radiomics model in the test set for PFS prediction (12). In a recent study, Jiang et al. used cross combination of 7 different machine learning algorithms for feature selection and found that the radiomics signature obtained by the support vector machine–support vector machine was highly predictive of PFS (AUC, 0.757) (14). Despite these encouraging findings, a recently developed ensemble learning approach has revealed diagnostic and prognostic advantages over a single machine learning method by aggregating multiple algorithms to achieve higher prediction accuracy (31,32). In our current study, the radiomics model built on a stacking ensemble learning approach outperformed those developed by the other 4 base classifiers and logistic regression, with AUCs of 0.715 and 0.707 for PFS prediction in the internal and external testing sets, respectively. This finding is consistent with the results from a recent radiomics study on DLBCL, in which a soft voting ensemble–based model showed higher accuracy than those based on single machine learning classifiers for 2-y event-free survival prediction (15). Notably, voting considers only linear relationships among classifiers whereas stacking is able to learn complex associations when individual base classifiers are heterogeneous (33). In our study, the combined model developed by 4 classifiers showed a more balanced performance than the other combinations, supporting the potential of stacking ensemble learning for radiomics analysis in DLBCL.
Our study also demonstrated that the combined models incorporating patient-level PET radiomics and clinical characteristics yielded higher AUCs and more distinct risk stratifications than IPI for outcome prediction in DLBCL, which is in line with previous observations (12,14,16). Recent studies suggested that the predictive ability of IPI has been weakened in the rituximab era (4). In this context, PET radiomics might add a new perspective on the phenotypic characteristics of DLBCL through profiling the intratumoral metabolic heterogeneity. Therefore, it is likely that considering both clinical and imaging features in analysis may offer a deeper understanding of the complex biologic properties of malignancy and thereby provide a better prognosis estimation.
Radiomics analysis in lymphoma remains challenging because of the lack of a primary site and the complexity of lesion delineation, particularly for disseminated disease. To date, no consensus has been reached on which segmentation method for lesion delineation in DLBCL is preferable. Although the 41%-of-SUVmax method has been recommended by the European Association of Nuclear Medicine for TMTV evaluation (21), this method is more likely to be influenced by interobserver variability (34). Other studies indicated that the SUV4.0 method could give a good approximation of TMTV for prediction of disease progression (35). On top of these, the impact of different segmentations on radiomics features for prognosis prediction in DLBCL remains to be explored. In our study, we compared the reliability of radiomics features based on 4 different segmentation methods. The SUV4.0 method yielded the highest interobserver reliability, with 830 features (66.7%) retained in ICC analysis, which is in line with the results from a recent study suggesting that SUV4.0 is the most stable approach (with excellent reliability for 84.8% of all features) among 6 semiautomatic segmentation methods (36). By contrast, the interobserver reliability of radiomics features based on 41%-of-SUVmax segmentation was the lowest in the current study, with only 46 features (3.7%) having excellent reliability. This discrepancy may correlate with differences in TMTV delineation. Previous studies demonstrated that variations in segmentation methods could have a marked effect on the outer contour of the segmentation, thereby influencing radiomics features, especially morphologic metrics (36,37). In our study, the SUV4.0 method exhibited a higher TMTV estimation and more stable radiomics features than the 41%-of-SUVmax method, indicating that a higher TMTV may cause the segmentation method to have less of an impact on radiomics features.
Several limitations of our study deserve mention. First, since this was a retrospective study with a relatively small sample size, our results need to be further validated in prospective multicenter studies involving a larger cohort of patients. Second, we applied only patient-level radiomics analysis; further studies are required to compare the impact of different lesion selection methods on radiomics analysis. Third, we applied ICC, Pearson correlation analysis, and LASSO for feature selection; further studies will be required to assess the performance of other strategies, for example, minimum redundancy maximum relevance and ReliefF. Fourth, to facilitate comparison with previous results, we used only PET images for radiomics analysis. A combination of PET and CT images may lead to the discovery of radiomics features that are more predictive. Fifth, Ki-67 expression and MYC/BCL-2 double-hit status are established prognostic factors but were not assessed in this study because of the incompleteness of the available data.
CONCLUSION
In the present study, we proposed an analytic approach using stacking ensemble learning for outcome prediction in DLBCL based on [18F]FDG PET radiomics. The stacking-based combined model that incorporates radiomics features and clinical characteristics could enable improved risk stratification in DLBCL patients.
DISCLOSURE
This study was partially supported by the National Natural Science Foundation of China (32027802), the National Key R&D Program of China (2021YFE0108300 and 2022YFE0118000), and the Key R&D Program of Zhejiang (2022C03071). No other potential conflict of interest relevant to this article was reported.
KEY POINTS
QUESTION: Can stacking ensemble learning–based [18F]FDG PET radiomics improve outcome prediction in patients with DLBCL?
PATIENT FINDINGS: In a retrospective study of 240 DLBCL patients, a stacking ensemble learning–based model that incorporates radiomics features and clinical characteristics enabled improved risk stratification.
IMPLICATIONS FOR PATIENT CARE: The stacking ensemble learning–based model incorporating PET radiomics and clinical information can be useful for better survival prediction and therapeutic decision making.
Footnotes
Published online Jul. 27, 2023.
- © 2023 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- Received for publication November 23, 2022.
- Revision received May 31, 2023.