Abstract
242097
Introduction: Lung cancer is one of the most common and deadly types of cancer, with a low five-year survival rate. Accurate prediction of overall survival (OS) is crucial for clinical decision-making and personalized treatment planning. However, predicting OS is challenging due to the heterogeneity and complexity of lung cancer. Supervised learning techniques require labeled patient data which can be challenging to obtain in large numbers. We investigate a semi-supervised framework involving pseudo-labeling of many patients with missed outcome, thus incorporating both labeled and unlabeled data, while performing ultimate testing on labeled data. To this end, we utilize hybrid machine learning systems (HMLSs) involving handcrafted radiomics (RF) and deep radiomics feature (DF) extracted from PET/CT images.
Methods: 221 patients with lung cancer who had PET/CT and clinical information were included from The Cancer Imaging Archive (38 patients) and our local clinical database (183 patients). PET images were first registered to CT by rigid algorithm; next. Standardized Uptake Value correction, clipping and normalization were applied to images. We generated both RFs and DFs in conjunction to improve risk modeling performance. In RF framework, 215 quantitative RFs were extracted from each segmented tumor area through the ViSERA software, standardized in reference to the Image Biomarker Standardization Initiative. In DF framework, a 3D Autoencoder neural network architecture was used to extract 1024 DFs from the bottleneck layer through 3 masks, including whole (W), cropped (C) (32×32×32 mm3), and segmented (S) PET/CT images. Two approaches, including supervised and semi-supervised, were used to predict continuous OS time. In supervised approach, different HMLSs including 3 feature selection algorithms (FSA) followed by 10 regression algorithms (RA) applied to RFs and DFs extracted from the masks mentioned. In semi-supervised approach, a pseudo-labeling algorithm enabled an increase in patient numbers by labeling patients with missed outcome (114 patients) and then adding those to labeled data (107 patients). Subsequently, all HMLSs used in supervised approach were applied to the enlarged datasets. We compared this approach to conventional supervised framework of only utilizing 107 labeled patient data. Furthermore, 3 survival prediction algorithms (SRA) linked with the mentioned FSA were utilized in survival hazard ration analysis. 3 subsets of relevant features (10, 30 and 50) as selected by FSAs were applied to RAs to predict OS. In addition, mean absolute errors (MAE) in 5-fold cross-validation (80% of total data) and external nested testing (remaining 20% of total data) were calculated to compare models.
Results: HMLSs employed in the semi-supervised approach significantly outperformed the supervised approach (p-value<0.0001, paired t-test). In semi-supervised approach, best 5-fold cross-validation MAE of 0.19±0.04 years [outcome range: 0.11-6.6 years] was obtained as provided by CT-S-DF (DFs extracted from the Segmented CT) linked with HMLS: Mutual Information (MI) (50 features) + Extra Trees Regressor (Fig. 1). By contrast, in supervised approach, 5-fold cross-validation MAE of 0.40±0.03 years was obtained from CT-RF (RFs extracted from the segmented CT) linked with F-Regression (30 relevant features) and Bagging Regression (Fig.2). External testing MAEs of 0.56±0.39 and 1.07±0.10 confirmed our findings in semi-supervised and supervised approaches, respectively. In survival analysis, MI + Fast Survival Support Vector Machines applied to PET-RF (RFs extracted from the segmented PET) provided the highest c-index of 0.79±0.03 with a Log Rank p-value of 0.006 (see Fig. 3). External testing performance confirmed our findings.
Conclusions: Use of semi-supervised approach linked with appropriate DFs/RFs, masks, and PET/CT images significantly enhanced OS prediction in lung cancer patients compared to conventional supervised learning.