Abstract
P1069
Introduction: FDG PET/CT is widely used for staging high-grade lymphoma. Time to evaluate studies vary depending on case complexity. Integrating artificial intelligence (AI) within the reporting workflow has the potential to improve efficiency and enable use of advanced quantification methods in a clinical setting. This study evaluated the impact of amount of data used to train a convolutional neural network (CNN) based deep learning (DL) model on detection and segmentation performance metrics.
Methods: A total of 6150 lymphoma lesions considered as ground truth (GT), were segmented on pre-treatment FDG PET/CT scans of 420 patients with high-grade lymphoma by a radiologist with 7 years’ experience. GT segmentation included nodal and extra-nodal disease. All segmentations were checked by a dual-certified radiologist/nuclear medicine physician with >15 years of experience. A DL model, consisting of an ensemble of patch-based 3D DenseNet, was trained using various dataset sizes, N = 50, 100, 150, 200 and 300, randomly sampled from a total of 300 cases. The same architecture, training strategies and loss function were used for each of the 5 training sets. Technical performance was assessed on a separate evaluation dataset of 120 cases. Per-patient lesion detection performance was assessed by computing true positive rate (TPR) and number of false positive (FP) findings. Voxel-wise detection sensitivity and positive predictive value (PPV) were also calculated. Segmentation and quantification performance were evaluated using DICE, non-parametric Bland Altman, and intraclass correlation coefficient (ICC) for SUVmax and SUVmean per lesion, and total metabolic volume (TMV) and total lesion glycolysis (TLG) per patient. Statistics reported for Bland-Altman were median difference (bias) and lower and upper limits of Agreement (LoA) calculated as 2.5th and 97.5th percentile.
Results: All models demonstrated good lesion detection capability (median TPR: 83-88%) whilst the number of FPs (median) decreased from 9 (N=50) to 3 (N=300). Similarly, per-voxel-analysis demonstrated consistent sensitivity across the 5 models (91-93%) whilst PPV increased (median: 75%, 82%, 83%, 86% and 88%, for N=50-300). Agreement between predicted and GT contours, measured using DICE score improved with larger datasets (median: 0.78, 0.83, 0.84, 0.85, 0.86, respectively). Bland Altman analysis showed significantly better agreement between predicted and GT SUVmax values for N=300 (bias = 0, LoA = [-0.03, 0.0]) versus N<=200 (bias = 0, LoA varying between [-0.19, 0] and [-0.12, 0.1]). However, for N>50, for 95% of cases predicted and GT SUVmax were in perfect agreement. LoA for SUVmean were consistent across the 5 models (between [-1.1, 1.1] and [-1.5, 1.5]) with bias between -0.11 and 0. TMV agreement consistently improved with increasing training dataset size (LoA = [-499, 461] for N=50 gradually decreasing down to LoA = [-345, 281] for N=300), whilst bias decreased from 26 (N=50) to 6 (N=300). TLG showed a similar trend to TMV. ICC for SUVmax significantly increased between N=50 (ICC=0.72) and N > 100 (ICC=0.97-0.99). ICC for other parameters was consistent across all models: 0.93-0.95 for SUVmean, 0.91-0.94 for TMV and 0.97-0.99 for TLG. Visual assessment confirmed that accuracy of lesion segmentation improved with larger training data size.
Conclusions: A deep learning model was relatively unaffected by size of training dataset in its ability to detect lymphoma lesions on PET/CT scans. However, more training data reduced FP rate, and improved agreement between prediction and ground truth segmentations for SUVmax, TMV and TLG.