Abstract
1430
Objectives: Accurate lesion segmentation in PET is required for quantification of volumetric and radiomic features from oncological PET images. There is an important need to clinically evaluate different segmentation methods. However, this is is challenging as true segmentations are typically unavailable. Current evaluation approaches use reference-based metrics that measure similarities such as spatial overlap between estimated and manual segmentations, where the latter is considered as a surrogate ground truth. These reference-based metrics are not designed to evaluate performance on the task of quantifying features from images [1]. Further, manual segmentations themselves are known to suffer from inter and intra-reader variability and may be erroneous due to partial volume effects. No-gold-standard evaluation (NGSE) techniques provide a mechanism to address these dual issues [1-4]. The technique evaluates quantitative imaging methods based on how precisely they measure the true quantitative value without a gold standard. We aim to compare the clinical evaluation of different segmentation methods on two oncological PET datasets with both reference-based metrics and NGSE technique.
Methods: Dataset 1 consisted of 69 oropharynx lesions [5]. Four segmentation methods were evaluated: (1) Generative adversarial network plus active contour (GAN-AC) [6], (2) GAN [7-8] (3) V-net [9] and (4) expert-defined manual segmentation. Dataset 2 consisted of 147 lymphoma lesions. We evaluated: (1) PET gradient-based method (PETedge) (2,3) Fixed thresholding of 25% and 41% SUVmax, and (4) expert-defined manual segmentation. For evaluation, we first computed reference-based metrics of Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC) and Hausdorff distance (HD) using manual segmentations as surrogate ground truth. We next evaluated the methods, including manual segmentation, with the NGSE technique on the task of computing SUVmean, metabolic tumor volume (MTV) and total lesion glycolysis (TLG). The NGSE technique made a linearity assumption between true and measured quantitative values, and estimates the slope, bias, and noise standard deviation terms that parameterize this relationship without knowledge of the ground truth. The noise-to-slope ratio (NSR) is then used to evaluate precision of different methods, with lower values being better.
Results: For Dataset 1, results from reference-based metrics indicated that GAN-AC yielded the best performance (DSC:0.81 SD=0.055, JSC: 0.79 SD=0.065, HD: 1.76 SD=0.66). However, the NGSE technique indicated that V-net, GAN and GAN-AC yielded the most precise SUVmean (NSR: 0.49, 95% CI: 0.42, 0.56), TLG (NSR: 17.23, 95% CI: 16.33, 18.13), and MTV (NSR: 2.40, 95% CI: 2.32, 2.48) values, respectively. Further, for all three features, manual segmentation had the highest NSR. For Dataset 2, evaluation with reference-based metrics indicated that PETedge yielded the best performance (DSC: 0.76 SD=0.24, JSC: 0.70 SD=0.29, HD: 3.68 SD=4.45). However, the NGSE technique indicated that SUV-max 25%, PETedge, and manual segmentation yielded the most precise SUVmean (NSR: 0.98, 95% CI: 0.95, 1.02), MTV (NSR: 25.00, 95% CI: 23,80, 26.21) and TLG (NSR: 222.3, 95% CI: 211.76, 232.85). Additionally, for both datasets, the method that yielded the best performance with reference-based metrics also yielded NSR close to that obtained with manual segmentation, providing confidence in the output of the NGSE technique. Conclusion: Results from the NGSE technique indicate that manual segmentation may not yield the most precise quantitative values compared to other segmentation methods. Methods that show inferior performance with reference-based metrics could actually yield more precise quantitative values. Thus, the NGSE technique could be used as a complement to commonly used reference-based metrics when clinically evaluating segmentation methods.