Abstract
611
Objectives: Whilst several PET delineation algorithms had been proven to be effective under specific conditions, recent studies (Dewalle-Vignion et al, Phys Med Biol 2015; Schaefer et al., Eur J Nucl Med Mol Imaging 2016) suggested that using a combination of several approaches may enhance the reliability of the segmentation step used within the area of radiomics. The objective of this study was twofold: 1) assess the impact of the number of segmentation methods (nbSM) involved in the computation of the consensus approaches, and 2) confirm if a simple majority vote (MV) and the simultaneous truth and performance level estimation (STAPLE) perform equally.
Methods: The study was conducted using two different populations: forty-four patients suffering from pheochromocytoma (first objective) and 61 patients suffering from pediatric sarcomas (second objective) were retrospectively enrolled. All patients with pheochromocytoma underwent a 18F-FDG PET before surgery. Forty-seven lesions were delineated using 5 different algorithms (40% of SUVmax, 2 different adaptive algorithms, k-means clustering and fixed threshold SUV=2.5). MV and STAPLE algorithms were computed using 3, 4 or 5 of these individual algorithms. The lesion maximum size was determined on the resected tumors by experienced pathologists and compared using a ranking approach to the results of the different segmentation algorithms (including MV and STAPLE) taking into account the limited PET spatial resolution. The second objective was assessed by adding 63 lesions extracted from from pediatric sarcomas patients who underwent 18F-FDG PET at diagnosis. The difference between MV and STAPLE (110 lesions analysed for the pooled population) was evaluated using a linear mixed model or Friedman test corrected for multiple comparison (Benjamini-Hochberg). All statistical tests were performed using R 3.2.5.
Results: Table 1 shows the mean error for each segmentation algorithm (in brackets: the number of times each method was ranked, respectively, as the best or the worst). MV and STAPLE led the highest ranking in more than 40% of cases and, interestingly, were the two only methods never ranked as the worst regardless of the nbSM used. The STAPLE and MV errors seemed to decrease with the nbSM used while it was found not statistically significant (Levene' s test). Additionally, no difference was found between STAPLE and MV when pooling pheochromocytoma lesions with pediatric sarcomas regardless the nbSM used, suggesting that both methods perform equally.
Conclusion: These results suggest that STAPLE and MV can be used with either 3, 4 of 5 input volumes without significant differences. However, the error associated with STAPLE decreases with nbSM suggesting that the accuracy of STAPLE increase with the nbSM involved in the computation. We also confirmed the similitude between STAPLE and MV (Schaefer et al., Eur J Nucl Med Mol Imaging 2016) but with a larger number of lesions than previously reported and regardless of the nbSM used. Research Support: French National Agency for Research Investissements d'Avenir no ANR-11-LABX-0018-01
Table 1