Visual Abstract
Abstract
Consensus about a standard segmentation method to derive metabolic tumor volume (MTV) in classical Hodgkin lymphoma (cHL) is lacking, and it is unknown how different segmentation methods influence quantitative PET features. Therefore, we aimed to evaluate the delineation and completeness of lesion selection and the need for manual adaptation with different segmentation methods, and to assess the influence of segmentation methods on the prognostic value of MTV, intensity, and dissemination radiomics features in cHL patients. Methods: We analyzed a total of 105 18F-FDG PET/CT scans from patients with newly diagnosed (n = 35) and relapsed/refractory (n = 70) cHL with 6 segmentation methods: 2 fixed thresholds on SUV4.0 and SUV2.5, 2 relative methods of 41% of SUVmax (41max) and a contrast-corrected 50% of SUVpeak (A50P), and 2 combination majority vote (MV) methods (MV2, MV3). Segmentation quality was assessed by 2 reviewers on the basis of predefined quality criteria: completeness of selection, the need for manual adaptation, and delineation of lesion borders. Correlations and prognostic performance of resulting radiomics features were compared among the methods. Results: SUV4.0 required the least manual adaptation but tended to underestimate MTV and often missed small lesions with low 18F-FDG uptake. SUV2.5 most frequently included all lesions but required minor manual adaptations and generally overestimated MTV. In contrast, few lesions were missed when using 41max, A50P, MV2, and MV3, but these segmentation methods required extensive manual adaptation and overestimated MTV in most cases. MTV and dissemination features significantly differed among the methods. However, correlations among methods were high for MTV and most intensity and dissemination features. There were no significant differences in prognostic performance for all features among the methods. Conclusion: A high correlation existed between MTV, intensity, and most dissemination features derived with the different segmentation methods, and the prognostic performance is similar. Despite frequently missing small lesions with low 18F-FDG avidity, segmentation with a fixed threshold of SUV4.0 required the least manual adaptation, which is critical for future research and implementation in clinical practice. However, the importance of small, low 18F-FDG–avidity lesions should be addressed in a larger cohort of cHL patients.
The 18F-FDG PET/CT scan is standard of care for staging and response evaluation in the treatment of classical Hodgkin lymphoma (cHL) (1). Optimizing baseline risk stratification contributes to the implementation of individualized treatment strategies aiming to lower toxicity in patients with favorable prognostic characteristics and identification of patients with unfavorable prognostic characteristics early for treatment with other therapies (2–4). The use of quantitative PET features to improve risk stratification could be implemented in clinical practice if workflows are optimized.
Several studies have shown that metabolic tumor volume (MTV) is a potential prognostic marker in newly diagnosed (ND) and relapsed/refractory (R/R)-cHL (4–11). However, there are different methods for assessing MTV, and there is no consensus which method performs best in cHL patients in terms of prognostic performance, ease of use, and interobserver variability (12). MTV assessment is especially challenging in disseminated diseases such as lymphoma. cHL is a heterogeneous disease that is typically localized in the mediastinal and paraaortic regions, mainly affecting young patients who frequently show high physiologic 18F-FDG uptake in brown fat and muscles (1). These regions with high physiologic 18F-FDG uptake impede accurate delineation of tumor lesions nearby. Therefore, it is important to evaluate different segmentation methods specifically for cHL.
Although manual segmentation is the current standard for determining MTV, it is time-consuming and prone to interobserver variability (12). Semiautomatic segmentation includes algorithms that select regions with high 18F-FDG uptake above the threshold of a certain SUV. Segmentation of the MTV can be performed by either predefining regions of interest in which lesions will be automatically selected or by starting with automatic segmentation and deleting regions with high physiologic 18F-FDG uptake (e.g., brain, liver, kidneys) thereafter. Although the segmentation method applied can significantly impact the MTV, it is unknown how each method affects other quantitative PET radiomics features, such as patient-level dissemination parameters (13–17). Besides, no comparative studies have been performed that address representativeness of the segmented MTV with the visual interpretation of the MTV in cHL patients.
The aim of our research was to evaluate the delineation and completeness of lesion selection, and the need for manual adaptation with 6 different semiautomatic segmentation methods, and to assess the influence of the segmentation method on the prognostic value of MTV, intensity, and dissemination radiomics features in scans of cHL patients.
MATERIALS AND METHODS
Study Population
PET/CT scans from ND-cHL patients were collected from study cohorts of the Amsterdam UMC (n = 35) (2,18). PET/CT scans of patients with RR-cHL were collected from 3 clinical trials conducted in Amsterdam UMC, The Netherlands (n = 47) and Memorial Sloan Kettering Cancer Center, New York (n = 23) (2–4). All patients had biopsy-proven cHL, and the PET/CT scan was obtained before the start of therapy. All patients provided written informed consent for participation in the clinical trials (NCT02280993, NCT00255723, NCT01508312) or biobank cohort (18) of which the study protocols were approved by Institutional Review Boards and Ethics Committees of the centers that conducted the trials. For secondary use of data for this analysis, a waiver was obtained from the Ethics Committee.
18F-FDG PET/CT Scans and Quality Control
The PET/CT systems used to acquire the scans were EANM Research GmbH (EARL, Europe)– or American College of Radiology (ACR, United States)–accredited (19). PET/CT scans were deidentified at the participating centers and centrally collected. PET scans that did not meet the following 4 criteria, described by European Association of Nuclear Medicine guidelines (19), were excluded from analysis: plasma glucose < 11 mmol/L; reconstruction of attenuation-corrected PET according to guidelines described by EARL or ACR; total image activity (MBq) between 50% and 80% of the total injected 18F-FDG activity or liver SUVmean between 1.3 and 3.0; and essential PET acquisition data and clinical data available (19).
Segmentation of the Volume of Interest (VOI)
Attenuation-corrected PET scans were analyzed using the ACCURATE tool (20). Six different semiautomatic methods were used for each scan to select the VOI: 2 fixed thresholds of SUV4.0 and SUV2.5, 2 relative thresholds of 41% of SUVmax (41max) and a contrast-corrected 50% of SUVpeak (A50P), and 2 majority vote (MV) methods selecting voxels that are chosen with ≥2 (MV2) and ≥3 (MV3) of the previously mentioned fixed or relative methods, respectively. The VOI was delineated by automatic preselection of 18F-FDG–avid structures using the 6 different segmentation methods and a volume threshold of ≥3 mL. Nontumor regions were deleted and lymphoma lesions < 3 mL were added with single mouse clicks. If tumor regions were adjacent to nontumor 18F-FDG–avid regions (e.g., heart, liver, bladder), nontumor regions were either removed manually or tumor segmentation was restricted by placing a border or mask, which prevented selection of lesions outside the border (Fig. 1A). Only focal extranodal and splenic lesions were included in the VOI. A global increase in 18F-FDG uptake of the spleen or bone marrow was not included in the VOI. Delineations were performed under supervision of a nuclear medicine physician.
Quality Scores of Representativeness of Segmentations Compared with Visual Judgment
The quality of the segmentation by the 6 different methods was assessed using 3 quality score (QS) criteria (Table 1): completeness of selection of the VOI (i.e., were all tumor-lesions selected); requirement of manual adaptation after semiautomatic segmentation (i.e., manual removal of nontumor regions); and delineation quality of the VOI (i.e., does the VOI border reflect the visual interpretation of the 18F-FDG–avid tumor area on the PET scan?).
Two reviewers performed the QS assessment for each of the 6 segmentations for all scans, masked to patient outcome. Completeness of selection and delineation QS were assessed independently, followed by a consensus meeting in which the reviewers reached a consensus on all discrepancy scores and assigned a final QS to each segmentation. The manual adaptation QS was assessed in consensus between the reviewers during review of the segmentation of scans. An example of the QS assessment by the 6 segmentation methods is included in Figure 1B.
Radiomics Feature Extraction
RaCat software (developed by Professor Ronald Boellaard; Amsterdam UMC) was used to extract 18 patient-level dissemination features from the complete MTV at patient level (21). Dissemination features included several novel features addressing interlesional heterogeneity based on distance, volume, SUVmax, and SUVpeak (the 1 mL with the highest SUV within the VOI). In addition, MTV, SUVmax, SUVpeak, SUVmean, and total lesion glycolysis were extracted from the VOI. An overview of all features and its definitions are provided in Supplemental Table 1 (supplemental materials are available at http://jnm.snmjournals.org).
Statistical Analysis
QS of segmentations were analyzed descriptively and compared using χ2 tests for the whole cohort and separately for ND-cHL and RR-cHL patients. MTV, intensity, and dissemination radiomics features were compared between the ND-cHL and RR-cHL cohorts using the Wilcoxon rank sum test for nonparametric data. Further analyses were performed on the whole cohort. Correlations of MTV, intensity, and dissemination radiomics features among the 6 different segmentation methods were assessed using Spearman rank coefficients correlation. Receiver-operating-characteristics analysis was used to calculate the area under the curve (AUC) for each feature per segmentation method on the whole cohort. An event was defined as the occurrence of progressive disease within 3 y, and patients who died without progression were excluded. AUC curves were compared using a paired t test as described by DeLong et al. (22).
Statistical analysis was performed using R software (version 4.0.3; R Core Team). A P value of < 0.05 was considered statistically significant.
RESULTS
Patient Characteristics
A total of 105 PET/CT scans of patients with ND-cHL (n = 35) and RR-cHL (n = 70) were included in the analysis (Supplemental Table 2). A comparison of radiomics features between ND-cHL and RR-cHL showed no significant differences for most features, except for MTV, SUVpeak, and Dvol (the maximum difference in volume between lesions), which were all higher in ND patients than in RR patients (Supplemental Table 3).
Quality Scores of Segmentations
Agreement of QS assessment between the 2 reviewers was high (91% for segmentation quality and 82% for delineation quality).
Segmentation resulted in complete selection of all lesions in most cases (Fig. 2A; Supplemental Table 4). SUV2.5 showed the highest rate of complete selection, followed by 41max, MV2, A50P, and MV3, while SUV4.0 frequently missed minor (59%) and major (10%) lesions. When the SUV4.0 method was used, 91% of scans could be segmented without any manual adaptation (Fig. 2B). The SUV2.5 method required minor adaptations in 37% of scans and 7% major adaptations. When the 41max and MV2 methods were used, only 30% and 34% of scans could be segmented without manual adaptation, and in 47% and 33% of cases, major manual adaptations were required, respectively. When A50P and MV3 were used, about 50% of scans did not require manual adaptation. None of the methods resulted in a high percentage of representative delineation of tumor borders (Fig. 2C). SUV4.0, SUV2.5, and MV3 resulted in representative delineation in about 50% of cases, whereas SUV4.0 tended to underestimate the MTV and SUV2.5 and MV3 tended to overestimate the MTV in the remaining cases. The 41max, A50P, and MV2 methods resulted in representative delineation in less than 30% and usually overestimated the MTV.
No significant differences were observed for QS between ND and RR patients, except for completeness of selection in which complete selection rates were higher in RR patients than in ND patients with 41max, A50P, or MV3 (Supplemental Fig. 1).
Comparison of Features
MTV differed significantly among the segmentation methods. The median MTV per method ranged between 44 and 143 mL (Fig. 3; Supplemental Table 5). SUV4.0 resulted in a significantly lower MTV than all other segmentation methods (P < 0.001). The number of lesions was significantly lower with 41max and MV2 than with SUV4.0 and SUV2.5 segmentation methods (P < 0.05). Dmax (the maximum distance between 2 lesions) was not significantly different among the segmentation methods.
MTV, the number of lesions, and Dmax showed high correlations among most methods (Fig. 4; Supplemental Table 6). For MTV and the number of lesions, the highest correlations were observed between the 2 fixed methods (SUV4.0 and SUV2.5), and between the relative and MV methods, with lower correlations between the fixed and relative or MV methods. SUVmax and SUVpeak had identical median values and were strongly correlated (R = 1) across all methods. Dissemination features addressing differences in volume or SUVpeak among lesions showed lower correlations between SUV4.0 and the other 5 segmentation methods (Supplemental Table 6).
To assess the effect of incomplete selection of lesions, several features derived with SUV4.0 were plotted against SUV2.5 (Supplemental Fig. 2). Scans that missed major lesions with SUV4.0 did not show large deviations in the correlation between SUV4.0 and SUV2.5 when compared with scans that had complete selection or missed only minor lesions.
Prognostic Performance per Method
Except for MV2, the AUC of the receiver-operating characteristics did not differ significantly among the segmentation methods for all features assessed (Fig. 5; Supplemental Table 7). The highest AUCs were observed for MTV (range, 0.62–0.65), total lesion glycolysis (range, 0.63–0.65), number of lesions (range, 0.55–0.63), spread in volume (VolSpread) (range, 0.58–0.65), and the difference in SUVpeak between the hottest lesion and all other lesions (DSUVpeakSumHot) (range, 0.56–0.63). Of all methods MV2 showed the lowest AUC for the various features (median AUC of all variables, 0.55). The other 5 methods showed comparable median AUCs, with the highest median AUC of all variables of 0.62 for SUV4.0.
DISCUSSION
MTV has shown prognostic value in cHL, but the use of different segmentation methods hampers direct comparisons between studies (4–10). This is especially true if a cutoff for MTV is used to divide patients in low- and high-risk groups, since absolute MTV values significantly differ between methods. Harmonization of MTV assessment enables the evaluation of MTV as a prognostic marker in cHL in a multicohort setting. The same holds for other quantitative PET features including dissemination features.
We evaluated the completeness of lesion selection, need for manual adaptations, and delineation quality of 6 semiautomatic segmentation methods to assess MTV and dissemination features in 105 cHL patients. Segmentation with SUV4.0 required the least manual adaptations because this method, in contrast to other methods, rarely floods into regions with high physiologic 18F-FDG uptake. SUV2.5 often required minor adaptations, but seldomly major adaptations. Although segmentation using SUV4.0 frequently did not include all lesions (missing those with a SUV < 4.0), these lesions were often small and scans with major lesions missing did not cause significant deviations in the correlation between SUV4.0 and SUV2.5, which was the most complete method. Additionally, the prognostic performance between all methods was similar, and SUV4.0 and SUV2.5 showed the highest AUCs for most variables.
The results of our evaluation suggest that small lesions with low SUV uptake, that are frequently not included with SUV4.0, probably do not contain critical prognostic information, which could be partly explained by the low contribution to total MTV of small lesions. However, small lesions could still influence dissemination features, of which the prognostic value needs to be established in a larger set of patients with more progression events. Additionally, small low-uptake lesions are potentially of higher importance in response assessment, thus, SUV4.0 may be less suitable for quantitative interim PET analyses in cHL (1).
All segmentation methods, except SUV4.0, frequently overestimated the MTV assessed by visual interpretation. This overestimation may be less relevant when using only patient-level features, as correlations among methods are high; however, lesion-based radiomics analysis involving texture features may be adversely affected by oversegmentation, that is, by selection of voxels that are not part of the tumor (23). Methods that tended to overestimate the MTV also showed a lower number of lesions, because lesions close to each other were frequently clustered into 1 lesion, as illustrated in Figure 1. This explains the discrepancy that SUV4.0 often misses small or low-uptake lesions but still shows the highest number of lesions (Fig. 3).
In a recent comparison of 6 segmentation methods in diffuse large B-cell lymphoma (DLBCL), a fixed threshold of SUV4.0 was considered the best method to derive MTV (24). Similar to our findings, MTV significantly differed among the methods, but the prognostic performance was comparable. Interestingly, method performance in DLBCL at interim PET has been shown to depend on the lesional SUVmax, in which lesions with SUVmax < 10 were delineated most successfully using MV3, whereas SUV4.0 was most successful in lesions with SUVmax > 10 (25). Correlations for MTV were significantly higher in our cohort than previously described for DLBCL, possible because our correlations were assessed after manual adaptation (24,25). Additionally, and contrary to our findings, the 41max, A50P, and MV3 methods yielded lower exact MTV values than SUV4.0 in baseline DLBCL, showing that performance of different methods can be disease-dependent. In our cohort, 41max resulted in the highest MTV, which can be explained by the lower SUV in our cHL cohort (median SUVmax, 11.3), compared with DLBCL patients (median SUVmax 22.6) (26). Because SUVmax is a patient-level feature, and cHL shows heterogeneous 18F-FDG uptake, other lesions within a patient may have a much lower SUVmax, resulting in overestimation of the MTV and flooding with relative methods such as 41max.
Methods based on relative thresholds (e.g., 41max and A50P) are less suitable for assessing MTV in diseases with heterogeneous 18F-FDG uptake, such as cHL, because a high lesional SUVmax may exclude the lower avid voxels of the lesion, causing undersegmentation. A low lesional SUVmax, however, results in a low threshold, leading to flooding into regions with physiologic 18F-FDG uptake. The MV methods could not overcome this disadvantage of the relative methods. MV2 frequently uses voxels that are being selected with 41max and A50P, and although MV3 needs a third method this did not result in better segmentation than methods with a fixed threshold.
Although the 41max method is recommended for MTV segmentation and has been used in several lymphoma studies, this method requires extensive manual adaptation, which is time-consuming and more susceptible to interobserver variation (13,15,19). Additionally, the recommendation for 41max is based on solid malignancies rather than disseminated diseases such as cHL, and 41max has not been compared directly to a fixed threshold of SUV4.0 (27–29). Therefore, this recommendation should be reconsidered for cHL.
CONCLUSION
For PET/CT segmentation in cHL, we showed a high correlation among MTV and most intensity and dissemination features derived with different segmentation methods, except for dissemination features addressing differences in volume and SUVmax/peak. The prognostic performance of all features is comparable among the methods. The SUV4.0 method required the least manual adaptation, which is critical for future research and implementation in clinical practice. Although segmentation with SUV4.0 often missed small lesions with low18F-FDG avidity, which may in particular affect dissemination features such as the Dmax, this seemed not to influence the prognostic performance of most features, including Dmax. However, to be conclusive about recommending SUV4.0 for cHL segmentation, the prognostic importance of small lesions with low uptake should be evaluated in a larger cohort of cHL patients with more progression events.
DISCLOSURE
This work was financially supported by SHOW (Dutch Foundation of hematooncologic research, a nonprofit donation fund of Amsterdam UMC). There is no financial support for this work that could have influenced the outcomes described in the article. Ronald Boellaard is a scientific advisor and chair of the EARL accreditation program. Marie José Kersten is a consultant for BMS/Celgene, Kite/Gilead, Miltenyi Biotech, Novartis, and Takeda and has received honoraria from Kite/Gilead, Novartis, and Roche as well as research funding from Kite/Gilead, Takeda. Craig H. Moskowitz is an advisor for and received research funding from Celgene, Genentech, Merck, and Seattle Genetics. Alison J. Moskowitz is a consultant for Takeda, Imbrium Therapeutics, Janpix, Merck, and Seattle Genetics and has received research funding from Incyte, Merck, Seattle Genetics, ADC Therapeutics, Beigene, Miragen, and Bristol-Myers Squibb. Josée M. Zijlstra has received research funding from Takeda.
KEY POINTS
QUESTION: Which segmentation method provides the best delineation and completeness of lesion selection with the least manual adaptation in scans of cHL patients, and what is the influence of the segmentation method on the prognostic value of MTV, intensity, and dissemination radiomics features?
PERTINENT FINDINGS: Segmentation with a fixed threshold of SUV4.0 required the least manual adaptation, with SUV2.5 resulting in the most complete selection of all lesions. The prognostic performance of features was comparable per segmentation method, and there was a high correlation for MTV and intensity features, but not for all dissemination features, assessed with the different methods.
IMPLICATIONS FOR PATIENT CARE: Semiautomated estimation of MTV, intensity, and dissemination radiomics features in cHL patients is feasible using a method with a fixed threshold.
ACKNOWLEDGMENTS
We thank the patients and collaborating investigators who kindly supplied their data.
Footnotes
Published online Jan. 6, 2022.
- © 2022 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- Received for publication August 17, 2021.
- Revision received December 28, 2021.