Deep-Learning 18F-FDG Uptake Classification Enables Total Metabolic Tumor Volume Estimation in Diffuse Large B-Cell Lymphoma

Total metabolic tumor volume (TMTV), calculated from 18F-FDG PET/CT baseline studies, is a prognostic factor in diffuse large B-cell lymphoma (DLBCL) whose measurement requires the segmentation of all malignant foci throughout the body. No consensus currently exists regarding the most accurate approach for such segmentation. Further, all methods still require extensive manual input from an experienced reader. We examined whether an artificial intelligence–based method could estimate TMTV with a comparable prognostic value to TMTV measured by experts. Methods: Baseline 18F-FDG PET/CT scans of 301 DLBCL patients from the REMARC trial (NCT01122472) were retrospectively analyzed using a prototype software (PET Assisted Reporting System [PARS]). An automated whole-body high-uptake segmentation algorithm identified all 3-dimensional regions of interest (ROIs) with increased tracer uptake. The resulting ROIs were processed using a convolutional neural network trained on an independent cohort and classified as nonsuspicious or suspicious uptake. The PARS-based TMTV (TMTVPARS) was estimated as the sum of the volumes of ROIs classified as suspicious uptake. The reference TMTV (TMTVREF) was measured by 2 experienced readers using independent semiautomatic software. The TMTVPARS was compared with the TMTVREF in terms of prognostic value for progression-free survival (PFS) and overall survival (OS). Results: TMTVPARS was significantly correlated with the TMTVREF (ρ = 0.76; P < 0.001). Using PARS, an average of 24 regions per subject with increased tracer uptake was identified, and an average of 20 regions per subject was correctly identified as nonsuspicious or suspicious, yielding 85% classification accuracy, 80% sensitivity, and 88% specificity, compared with the TMTVREF region. Both TMTV results were predictive of PFS (hazard ratio, 2.3 and 2.6 for TMTVPARS and TMTVREF, respectively; P < 0.001) and OS (hazard ratio, 2.8 and 3.7 for TMTVPARS and TMTVREF, respectively; P < 0.001). Conclusion: TMTVPARS was consistent with that obtained by experts and displayed a significant prognostic value for PFS and OS in DLBCL patients. Classification of high-uptake regions using deep learning for rapidly discarding physiologic uptake may considerably simplify TMTV estimation, reduce observer variability, and facilitate the use of TMTV as a predictive factor in DLBCL patients.

Tot al metabolic tumor volume (TMTV) derived from 18 F-FDG PET/CT baseline studies is a promising prognostic factor in diffuse large B-cell lymphoma (DLBCL) (1,2) and other types of lymphoma (3)(4)(5). DLBCL is the most frequent non-Hodgkin lymphoma, being present in about 30%-40% of non-Hodgkin lymphoma cases worldwide. Although the prognosis of DLBCL can be improved with immunochemotherapy, more than 30% of patients are refractory or relapse after first-line treatment, with a poor outcome (6,7). Therefore, there is a need to identify high-risk patients who could benefit from intensive or novel therapies early. Unfortunately, the role of current prognostic factors such as the International Prognostic Index (8), Revised International Prognostic Index (9), and National Comprehensive Cancer Network International Prognostic Index (10), based on tumor burden surrogates is limited. Thus, baseline TMTV, which estimates the total metabolic tumor burden at diagnosis, has been proposed as an alternative prognostic tool for early risk stratification.
To date, TMTV is not yet routinely used in clinical lymphoma patient management, in part because of a lack of consensus throughout the literature. Several methods have been proposed to calculate TMTV (11)(12)(13), and the cutoffs reported to detect highrisk patients differed among methods and studies. However, recent studies have suggested that, despite these differences, most methods yielded similar accuracy in predicting patient prognosis when applied in similar patient groups (11,12), emphasizing the strong prognostic power of baseline TMTV.
Regardless of the criteria used for delineating tumor regions, all methods for deriving TMTV require extensive and time-consuming manual input from an experienced reader. The reader either manually segments the tumor regions or, more commonly, uses an automated method to detect all regions with increased uptake and then manually eliminates the regions of physiologic uptake and adds in undetected tumor regions (13). Recently, a machine-learning algorithm using a convolutional neural network (CNN) was trained to differentiate physiologic from nonphysiologic uptake regions in whole-body 18 F-FDG PET scans acquired from an unselected population of more than 600 patients, including half who were lymphoma patients with different subtypes of diseases (14,15). This CNN achieved a high degree of accuracy in characterizing increased tracer uptake in the whole body as physiologic or nonphysiologic. Such automated identification of nonphysiologic regions would facilitate TMTV measurement and clinical adoption. This study therefore sought to assess the ability of this CNN to identify regions from which TMTV could be automatically calculated and to evaluate the ability of the resulting TMTV in predicting patient outcome among a large group of DLBCL patients included in an international phase III trial wherein TMTV has already been demonstrated to be a strong predictor of 4-y progression-free survival (PFS) and overall survival (OS). To evaluate the CNN performance, regions with elevated tracer uptake automatically identified as physiologic or suspicious were compared with regions attributed to suspicious uptake by an expert reader using a semiautomatic method.

Patients
Patients from an ancillary study (16,17) of the REMARC trial (NCT01122472) were retrospectively analyzed. This trial is a phase III study that was designed to assess the efficacy of lenalidomide versus placebo in responding elderly DLBCL patients (60-80 y old) treated with the standard first-line rituximab, cyclophosphamide, doxorubicin hydrochloride (hydroxydaunorubicin), vincristine sulfate, and prednisone (R-CHOP) therapy approach (18). The institutional review board approval and the informed consent of the REMARC trial included all the ancillary studies. The ancillary study was conducted by involving 301 patients who underwent baseline PET/CT before R-CHOP and showed that TMTV was a strong prognosticator of outcome in patients responding to first-line chemotherapy combined with monoclonal antibody treatment.

Image Acquisition and Analysis
All baseline 18 F-FDG PET/CT images from the ancillary study were collected in an anonymized DICOM format. Patients whose PET or CT DICOM series had incomplete axial slices or irregular slice intervals were excluded. PET images were expressed in SUV units, accounting for injected dose and patient body weight.
PET/CT images were analyzed using an investigational software prototype (PET Assisted Reporting System [PARS]; Siemens Medical Solutions USA, Inc.) that uses artificial intelligence. The prototype first automatically located a cylindric reference region at the center of the proximal descending aorta by applying a landmarking algorithm to the CT image (19). This region was used to determine the mean blood pool SUV and mean blood pool SUV standard deviation (SD), following PERCIST recommendations (20). The 3-dimensional regions of the PET image with increased tracer uptake were identified for each subject using an automated whole-body high-uptake segmentation algorithm (multi-foci segmentation, MFS) (21). In line with the PERCIST recommendations, only the regions with SUV peak greater than twice the mean blood pool SUV plus twice the mean blood pool SUV SD were included. Those regions were then further segmented according to 42% of the SUV max threshold, and the ones with volumes below 2 cm 3 were discarded. The resulting regions of interest (ROIs), called ROI PARS , were then automatically processed by a CNN. Details of the training and validation of this CNN were previously reported (15). The input of the CNN was the PET/CT data together with the set of ROI PARS sites. For each ROI PARS , the output of the CNN was the anatomic localization among a set of possible anatomic sites relevant for staging and whether the ROI PARS uptake was physiologic (e.g., due to unspecific bowel uptake, muscle activation, inflammation, infection, or bone degeneration) or suspicious (i.e., due to lymphoma). The volumes of all ROI PARS sites classified as suspicious uptake were then summed to obtain the TMTV PARS .
The CNN was also used in combination with 2 other settings of the initial high-uptake ROI segmentation: the first used an initial threshold of 2.5 SUV instead of the blood-pool-based threshold, followed by thresholding with 41% of SUV max ; the second also included ROIs with a volume between 0.1 and 2 cm 3 .
The TMTV obtained by 2 experienced nuclear medicine physicians in the context of a previous study (16,17) was used as a reference (TMTV REF ). The TMTV REF was obtained using the semiautomatic version of the Beth Israel Fiji (ImageJ) software plugin (22), which was previously used to demonstrate the prognostic value of TMTV in various lymphoma subtypes (5,23). To calculate TMTV REF , the physician combined automated and manual steps as follows. First, volumes of interest with high uptake in the PET images were segmented using an automated method, which applied in sequence an algorithm based on component trees and shape priors (24), a region growing, and a final region delineation using 41% of the region SUV max threshold (25). Second, the resulting ROIs were manually reviewed by the reader, who selected only the regions corresponding to lymphoma (ROI REF ), adding an ROI REF wherever a lymphoma lesion had been missed by the algorithm by drawing a prism around that lesion and applying a 41% SUV max threshold. The volumes of all lymphoma ROI REF sites were summed to obtain the reference TMTV (TMTV REF ).

Statistical Analysis
To evaluate the performance of the CNN classification, for each patient, each ROI PARS , having been labeled as presenting suspicious or physiologic uptake by the CNN, was compared with all the ROI REF sites of that patient taken together. The ROI PARS was considered to match the ROI REF if at least 50% of its volume overlapped with one or several ROI REF sites. ROI PARS sites classified as suspicious and matching one or several ROI REF sites were considered true-positives, ROI PARS sites classified as physiologic and matching one or several ROI REF sites were considered false-negatives, ROI PARS sites classified as physiologic and not matching any ROI REF sites were considered true-negatives, and ROI-PARS sites classified as suspicious and not matching any ROI REF sites were considered false-positives. The sensitivity, specificity, and accuracy of the uptake classification were calculated. The performance of the CNN classification was also assessed in case a minimum overlap of 25% and 75% was required to consider an ROI PARS as matching the ROI REF .
To evaluate differences between TMTV PARS and TMTV REF , Bland-Altman analysis was performed. Since the Shapiro-Wilk test revealed a significant nonnormal distribution of the differences between TMTV PARS and TMTV REF (P , 0.001), the median bias and limits of agreement at the 2.5 and 97.5 percentiles were reported in the Bland-Altman plot. To assess the correlation between ranked TMTV values, the Spearman rank correlation coefficient was used. For each patient, the agreement between the patient set of ROI PARS sites classified as suspicious and the patient set of ROI REF sites was characterized using the Dice score, precision (the fraction of voxels in the set of ROI PARS sites classified as suspicious that were also present in the set of ROI REF sites), and recall (the fraction of voxels in the set of ROI REF sites that were also present in the set of ROI PARS sites classified as suspicious).
Survival analysis was performed for both TMTV PARS and TMTV REF with respect to PFS and OS. Receiver-operating-characteristic curves were used to determine TMTV cutoffs to predict the occurrence of events within 4 y for both PFS and OS, by maximizing the Youden index (sensitivity 1 specificity 2 1). Survival functions were computed by Kaplan-Meier analyses and used to estimate survival time statistics (such as 4-y PFS rate and 4-y OS rate) for low-and high-TMTV groups. A log-rank test was used to assess whether differences between Kaplan-Meier survival curves were significant. Univariate Cox regression was used to calculate hazard ratios between survival groups. Statistical significance was set at a P value of less than 0.05. Statistical analysis was performed using R, version 3.6.1, with survivalROC, version 1.0.3, and pROC, version 1.15.3 (26).

RESULTS
In total, 280 patients from 124 centers were included in the analysis. Patient characteristics are reported in Table 1. All received first-line treatment with R-CHOP and were responders at the time of inclusion in the trial, 142 received a lenalidomide regimen afterward as maintenance, and 138 received placebo. After a median follow-up of 5 y, 86 patients presented with a PFS event and 51 patients had an OS event; the 4-y survival rates were 69% for PFS and 83% for OS. The 4-y survival rates were comparable to those of the entire trial.
PET/CT images were acquired using different scanner models from different vendors as summarized in Supplemental Table 1 (supplemental materials are available at http://jnm.snmjournals. org). The delay between injection and acquisition time was 71.7 6 14.1 min (mean 6 SD). The SUV mean in the proximal descending aorta cylindric region was 1.6 6 0.5 (mean 6 SD across subjects), resulting in an SUV peak threshold of 3.6 6 1.2 for detecting ROIs with increased tracer uptake.
The results below are described for the PERCIST-based setting of the initial high-uptake ROI segmentation, whereas changes observed with other settings are reported in Supplemental Tables 2-4.

Uptake Classification
In total, 6,737 ROI PARS sites exhibiting increased uptake were obtained from the 280 subjects. There were 7,996 ROI REF sites in the 280 subjects. Descriptive statistics for the number of ROI PARS and ROI REF sites per subject are summarized in Supplemental Table 5. Among the 6,737 ROI PARS sites with increased uptake, 2,831 (42%) were classified as having suspicious uptake by the CNN.
When compared with the ROI REF sites obtained by the experienced reader, the identification of the ROI PARS sites with suspicious uptake by the CNN yielded 3,317 true-negatives, 2,399 true-positives, 589 false-negatives, and 432 false-positives. Corresponding sensitivity was 80%, specificity was 88%, and accuracy was 85%.   (Table 2).
There was a significant correlation between ranked TMTV estimates (r 5 0.76; P , 0.001). The median Dice score across all patients between the patient set of ROI PARS sites labeled as suspicious and the patient set of ROI REF sites was 0.73 (IQR, 0.33-0.86), the median recall of the patient set of ROI PARS sites labeled as suspicious with respect to the patient set of ROI REF sites was 0.62 (IQR, 0.20-0.81), and the median precision was 0.96 (IQR, 0.86-0.99). The Bland-Altman plot comparing TMTV PARS and TMTV REF (Fig. 2) showed wide limits of agreement.

Survival Analysis
The area under the receiver-operating-characteristic curve for predicting the 4-y PFS was 0.63 for TMTV PARS and 0.69 for TMTV REF (Fig. 3). The optimal cutoffs for predicting the 4-y PFS were 171 cm 3 for TMTV PARS and 242 cm 3 for TMTV REF .
Kaplan-Meier survival curves are shown in Figure 4. The 4-y PFS rates were 79% and 54% for the low-and high-TMTV PARS groups and 83% and 55% for the low-and high-TMTV REF groups, respectively. The log-rank test indicated a significantly longer PFS time in the low-TMTV patient group for both TMTV estimation methods (P , 0.001 for TMTV PARS and TMTV REF ). Cox regression for PFS resulted in hazard ratios (high-TMTV group vs. low-TMTV group) of 2.3 (95% confidence interval, 1.5-3.6; P , 0.001 for Wald test) for TMTV PARS and 2.6 (95% confidence interval, 1.6-4.1; P , 0.001) for TMTV REF . The survival results are summarized in Table 3.
For the 4-y OS, the area under the receiver-operating-characteristic curve was 0.65 for TMTV PARS and 0.68 for TMTV REF . The optimal TMTV cutoffs for predicting the 4-y OS were 148 cm 3 for TMTV PARS and 223 cm 3 for TMTV REF . The 4-y OS rates were 90% and 74% for the low-and high-TMTV PARS groups and 93% and 74% for the low-and high-TMTV REF groups, respectively. The log-rank test revealed a significantly higher OS time in the low-TMTV patient group for both TMTV estimation methods (P , 0.001 for TMTV PARS and TMTV REF ). Cox regression for OS resulted in hazard ratios (high-TMTV group vs. low-TMTV group) of 2.8 (95% confidence interval, 1.6-5.1; P , 0.001) for TMTV PARS and 3.7 (95% confidence interval, 1.9-7.2; P , 0.001) for TMTV REF .
The sensitivity, specificity, negative predictive value, positive predictive value, and accuracy for predicting the occurrence of survival events within 4 y, determined at the optimal TMTV cutoff for each method, are reported in Supplemental Table 7 and were similar for both PFS and OS.

DISCUSSION
Our main result was that a fully automated method combining a region delineation method based on PERCIST recommendations and a CNN-based algorithm to distinguish between regions with elevated physiologic uptake and nonphysiologic regions was able to generate, in a uniform population of DLBCL patients, TMTV values predictive of 4-y PFS and OS with an accuracy comparable to that obtained when TMTV is calculated by manual selection of the tumor regions by medical experts. Although the CNN-based algorithm was trained using images obtained on only 2 scanner models from the same vendor, the algorithm was highly accurate in classifying increased uptake in patients from an international trial involving 124 centers that obtained images on different scanner models from different vendors and with variable reconstruction settings. This accuracy underlines the robustness of the CNN despite different image quality. Moreover, this algorithm was not originally trained for TMTV computation and outcome  prediction and was developed with data from patients with different lymphoma subtypes and lung cancer who underwent PET at baseline and for response assessment. However, we showed that the algorithm was successful in a group of patients with a homogeneous lymphoma subtype scanned at baseline, enabling the identification of a TMTV cutoff separating high-risk from lowrisk patients and predicting prognosis with accuracy comparable to that of the reference method. No subject was excluded because of failure of the initial high-uptake ROI segmentation, which identified at least one high-uptake region in all subjects. Furthermore, comparable results were obtained when different settings of the initial high-uptake ROI segmentation were applied using a lower threshold (2.5 SUV) than the PERCIST-recommended blood-pool-based threshold (Supplemental  Tables 2 and 3), suggesting the robustness of the algorithm to the initial segmentation results. Additionally, the accuracy of the high-uptake ROI classification was not substantially impacted when a different level of overlap was required to consider an ROI as matching the TMTV REF and when ROIs with volumes of less than 2 cm 3 were included in the analysis (Supplemental Tables 4 and 6). The median TMTV PARS and the resulting cutoff were lower than those observed for TMTV REF . This finding could be due to multiple factors, including the higher initial SUV threshold used for TMTV PARS relative to the one used for TMTV REF , the manual addition of suspicious regions with low uptake in TMTV REF , regions being classified as physiologic in TMTV PARS but considered suspicious in TMTV REF , and differences in the contouring of suspicious regions between TMTV PARS and TMTV REF . However, the ability of the TMTV PARS estimates to be predictive of PFS and OS despite involving a TMTV range different from that of TMTV REF is consistent with what has already been reported (11,12) when comparing different TMTV estimation methods. This result confirms both the validity of the CNN method and the value of TMTV as a prognostic indicator.
Our study had limitations. Since there is currently no gold standard method for TMTV calculation from 18 F-FDG PET/ CT images (27), the reported figures of merit supporting the uptake classification performance and accuracy of the TMTV segmentation are limited to the comparison with the reference method considered in the study. Moreover, a uniform cohort of lymphoma patients was evaluated in the current study, and results may differ for different lymphoma subtypes or different cancer types.
In the present work, we evaluated a fully automated application of PARS. However, PARS was initially intended to be used in a supervised manner, allowing the reader to correct for potentially misclassified regions when appropriate. In particular, pitfalls in PET/CT image quality, such as misalignment due to motion or image artifacts, may influence the classification output of the CNN algorithm, and the results should be validated by an expert. This is especially true when the labeling results are used to derive a prognostic index such as TMTV that can be used to stratify the risk and guide  personalized therapy. Nevertheless, this approach could be used by expert readers to efficiently estimate TMTV, as the deeplearning-based method is able to automatically identify several relevant suspicious uptake sites and automatically discard physiologic uptake sites, with the expert only having to correct the potential improper classification of a limited number of regions per subject, requiring limited user interaction and potentially improving interreader variability. This approach may introduce bias in the TMTV estimation process by relying on pregenerated results. However, this risk should be marginal, especially when a careful revision of the results is performed by an experienced reader.
To our knowledge, this was the first study showing that an artificial intelligence method can generate a TMTV value prognostic of outcome in a large series of patients with DLBCL, with results comparable to other currently used methodologies. Other machine-learning-based approaches for TMTV estimation in lymphoma patients, including some involving CNN, are being developed and evaluated (28). The automated method for TMTV segmentation assessed in the present study combined a regiondelineation method based on PERCIST recommendations and a deep-learning-based classification scheme for rapidly discarding physiologic uptake. Further efforts toward developing a stricter definition of TMTV, standardizing volume-segmentation methods, and establishing guidelines for the inclusion of tumor-bearing anatomic regions are ongoing, and these will constitute a prerequisite for the optimization of a complete automated method (13). CONCLUSION We showed that TMTV can be estimated fully automatically using a deep-learning approach. The resulting TMTV was consistent with that obtained by independent experts and showed significant prognostic value for PFS and OS in a large cohort of DLBCL subjects. PERTINENT FINDINGS: In a cohort of 280 DLBCL patients from the REMARC trial, a deep-learning algorithm could classify regions of interest with elevated uptake in 18 F-FDG PET/CT as physiologic or suspicious in good agreement with expert human reader assessment. By aggregating the regions of interest classified as suspicious uptake by the deep-learning algorithm, the automated TMTV estimates were significant for PFS and OS prediction.
IMPLICATIONS FOR PATIENT CARE: Estimation of TMTV with an automated method using deep learning may contribute to reproducible and accurate identification of high-risk patients with DLBCL.