Need for Objective Task-Based Evaluation of Image Segmentation Algorithms for Quantitative PET: A Study with ACRIN 6668/RTOG 0235 Multicenter Clinical Trial Data

Ziping Liu; Joyce C. Mhlanga; Huitian Xia; Barry A. Siegel; Abhinav K. Jha

doi:10.2967/jnumed.123.266018

Visual Abstract

Abstract

Reliable performance of PET segmentation algorithms on clinically relevant tasks is required for their clinical translation. However, these algorithms are typically evaluated using figures of merit (FoMs) that are not explicitly designed to correlate with clinical task performance. Such FoMs include the Dice similarity coefficient (DSC), the Jaccard similarity coefficient (JSC), and the Hausdorff distance (HD). The objective of this study was to investigate whether evaluating PET segmentation algorithms using these task-agnostic FoMs yields interpretations consistent with evaluation on clinically relevant quantitative tasks. Methods: We conducted a retrospective study to assess the concordance in the evaluation of segmentation algorithms using the DSC, JSC, and HD and on the tasks of estimating the metabolic tumor volume (MTV) and total lesion glycolysis (TLG) of primary tumors from PET images of patients with non–small cell lung cancer. The PET images were collected from the American College of Radiology Imaging Network 6668/Radiation Therapy Oncology Group 0235 multicenter clinical trial data. The study was conducted in 2 contexts: (1) evaluating conventional segmentation algorithms, namely those based on thresholding (SUV_max40% and SUV_max50%), boundary detection (Snakes), and stochastic modeling (Markov random field–Gaussian mixture model); (2) evaluating the impact of network depth and loss function on the performance of a state-of-the-art U-net–based segmentation algorithm. Results: Evaluation of conventional segmentation algorithms based on the DSC, JSC, and HD showed that SUV_max40% significantly outperformed SUV_max50%. However, SUV_max40% yielded lower accuracy on the tasks of estimating MTV and TLG, with a 51% and 54% increase, respectively, in the ensemble normalized bias. Similarly, the Markov random field–Gaussian mixture model significantly outperformed Snakes on the basis of the task-agnostic FoMs but yielded a 24% increased bias in estimated MTV. For the U-net–based algorithm, our evaluation showed that although the network depth did not significantly alter the DSC, JSC, and HD values, a deeper network yielded substantially higher accuracy in the estimated MTV and TLG, with a decreased bias of 91% and 87%, respectively. Additionally, whereas there was no significant difference in the DSC, JSC, and HD values for different loss functions, up to a 73% and 58% difference in the bias of the estimated MTV and TLG, respectively, existed. Conclusion: Evaluation of PET segmentation algorithms using task-agnostic FoMs could yield findings discordant with evaluation on clinically relevant quantitative tasks. This study emphasizes the need for objective task-based evaluation of image segmentation algorithms for quantitative PET.

PET-derived quantitative metrics, such as tumor volumetric and radiomic features, are showing strong promise in multiple oncologic applications (1–3). Reliable quantification of these features requires accurate segmentation of tumors on the PET images. To address this need, multiple computer-aided image segmentation algorithms have been developed (4), including those based on deep learning (DL) (5–8). Clinical translation of these image segmentation algorithms requires objectively evaluating them with patient data.

Medical images are acquired for specified clinical tasks; thus, it is important that the performance of imaging and image-analysis algorithms be objectively assessed on those tasks. In this context, strategies have been proposed for task-based assessment of image quality (9–12). However, imaging algorithms, including those based on DL, are often evaluated using figures of merit (FoMs) that are not explicitly designed to measure clinical task performance (11). Recent studies conducted specifically in the context of evaluating image-denoising algorithms showed that task-agnostic FoMs may yield interpretations that are inconsistent with evaluation on clinical tasks (13–17). For example, in Yu et al. (17) a DL-based denoising algorithm for myocardial perfusion SPECT indicated significantly superior performance based on a structural similarity index measure and mean squared error but did not yield any improved performance on the clinical task of detecting myocardial perfusion defects.

Similar to image denoising, algorithms for image segmentation are almost always evaluated using FoMs that are not explicitly designed to quantify clinical task performance (5,18–21). These FoMs, including the Dice similarity coefficient (DSC), the Jaccard similarity coefficient (JSC), and the Hausdorff distance (HD) (4), quantify some measure of similarity between the predicted segmentation and a reference standard such as manual delineation. For example, the DSC measures spatial overlap between the predicted segmentation and reference standard. A higher value of DSC is typically used to infer more accurate performance. However, it is unclear how these task-agnostic FoMs correlate with performance on clinically relevant tasks.

Our objective was to investigate whether evaluating PET segmentation algorithms using task-agnostic FoMs leads to interpretations that are consistent with evaluation based on clinical task performance. Performing this investigation with patient data in a multicenter setting is highly desirable because such a study offers the ability to model variabilities in both patient population and clinical scanner configurations. Toward this goal, we conducted a retrospective study using data from the American College of Radiology Imaging Network (ACRIN) 6668/Radiation Therapy Oncology Group (RTOG) 0235 multicenter clinical trial (22,23). In this trial, patients with stage IIB/III non–small cell lung cancer were imaged with ¹⁸F-FDG PET/CT studies. In the study of non–small cell lung cancer, there is a strong interest in investigating whether early changes in tumor metabolism can help predict therapy response (24). Although most studies have focused on SUV-based metrics, the findings have been inconsistent (24,25), motivating the need for new and improved metrics. In this context, metabolic tumor volume (MTV) and total lesion glycolysis (TLG) are showing strong promise as prognostic biomarkers in multiple studies (3,26,27). As introduced above, computing these features requires tumor segmentation. Thus, our study was designed to assess the concordance in evaluating various image segmentation algorithms using task-agnostic metrics (DSC, JSC, and HD) versus on the clinically relevant tasks of estimating the MTV and TLG. Initial results of this study were presented in brief previously (28); here, we provide a detailed description of the methods and study design, provide new findings, and conduct comprehensive analyses of the results.

MATERIALS AND METHODS

Study Population

This retrospective study of existing data was approved by the institutional review board, which waived the requirement to obtain informed consent. Deidentified ¹⁸F-FDG PET/CT images of 225 patients with inoperable stage IIB/III locally advanced non–small cell lung cancer were collected from the ACRIN 6668/RTOG 0235 multicenter clinical trial (22,23). The images were collected from The Cancer Imaging Archive database (29). Baseline PET/CT scans were acquired before curative-intent chemoradiotherapy for each patient. Demographics and clinical characteristics of the patient population are summarized in Supplemental Table 1 (supplemental materials are available at http://jnm.snmjournals.org). A standardized imaging protocol was detailed by Machtay et al. (23). Briefly, an ¹⁸F-FDG dose ranging from 370 to 740 MBq was administered, with image acquisition beginning 50–70 min later and including the body from the upper–mid neck to proximal femurs. The PET images were acquired from 12 ACRIN-qualified clinical scanners (30), including GE Healthcare Discovery LS/ST/STE/RX, GE Healthcare Advance, Philips Allegro/Guardian, and CTI PET Systems (marketed as Siemens scanners): models 1023/1024/1062/1080/1094. The image reconstruction procedure compensated for attenuation, scatter, randoms, normalization, decay, and dead time. Details of the reconstruction protocol for each PET scanner are provided in Supplemental Table 2.

Data Curation

Evaluation of PET segmentation algorithms required knowledge of true tumor boundaries or a surrogate for ground truth, such as tumor delineations performed by an expert human reader. For this purpose, a board-certified nuclear medicine physician with more than 10 y of experience reading PET scans was tasked with defining the boundary of the primary tumor for each patient (Fig. 1). The physician was instructed to locate the primary tumor by carefully reviewing the coregistered PET/CT images along coronal, sagittal, and transverse planes and then using an edge-detection tool (MIM Encore 6.9.3; MIM Software Inc.) to obtain an initial boundary of the primary tumor. The physician was informed explicitly about potential errors in this initial boundary and was thus advised to review this boundary carefully and make any modifications as needed. The task of segmenting the tumors in the whole dataset was split into multiple sessions to avoid reader fatigue. At the end of this process, we had expert-defined segmentations for the primary tumors in the 225 PET scans in our dataset.

FIGURE 1.

Workflow to obtain manual segmentation of primary tumor (arrow) for each patient. MIM = MIM Encore 6.9.3.

Consideration of Conventional Computer-Aided Image Segmentation Algorithms

Conventional computer-aided PET segmentation algorithms are typically categorized into those based on thresholding, boundary detection, and stochastic modeling (4). We selected the algorithms of SUV_max thresholding (SUV_max40% and SUV_max50%) (31), Snakes (32), and Markov random field-Gaussian mixture model (MRF-GMM) (33) from each of those categories, respectively. A detailed description of these algorithms is provided in the supplemental materials (31–33).

Consideration of DL-Based Image Segmentation Algorithm

We next considered the evaluation of a state-of-the-art U-net–based algorithm (5,8,34,35). A detailed description of the network architecture is provided in Supplemental Figure 1. When DL-based algorithms are developed and evaluated, common factors known to impact the performance include the choice of network depth (36), network width (37), loss function (38), and data preprocessing and augmentation strategies. In this study, we focused on investigating whether evaluating the impact of network depth and loss function using the task-agnostic FoMs yields inferences that are consistent with evaluation on the tasks of estimating MTV and TLG.

Network Training

The U-net–based algorithm was implemented to segment the primary tumor on 3-dimensional PET images on a per-slice basis. During training, 2-dimensional PET images of 180 patients with the corresponding surrogate ground truth (tumor delineations performed by the physician) were input into the U-net–based algorithm. The network was trained to minimize a loss function between the true and predicted segmentations using the Adam optimization method (39). The loss function will be specified in each experiment described below. Network hyperparameters, including parameters of activation function and dropout probability, were optimized via 5-fold cross-validation on the training dataset. The final optimized U-net–based algorithm was then evaluated on the remaining independent 45 patients from the same cohort. There was no overlap between the training and test sets.

Configuring the U-Net–Based Algorithm with Different Network Depths

We varied the network depth by setting the number of paired blocks of convolutional layers (supplemental materials) in the encoder and decoder to 2, 3, 4, and 5. The detailed network architecture that consisted of 2 paired blocks is provided in Supplemental Table 3. For each choice of depth, the network was trained to minimize a binary cross-entropy (BCE) loss between the true and predicted segmentations, denoted by and , respectively. The number of voxels in the PET image is denoted by N. The BCE loss is given by Eq. 1

The network with each depth choice was independently trained and cross-validated on the training dataset. After training, each network was evaluated on the 45 test patients.

Configuring the U-Net–Based Algorithm with Different Loss Functions

A commonly used loss function in DL-based segmentation algorithms is the combined Dice and BCE loss, which leverages the flexibility of Dice loss for handling class-imbalance problems and the use of BCE loss for curve smoothing (36). In this loss function, the weight of BCE loss is controlled by a hyperparameter, denoted by λ. We investigated whether evaluating the impact of different values of λ on the performance of the U-net–based algorithm using the task-agnostic and task-based FoMs yields consistent interpretations.

The Dice loss is denoted by , such that Eq. 2

The combined Dice and BCE losses are defined as Eq. 3where the term is defined in Equation 1. In this experiment, we considered 6 different values of λ ranging from 0 to 1. We fixed the depth of the network by considering 3 paired blocks of convolutional layers in the encoder and decoder. For each value of λ, the network was independently trained and cross-validated on the same training dataset. Each trained network was then evaluated on the 45 test patients.

Evaluation FoMs

Task-Agnostic FoMs

The widely used task-agnostic FoMs of DSC, JSC, and HD were used in this study. The DSC and JSC, as defined in Taha and Hanbury (40), measure the spatial overlap between the true and predicted segmentations. The values of both DSC and JSC lie between 0 and 1, and a higher value implies a more accurate performance. The HD quantifies the shape similarity between the true and predicted segmentations, and a lower value implies a more accurate performance. The values of DSC, JSC, and HD are reported as mean and 95% CI. Paired sample t-tests were performed to assess whether significant differences exist.

Task-Based FoMs

An essential criterion in validating algorithms to extract quantitative imaging metrics such as MTV and TLG is that the measurements obtained with the algorithm are accurate (41,42), because an algorithm that yields biased measurements would not correctly reflect the underlying pathophysiology. In a population, the bias can often vary on the basis of the true value and thus should be quantified over the entire measurable range of values to provide a more complete measure of accuracy (43). Ensemble normalized bias, defined as the bias averaged over the distribution of true values, helps address this issue and provides a summarized FoM for accuracy (44,45). This FoM was thus used in this study. Detailed definitions of the ensemble normalized bias are provided in the supplemental materials (41,42,44,45).

RESULTS

Evaluation of Conventional Computer-Aided Algorithms

Figures 2A and 2B present the quantitative assessment of conventional computer-aided segmentation algorithms over the 225 patients using the task-agnostic and task-based FoMs. On the basis of DSC and JSC, SUV_max40% significantly outperformed SUV_max50% (P < 0.05). However, we observed that SUV_max40% yielded increased ensemble normalized bias in the estimated MTV and TLG of 51% and 54%, respectively, indicating a much less accurate performance on the clinically relevant quantitative tasks. Similarly, the MRF-GMM significantly outperformed Snakes on the basis of the DSC, JSC, and HD (P < 0.05) but revealed a 24% increased ensemble normalized bias in the estimated MTV.

FIGURE 2.

Quantitative assessment of concordance in evaluation of considered conventional PET segmentation algorithms using task-agnostic FoMs of DSC, JSC, and HD (A) and on tasks of estimating MTV and TLG of primary tumor (B). Comparisons of segmentations yielded by SUV_max40% vs. SUV_max50% (C) and MRF-GMM vs. Snakes (D) were provided for 2 representative patients. ens. norm. = ensemble normalized; abs. norm. = absolute normalized.

Figure 2C shows the visual comparison of segmentations yielded by SUV_max40% versus SUV_max50% for a representative patient. We observed that both algorithms yielded very similar DSC, JSC, and HD values. However, SUV_max40% yielded substantially higher absolute normalized error (aNE) in the estimated MTV and TLG. For another representative patient shown in Figure 2D, the MRF-GMM yielded higher DSC and JSC and lower HD values. However, this algorithm yielded less accurate estimates of MTV and TLG, as indicated by the higher aNEs.

Evaluating the U-Net–Based Algorithm

Impact of Network Depth Choice

Figure 3A shows the impact of varying network depth on the performance of the U-net–based algorithm, as evaluated using both the task-agnostic and the task-based FoMs on the 45 test patients. No significant difference was detected among any of the considered network depths on the basis of the DSC, JSC, and HD (P < 0.05). However, deeper networks yielded more accurate performance on the tasks of estimating MTV and TLG. Particularly, compared with the shallower network with 2 paired blocks of convolutional layers, the deeper network with 4 paired blocks yielded substantially lower absolute ensemble normalized bias in the estimated MTV and TLG, with a decrease of 91% and 87%, respectively. Segmentations of the shallower and deeper networks are shown for 1 representative test patient in Figure 3B. We observed that the deeper network yielded lower DSC and JSC and higher HD values but actually outperformed the shallower network on the tasks of estimating the MTV and TLG.

FIGURE 3.

(A) Quantitative assessment of concordance between task-agnostic and task-based FoMs in evaluating impact of varying network depth on performance of U-net–based algorithm. (B) Comparison of segmentations yielded by deeper and shallower network for 1 representative test patient. abs. ens. norm. = absolute ensemble normalized; abs. norm. = absolute normalized.

Impact of Loss Function Choice

Figure 4A shows the assessment of concordance between task-agnostic versus task-based FoMs in evaluating the impact of varying loss functions on the performance of the U-net–based algorithm. On the basis of the DSC, JSC, and HD, there was no significant difference among any values of the hyperparameter, λ. However, we observed substantial variations in the tasks of estimating MTV and TLG, with up to a 73% and 58% difference between the highest and lowest ensemble normalized bias in the estimated MTV and TLG, respectively. Figure 4B compares the segmentations obtained with a λ of 0 versus a λ of 0.8 for a representative test patient. For this patient, whereas the values of DSC, JSC, and HD were similar, a λ of 0 yielded lower aNEs in the estimated MTV and TLG.

FIGURE 4.

(A) Quantitative assessment of concordance between task-agnostic and task-based FoMs in evaluating impact of loss function on performance of U-net–based algorithm. (B) Comparison of segmentations yielded by U-net–based algorithm configured with 2 loss functions for 1 representative test patient. abs. ens. norm. = absolute ensemble normalized; abs. norm. = absolute normalized.

DISCUSSION

Reliable performance on clinically relevant tasks is crucial for clinical translation of image segmentation algorithms. A key task for which image segmentation is often conducted in oncologic PET is quantifying features such as MTV and TLG. However, these segmentation algorithms are almost always evaluated using FoMs that are not explicitly designed to measure clinical task performance. In this study, we investigated whether evaluating PET segmentation algorithms with the widely used task-agnostic FoMs leads to interpretations that are consistent with evaluation on clinically relevant quantitative tasks.

Results from Figure 2 indicate that evaluation of conventional computer-aided PET segmentation algorithms based on task-agnostic FoMs of DSC, JSC, and HD could yield discordant interpretations compared with evaluation on the tasks of estimating MTV and TLG of the primary tumor. When evaluating the SUV_max thresholding algorithm, initial inspection based on the task-agnostic FoMs implied that the intensity threshold of 40% SUV_max yielded a significantly superior performance. However, further investigation showed that SUV_max50% provided substantially more accurate performance on estimating MTV and TLG. This discordance was also observed when comparing the MRF-GMM and Snake algorithms. Thus, these results demonstrate the limited ability of the DSC, JSC, and HD to evaluate image segmentation algorithms on clinically relevant tasks.

The limitation in task-agnostic FoMs was again observed in evaluating the impact of network depth and loss function on the performance of a state-of-the-art U-net–based image segmentation algorithm. In Figure 3, we observed initially that the deeper networks yielded DSC, JSC, and HD values statistically similar to those in the shallower networks. Considering the requirement for computational resources when training DL-based algorithms, this may motivate the deployment of shallower networks in clinical studies. However, our task-based evaluation showed that a deeper network yielded substantially higher accuracy in the estimated MTV and TLG. Similarly, we observed from Figure 4 that based on the task-agnostic FoMs, the performance of the U-net–based algorithm was insensitive to the choice of λ (the hyperparameter controlling the weight of BCE loss in the cost function). However, differences up to 73% and 58% could exist between the highest and lowest ensemble normalized bias in the estimated MTV and TLG, respectively.

To gain further insights into the observed discordance between task-agnostic and task-based FoMs, we performed secondary analyses on a per-patient basis. In Figure 5A, for each of the 225 patients, we first calculated the difference (Δ) in DSC, JSC, and HD between SUV_max50% and SUV_max40% (e.g., ). Next, we obtained the difference in the aNE (supplemental materials; Eq. 2) in the estimated MTV and TLG (e.g., ). We then studied the relationship between ΔDSC (and ΔJSC and ΔHD) versus (and ) via scatter diagrams. For 36 patients, a negative value of was observed, implying that SUV_max50% was inferior to SUV_max40%. However, for these patients, SUV_max50% actually yielded better estimates of MTV, as indicated by the lower aNEs. Similarly, it was observed that interpretations obtained with ΔHD could be discordant with those based on . Additionally, even for minor changes in DSC, JSC, and HD (i.e., ; close to the vertical dashed line in the scatter diagram), we observed substantial variations in the values. This indicates that these task-agnostic FoMs could be insensitive to even dramatic changes in quantitative task performance. This trend was again observed when comparing MRF-GMM versus Snakes (Fig. 5B) and evaluating the impact of network depth and loss function on the performance of the U-net–based algorithm (Fig. 6).

FIGURE 5.

Quantitative assessment of concordance between interpretations obtained with task-agnostic vs. task-based FoMs on per-patient basis for considered computer-aided PET segmentation algorithms. Each point in scatter diagram represents individual patient. Horizontal position of each point indicates difference in DSC, JSC, and HD between SUV_max50% vs. SUV_max40% (A) and MRF-GMM vs. Snakes (B). Similarly, vertical position indicates difference in aNEs in estimated MTV and TLG. abs. norm. = absolute normalized.

FIGURE 6.

Quantitative assessment of concordance between interpretations obtained with task-agnostic vs. task-based FoMs on per-patient basis when evaluating impact of network depth (A) and loss function (B) on performance of U-net–based algorithm. abs. norm. = absolute normalized.

The findings of this study are not meant to suggest that the task-agnostic metrics, including the DSC, JSC, and HD, are not helpful. In fact, initial development of segmentation algorithms may not be associated with a specific task, and thus, task-agnostic FoMs are valuable for assessing the promise of these algorithms. However, for clinical application, it is important to further assess the performance of these algorithms on clinical tasks for which imaging is performed, as also emphasized in the best practices for evaluation of artificial intelligence algorithms for nuclear medicine (RELAINCE guidelines) (44). Results from our study further confirm the need for this task-based evaluation.

Our task-based evaluation focused on assessing the accuracy of image segmentation algorithms in quantifying features from PET images. In clinical studies, other criteria to evaluate the quantification performance could include precision, when repeatability or reproducibility are required for clinical decision-making. When the segmentation is required for radiotherapy planning, the relevant criterion is therapeutic efficacy—for example, the task of improving the probability of tumor control while minimizing the chances of normal-tissue complications. For this task, Barrett et al. proposed the use of an area under the therapy operating characteristic curve (46) for evaluating the segmentation algorithms. In all of these evaluation studies, clinicians (radiologists, nuclear medicine physicians, and disease specialists) have a crucial role in defining the clinically most relevant task and corresponding FoMs for the evaluation of image segmentation algorithms (11).

Evaluating PET segmentation algorithms on quantification tasks required knowledge of true quantitative values of interest. However, such ground truth is often unavailable in clinical studies. To circumvent this challenge, we considered quantitative values obtained using expert human-reader–defined manual delineations as surrogate ground truth. However, we recognize that this surrogate may be erroneous. To address the issue of a lack of ground truth in task-based evaluation of quantitative imaging algorithms, no-gold-standard evaluation techniques have been developed (47–50). These techniques have demonstrated promise in evaluating PET segmentation algorithms on clinically relevant quantitative tasks (51–53). As these techniques are validated further, they could provide a mechanism to perform objective task-based evaluation of segmentation algorithms with patient data. The findings from this study motivate further development and validation of these no-gold-standard evaluation techniques.

Other limitations of this study include the fact that the PET scanners used in the ACRIN 6668/RTOG 0235 multicenter clinical trial were relatively old and did not have time-of-flight capability. Thus, these scanners could yield substantially lower effective sensitivity compared with modern PET scanners. Conducting the proposed study with newer-generation scanners could provide further insights into the potential discordance between task-agnostic and task-based FoMs with more modern technologies. Additionally, the U-net–based algorithm was trained to segment tumors on a per-slice basis. As shown by Leung et al. (5), this strategy helped alleviate the requirement for large amounts of training data and the demand for computational resources. Results from this study motivate expanding the evaluation of 3-dimensional fully automated DL-based algorithms.

As a final remark, the purpose of this study was not to compare DL-based algorithms with conventional computer-aided algorithms. Although we observed that the considered U-net–based algorithm yielded substantially improved performance compared with conventional algorithms based on the task-agnostic and task-based metrics, this study does not intend to suggest that DL-based algorithms are preferable over conventional algorithms.

CONCLUSION

Our retrospective analysis with the ACRIN 6668/RTOG 0235 multicenter clinical trial data shows that evaluation of PET segmentation algorithms based on widely used task-agnostic FoMs could lead to findings that are discordant with evaluation on clinically relevant quantitative tasks. The results emphasize the important need for objective task-based evaluation of image segmentation algorithms for quantitative PET.

DISCLOSURE

This work was supported by the National Institute of Biomedical Imaging and Bioengineering through R01-EB031051, R01-EB031962, R56-EB028287, and R21-EB024647 (Trailblazer Award). No other potential conflict of interest relevant to this article was reported.

KEY POINTS

QUESTION: Are widely used metrics such as DSC, JSC, and HD sufficient to evaluate image segmentation algorithms for their clinical applications?

PERTINENT FINDINGS: Our retrospective analysis with the ACRIN 6668/RTOG 0235 multicenter clinical trial data shows that evaluating PET segmentation algorithms on the basis of the DSC, JSC, and HD FoMs could lead to interpretations that are discordant with evaluation on the clinically relevant quantitative tasks of estimating the MTV and TLG of primary tumors in patients with non–small cell lung cancer.

IMPLICATIONS FOR PATIENT CARE: Objective task-based evaluation of new and improved image segmentation algorithms is important for their clinical application.

Footnotes

Published online Feb. 15, 2024.

Immediate Open Access: Creative Commons Attribution 4.0 International License (CC BY) allows users to share and adapt with attribution, excluding materials credited to previous publications. License: https://creativecommons.org/licenses/by/4.0/. Details: http://jnm.snmjournals.org/site/misc/permission.xhtml.

REFERENCES

1.↵
1. Chen HH,
2. Chiu N-T,
3. Su W-C,
4. Guo H-R,
5. Lee B-F
. Prognostic value of whole-body total lesion glycolysis at pretreatment FDG PET/CT in non–small cell lung cancer. Radiology. 2012;264:559–566.
OpenUrl CrossRef PubMed
2.
1. Mena E,
2. Sheikhbahaei S,
3. Taghipour M,
4. et al
. ¹⁸F-FDG PET/CT metabolic tumor volume and intratumoral heterogeneity in pancreatic adenocarcinomas: impact of dual-time point and segmentation methods. Clin Nucl Med. 2017;42:e16.
OpenUrl
3.↵
1. Ohri N,
2. Duan F,
3. Machtay M,
4. et al
. Pretreatment FDG-PET metrics in stage III non–small cell lung cancer: ACRIN 6668/RTOG 0235. J Natl Cancer Inst. 2015;107:djv004.
OpenUrl CrossRef PubMed
4.↵
1. Foster B,
2. Bagci U,
3. Mansoor A,
4. Xu Z,
5. Mollura DJ
. A review on segmentation of positron emission tomography images. Comput Biol Med. 2014;50:76–96.
OpenUrl CrossRef PubMed
5.↵
1. Leung KH,
2. Marashdeh W,
3. Wray R,
4. et al
. A physics-guided modular deep-learning based automated framework for tumor segmentation in PET. Phys Med Biol. 2020;65:245032.
OpenUrl
6.
1. Liu Z,
2. Mhlanga JC,
3. Laforest R,
4. Derenoncourt P-R,
5. Siegel BA,
6. Jha AK
. A Bayesian approach to tissue-fraction estimation for oncological PET segmentation. Phys Med Biol. 2021;66:124002.
OpenUrl
7.
1. Yousefirizi F,
2. Jha AK,
3. Brosch-Lenz J,
4. Saboury B,
5. Rahmim A
. Toward high-throughput artificial intelligence-based segmentation in oncological PET imaging. PET Clin. 2021;16:577–596.
OpenUrl
8.↵
1. Zhao X,
2. Li L,
3. Lu W,
4. Tan S
. Tumor co-segmentation in PET/CT using multi-modality fully convolutional neural network. Phys Med Biol. 2018;64:015011.
OpenUrl
9.↵
1. Barrett HH,
2. Abbey CK,
3. Clarkson E
. Objective assessment of image quality: III—ROC metrics, ideal observers, and likelihood-generating functions. J Opt Soc Am A Opt Image Sci Vis. 1998;15:1520–1535.
OpenUrl PubMed
10.
1. Barrett HH,
2. Denny J,
3. Wagner RF,
4. Myers KJ
. Objective assessment of image quality: II—Fisher information, Fourier crosstalk, and figures of merit for task performance. J Opt Soc Am A Opt Image Sci Vis. 1995;12:834–852.
OpenUrl PubMed
11.↵
1. Jha AK,
2. Myers KJ,
3. Obuchowski NA,
4. et al
. Objective task-based evaluation of artificial intelligence-based medical imaging methods: framework, strategies, and role of the physician. PET Clin. 2021;16:493–511.
OpenUrl
12.↵
1. Barrett HH,
2. Myers KJ,
3. Hoeschen C,
4. Kupinski MA,
5. Little MP
. Task-based measures of image quality and their relation to radiation dose and patient risk. Phys Med Biol. 2015;60:R1.
OpenUrl CrossRef PubMed
13.↵
1. Badal A,
2. Cha KH,
3. Divel SE,
4. Graff CG,
5. Zeng R,
6. Badano A
. Virtual clinical trial for task-based evaluation of a deep learning synthetic mammography algorithm. SPIE Digital Library website. https://doi.org/10.1117/12.2513062. Published March 7, 2019. Accessed January 12, 2024.
14.
1. Pretorius PH,
2. Liu J,
3. Kalluri KS,
4. et al
. Observer studies of image quality of denoising reduced-count cardiac single photon emission computed tomography myocardial perfusion imaging by three-dimensional Gaussian post-reconstruction filtering and deep learning. J Nucl Cardiol. 2023;30:2427–2437.
OpenUrl
15.
1. Li K,
2. Zhou W,
3. Li H,
4. Anastasio MA
. Assessing the impact of deep neural network-based image denoising on binary signal detection tasks. IEEE Trans Med Imaging. 2021;40:2295–2305.
OpenUrl
16.
1. Prabhat K,
2. Zeng R,
3. Farhangi MM,
4. Myers KJ
. Deep neural networks-based denoising models for CT imaging and their efficacy. SPIE Digital Library website. https://doi.org/10.1117/12.2581418. Published February 15, 2021. Accessed January 12, 2024.
17.↵
1. Yu Z,
2. Rahman MA,
3. Laforest R,
4. et al
. Need for objective task‐based evaluation of deep learning‐based denoising methods: a study in the context of myocardial perfusion SPECT. Med Phys. 2023;50:4122–4137.
OpenUrl
18.↵
1. Jha AK,
2. Rodríguez JJ,
3. Stephen RM,
4. Stopeck AT
. A clustering algorithm for liver lesion segmentation of diffusion-weighted MR images. Proc IEEE. 2010;2010:93–96.
OpenUrl
19.
1. Oreiller V,
2. Andrearczyk V,
3. Jreige M,
4. et al
. Head and neck tumor segmentation in PET/CT: the HECKTOR challenge. Med Image Anal. 2022;77:102336.
OpenUrl CrossRef
20.
1. Song Q,
2. Bai J,
3. Han D,
4. et al
. Optimal co-segmentation of tumor in PET-CT images with context information. IEEE Trans Med Imaging. 2013;32:1685–1697.
OpenUrl CrossRef
21.↵
1. Kofler F,
2. Ezhov I,
3. Isensee F,
4. et al
. Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. arXiv website. https://arxiv.org/abs/2103.06205. Published March 10, 2021. Accessed January 12, 2024.
22.↵
1. Kinahan P,
2. Muzi M,
3. Bialecki B,
4. Herman B,
5. Coombs L
. ACRIN 6668 (ACRIN-NSCLC-FDG-PET). Cancer Imaging Archive website. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=39879162. Modified November 27, 2023. Accessed January 12, 2024.
23.↵
1. Machtay M,
2. Duan F,
3. Siegel BA,
4. et al
. Prediction of survival by [¹⁸F]fluorodeoxyglucose positron emission tomography in patients with locally advanced non–small-cell lung cancer undergoing definitive chemoradiation therapy: results of the ACRIN 6668/RTOG 0235 trial. J Clin Oncol. 2013;31:3823.
OpenUrl Abstract/FREE Full Text
24.↵
1. Cremonesi M,
2. Gilardi L,
3. Ferrari ME,
4. et al
. Role of interim ¹⁸F-FDG-PET/CT for the early prediction of clinical outcomes of non–small cell lung cancer (NSCLC) during radiotherapy or chemo-radiotherapy: a systematic review. Eur J Nucl Med Mol Imaging. 2017;44:1915–1927.
OpenUrl
25.↵
1. Sheikhbahaei S,
2. Mena E,
3. Yanamadala A,
4. et al
. The value of FDG PET/CT in treatment response assessment, follow-up, and surveillance of lung cancer. AJR. 2017;208:420–433.
OpenUrl
26.↵
1. Hyun SH,
2. Ahn HK,
3. Kim H,
4. et al
. Volume-based assessment by ¹⁸F-FDG PET/CT predicts survival in patients with stage III non-small-cell lung cancer. Eur J Nucl Med Mol Imaging. 2014;41:50–58.
OpenUrl PubMed
27.↵
1. Im H-J,
2. Pak K,
3. Cheon GJ,
4. et al
. Prognostic value of volumetric parameters of ¹⁸F-FDG PET in non-small-cell lung cancer: a meta-analysis. Eur J Nucl Med Mol Imaging. 2015;42:241–251.
OpenUrl CrossRef PubMed
28.↵
Liu Z, Mhlanga JC, Siegel BA, Jha AK. Need for objective task-based evaluation of AI-based segmentation methods for quantitative PET. Medical Imaging 2023: Image Perception, Observer Performance, and Technology Assessment. SPIE;12467:194–201.
29.↵
1. Clark K,
2. Vendt B,
3. Smith K,
4. et al
. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013;26:1045–1057.
OpenUrl CrossRef PubMed
30.↵
1. Scheuermann JS,
2. Saffer JR,
3. Karp JS,
4. Levering AM,
5. Siegel BA
. Qualification of PET scanners for use in multicenter cancer clinical trials: the American College of Radiology Imaging Network experience. J Nucl Med. 2009;50:1187–1193.
OpenUrl Abstract/FREE Full Text
31.↵
1. Sridhar P,
2. Mercier G,
3. Tan J,
4. Truong MT,
5. Daly B,
6. Subramaniam RM
. FDG PET metabolic tumor volume segmentation and pathologic volume of primary human solid tumors. AJR. 2014;202:1114–1119.
OpenUrl CrossRef PubMed
32.↵
1. Kass M,
2. Witkin A,
3. Terzopoulos D
. Snakes: active contour models. Int J Comput Vis. 1988;1:321–331.
OpenUrl CrossRef
33.↵
1. Layer T,
2. Blaickner M,
3. Knäusl B,
4. et al
. PET image segmentation using a Gaussian mixture model and Markov random fields. EJNMMI Phys. 2015;2:9.
OpenUrl
34.↵
1. Blanc-Durand P,
2. Van Der Gucht A,
3. Schaefer N,
4. Itti E,
5. Prior JO
. Automatic lesion detection and segmentation of ¹⁸F-FET PET in gliomas: a full 3D U-Net convolutional neural network study. PLoS One. 2018;13:e0195798.
OpenUrl
35.↵
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; Munich, Germany, October 5–9, 2015. Springer International Publishing: 234–241.
36.↵
1. Jadon S
. A survey of loss functions for semantic segmentation. IEEE Xplore website. https://ieeexplore.ieee.org/document/9277638. Published December 7, 2020. Accessed January 12, 2024.
37.↵
1. Isensee F,
2. Jaeger PF,
3. Kohl SA,
4. Petersen J,
5. Maier-Hein KH
. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18:203–211.
OpenUrl CrossRef PubMed
38.↵
1. Sun S,
2. Chen W,
3. Wang L,
4. Liu X,
5. Liu T-Y
. On the depth of deep neural networks: a theoretical view. arXiv website. https://arxiv.org/abs/1506.05232. Published June 17, 2015. Accessed January 17, 2024.
39.↵
1. Kingma DP,
2. Ba J
. Adam: a method for stochastic optimization. arXiv website. https://arxiv.org/abs/1412.6980. Published December 22, 2014. Accessed January 10, 2024.
40.↵
1. Taha AA,
2. Hanbury A
. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015;15:29.
OpenUrl CrossRef PubMed
41.↵
1. Prescott JW
. Quantitative imaging biomarkers: the application of advanced image processing and analysis to clinical and preclinical decision making. J Digit Imaging. 2013;26:97–108.
OpenUrl CrossRef PubMed
42.↵
1. Rosenkrantz AB,
2. Mendiratta-Lala M,
3. Bartholmai BJ,
4. et al
. Clinical utility of quantitative imaging. Acad Radiol. 2015;22:33–49.
OpenUrl
43.↵
1. Raunig DL,
2. McShane LM,
3. Pennello G,
4. et al
. Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment. Stat Methods Med Res. 2015;24:27–67.
OpenUrl CrossRef PubMed
44.↵
1. Jha AK,
2. Bradshaw TJ,
3. Buvat I,
4. et al
. Nuclear medicine and artificial intelligence: best practices for evaluation (the RELAINCE guidelines). J Nucl Med . 2022;63:1288–1299.
OpenUrl Abstract/FREE Full Text
45.↵
1. Barrett HH,
2. Myers KJ
., eds. Foundations of Image Science. John Wiley & Sons; 2013:875–877.
46.↵
1. Barrett HH,
2. Wilson DW,
3. Kupinski MA,
4. et al
. Therapy operating characteristic (TOC) curves and their application to the evaluation of segmentation algorithms. Proc SPIE Int Soc Opt Eng. 2010:76270z.
47.↵
1. Hoppin JW,
2. Kupinski MA,
3. Kastis GA,
4. Clarkson E,
5. Barrett HH
. Objective comparison of quantitative imaging modalities without the use of a gold standard. IEEE Trans Med Imaging. 2002;21:441–449.
OpenUrl PubMed
48.
1. Jha AK,
2. Caffo B,
3. Frey EC
. A no-gold-standard technique for objective assessment of quantitative nuclear-medicine imaging methods. Phys Med Biol. 2016;61:2780.
OpenUrl
49.
1. Kupinski MA,
2. Hoppin JW,
3. Clarkson E,
4. Barrett HH,
5. Kastis GA
. Estimation in medical imaging without a gold standard. Acad Radiol. 2002;9:290–297.
OpenUrl CrossRef PubMed
50.↵
1. Liu Z,
2. Li Z,
3. Mhlanga JC,
4. Siegel BA,
5. Jha AK
. No-gold-standard evaluation of quantitative imaging methods in the presence of correlated noise. Proc SPIE Int Soc Opt Eng. 2022:120350M.
51.↵
1. Jha AK,
2. Mena E,
3. Caffo BS,
4. et al
. Practical no-gold-standard evaluation framework for quantitative imaging methods: application to lesion segmentation in positron emission tomography. J Med Imaging (Bellingham). 2017;4:011011.
OpenUrl
52.
1. Liu J,
2. Liu Z,
3. Moon HS,
4. Mhlanga J,
5. Jha A
. A no-gold-standard technique for objective evaluation of quantitative nuclear-medicine imaging methods in the presence of correlated noise [abstract]. J Nucl Med. 2020;61(suppl 1):523.
OpenUrl
53.↵
1. Zhu Y,
2. Yousefirizi F,
3. Liu Z,
4. Klyuzhin I,
5. Rahmim A,
6. Jha A
. Comparing clinical evaluation of PET segmentation methods with reference-based metrics and no-gold-standard evaluation technique [abstract]. J Nucl Med. 2021;62(suppl 1):1430.
OpenUrl

Received for publication May 12, 2023.
Accepted for publication December 19, 2023.

View Abstract

In this issue

Download PDF

Article Alerts

Email Article

Citation Tools

Bookmark this article

Cited By...

The Use of Maximum-Intensity Projections and Deep Learning Adds Value to the Fully Automatic Segmentation of Lesions Avid for [18F]FDG and [68Ga]Ga-PSMA in PET/CT

Google Scholar

More in this TOC Section

Show more AI/Advanced Image Analysis

Keywords

[1] 1.↵
Chen HH,
Chiu N-T,
Su W-C,
Guo H-R,
Lee B-F
. Prognostic value of whole-body total lesion glycolysis at pretreatment FDG PET/CT in non–small cell lung cancer. Radiology. 2012;264:559–566.
OpenUrl CrossRef PubMed

[2] Chen HH,

[3] Chiu N-T,

[4] Su W-C,

[5] Guo H-R,

[6] Lee B-F

[7] 2.
Mena E,
Sheikhbahaei S,
Taghipour M,
et al
. ¹⁸F-FDG PET/CT metabolic tumor volume and intratumoral heterogeneity in pancreatic adenocarcinomas: impact of dual-time point and segmentation methods. Clin Nucl Med. 2017;42:e16.
OpenUrl

[8] Mena E,

[9] Sheikhbahaei S,

[10] Taghipour M,

[11] et al

[12] 3.↵
Ohri N,
Duan F,
Machtay M,
et al
. Pretreatment FDG-PET metrics in stage III non–small cell lung cancer: ACRIN 6668/RTOG 0235. J Natl Cancer Inst. 2015;107:djv004.
OpenUrl CrossRef PubMed

[13] Ohri N,

[14] Duan F,

[15] Machtay M,

[16] et al

[17] 4.↵
Foster B,
Bagci U,
Mansoor A,
Xu Z,
Mollura DJ
. A review on segmentation of positron emission tomography images. Comput Biol Med. 2014;50:76–96.
OpenUrl CrossRef PubMed

[18] Foster B,

[19] Bagci U,

[20] Mansoor A,

[21] Xu Z,

[22] Mollura DJ

[23] 5.↵
Leung KH,
Marashdeh W,
Wray R,
et al
. A physics-guided modular deep-learning based automated framework for tumor segmentation in PET. Phys Med Biol. 2020;65:245032.
OpenUrl

[24] Leung KH,

[25] Marashdeh W,

[26] Wray R,

[27] et al

[28] 6.
Liu Z,
Mhlanga JC,
Laforest R,
Derenoncourt P-R,
Siegel BA,
Jha AK
. A Bayesian approach to tissue-fraction estimation for oncological PET segmentation. Phys Med Biol. 2021;66:124002.
OpenUrl

[29] Liu Z,

[30] Mhlanga JC,

[31] Laforest R,

[32] Derenoncourt P-R,

[33] Siegel BA,

[34] Jha AK

[35] 7.
Yousefirizi F,
Jha AK,
Brosch-Lenz J,
Saboury B,
Rahmim A
. Toward high-throughput artificial intelligence-based segmentation in oncological PET imaging. PET Clin. 2021;16:577–596.
OpenUrl

[36] Yousefirizi F,

[37] Jha AK,

[38] Brosch-Lenz J,

[39] Saboury B,

[40] Rahmim A

[41] 8.↵
Zhao X,
Li L,
Lu W,
Tan S
. Tumor co-segmentation in PET/CT using multi-modality fully convolutional neural network. Phys Med Biol. 2018;64:015011.
OpenUrl

[42] Zhao X,

[43] Li L,

[44] Lu W,

[45] Tan S

[46] 9.↵
Barrett HH,
Abbey CK,
Clarkson E
. Objective assessment of image quality: III—ROC metrics, ideal observers, and likelihood-generating functions. J Opt Soc Am A Opt Image Sci Vis. 1998;15:1520–1535.
OpenUrl PubMed

[47] Barrett HH,

[48] Abbey CK,

[49] Clarkson E

[50] 10.
Barrett HH,
Denny J,
Wagner RF,
Myers KJ
. Objective assessment of image quality: II—Fisher information, Fourier crosstalk, and figures of merit for task performance. J Opt Soc Am A Opt Image Sci Vis. 1995;12:834–852.
OpenUrl PubMed

[51] Barrett HH,

[52] Denny J,

[53] Wagner RF,

[54] Myers KJ

[55] 11.↵
Jha AK,
Myers KJ,
Obuchowski NA,
et al
. Objective task-based evaluation of artificial intelligence-based medical imaging methods: framework, strategies, and role of the physician. PET Clin. 2021;16:493–511.
OpenUrl

[56] Jha AK,

[57] Myers KJ,

[58] Obuchowski NA,

[59] et al

[60] 12.↵
Barrett HH,
Myers KJ,
Hoeschen C,
Kupinski MA,
Little MP
. Task-based measures of image quality and their relation to radiation dose and patient risk. Phys Med Biol. 2015;60:R1.
OpenUrl CrossRef PubMed

[61] Barrett HH,

[62] Myers KJ,

[63] Hoeschen C,

[64] Kupinski MA,

[65] Little MP

[66] 13.↵
Badal A,
Cha KH,
Divel SE,
Graff CG,
Zeng R,
Badano A
. Virtual clinical trial for task-based evaluation of a deep learning synthetic mammography algorithm. SPIE Digital Library website. https://doi.org/10.1117/12.2513062. Published March 7, 2019. Accessed January 12, 2024.

[67] Badal A,

[68] Cha KH,

[69] Divel SE,

[70] Graff CG,

[71] Zeng R,

[72] Badano A

[73] 14.
Pretorius PH,
Liu J,
Kalluri KS,
et al
. Observer studies of image quality of denoising reduced-count cardiac single photon emission computed tomography myocardial perfusion imaging by three-dimensional Gaussian post-reconstruction filtering and deep learning. J Nucl Cardiol. 2023;30:2427–2437.
OpenUrl

[74] Pretorius PH,

[75] Liu J,

[76] Kalluri KS,

[77] et al

[78] 15.
Li K,
Zhou W,
Li H,
Anastasio MA
. Assessing the impact of deep neural network-based image denoising on binary signal detection tasks. IEEE Trans Med Imaging. 2021;40:2295–2305.
OpenUrl

[79] Li K,

[80] Zhou W,

[81] Li H,

[82] Anastasio MA

[83] 16.
Prabhat K,
Zeng R,
Farhangi MM,
Myers KJ
. Deep neural networks-based denoising models for CT imaging and their efficacy. SPIE Digital Library website. https://doi.org/10.1117/12.2581418. Published February 15, 2021. Accessed January 12, 2024.

[84] Prabhat K,

[85] Zeng R,

[86] Farhangi MM,

[87] Myers KJ

[88] 17.↵
Yu Z,
Rahman MA,
Laforest R,
et al
. Need for objective task‐based evaluation of deep learning‐based denoising methods: a study in the context of myocardial perfusion SPECT. Med Phys. 2023;50:4122–4137.
OpenUrl

[89] Yu Z,

[90] Rahman MA,

[91] Laforest R,

[92] et al

[93] 18.↵
Jha AK,
Rodríguez JJ,
Stephen RM,
Stopeck AT
. A clustering algorithm for liver lesion segmentation of diffusion-weighted MR images. Proc IEEE. 2010;2010:93–96.
OpenUrl

[94] Jha AK,

[95] Rodríguez JJ,

[96] Stephen RM,

[97] Stopeck AT

[98] 19.
Oreiller V,
Andrearczyk V,
Jreige M,
et al
. Head and neck tumor segmentation in PET/CT: the HECKTOR challenge. Med Image Anal. 2022;77:102336.
OpenUrl CrossRef

[99] Oreiller V,

[100] Andrearczyk V,

[101] Jreige M,

[102] et al

[103] 20.
Song Q,
Bai J,
Han D,
et al
. Optimal co-segmentation of tumor in PET-CT images with context information. IEEE Trans Med Imaging. 2013;32:1685–1697.
OpenUrl CrossRef

[104] Song Q,

[105] Bai J,

[106] Han D,

[107] et al

[108] 21.↵
Kofler F,
Ezhov I,
Isensee F,
et al
. Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. arXiv website. https://arxiv.org/abs/2103.06205. Published March 10, 2021. Accessed January 12, 2024.

[109] Kofler F,

[110] Ezhov I,

[111] Isensee F,

[112] et al

[113] 22.↵
Kinahan P,
Muzi M,
Bialecki B,
Herman B,
Coombs L
. ACRIN 6668 (ACRIN-NSCLC-FDG-PET). Cancer Imaging Archive website. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=39879162. Modified November 27, 2023. Accessed January 12, 2024.

[114] Kinahan P,

[115] Muzi M,

[116] Bialecki B,

[117] Herman B,

[118] Coombs L

[119] 23.↵
Machtay M,
Duan F,
Siegel BA,
et al
. Prediction of survival by [¹⁸F]fluorodeoxyglucose positron emission tomography in patients with locally advanced non–small-cell lung cancer undergoing definitive chemoradiation therapy: results of the ACRIN 6668/RTOG 0235 trial. J Clin Oncol. 2013;31:3823.
OpenUrl Abstract/FREE Full Text

[120] Machtay M,

[121] Duan F,

[122] Siegel BA,

[123] et al

[124] 24.↵
Cremonesi M,
Gilardi L,
Ferrari ME,
et al
. Role of interim ¹⁸F-FDG-PET/CT for the early prediction of clinical outcomes of non–small cell lung cancer (NSCLC) during radiotherapy or chemo-radiotherapy: a systematic review. Eur J Nucl Med Mol Imaging. 2017;44:1915–1927.
OpenUrl

[125] Cremonesi M,

[126] Gilardi L,

[127] Ferrari ME,

[128] et al

[129] 25.↵
Sheikhbahaei S,
Mena E,
Yanamadala A,
et al
. The value of FDG PET/CT in treatment response assessment, follow-up, and surveillance of lung cancer. AJR. 2017;208:420–433.
OpenUrl

[130] Sheikhbahaei S,

[131] Mena E,

[132] Yanamadala A,

[133] et al

[134] 26.↵
Hyun SH,
Ahn HK,
Kim H,
et al
. Volume-based assessment by ¹⁸F-FDG PET/CT predicts survival in patients with stage III non-small-cell lung cancer. Eur J Nucl Med Mol Imaging. 2014;41:50–58.
OpenUrl PubMed

[135] Hyun SH,

[136] Ahn HK,

[137] Kim H,

[138] et al

[139] 27.↵
Im H-J,
Pak K,
Cheon GJ,
et al
. Prognostic value of volumetric parameters of ¹⁸F-FDG PET in non-small-cell lung cancer: a meta-analysis. Eur J Nucl Med Mol Imaging. 2015;42:241–251.
OpenUrl CrossRef PubMed

[140] Im H-J,

[141] Pak K,

[142] Cheon GJ,

[143] et al

[144] 28.↵
Liu Z, Mhlanga JC, Siegel BA, Jha AK. Need for objective task-based evaluation of AI-based segmentation methods for quantitative PET. Medical Imaging 2023: Image Perception, Observer Performance, and Technology Assessment. SPIE;12467:194–201.

[145] 29.↵
Clark K,
Vendt B,
Smith K,
et al
. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013;26:1045–1057.
OpenUrl CrossRef PubMed

[146] Clark K,

[147] Vendt B,

[148] Smith K,

[149] et al

[150] 30.↵
Scheuermann JS,
Saffer JR,
Karp JS,
Levering AM,
Siegel BA
. Qualification of PET scanners for use in multicenter cancer clinical trials: the American College of Radiology Imaging Network experience. J Nucl Med. 2009;50:1187–1193.
OpenUrl Abstract/FREE Full Text

[151] Scheuermann JS,

[152] Saffer JR,

[153] Karp JS,

[154] Levering AM,

[155] Siegel BA

[156] 31.↵
Sridhar P,
Mercier G,
Tan J,
Truong MT,
Daly B,
Subramaniam RM
. FDG PET metabolic tumor volume segmentation and pathologic volume of primary human solid tumors. AJR. 2014;202:1114–1119.
OpenUrl CrossRef PubMed

[157] Sridhar P,

[158] Mercier G,

[159] Tan J,

[160] Truong MT,

[161] Daly B,

[162] Subramaniam RM

[163] 32.↵
Kass M,
Witkin A,
Terzopoulos D
. Snakes: active contour models. Int J Comput Vis. 1988;1:321–331.
OpenUrl CrossRef

[164] Kass M,

[165] Witkin A,

[166] Terzopoulos D

[167] 33.↵
Layer T,
Blaickner M,
Knäusl B,
et al
. PET image segmentation using a Gaussian mixture model and Markov random fields. EJNMMI Phys. 2015;2:9.
OpenUrl

[168] Layer T,

[169] Blaickner M,

[170] Knäusl B,

[171] et al

[172] 34.↵
Blanc-Durand P,
Van Der Gucht A,
Schaefer N,
Itti E,
Prior JO
. Automatic lesion detection and segmentation of ¹⁸F-FET PET in gliomas: a full 3D U-Net convolutional neural network study. PLoS One. 2018;13:e0195798.
OpenUrl

[173] Blanc-Durand P,

[174] Van Der Gucht A,

[175] Schaefer N,

[176] Itti E,

[177] Prior JO

[178] 35.↵
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; Munich, Germany, October 5–9, 2015. Springer International Publishing: 234–241.

[179] 36.↵
Jadon S
. A survey of loss functions for semantic segmentation. IEEE Xplore website. https://ieeexplore.ieee.org/document/9277638. Published December 7, 2020. Accessed January 12, 2024.

[180] Jadon S

[181] 37.↵
Isensee F,
Jaeger PF,
Kohl SA,
Petersen J,
Maier-Hein KH
. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18:203–211.
OpenUrl CrossRef PubMed

[182] Isensee F,

[183] Jaeger PF,

[184] Kohl SA,

[185] Petersen J,

[186] Maier-Hein KH

[187] 38.↵
Sun S,
Chen W,
Wang L,
Liu X,
Liu T-Y
. On the depth of deep neural networks: a theoretical view. arXiv website. https://arxiv.org/abs/1506.05232. Published June 17, 2015. Accessed January 17, 2024.

[188] Sun S,

[189] Chen W,

[190] Wang L,

[191] Liu X,

[192] Liu T-Y

[193] 39.↵
Kingma DP,
Ba J
. Adam: a method for stochastic optimization. arXiv website. https://arxiv.org/abs/1412.6980. Published December 22, 2014. Accessed January 10, 2024.

[194] Kingma DP,

[195] Ba J

[196] 40.↵
Taha AA,
Hanbury A
. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015;15:29.
OpenUrl CrossRef PubMed

[197] Taha AA,

[198] Hanbury A

[199] 41.↵
Prescott JW
. Quantitative imaging biomarkers: the application of advanced image processing and analysis to clinical and preclinical decision making. J Digit Imaging. 2013;26:97–108.
OpenUrl CrossRef PubMed

[200] Prescott JW

[201] 42.↵
Rosenkrantz AB,
Mendiratta-Lala M,
Bartholmai BJ,
et al
. Clinical utility of quantitative imaging. Acad Radiol. 2015;22:33–49.
OpenUrl

[202] Rosenkrantz AB,

[203] Mendiratta-Lala M,

[204] Bartholmai BJ,

[205] et al

[206] 43.↵
Raunig DL,
McShane LM,
Pennello G,
et al
. Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment. Stat Methods Med Res. 2015;24:27–67.
OpenUrl CrossRef PubMed

[207] Raunig DL,

[208] McShane LM,

[209] Pennello G,

[210] et al

[211] 44.↵
Jha AK,
Bradshaw TJ,
Buvat I,
et al
. Nuclear medicine and artificial intelligence: best practices for evaluation (the RELAINCE guidelines). J Nucl Med . 2022;63:1288–1299.
OpenUrl Abstract/FREE Full Text

[212] Jha AK,

[213] Bradshaw TJ,

[214] Buvat I,

[215] et al

[216] 45.↵
Barrett HH,
Myers KJ
., eds. Foundations of Image Science. John Wiley & Sons; 2013:875–877.

[217] Barrett HH,

[218] Myers KJ

[219] 46.↵
Barrett HH,
Wilson DW,
Kupinski MA,
et al
. Therapy operating characteristic (TOC) curves and their application to the evaluation of segmentation algorithms. Proc SPIE Int Soc Opt Eng. 2010:76270z.

[220] Barrett HH,

[221] Wilson DW,

[222] Kupinski MA,

[223] et al

[224] 47.↵
Hoppin JW,
Kupinski MA,
Kastis GA,
Clarkson E,
Barrett HH
. Objective comparison of quantitative imaging modalities without the use of a gold standard. IEEE Trans Med Imaging. 2002;21:441–449.
OpenUrl PubMed

[225] Hoppin JW,

[226] Kupinski MA,

[227] Kastis GA,

[228] Clarkson E,

[229] Barrett HH

[230] 48.
Jha AK,
Caffo B,
Frey EC
. A no-gold-standard technique for objective assessment of quantitative nuclear-medicine imaging methods. Phys Med Biol. 2016;61:2780.
OpenUrl

[231] Jha AK,

[232] Caffo B,

[233] Frey EC

[234] 49.
Kupinski MA,
Hoppin JW,
Clarkson E,
Barrett HH,
Kastis GA
. Estimation in medical imaging without a gold standard. Acad Radiol. 2002;9:290–297.
OpenUrl CrossRef PubMed

[235] Kupinski MA,

[236] Hoppin JW,

[237] Clarkson E,

[238] Barrett HH,

[239] Kastis GA

[240] 50.↵
Liu Z,
Li Z,
Mhlanga JC,
Siegel BA,
Jha AK
. No-gold-standard evaluation of quantitative imaging methods in the presence of correlated noise. Proc SPIE Int Soc Opt Eng. 2022:120350M.

[241] Liu Z,

[242] Li Z,

[243] Mhlanga JC,

[244] Siegel BA,

[245] Jha AK

[246] 51.↵
Jha AK,
Mena E,
Caffo BS,
et al
. Practical no-gold-standard evaluation framework for quantitative imaging methods: application to lesion segmentation in positron emission tomography. J Med Imaging (Bellingham). 2017;4:011011.
OpenUrl

[247] Jha AK,

[248] Mena E,

[249] Caffo BS,

[250] et al

[251] 52.
Liu J,
Liu Z,
Moon HS,
Mhlanga J,
Jha A
. A no-gold-standard technique for objective evaluation of quantitative nuclear-medicine imaging methods in the presence of correlated noise [abstract]. J Nucl Med. 2020;61(suppl 1):523.
OpenUrl

[252] Liu J,

[253] Liu Z,

[254] Moon HS,

[255] Mhlanga J,

[256] Jha A

[257] 53.↵
Zhu Y,
Yousefirizi F,
Liu Z,
Klyuzhin I,
Rahmim A,
Jha A
. Comparing clinical evaluation of PET segmentation methods with reference-based metrics and no-gold-standard evaluation technique [abstract]. J Nucl Med. 2021;62(suppl 1):1430.
OpenUrl

[258] Zhu Y,

[259] Yousefirizi F,

[260] Liu Z,

[261] Klyuzhin I,

[262] Rahmim A,

[263] Jha A

Main menu

User menu

Search

Need for Objective Task-Based Evaluation of Image Segmentation Algorithms for Quantitative PET: A Study with ACRIN 6668/RTOG 0235 Multicenter Clinical Trial Data

Visual Abstract

Abstract

MATERIALS AND METHODS

Study Population

Data Curation

Consideration of Conventional Computer-Aided Image Segmentation Algorithms

Consideration of DL-Based Image Segmentation Algorithm

Network Training

Configuring the U-Net–Based Algorithm with Different Network Depths

Configuring the U-Net–Based Algorithm with Different Loss Functions

Evaluation FoMs

Task-Agnostic FoMs

Task-Based FoMs

RESULTS

Evaluation of Conventional Computer-Aided Algorithms

Evaluating the U-Net–Based Algorithm

Impact of Network Depth Choice

Impact of Loss Function Choice

DISCUSSION

CONCLUSION

DISCLOSURE

KEY POINTS

Footnotes

REFERENCES

In this issue

Citation Manager Formats

Related Articles

Cited By...

More in this TOC Section

Similar Articles

Keywords

Main menu

User menu

Search

Need for Objective Task-Based Evaluation of Image Segmentation Algorithms for Quantitative PET: A Study with ACRIN 6668/RTOG 0235 Multicenter Clinical Trial Data

Visual Abstract

Abstract

MATERIALS AND METHODS

Study Population

Data Curation

Consideration of Conventional Computer-Aided Image Segmentation Algorithms

Consideration of DL-Based Image Segmentation Algorithm

Network Training

Configuring the U-Net–Based Algorithm with Different Network Depths

Configuring the U-Net–Based Algorithm with Different Loss Functions

Evaluation FoMs

Task-Agnostic FoMs

Task-Based FoMs

RESULTS

Evaluation of Conventional Computer-Aided Algorithms

Evaluating the U-Net–Based Algorithm

Impact of Network Depth Choice

Impact of Loss Function Choice

DISCUSSION

CONCLUSION

DISCLOSURE

KEY POINTS

Footnotes

REFERENCES

In this issue

Citation Manager Formats

Jump to section

Related Articles

Cited By...

More in this TOC Section

Similar Articles

Keywords