Visual Abstract
Abstract
Tumor hypoxia, an integral biomarker to guide radiotherapy, can be imaged with 18F-fluoromisonidazole (18F-FMISO) hypoxia PET. One major obstacle to its broader application is the lack of standardized interpretation criteria. We sought to develop and validate practical interpretation criteria and a dedicated training protocol for nuclear medicine physicians to interpret 18F-FMISO hypoxia PET. Methods: We randomly selected 123 patients with human papillomavirus–positive oropharyngeal cancer enrolled in a phase II trial who underwent 123 18F-FDG PET/CT and 134 18F-FMISO PET/CT scans. Four independent nuclear medicine physicians with no 18F-FMISO experience read the scans. Interpretation by a fifth nuclear medicine physician with over 2 decades of 18F-FMISO experience was the reference standard. Performance was evaluated after initial instruction and subsequent dedicated training. Scans were considered positive for hypoxia by visual assessment if 18F-FMISO uptake was greater than floor-of-mouth uptake. Additionally, SUVmax was determined to evaluate whether quantitative assessment using tumor-to-background ratios could be helpful to define hypoxia positivity. Results: Visual assessment produced a mean sensitivity and specificity of 77.3% and 80.9%, with fair interreader agreement (κ = 0.34), after initial instruction. After dedicated training, mean sensitivity and specificity improved to 97.6% and 86.9%, with almost perfect agreement (κ = 0.86). Quantitative assessment with an estimated best SUVmax ratio threshold of more than 1.2 to define hypoxia positivity produced a mean sensitivity and specificity of 56.8% and 95.9%, respectively, with substantial interreader agreement (κ = 0.66), after initial instruction. After dedicated training, mean sensitivity improved to 89.6% whereas mean specificity remained high at 95.3%, with near-perfect interreader agreement (κ = 0.86). Conclusion: Nuclear medicine physicians without 18F-FMISO hypoxia PET reading experience demonstrate much improved interreader agreement with dedicated training using specific interpretation criteria.
Tumor hypoxia is a prognostic and predictive cancer biomarker (1) associated with resistance to chemotherapy, radiotherapy, and immunotherapy across multiple cancers (2–7). In head and neck cancer, including human papillomavirus–positive (HPV+) oropharyngeal cancer (5,6,8), it is associated with a poor chemoradiotherapy outcome.
18F-fluoromisonidazole (FMISO) is an imaging biomarker for tumor hypoxia (7,9) and has been used to predict outcome and prescribe radiotherapy (5,10) as well as proposed to select patients for trials with hypoxia-activated prodrugs, hyperthermia, or high–linear-energy-transfer radiation (4,11–14). Other studies have shown feasibility for escalating radiotherapy to hypoxic subvolumes (5,10,15–17). Over the past decade, we have completed several trials using 18F-FMISO hypoxia PET under the investigational-new-drug pathway to direct radiation deescalation (18–20).
Currently, interpretation criteria for 18F-FMISO hypoxia PET are not well defined (21–24). Given our exciting phase II data (20), there is great interest in applying hypoxia 18F-FMISO PET in the management of patients with head and neck cancer treated with chemoradiation. Although this has been implemented successfully at our institution (19,20), disparate interpretation criteria and minimal reading experience are major obstacles to the broader applicability of 18F-FMISO PET. Therefore, we developed and validated practical interpretation criteria and a dedicated training protocol for nuclear medicine physicians using 18F-FMISO hypoxia PET.
MATERIALS AND METHODS
Patients
The institutional review board approved this retrospective study and waived the need for written informed consent. We analyzed data from a phase II clinical trial (NCT03323463) performed at our institution in compliance with the Health Insurance Portability and Accountability Act. The trial included patients with HPV+ oropharyngeal cancer (American Joint Committee on Cancer, seventh edition, T1–2/N1–2c) scheduled to undergo personalized radiotherapy with concurrent platinum-based chemotherapy between October 2017 and December 2020. Every patient had baseline 18F-FDG PET/CT showing 18F-FDG–avid metastatic cervical lymph nodes (LNs) and also underwent 18F-FMISO PET/CT before and about 2 wk into chemoradiotherapy (the supplemental materials provide details; available at http://jnm.snmjournals.org). From 302 available patients, we randomly selected 123 patients with a total of 134 18F-FMISO scans for analysis and another 25 patients with a total of 25 18F-FMISO scans for training inexperienced nuclear medicine physicians (Fig. 1).
Consolidated Standards of Reporting Trials (CONSORT) diagram: our study included 302 patients from prior 18F-FMISO phase II trial. In total, 25 18F-FMISO scans from 25 patients were used for training, which included 15 scans for initial instruction and 10 scans for dedicated training, and 123 patients were randomly selected for interpretation and analysis. Some patients had both baseline and follow-up 18F-FMISO scans; therefore, 134 18F-FMISO scans were available from 123 patients. Of these, 53 scans were used for analysis after initial instruction, after excluding 3 scans because of missing information. In total, 75 scans were used for analysis after dedicated training, after excluding 3 patients for missing information.
Image Analysis
We recruited 4 nuclear medicine physicians with varying clinical experience (1, 3, 5, and 15 y) and no 18F-FMISO reading experience. Dedicated training and reference interpretations were performed by a fifth nuclear medicine physician with more than 25 y of clinical and 20 y of 18F-FMISO reading experience. The training and validation protocol comprised 2 parts: initial instruction, which involved the most experienced reader explaining to the inexperienced readers what positive and negative scans look like, and dedicated training, which involved a more intensive review of standard criteria for determining whether a scan was positive versus negative and the specific image qualities to analyze.
Initial Instruction
Initial instruction comprised a live web-based instruction and a group practice session. Fifteen cases were used for training in initial instruction.
In part 1, the most experienced reader (trainer) explained how to determine positive and negative scans and demonstrated this on 10 representative cases. Readers were instructed to perform a binary assignment (hypoxia-positive or -negative) as follows: baseline 18F-FDG PET is used to determine all suggestive cervical neck LNs (>1 cm in short-axis diameter with focal abnormally increased 18F-FDG avidity). Suggestive LNs are then assessed on 18F-FMISO PET both visually and quantitatively. In visual assessment, 18F-FMISO nodal uptake is compared with 18F-FMISO reference background uptake in the floor of the mouth (FOM). If 18F-FMISO nodal uptake is greater than FOM uptake, this is considered hypoxia-positive, and if it is equal to or less than FOM uptake, this is considered hypoxia-negative. Quantitative assessment is also performed, measuring the SUVmax of suggestive LNs (based on a 3-dimensional volume of interest over each suggestive LN) and the SUVmax and SUVmean of the FOM reference region (based on a 1.5-cm 3-dimensional volume of interest in the center of the FOM in the axial and sagittal planes) (Fig. 2). SUVmean was included as a comparison to SUVmax, as it is less affected by noise and heterogeneous tracer variation.
18F-FMISO axial PET and sagittal PET/CT images: (A) Region-of-interest (ROI) 1 demonstrates background reference region placed centrally in FOM in axial plane. ROI 2 demonstrates proper placement of lesion region completely encompassing suggestive left level 2 cervical LN. Arrow demonstrates sternocleidomastoid uptake to be used as additional qualitative reference for background uptake. (B) ROI 1 demonstrates background reference region placed centrally in FOM in second axis, sagittal plane.
In part 2, readers practiced interpretation on 5 cases and submitted them for evaluation by the trainer. All incorrect individual scores and any discordant reading patterns were identified and discussed.
Dedicated Training and Final Readings
Since initial interreader agreement was poor, additional dedicated training was provided. Ten cases were used in dedicated training, for a total of 25 cases in both initial instruction and dedicated training.
Part 1 consisted of another web-based group practice session that used 4 questions outlining standardized qualitative image characteristics to improve interpretation: Do the LNs show focal uptake on the maximum-intensity projection (MIP) image using an SUV display range of 0–4? Is there focal 18F-FMISO uptake corresponding to any portion of the suggestive LNs? Is this focal uptake visually greater than the FOM reference background, verifying that background uptake is diffuse and within the expected range for 18F-FMISO distribution? Is the 18F-FMISO uptake in suggestive LNs greater than uptake in adjacent structures?
An increasing number of positive visual characteristics indicates a higher likelihood that the scan is hypoxia-positive. Although these criteria focus the reader’s attention on specific image characteristics, the ultimate binary assessment is based purely on whether suggestive LNs visually have uptake greater than reference tissue (FOM musculature), analogous to other PET interpretation criteria (22,23).
In part 2, readers scored a final set of 5 additional novel cases using the above criteria and submitted them for group evaluation by the trainer. After the trainer reviewed each case with each reader one-on-one, the trainer demonstrated the interpretation criteria on all cases used in training with conflicting previous reviews.
Statistical Analysis
After the initial instruction, we performed a statistical analysis. Regarding visual assessment, we estimated for each reader the hypoxia positivity rate, sensitivity, and specificity, with the interpretation by the trainer serving as the reference standard. Regarding quantitative assessment, the tumor-to-background SUVmax ratio, that is, the ratio of the LN SUVmax to the FOM SUVmax, was used to determine the presence of hypoxia, whereby LNs with SUVmax ratios of both more than 1.2 and more than 1.3 were arbitrarily evaluated as positive according to the literature (25). Additionally, interreader agreement for visual assessment was assessed using the Fleiss κ, with 95% CIs calculated using Monte Carlo approximation (26), and the correlation for quantitative SUVs between readers was assessed using the intraclass correlation coefficient (21).
Because of the low agreement and performance after the initial instruction, dedicated training was completed with additional 18F-FMISO PET scans, and the statistical analysis was then repeated. The same methods were used to calculate reader performance after the dedicated training. The additional scans were also used to investigate a new custom positivity threshold based on the SUV ratio. Receiver operating characteristic curve analysis was used to estimate the best threshold by maximizing the Youden index, that is, the sum of sensitivity and specificity. The sensitivity for visual versus quantitative approaches was compared for each reader using the McNemar test for scans deemed hypoxia-positive by the trainer. Statistical analysis was performed using R version 4.1.1.
RESULTS
Patients and Imaging Characteristics
The present analysis comprised 123 patients (Table 1) with 123 18F-FDG and 134 18F-FMISO scans. In total, 53 18F-FMISO scans, with 3 excluded for missing information, were used in the analysis after initial instruction, and 75 18F-FMISO scans, with 3 excluded for missing information, were used in the analysis after dedicated training. On 18F-FMISO PET, LNs had a median SUVmax of 1.7 (range, 1.0–4.5) and a median SUVmean of 1.3 (range, 0.7–2.5), whereas FOM had a median SUVmax of 1.7 (range, 1.1–2.5) and a median SUVmean of 1.4 (range, 1.0–2.2).
Patient Demographics (n = 123)
Establishing a Quantitative Threshold to Determine Hypoxia Positivity After Dedicated Training
Readings after dedicated training were used to determine the best SUVmax ratio threshold to define hypoxia positivity. Receiver operating characteristic curve analysis was done for each reader separately. The best thresholds (with associated sensitivity and specificity) were 1.17 (0.90 and 0.88) for reader 1 (R1), 1.18 (1.00 and 0.93) for reader 2 (R2), 1.27 (1.00 and 0.95) for reader 3 (R3), and 1.19 (0.97 and 0.95) for reader 4 (R4). On the basis of these results, an SUVmax ratio of more than 1.2 was selected as the optimal threshold to define hypoxia positivity.
Effect of Dedicated Training on 18F-FMISO PET/CT Reading Performance
Regarding visual assessment, sensitivity (R1, 81.8%; R2, 90.9%; R3, 90.9%; R4, 45.5%) and specificity (R1, 59.5%; R2, 95.2%; R3, 69.0%; R4, 100%) were low after initial instruction; after dedicated training, both sensitivity (R1, 95.1%; R2, 100%; R3, 97.6%; R4, 97.6%) and specificity (R1, 90.3%; R2, 81.8%; R3, 84.4%; R4, 90.9%) improved substantially (Table 2).
Diagnostic Performance of Qualitative and Quantitative Readings After Initial Instruction and After Dedicated Training
Regarding quantitative assessment, using a threshold of more than 1.2, sensitivity (R1, 54.5%; R2, 54.5%; R3, 63.6%; R4, 54.5%) was very low whereas specificity (R1, 100%; R2, 100%; R3, 90.5%; R4, 92.9%) was very high after initial instruction; after dedicated training, sensitivity (R1, 85.4%; R2, 82.9%; R3, 95.1%; R4, 95.1%) improved substantially and specificity (R1, 90.3%; R2, 100%; R3, 93.8%; R4, 96.9%) remained very high (Table 2).
Summary of the Scoring System
Regarding visual assessment, the mean sensitivity and specificity were 77.3% and 80.9%, respectively, after initial instruction, with fair interreader agreement (κ = 0.34 [95% CI, 0.23–0.45]); after dedicated training, the mean sensitivity and specificity improved to 97.6% and 86.9%, respectively, with almost perfect interreader agreement (κ = 0.86 [95% CI, 0.77–0.96]).
Regarding quantitative assessment, using an SUVmax threshold of more than 1.2, the mean sensitivity and specificity were 56.8% and 95.9%, respectively, after initial instruction, with substantial interreader agreement (κ = 0.66 [95% CI, 0.55–0.77]); after dedicated training, the mean sensitivity improved to 89.6% and the mean specificity remained high, at 95.3%, with almost perfect interreader agreement (κ = 0.86 [95% CI, 0.76–0.94]).
Notably, using an SUVmax ratio threshold of more than 1.2 alone demonstrated lower sensitivity than visual assessment, 89.6% versus 97.6%, but the difference was not significant for all readers (R1, P = 0.13; R2, P = 0.02; R3, P > 0.99; R4, P > 0.99). Although there was good agreement between visual assessment and an SUVmax ratio threshold of more than 1.2 in very positive (Fig. 3) and very negative (Fig. 4) cases, some cases were assessed as positive for hypoxia despite an SUVmax ratio equal to 1.2 (Fig. 5) or less than 1.2 (Fig. 6).
Visually and quantitatively positive 18F-FMISO scan: 18F-FMISO (top) and 18F-FDG (bottom) PET/CT MIP, fused, CT, and PET images. Shown is study of right level 2A cervical LN, with LN/FOM ratio of 3.6/1.8 = 2.0 (quantitatively positive). Nodal uptake is seen on MIP, is focal and increased in LN, and is visually greater than in FOM and adjacent structures (qualitatively positive).
Visually and quantitatively negative 18F-FMISO scan: 18F-FMISO (top) and 18F-FDG (bottom) PET/CT MIP, fused, CT and PET images. Shown is study of left level 2A cervical LN, with LN/FOM ratio of 0.9/1.5 = 0.6 (quantitatively negative). Nodal uptake is not seen on MIP, is not focal or increased in LN, and is visually less than in FOM and adjacent structures (qualitatively negative).
Visually positive 18F-FMISO scan that is quantitatively negative: 18F-FMISO (top) and 18F-FDG (bottom) PET/CT MIP, fused, CT and PET images. Shown is study of left level 2A cervical LN, with LN/FOM ratio of 1.9/1.6 = 1.2 (quantitatively negative). Nodal uptake is seen on MIP, is focal and increased in LN, and is visually greater than in FOM and adjacent structures (qualitatively positive).
Visually positive 18F-FMISO scan that is quantitatively negative: 18F-FMISO (top) and 18F-FDG (bottom) PET/CT MIP, fused, CT and PET images. Shown is study of right level 2 cervical LN, with LN/FOM ratio of 1.8/1.6 = 1.1 (quantitatively negative). Uptake is seen on MIP, is focal and increased in LN, is visually greater than in FOM (which is heterogeneous and above expected uptake in some areas), and is greater than in adjacent structures (qualitatively positive).
DISCUSSION
We developed and validated standardized interpretation criteria and a dedicated training protocol for 18F-FMISO hypoxia PET/CT assessment in patients with oropharyngeal cancer. After dedicated training, inexperienced nuclear medicine physicians showed much-improved sensitivity and interreader agreement. This addresses a major criticism that has hindered the introduction of hypoxia imaging in radiotherapy multicenter trials: reproducible 18F-FMISO PET imaging interpretation (22,23).
18F-FMISO imaging has been used in the research setting for several decades. The University of Washington group proposed a tumor-to-blood ratio of radiotracer concentrations to identify hypoxic tissue (24), applied in a series of clinical research studies (7,27,28). Other groups have proposed imaging-based methods, including a tumor SUVmax/contralateral neck musculature SUVmean ratio of at least 1.4, a tumor SUVmax/posterior neck muscle SUVmax ratio of more than 1.3, a tumor SUVmax/posterior cervical musculature uptake ratio of more than 1.25 (25,29,30), or comparison to contralateral tissues (13). Although widely applied, the imaging-based methods have been fraught with a lack of consensus. Of note, clinical PET/CT interpretation also relies on visual assessment, but over time, researchers have realized the need for standardized interpretation criteria (22,31,32). Single numeric SUV cutoffs are not used in PET/CT clinical practice (22,23). Therefore, our method follows in the footsteps of previous criteria with acceptable performance and reliability. Additionally, previous 18F-FMISO investigations explored only the potential role of hypoxia imaging’s prognostic value and, with very few exceptions (17,33), did not use imaging findings to inform patient management. In contrast, we have proven (20) that HPV+ oropharyngeal cancer, which is hypoxia-negative, can benefit from deescalation of therapy. There is growing interest in this application, and our work addresses the need for a standardized, reproducible method for 18F-FMISO PET/CT interpretation.
In response to those who desire an arbitrary SUV cutoff for positivity in hypoxia imaging, we compared our standardized visual assessment criteria with a quantitative tumor-to-background threshold ratio. Our data suggest an inflection point from positive to negative scans around a ratio of 1.2; however, there were positive scans with ratios of as low as 1.0 (Fig. 6) and negative scans with ratios of up to 1.3. Overall, the quantitative assessment was specific but not as sensitive as the visual assessment, with greatest agreement for clearly positive (Fig. 3) or negative (Fig. 4) scans. Although quantitative assessment may seem more objective, it is still subject to physician assessment and annotation, leaving room for technical mistakes, such as inappropriate placement and sizing of regions of interest. For instance, erroneous measurements can be obtained if a region of interest is placed on the wrong LN or if the reference region includes streak artifacts. Therefore, we do not recommend a numeric cutoff for hypoxia positivity. Rather, our data support dedicated training for readers and use of the standardized visual interpretation criteria to achieve better interreader reliability and proved especially useful for subtle cases with SUV ratios of around 1.0–1.3 (Figs. 5 and 6). This is analogous to previous experience with qualitative criteria (34,35). However, sensitivity, specificity, and interreader agreement were suboptimal in the first round of interpretations, highlighting the importance of a dedicated training program and standardized reporting. Accordingly, we have also developed a structured reporting template to promote clear communication with referring physicians (Supplemental Fig. 1).
Our study had some limitations. First, whereas visual interpretation can be criticized as subjective in comparison to measurement-based tissue-to-blood ratios to define hypoxia positivity on 18F-FMISO PET, our proposed method does not require blood sampling and processing, is easily implemented into clinical PET/CT workflows anywhere, and leads to high interreader agreement. Second, our study was confined to patients with HPV+ oropharyngeal cancer, and our criteria may require some modification when studying other diseases. Finally, although we did not perform CT-guided biopsy for every patient to demonstrate the presence of hypoxia, as was done in a smaller number of patients in a prior trial (18), we emphasize that a treatment strategy using 18F-FMISO imaging and classifying patients as hypoxia-positive or -negative has proven successful in clinical practice in our institution, allowing for deescalation of radiotherapy doses in a significant portion of patients with HPV+ oropharyngeal cancer (19).
CONCLUSION
Our systematic approach to standardizing the 18F-FMISO interpretation criteria and training protocol for nuclear medicine physicians achieved high interreader agreement, setting the stage for broader application of 18F-FMISO imaging—from academic to community practice—and its use in future multicenter clinical trials.
DISCLOSURE
This research was funded in part by NIH/NCI Cancer Center Support grant P30 CA008748. Nancy Lee has served on the advisory board of or as a consultant for Merck, Merck Serono, Nanobiotix, Galera Therapeutics Inc., and Leo Cancer Care. She also owns stock options in Leo Cancer Care. She has been a consultant or speaker for Varian Inc., Shanghai JoAnn Medical Technology Co. Ltd., and Yingming. Nadeem Riaz has received research support from Pfizer, BMS, and Repare Therapeutics Inc. No other potential conflict of interest relevant to this article was reported.
KEY POINTS
QUESTION: Can reliable interpretation criteria be established for 18F-FMISO PET/CT?
PERTINENT FINDINGS: Nuclear medicine physicians need training to read 18F-FMISO hypoxia PET/CT images accurately and reliably. Visual assessment is preferred to an SUVmax cutoff. Standardized reporting based on developed 18F-FMISO interpretation criteria is sensitive and specific.
IMPLICATIONS FOR PATIENT CARE: Standardized reporting based on developed 18F-FMISO interpretation criteria will be critical for conducting future multicenter clinical trials aimed at changing patient management and for Food and Drug Administration approval of hypoxia PET radiotracers.
Footnotes
Published online Sep. 12, 2024.
- © 2024 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- Received for publication March 13, 2024.
- Accepted for publication July 23, 2024.