Abstract
The interobserver agreement for 68Ga-PSMA-11 PET/CT study interpretations in patients with prostate cancer is unknown. Methods: 68Ga-PSMA-11 PET/CT was performed in 50 patients with prostate cancer for biochemical recurrence (n = 25), primary diagnosis (n = 10), biochemical persistence after primary therapy (n = 5), or staging of known metastatic disease (n = 10). Images were reviewed by 16 observers who used a standardized approach for interpretation of local (T), nodal (N), bone (Mb), or visceral (Mc) involvement. Observers were classified as having a low (<30 prior 68Ga-PSMA-11 PET/CT studies; n = 5), intermediate (30–300 studies; n = 5), or high level of experience (>300 studies; n = 6). Histopathology (n = 25, 50%), post–external-beam radiation therapy prostate-specific antigen response (n = 15, 30%), or follow-up PET/CT (n = 10, 20%) served as a standard of reference. Observer groups were compared by overall agreement (% patients matching the standard of reference) and Fleiss' κ with mean and corresponding 95% confidence interval (CI). Results: Agreement among all observers was substantial for T (κ = 0.62; 95% CI, 0.59–0.64) and N (κ = 0.74; 95% CI, 0.71–0.76) staging and almost perfect for Mb (κ = 0.88; 95% CI, 0.86–0.91) staging. Level of experience positively correlated with agreement for T (κ = 0.73/0.66/0.50 for high/intermediate/low experience, respectively), N (κ = 0.80/0.76/0.64, respectively), and Mc staging (κ = 0.61/0.46/0.36, respectively). Interobserver agreement for Mb was almost perfect irrespective of prior experience (κ = 0.87/0.91/0.88, respectively). Observers with low experience, when compared with intermediate and high experience, demonstrated significantly lower median overall agreement (54% vs. 66% and 76%, P = 0.041) and specificity for T staging (73% vs. 88% and 93%, P = 0.032). Conclusion: The interpretation of 68Ga-PSMA-11 PET/CT for prostate cancer staging is highly consistent among observers with high levels of experience, especially for nodal and bone assessments. Initial training on at least 30 patient cases is recommended to ensure acceptable performance.
See an invited perspective on this article on page 1615.
The radioligand 68Ga-PSMA-11 (Glu-NH-CO-NH-Lys-(Ahx)-[68Ga(HBED-CC)]) binds with high affinity to prostate-specific membrane antigen (PSMA) (1). High PSMA expression together with little or no background uptake enables accurate imaging of prostate cancer by PET (2,3). Current evidence strongly suggests that 68Ga-PSMA-11 PET/CT adds value to current diagnostic approaches (4). Large, mainly retrospective trials demonstrate superior detection rates and higher accuracy for the localization of biochemical recurrence when compared with morphologic imaging or choline PET/CT (5–9). A recent systematic review supports the use of 68Ga-PSMA-11 PET/CT in patients with biochemical recurrence and low prostate-specific antigen (PSA) values (<2 ng/mL) (10). Moreover, there is evidence for additional value for primary staging (11–13), stratification for PSMA-targeted radioligand therapy, and management of metastatic disease (14–18).
Multicenter trials to evaluate accuracy and impact on management of 68Ga-PSMA-11 PET/CT are currently under way in Europe and the United States (e.g., NCT02940262, NCT02918357, NCT02919111).
Before widespread clinical adoption of PSMA-targeted PET imaging, its interobserver variability and agreement need to be established (19,20). This information has thus far not been available for 68Ga-PSMA-11 PET/CT interpretations. To address this unmet need, we evaluated prospectively the interobserver agreement for 68Ga-PSMA-11 PET/CT interpretations and compared findings among readers with various levels of experience.
MATERIALS AND METHODS
Patients and Standard of Reference (SOR)
From 2 institutional databases (Ludwig-Maximilians-University and Technical University Munich), 50 patients who underwent 68Ga-PSMA-11 PET/CT for the following indications were selected retrospectively: biochemical recurrence (n = 25), primary diagnosis (n = 10), biochemical persistence after primary therapy (n = 5), or staging of known metastatic disease (n = 10). Patient characteristics are given in Table 1. Twenty-five of 50 patients (50%) had histologic verification of PET/CT-positive lesions. In the remaining patients, PSA response after external-beam radiation (n = 15) or 68Ga-PSMA-11 PET/CT follow-up (n = 10) served as SOR. PET/CT-positive lesions were defined during a joint reading session by consensus of 2 expert readers, each with more than 1,000 prior clinical or research 68Ga-PSMA-11 PET/CT interpretations. Expert readers had access to all clinical data. Cases were selected to represent clinical routine, ranging from negative cases (n = 6, 12%) to extensive disease (n = 10, 20%), with typical pitfalls. Pitfalls included 68Ga-PSMA-11 PET/CT false-positive (unspecific bone uptake, n = 4; celiac ganglia, n = 2; inflammatory or postinflammatory, n = 4; benign tumor, n = 2) and false-negative lesions (n = 8) to resemble a total of 20 challenges in 15 patients (Table 2).
The prospective study was approved by the Institutional Review Board at the Ludwig-Maximilians-University Munich, Munich, Germany, and registered in the ISRCTN registry (number ISRCTN13499475).
Image Acquisition and Reconstruction
Patient preparation and image acquisition were performed as previously described (8,13). In brief, 68Ga-PSMA-11 was injected intravenously at a median dose of 182 MBq (interquartile range, 80 MBq) along with 20 mg of furosemide. A median tracer uptake period of 57 min (interquartile range, 14 min) was allowed before imaging with either a Siemens Biograph mCT (n = 22, 44%), Siemens True Point 64 (n = 22, 44%), or GE Discovery 690 (n = 6, 12%) scanner.
In all patients, a diagnostic CT scan (reference mAs, 200–240; 120 kV) was obtained in the portal venous phase 80 s after intravenous injection of contrast agent followed by the PET scan. All patients received diluted oral contrast.
PET images were reconstructed with an axial 168 × 168 matrix based on the TrueX algorithm (3 iterations, 21 subsets; Biograph 64) and a 256 × 256 matrix based on the TrueX algorithm (4 iterations, 8 subsets; Biograph mCT) or on the VUE Point FX algorithm (2 iterations, 36 subsets; Discovery 690).
Observers
Sixteen physicians from 13 centers located in Europe (n = 9), North America (n = 2), Asia (n = 1), and Australia (n = 1) were recruited prospectively as research participants based on their training (nuclear medicine physician or radiologist) and prior experience with PET/CT. The research participants, that is, the observers, reviewed 50 68Ga-PSMA-11 PET/CT datasets. Each dataset included diagnostic CT and attenuation-corrected PET images.
Observers reported the number of previous clinical 68Ga-PSMA-11 PET/CT interpretations. On the basis of this information, observers were classified as having a low (<30 prior 68Ga-PSMA PET/CT studies; n = 5), intermediate (30–300 studies; n = 5), or high level of experience (>300 studies; n = 6).
Guidelines for Visual Interpretation
A written guide (supplemental materials [available at http://jnm.snmjournals.org]), 4 teaching cases, an electronic case report form, and 1 test patient dataset with disclosed data entries were provided to each observer. In addition, observers were asked to learn about 68Ga-PSMA-11 PET/CT pitfalls (21) and the typical nomenclature for lymph node regions (22) to achieve best possible agreement.
The following patient information was disclosed to each observer before image interpretation: indication (biochemical recurrence, primary diagnosis, biochemical persistence after primary therapy, staging of metastatic disease), age (y), weight (kg), injected dose (MBq), uptake time (min), PET/CT device, and PSA level (ng/mL). Observers were masked to all other clinical data. Visual image interpretation for the presence or absence of malignant disease was reported for predefined categories (Supplemental Table 1).
Semiquantitative Measurements
Each observer recorded SUVmax for 1 diseased target region per T, N, Mb, and Mc category. The target region for SUV measurement was automatically identified in the electronic case report form.
Each observer measured background activity by defining SUVmax and SUVmean using a 1.5-cm-diameter circular region of interest placed in the center of the aortic arch and the left gluteus muscle. To exclude variability among different image software used for interpretation, observers were asked to repeat tumor and background SUV for 1 test patient dataset to exclude deviation greater than 10%.
Statistical Analysis and Reference Standard
For binary data, agreement among observer groups was evaluated using Fleiss’ κ (23). For nonbinary data with more than 10 observations, agreement among observer groups was evaluated by intraclass correlation coefficient (ICC) using a 2-way mixed model for absolute agreement (average measures) (24). Ninety-five percent confidence intervals (CIs) are reported for κ and ICC values. Interpretation of κ and ICC was based on a classification provided by Landis and Koch (25): 0.0, poor; 0.0–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1.00, almost-perfect reproducibility.
Overall agreement, defined as complete agreement of an observer for all categories (T, N, Mb, Mc), and sensitivity and specificity compared with the SOR, respectively, were calculated for each observer. Group median and range were reported for overall agreement, sensitivity, and specificity. Difference between 2 groups was assessed by Student t test. The significance level was 5%.
Discrepancies in semiquantitative measurements between observer groups and the SOR were expressed as mean difference (Δ) ± SD. Statistical analyses were performed using R software (R Core Team 2015, R Foundation for Statistical Computing) with the package “irr” (Gamer et al., version 0.84) for Fleiss’ κ and SPSS (version 15.0; SPSS Inc.) for all other statistical analyses.
At least substantial agreement for visual and semiquantitative interpretation of all scans for the 3 major staging categories (T, N, Mb) was defined as acceptable performance.
RESULTS
Patient Characteristics
Table 1 summarizes the patient characteristics. 68Ga-PSMA-11 PET/CT studies were interpreted as positive for prostate cancer presence in 44 of 50 (88%) patients by the reference readers: local tumor was present in 9 patients (18%); 30 patients (60%) had lymph node (N)–positive disease, whereas 15 (30%) and 6 (12%) were staged as bone (Mb) and organ (Mc) positive, respectively.
Image Interpretation: Interobserver Agreement
The interobserver agreement for visual image interpretation is shown in Figure 1A and Table 3. Highly experienced observers agreed substantially or almost perfectly for all categories (T, N, Mb, Mc). Intermediate- and low-experienced observers provided substantially or almost-perfectly reproducible assessments for the T, N, and Mb categories and N and Mb categories, respectively.
Interobserver agreement was analyzed separately for patients with biochemical recurrence or persistence after primary definitive treatment: high-experienced observers agreed substantially or almost perfectly for all categories (T, N, Mb, Mc) whereas intermediate- and low-experienced observers agreed substantially or almost perfectly only for the N and Mb categories, and Mb category, respectively.
Image Interpretation: Comparison to SOR
Median overall agreement with SOR for T, N, Mb, and Mc staging was 69% (range, 48–84) for the entire group of observers. High- or intermediate-experienced observers performed significantly better than low-experienced observers for T, N, Mb, and Mc staging (median, 76 or 66 vs. 54%, P = 0.041; Fig. 1B).
Table 4 summarizes sensitivity and specificity for the entire group and separated for low-, intermediate-, or high-experienced observers, each stratified by staging category. All observer groups were highly sensitive in detection of local tumor. However, on the basis of a higher rate for false-positive local findings, median specificity was significantly lower for observers with low versus intermediate or high experience (73% vs. 88% and 93%, P = 0.032). For lymph node and bone metastases performance compared with the SOR was almost identical (all P > 0.05). In assessing organ metastases, sensitivity was slightly higher for high-experienced observers (median, 58%) versus observers with intermediate or low experience (median, 50%).
Three patient examples for low degree of observer agreement are given in Figure 2. Notable sources for disagreement were among other false-negative findings due to low 68Ga-PSMA-11 uptake and false-positive findings due to 68Ga-PSMA-11 uptake in benign entities (Table 2). For instance, low 68Ga-PSMA-11 uptake (SUVmax < 5) in lymph node metastases resulted in false-negative findings in 3 patients for 4 (25%), 8 (50%), and 13 (81%) observers, respectively (Fig. 2A). Degenerative or posttraumatic bone uptake resulted in false-positive Mb stage in 4 patients for 2 (13%), 2 (13%), 11 (69%), and 13 (81%) observers, respectively. Hepatic metastases resulted in false-negative Mc stage in 2 patients for 11 (69%) and 12 (75%) observers, respectively. Metastases to the thyroid cartilage and to the penis in 2 patients were missed by 9 (56%) and 16 (100%) observers, respectively, resulting in false-negative Mc stage. Celiac ganglia with high 68Ga-PSMA-11 uptake in 2 patients resulted in false-positive N stage by 1 (6%) and 9 (56%) observers, respectively.
Semiquantitative Measurements
Interobserver agreement including mean Δ differences for SUV measurement is given in Table 5. Agreement was almost perfect for SUVmax of local tumor, lymph node, and bone metastases. Agreement was not associated with tumor lesion uptake (ICC, 1.00 for SUVmax < 10; 0.94 for 10 ≤ SUVmax < 20; 0.98 for SUVmax ≥ 20). SUVmax and SUVmean of mediastinal blood pool and muscle were highly reproducible. Figure 3 illustrates agreement among individual SUV measurements.
Overall, observers with high or intermediate experience fulfilled our criteria for acceptable performance, whereas observers with low experience did not, based on fair agreement for local staging.
DISCUSSION
This prospective study on 50 68Ga-PSMA-11 PET/CT scans demonstrated that readings are highly reproducible for high- and intermediate-experienced observers. Observers in the low-experience group provided highly reproducible reads for bone metastases but achieved lower agreement for local tumor, lymph node, and organ metastases assessments.
Semiquantitative analyses of tumor lesions and background activity was highly reproducible for all levels of observer experience. On the basis of our predefined criteria, we recommend initial training on at least 30 representative patient cases to reach acceptable diagnostic performance for clinical and research interpretations of 68Ga-PSMA-11 PET/CT scans. Training cases should include routine findings (ranging from unremarkable to extensive disease) and typical pitfalls, such as PET-positive ganglia or degenerative/posttraumatic bone lesions.
Interobserver agreement is an important aspect of clinical applicability. 68Ga-PSMA-11 PET/CT scan interpretation is not without pitfalls: PSMA expression has been observed in tissues other than prostate cancer. Common examples are ganglia, hemangioma, Paget’s bone disease, and other benign and malignant tumors (26–32). Sources of misinterpretations include normal and variable PSMA ligand uptake due to background activity in salivary glands, liver, spleen, small intestine, colon, and kidney or in the urinary system. In general, the list of false-positive pitfalls is still evolving, prompting any clinician to stay vigilant with the current literature. Visceral metastases to the liver can occasionally exhibit no to low uptake, which cannot always be differentiated reliably from background activity. Approximately 5%–10% of all primary prostate cancers as well as their metastases do not exhibit significant PSMA expression (11,33), stressing the importance of reader experience for interpretation of the PET/CT study.
To reduce error rates, reported studies used consensus readings by multiple physicians (7–9,11,34,35). However, this does not solve the issue of observer variability in the clinical setting. The current cases, selected from 2 databases, contained a considerable proportion of pitfalls (Table 2). This approach was chosen to also challenge readers with difficult cases. Despite this additional level of difficulty, readers with intermediate and high experience levels achieved substantial to almost-perfect agreement for all clinically relevant categories.
Intermediate- and low-experienced observers demonstrated substantial or almost-perfect agreement for the N and Mb categories. This may be a result of high tumor-to-background uptake for 68Ga-PSMA-11 and basic understanding of common metastatic pathways. False-positive findings for local involvement with potential implication on management, such as substantial changes of a salvage radiation therapy plan, occurred more often in the low-experience group. Thus observers with low experience (<30 previous 68Ga-PSMA-11 PET/CT readings) showed only moderate interobserver agreement for T staging with somewhat reduced specificity. Indeed, the judgment of local tumor can be challenging because small recurrence frequently occurs near the base of the bladder, causing problems with signal overlay by excreted tracer; and background uptake in normal prostate especially in benign hypertrophy as well as after local radiation therapy decreases signal-to-noise ratio (13,35,36).
Agreement for Mc staging was lower than for T, N, and Mb for all observer groups. In particular, intermediate- and low-experienced observers exhibited only fair to moderate agreement. This is likely due to a low number of observations (6 patients were true positive) combined with the relatively high portion of pitfalls: metastasis in the thyroid cartilage was missed by more than half of observers, especially those with intermediate and low experience. In general, false-negative visceral findings were triggered by reader bias due to low incidence (e.g., 5% in patients with biochemical recurrence (8)) and absent or low PSMA expression (37–39) impeding 68Ga-PSMA-11 PET interpretation.
Observer agreement levels of the current study are in line with PET procedures using high-affinity radioligands. Two recent studies reported almost-perfect reproducibility for 68Ga-DOTATATE PET/CT interpretations (κ = 0.82 and 0.80) in patients with neuroendocrine tumors (40,41). Thus, interpretations of radioligand PET/CT studies in patients with neuroendocrine and prostate cancer, respectively, are equally robust. 68Ga-DOTATATE and 68Ga-PSMA-11 PET/CT are characterized by specific and high tumor signal. These hallmarks contribute to a high level of reader agreement even after short training period.
The present study has several limitations. First, observers were grouped on the basis of experience with 68Ga-PSMA-11 PET/CT interpretation. However, the skill of a reader is determined by multiple factors, including clinical knowledge and general experience in imaging of prostate cancer. This may have led to a relatively broad variance in overall agreement, for example, observed for the low-experienced observers in our study (Fig. 1B). Second, the sensitivities reported might be overestimated because it is difficult to identify false-negative lesions especially in the setting of recurrence when histologic validation is image driven. Third, lymph node metastases within versus outside the pelvis were not separated in our staging system, which was organ-focused to analyze findings based on their PET/CT appearance. American Joint Committee on Cancer staging focuses on patient prognosis and thus discriminates intra- from extrapelvic lymph node metastases. Fourth, intraobserver agreement was not assessed, which might have given insight into reliability and confidence for individual judgments. However, applicability of our findings is supported by selection of representative patients and pitfalls as well as inclusion of a high number of observers from Europe, the United States, Asia, and Australia.
CONCLUSION
Both visual and semiquantitative 68Ga-PSMA-11 PET/CT interpretations in prostate cancer patients are highly reproducible among observers with intermediate and high experience. Our findings indicate acceptable reader performance after initial training on at least 30 representative patient cases.
DISCLOSURE
This study was partially funded by the U.S. Department of Energy, Office of Science Award DE-SC0012353. Wolfgang Peter Fendler received a scholarship from the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG, grant 807122). Jeremie Calais received a grant from the Fondation ARC pour la recherche sur le cancer (grant no. SAE20160604150). No other potential conflict of interest relevant to this article was reported.
Footnotes
Published online Apr. 13, 2017.
- © 2017 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- Received for publication January 26, 2017.
- Accepted for publication April 3, 2017.