Abstract
Many studies demonstrate a high accuracy for PET in staging lymphoma, but few assess observer variation. This study quantified agreement for staging lymphoma with PET/CT. Methods: The PET/CT images of 100 patients with lymphoma who had been referred for staging were reviewed by 3 experienced observers, with 2 observers reviewing each series a second time. Ann Arbor stage and individual nodal and extranodal regions were assessed. Weighted κ (κw) and intraclass correlation coefficient were used to compare ratings. Results: Intra- and interobserver agreement was high for Ann Arbor stage (κw = 0.79–0.91), number of nodal regions involved (intraclass correlation coefficient, 0.83–0.93), and presence of extranodal disease (κ = 0.74–0.86). High agreement was also observed for all nodal regions (κw > 0.60) except hilar (κw = 0.56–0.82) and infraclavicular (κw = 0.14–0.55). Lower agreement was observed for bowel involvement (κw = 0.37–0.71). Conclusion: Experienced observers had a high level of agreement using PET/CT for lymphoma staging, supporting its use as a robust noninvasive staging tool. Further research is needed to evaluate observer variability for restaging during and after chemotherapy.
Many studies have assessed the sensitivity and specificity of PET and PET/CT for staging lymphoma (1), but few have analyzed variation between observers. Although a high level of reproducibility does not necessarily equate to high accuracy, low levels of reproducibility cannot be associated with high accuracy. Observer variation can be substantial, and differences between observers can outweigh purported differences between imaging techniques (2). Assessment is particularly pertinent with the newer imaging modalities that generate hundreds of images for review from a single patient encounter. Quantifying interpretative variability is important, especially in view of increasing PET/CT use in multicenter trials designed to establish how functional imaging results can be used to alter patient management.
This study quantified both intra- and interobserver variation for staging lymphoma with PET/CT. Primary outcome variables were agreement on Ann Arbor stage, number of nodal regions, and presence of extranodal involvement. Secondary outcome variables were agreement on specific nodal and extranodal sites.
MATERIALS AND METHODS
Patients
One hundred consecutive PET/CT studies of patients with biopsy-proven lymphoma who underwent PET/CT staging before therapy were reviewed. The identity of all patients was masked, and no history or correlative imaging was provided. The observers used a standardized form and rated individual nodal groups as negative, equivocal, or positive for disease. Negative included inflammatory, reactive, or other benign etiologies. Nodal regions, as defined at the Rye Symposium in 1965, included cervical (including supraclavicular, occipital, and preauricular), axillary, infraclavicular, mediastinal, hilar, periaortic, mesentery, pelvic, and inguinal or femoral. Extranodal sites included spleen, bone or bone marrow, lung, liver, bowel (including gastric), and other (including sites such as muscle, subcutaneous tissue, and breast). Ann Arbor stage was also assigned as per the sixth edition classification of the American Joint Committee on Cancer (3).
Three observers reviewed the same 100-patient series. Two of these observers reviewed the same patient series a second time in a different order to assess intraobserver variability, with reviews separated by several weeks to reduce the effect of memory. Thus, 500 reviews were conducted in total. All observers were experienced in reporting PET/CT. Two were nuclear medicine physicians, and one was a radiologist. The observers had a minimum of 8 y and a maximum of 17 y of experience with PET reporting, and all had 3 y of experience of PET/CT reporting in a unit performing approximately 4,000 studies per year, of which 800 were lymphoma.
The characteristics of the study population are summarized in Tables 1 and 2. Patient age ranged from 11 to 80 y (median, 53 y). The study population included patients with high-grade non-Hodgkin lymphoma (57%), Hodgkin lymphoma (32%), follicular lymphoma (9%), and other subtypes (2%). The average number of nodal regions was 4.2, with extranodal involvement in 45% of patients. The proportions of patients with Ann Arbor stages 1, 2, 3 and 4 were 15.6%, 22.8%, 28.2%, and 33.4%, respectively.
PET/CT Acquisition
The studies were performed from skull base to upper thighs 90 min after injection of 18F-FDG on a dual-modality PET/CT scanner (Discovery ST; GE Healthcare). The images were acquired in 2-dimensional mode and were reconstructed with an iterative technique.
PET/CT Interpretation
All cases were reviewed on a Hermes workstation (Nuclear Diagnostics) volume display. Viewing conditions such as the physical environment, monitor brightness, or background lighting were not standardized. All images were scaled to a standardized uptake value upper threshold of 10 using a gray scale. Areas of increased 18F-FDG uptake not considered physiologic were generally reported as areas of nodal or extranodal involvement. Correlative low-dose CT findings were incorporated for anatomic localization and further differentiation between physiologic, inflammatory, and lymphomatous etiologies.
Our unit has used a reporting routine of 2 observers who read the scans independently. If there is disagreement, a third observer issues the consensus report. This routine is likely to result in similar thresholds for reporting, as reinforced by feedback from multidisciplinary meetings of hematologists, oncologists, and pathologists.
Statistical Analysis
Levels of agreement were quantified using weighted κ (κw) (4,5), and intraclass correlation coefficient for continuous measures (6). Weighting was defined as zero credit being given for the most extreme discrepancies possible and the highest partial credit being given for the least discrepant pairs of ratings. For analysis, a conservative interpretative threshold was used, with nodal and extranodal regions assigned a value of 0 for benign and equivocal, and 1 for malignant. For nodal stations, left- and right-sided scores were added together; that is, a value of 2 was assigned if both sides were involved. κ-values are reported using the benchmarks of Landis and Koch (7) (with 0.81–1 being almost perfect agreement; 0.61–0.8, substantial agreement; 0.41–0.6, moderate agreement; 0.21–0.4, fair agreement; 0.01–0.2, slight agreement; and ≤0, poor agreement). Bootstrapping was used to calculate 95% confidence intervals (8). Analyses were performed using the statistical package Stata (version 9.2; Stata Corp).
Postanalysis Review
For those variables with the lowest κ-agreement, the 3 observers were asked to review and reach a consensus without any further information and then to review again with the aid of relevant clinical history and correlative imaging. The observers were also asked to postulate the main reasons for the observer disagreement. The instances of disagreement included 18 patients with infraclavicular nodal involvement and 8 patients with bowel involvement.
RESULTS
Results are summarized in Table 3. Intra- and interobserver agreement was high for overall Ann Arbor stage (κw = 0.79–0.91). For intraobserver agreement, 95% confidence intervals were within the “almost perfect” range of κ-values, whereas for interobserver agreement, there was crossover into the substantial-agreement range. Similarly, agreement was high for the total number of nodal groups involved (intraclass correlation coefficient, 0.83–0.93) and for presence of extranodal involvement (intra- and interobserver κw = 0.82–0.84 and 0.74–0.86, respectively).
For specific nodal groups, there was substantial or greater agreement for cervical (κw = 0.77–0.86), axillary (0.69–0.80), pelvic (0.65–0.82), inguinal (0.69–0.82), mediastinal (0.73–0.77), periaortic (0.75–0.81), and mesenteric (0.61–0.67) nodal groups. Agreement was lower for hilar (0.56–0.82) and infraclavicular (0.14–0.55) nodal groups. For extranodal involvement, there was substantial or greater agreement for spleen (0.69–0.84) and bone marrow (0.76–0.94). Agreement was lower for lung (0.58–0.90), liver (0.59–0.95), and bowel (0.37–0.71). The higher κ-values in these ranges reflect greater intraobserver agreement.
In the postanalysis review of patients with variability of infraclavicular nodal classification, all patients had distant disease and the variability did not change the overall stage. On consensus review, the variability was clearly due to variable definition of the boundary between infraclavicular nodes and adjacent nodal regions, including medial axillary and supraclavicular (Fig. 1). Review of the 8 patients with disagreement on bowel involvement indicated that 2 were related to gastric involvement and 6 to large-bowel involvement. On consensus review, all these patients had either stage 3e or 4 disease. In the cases of large-bowel involvement, the disagreement was related to differentiating colonic involvement from adjacent mesenteric nodal disease (Fig. 2).
DISCUSSION
Few studies have assessed observer variability in lymphoma imaging. Fletcher et al. (9) evaluated interobserver variability for CT detection of cervical–thoracic Hodgkin disease. For individual nodal stations, agreement ranged from poor to moderate, with κ-scores ranging from 0.13 for left paratracheal nodes to 0.72 for right lower cervical nodes. Agreement between the majority of reviewers and the primary report was poor (κ < 0.40) for two thirds of the sites. Zijlstra et al. (10) measured observer variation for PET staging and restaging. For experts using sensitive and conservative models, concordance was 61% and 56%, respectively, for staging and 82% and 94%, respectively, for restaging.
In this study, intra- and interobserver agreement was high for Ann Arbor stage, number of nodal regions involved, and presence of extranodal disease. Most of the study patients (89%) had high-grade lymphoma, in which high glucose use results in high lesion-to-background contrast. This facilitates easy visual perception of active sites of disease and is further assisted by the use of contemporaneous CT, which allows precise anatomic correlation and the identification of physiologic variants or other pathologic processes.
The lower agreement for extranodal bowel involvement appears to be due largely to difficulty in differentiating bowel involvement from adjacent mesenteric nodal disease. This difficulty is not unexpected on imaging alone and is unlikely to alter the management strategy. Interpretation is limited by the low prevalence of patients with bowel involvement (<5.5%); κ-statistics also tend to weigh disagreements more heavily when the prevalence of a positive finding approaches zero (11).
For specific nodal regions, agreement was lowest for infraclavicular nodes, as is consistent with previous findings for CT staging (9). The cause was disagreement about the definitions of infraclavicular, supraclavicular, and axillary nodal regions, but this disagreement did not change the overall stage, and locoregional radiotherapy would not have been considered. There was also lower agreement about hilar nodes than about other nodal stations. Hilar 18F-FDG uptake is not uncommon, because of inflammatory changes secondary to a reactive or granulomatous process (12). In staging lung carcinoma with PET/CT, we have previously described lower agreement in the hilar region than in mediastinal nodal stations (13). Caution is thus warranted in interpretation of hilar activity, especially if the intensity is discordant with uptake at other sites.
This study had several potential sources of error. The form-based method of data collection may have reduced errors by ensuring systematic review and standardization of terminology. PET/CT findings were not compared with histology or patient follow-up, and therefore no comment can be made on accuracy. Other studies, however, have demonstrated high sensitivity and specificity (1). In this study, reviewers were unaware of clinical history and were not provided with correlative imaging results—a discrepancy with routine clinical practice. Availability of this information, however, would likely result in higher agreement. For example, knowledge of lymphoma subtype will help the reviewer refine expected distribution, likely improving agreement.
This study was limited by the use of experienced PET/CT observers from a single center. As such, they could be expected to show a significant degree of concordance in their approach, especially as the center adopts a dual-reporting system. The findings are still of interest, as multiinstitutional trials increasingly use a central core laboratory for reporting. It would be useful to extend the study to assess variation across different centers and also investigate agreement with less experienced observers or trainees.
Our study did not assess observer agreement for restaging lymphoma after chemotherapy. Observer agreement may be lower in these patients because the intensity of 18F-FDG uptake may be low. Although the Imaging Subcommittee of the International Harmonization Project in Lymphoma has published positivity criteria for restaging after completion of chemotherapy (14), defining positivity when using PET for restaging during a course of therapy is less well established. Defining appropriate criteria and assessing observer variability in these patients warrant further investigation.
CONCLUSION
Among experienced physicians in a single center, there was a high level of intra- and interobserver agreement using PET/CT for lymphoma staging. This result complements the results of other studies demonstrating a high accuracy and supports the use of PET/CT as a robust noninvasive staging tool. Further research is needed to evaluate observer variability for restaging during and after completion of chemotherapy.
Footnotes
-
COPYRIGHT © 2009 by the Society of Nuclear Medicine, Inc.
References
- Received for publication March 12, 2009.
- Accepted for publication July 10, 2009.