Abstract
The ability of PET with 18F-FDG to evaluate bone marrow infiltration in patients with lymphoma has been a matter of extensive investigation with controversial results. Therefore, we aimed to evaluate systematically, with a meta-analysis, the diagnostic performance of 18F-FDG PET in this setting. Methods: Relevant studies were identified with MEDLINE and EMBASE searches (last update, August 2004). Data on the diagnostic performance of 18F-FDG PET were combined quantitatively across eligible studies. We estimated weighted summary sensitivities and specificities, summary receiver-operating-characteristic (SROC) curves, and weighted summary likelihood ratios. We also conducted separate analyses according to various subgroups. Bone marrow biopsy (BMB) was used as the reference standard. Results: Thirteen eligible nonoverlapping studies, which enrolled a total of 587 patients, were included in the meta-analysis. The independent random-effects weighted estimates of sensitivity and specificity against BMB were 51% (95% confidence interval [CI], 38%–64%) and 91% (95% CI, 85%–95%), respectively. Results were consistent in the SROC curve: a sensitivity of 51% corresponds to a specificity of 92%, whereas a specificity of 91% corresponds to a sensitivity of 55%. The weighted positive likelihood ratio (LR+) was 5.75 (95% CI, 348–9.48) and the negative likelihood ratio (LR−) was 0.67 (95% CI, 0.55–0.82). Six of 12 patients with positive 18F-FDG PET and negative initial biopsy were found to have bone marrow involvement when biopsy was performed at the sites with positive imaging signals. Subgroup analyses showed better sensitivity in patients with Hodgkin’s disease and in aggressive histologic types of non-Hodgkin’s lymphoma than in patients with less aggressive histologic types and in studies using unilateral BMB compared with those using bilateral biopsy. Conclusion: This meta-analysis showed that 18F-FDG PET has good, but not excellent, concordance with the results of BMB for the detection of bone marrow infiltration in the staging of patients with lymphoma. 18F-FDG PET may complement the results of BMB and its performance may vary according to the type of lymphoma.
PET with the radiolabeled glucose analog 18F-FDG has been increasingly used in the evaluation of several malignant tumors, including lymphoma (1–4). One of the most promising applications in lymphoma is to determine the clinical stage of the disease at initial presentation or recurrence (5,6). The ability of 18F-FDG PET to evaluate both nodal and extranodal sites such as spleen, liver, and bone marrow has been a matter of extensive investigation. In particular, bone marrow infiltration is of crucial importance in staging of lymphoma, since it signifies advanced-stage disease and, thus, may affect both treatment and prognosis. Bone marrow biopsy (BMB) is the established method for the detection of bone marrow infiltration. However, BMB is a painful procedure. Moreover, sometimes only a small sample can be obtained, which may be inconclusive. Several studies have been conducted to date addressing the ability of 18F-FDG PET to evaluate bone marrow infiltration in staging of lymphoma (7–19). However, these studies have been ineffective in evaluating the diagnostic accuracy of 18F-FDG PET due to small sample sizes. A quantitative synthesis using rigorous methods would be important to perform. Therefore, we undertook a meta-analysis of all available studies to address the diagnostic performance of 18F-FDG PET in evaluating bone marrow infiltration in the staging of patients with primary lymphoma or recurrent lymphoma after complete remission.
MATERIALS AND METHODS
Identification and Eligibility of Relevant Studies
We considered studies examining the performance of 18F-FDG PET as a diagnostic test for detecting bone marrow infiltration in the initial staging or staging of recurrent disease before treatment in lymphoma. We considered all relevant studies that included patients with and without bone marrow infiltration according to biopsy results and that had a total sample size of at least 5 patients. Studies were included in our meta-analysis regardless of the type of lymphoma (Hodgkin’s disease [HD], non-Hodgkin’s lymphoma [NHL]). We excluded studies using 18F-FDG PET for evaluation of recurrences after treatment. We also excluded studies in which the selection to perform one test (BMB or 18F-FDG PET) was based on the results of the other test, since this would entail clear verification bias.
We conducted MEDLINE and EMBASE searches (last update, August 2004). The search strategy was based on the combination of the terms (a) PET, positron emission tomography, 18F-FDG, or fluorodeoxyglucose; (b) lymphoma, Hodgkin disease, or non-Hodgkin lymphoma; and (c) diagnosis or staging. Searches were limited to human subjects.
References of retrieved articles were also screened for additional studies. Investigators of eligible studies were contacted and asked to supplement additional data, when key information relevant to the meta-analysis was missing. Whenever reports pertained to overlapping patients, we retained only the largest study to avoid duplication of information. We set no language restrictions.
Data Extraction
Two investigators extracted data from eligible studies independently, discussed discrepancies, and reached consensus for all items with the help of a third investigator. We extracted data on characteristics of studies and patients, measurements performed, and results. In each report, we recorded author names, journal and year of publication, country of origin, years of patient enrollment, number of eligible patients, number of patients analyzed, reasons for exclusions from the analysis, study design (prospective, retrospective, or unclear), type of lymphoma (HD, NHL), histologic type, disease status (primary or recurrent), inclusion and exclusion criteria, demographic characteristics of patients, stage of lymphoma, location of BMB and whether it was unilateral or bilateral, time of 18F-FDG PET (before or after biopsy), technical characteristics of 18F-FDG PET, definition of positive 18F-FDG PET test (qualitative or quantitative methods), and number of experts who assessed and interpreted the results of 18F-FDG PET and biopsy. We also recorded whether there was any mention on blinding of 18F-FDG PET measurements to the BMB results and vice versa and whether any data were given on inter- or intraobserver variability.
For each report, we recorded the number of true-positive, false-positive, true-negative, and false-negative findings for 18F-FDG PET in diagnosing bone marrow infiltration, using BMB as the reference standard. These terms are used for convention to denote the concordance of the 2 diagnostic tests, since it is unlikely that BMB is a perfect gold standard. We also recorded whether a new local BMB had been performed at a site with positive 18F-FDG PET, whenever the initial BMB was negative. Rebiopsy results were not considered in the main analysis but were analyzed separately. We also recorded separate data for HD and NHL and for primary and recurrent lymphoma, whenever these data were available.
Statistical Analysis
Data on the diagnostic performance of 18F-FDG PET were combined quantitatively across eligible studies. Three approaches were used. First, we combined independently sensitivities and specificities across studies. Between-study heterogeneity was assessed with the Fisher exact test. We estimated the weighted sensitivities and specificities using a random-effects model that incorporated between-study heterogeneity. Second, we constructed summary receiver-operating-characteristic (SROC) curves. Third, we estimated the weighted positive and negative likelihood ratio (LR+, LR−) across studies using random-effects calculations.
For a diagnostic or predictive test, the sensitivity (true-positives) and specificity (1 − false-positive) are related to each other; therefore, it is not totally correct to estimate these 2 quantities independently. To bypass this problem, one may use the SROC method. The SROC curve is estimated by the regression D = a + bS, where D is the difference of the logits of the true-positive and false-positive rate and S is the sum of these logits (20). Both weighted and unweighted regressions were estimated. The SROC curve shows the trade-off between sensitivity and specificity across the included studies.
Likelihood ratios are also metrics that combine both sensitivity and specificity in their calculation. LR+ is defined as the ratio of sensitivity over 1 − specificity, whereas LR− is defined as the ratio of 1 − sensitivity over specificity. When there is absolutely no discriminating ability for a diagnostic test, both likelihood ratios equal 1. The discriminating ability is better with higher LR+ and lower LR−. Although there is no absolute cutoff, a good diagnostic test may have LR+ above 5 and LR− below 0.2. Between-study heterogeneity in the likelihood ratios was assessed with the Q statistic (21) and was considered significant for P < 0.10 (22). We also estimated whether the LR+ and LR− were significantly different in small versus larger studies.
The main analysis combined all data regardless of the definition of 18F-FDG PET positivity, type of lymphoma (HD or NHL), disease status (primary or recurrent), type of biopsy (unilateral or bilateral), study design (prospective or retrospective), and blinding of each diagnostic test to the results of the other. However, subgroup analyses were also performed for each of these parameters.
Analyses were conducted in SPSS (SPSS, Inc.), Meta-Test (Joseph Lau, Boston, MA), and StatXact 3.0 (Cytel Inc.). P values are 2-tailed.
RESULTS
Eligible Studies
Twenty-six potentially eligible reports were retrieved. Of those, 5 were excluded because, although the authors stated that patients had undergone both 18F-FDG PET and BMB, the results of the biopsy were not reported and the authors did not respond to our attempts to contact them (23–27). Two studies were excluded because all patients had positive 18F-FDG PET results (28,29). One study (30) was excluded because all BMBs that were performed were negative. Five reports pertained to overlapping patients (8,31–34). We accepted the report with the largest sample size (8) and the remaining 4 were excluded from our analysis. Another report (35) also overlapped with a larger study (15) and was excluded. Of 2 reports from the same institution (18,19), where only one reported results on NHL (19) while both gave data on HD, we excluded the data on the smaller population of HD patients to avoid overlap (18). One study (11), including 21 patients at initial staging and 9 patients at restaging after treatment without providing separate results, was considered eligible for the meta-analysis. However, analyses excluding this study were also done and showed the same results (not shown).
Finally, 13 eligible non-overlapping studies, which enrolled a total of 587 patients, were included in the meta-analysis (Table 1). The mean age of patients varied from 13 to 65 years across eligible studies. Seven studies included patients with primary disease (8,10,11,13,14,17,19), whereas the others included mixed populations with primary and recurrent lymphoma (7,9,12,15,16,18). Four studies recruited patients with HD (7,13,16,18), 3 studies had patients with NHL (10,15,19), and 6 studies had mixed populations (8,9,11,12,14,17). Iliac crest BMB was performed in 6 studies (7–9,11,17,19), whereas in 1 study the biopsy was either from the sternum or the iliac crest (12), and 6 studies did not report the location of the biopsy (10,13–16,18). The 18F-FDG dose ranged considerably across studies (Table 1). The vast majority of studies (n = 11) used qualitative methods to evaluate the 18F-FDG PET scans (Table 1). Two studies (7,9) used quantitative methods, with standardized uptake values (SUVs) of 2.0 (7) and cut-offs for positivity of 2.5 (9). Nine studies reported blinding of 18F-FDG PET or BMB measurements to each other (8,10–12,15–19).
Data Synthesis
The sensitivity rates of 18F-FDG PET for identifying bone marrow infiltration ranged from 0% to 100% across the eligible studies (P = 0.014 for heterogeneity). The respective specificity rates ranged from 72% to 100% (P < 0.001 for heterogeneity). When all studies were considered, there were 50 patients with bone marrow infiltration and positive 18F-FDG PET findings, 53 patients with bone marrow infiltration identified as negative by 18F-FDG PET, 449 patients without bone marrow infiltration and negative 18F-FDG PET findings, and 35 patients without bone marrow infiltration identified as positive by 18F-FDG PET. The independent random-effects summary estimates of sensitivity and specificity were 51% (95% confidence interval [CI], 38%–64%) and 91% (95% CI, 85%–95%), respectively. In the SROC curve, the results were consistent with those obtained in the independent weighting of sensitivity and specificity: a sensitivity of 51% corresponded to a specificity of 92%, whereas a specificity of 91% corresponded to a sensitivity of 55% (Fig. 1). The slope of the regression of the SROC curve was negligible and nonsignificant, suggesting that the overall diagnostic performance was similar at different parts of the curve, after allowing for the trade-off between sensitivity and specificity. Likelihood ratio syntheses gave a weighted LR+ of 5.75 (95% CI, 3.48–9.48) and weighted LR− of 0.67 (95% CI, 0.55–0.82) without any statistically significant between-study heterogeneity for either metric (P > 0.10 for both). There was no evidence that the LR+ differed in small versus larger studies (τ correlation coefficient between the natural logarithm of the LR+ and the weight of each study = −0.03, P = 0.90). Conversely, there was some evidence that the LR− was less favorable in larger studies (τ correlation coefficient between the natural logarithm of the LR− and the weight of each study = 0.69; P = 0.001).
Two studies (n = 130) reported secondary biopsy in seemingly false-positive patients (8,17). Six of 12 rebiopsied patients (50%; 95% binomial CI: 21%–79%) were actually found to have bone marrow involvement at the sites with positive PET signals. Including the results of the secondary biopsy in the reference standard, the weighted sensitivity and specificity of PET against BMB in these 2 studies became 74% (95% CI, 53%–88%) and 95% (95% CI, 72%–99%), respectively, and the weighted LR+ and LR− were 14.5 (95% CI, 2.15–98.1) and 0.29 (95% CI, 0.15–0.56). Replacing these 2 studies in the main analysis yielded summary sensitivity and specificity of 54% (95% CI, 40%–68%) and 92% (95% CI, 86%–96%), respectively, and the weighted LR+ and LR− became 6.45 (95% CI, 3.71–11.2) and 0.62 (95% CI, 0.49–0.79), respectively. Note that the 2 studies with available rebiopsy results already had a sensitivity of 65% for the PET against the first BMB results.
Subgroup Analyses
The weighted rates showed significantly better sensitivity in studies with HD than in those with NHL patients. However, there were only 11 patients with positive BMB among cases with HD. For NHL, there was a clear difference in the sensitivity depending on the histologic type. On the basis of the available data, 18F-FDG PET identified 16 of 21 cases of bone marrow involvement (76.2%) from large lymphocytic, large B-cell, Burkitt, and centroblastic lymphocytic lymphomas, whereas it detected only 16 of 53 cases with bone marrow involvement (30.2%) from less aggressive histologic types (follicular, mantle cell, marginal zone, small lymphocytic lymphomas and mucosa-associated lymphoid tissue) (P < 0.001). There was also significantly better sensitivity in studies using unilateral BMB compared with those using bilateral biopsy, but this was also based on relatively sparse data (Table 2). No major subgroup differences were observed for prospective versus retrospective studies, studies with versus without reported blinding, and studies with qualitative versus quantitative PET measurements (Table 2).
DISCUSSION
This meta-analysis including data from 587 patients showed that 18F-FDG PET has moderately good, but not excellent, concordance with the results of BMB for the detection of bone marrow infiltration in the staging of patients with lymphoma. Only about half of the patients with bone marrow infiltration detected in BMB were detected as positive by 18F-FDG PET. On the other hand, >90% of patients with a negative BMB will also have negative 18F-FDG PET. In fact, positive 18F-FDG PET in the presence of negative BMB often indicated missed bone marrow involvement that could be documented with a second BMB directed at the site of positive PET signal. On the basis of these findings, 18F-FDG PET cannot yet be recommended for replacing BMB routinely in the staging of lymphoma because many cases of bone marrow involvement would be missed. However, 18F-FDG PET could complement BMB and could occasionally identify additional cases of focal bone marrow involvement that would be missed by the BMB. It is essential to establish in future research whether this complementary information may have considerable impact on the prognosis of these patients. Also, the current meta-analysis did not address the accuracy of PET in restaged patients.
The differences that were observed in subgroup analyses could be possibly due to chance. However, 18F-FDG PET showed considerable variable sensitivity for the evaluation of bone marrow infiltration depending on the histologic type of lymphoma. Sensitivity was very good for HD, but very few patients with HD had bone marrow involvement in our accumulated sample, so this encouraging finding has to be verified in a larger number of patients with various levels of bone marrow involvement. Conversely, the aggregate sensitivity was modest in NHLs. Overall, the rates of bone marrow involvement are reported to be higher in NHL compared with HD (36–38). Scrutiny of the available data showed that sensitivity was actually very good for detection of bone marrow disease when aggressive types of NHL were involved. This is probably due to the high metabolic activity, and possibly most extensive bone marrow involvement, of these tumors. On the contrary, 18F-FDG PET detected less than a third of bone marrow involvement by more indolent histologic types of NHLs. These cases might have had mostly limited involvement of the bone marrow (8).
Another challenging finding in this meta-analysis was that the sensitivity of 18F-FDG PET was significantly lower in studies using bilateral BMB compared with those using unilateral biopsy as the reference standard. BMB removes a small core of marrow and, therefore, is subject to sampling errors. The patchy nature of some lymphomas may lead to discordant findings between the 2 cores in bilateral biopsies. The reported rates of unilateral involvement in bilateral biopsies range from 10% to 50% (8,19,38), clear proof of the limitations of BMB as a proposed gold standard. Cases with bone marrow infiltration missed by unilateral biopsies might be mostly those with less extensive bone marrow infiltration. Therefore, cases detected on bilateral, but not unilateral, BMB may be less likely to be identified by 18F-FDG PET.
Some limitations of this meta-analysis should be acknowledged. First, the overall sample size was limited. However, we tried to be all-inclusive and, to our knowledge, the cumulative sample size of the meta-analysis was about 6 times larger than the largest single study published to date. We tried to retrieve additional data, but it is possible that some missing data may still exist. It is unknown whether publication bias may operate in this field against the publication of small studies with less-promising results. Second, as already acknowledged, the biopsy reference standard is not perfect for the evaluation of bone marrow involvement. However, this might lead mostly to underestimation of the diagnostic performance of 18F-FDG PET. Finally, in the vast majority of studies, the interpretation of 18F-FDG PET scans was performed by qualitative methods. The qualitative interpretation of 18F-FDG PET scans was largely based on subjective evaluation and the results were given after consensus between experts. Quantitative methods were used by only 2 studies. Future studies should focus more on quantitative indices.
CONCLUSION
Allowing for these caveats, the meta-analysis suggests that 18F-FDG PET has overall good diagnostic performance for detecting bone marrow involvement, but this may depend also on the type of lymphoma. 18F-FDG PET may complement BMB in the staging of primary or recurrent lymphoma.
Acknowledgments
We thank Drs. Ralph Naumann, Martha Hoffman, and Masayuki Sasaki for providing additional data and clarifications of their studies.
Footnotes
Received Dec. 27, 2004; revision accepted Jan. 27, 2005.
For correspondence contact: John P.A. Ioannidis, MD, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece.
E-mail: jioannid{at}cc.uoi.gr