Abstract
The effectiveness of cancer therapy, both in individual patients and across populations, requires a systematic and reproducible method for evaluating response to treatment. Early efforts to meet this need resulted in the creation of numerous guidelines for quantifying posttherapy changes in disease extent, both anatomically and metabolically. Over the past few years, criteria for disease response classification have been developed for specific cancer histologies. To date, the spectrum of disease broadly referred to as lymphoma is perhaps the most common for which disease response classification is used. This review article provides an overview of the existing response assessment criteria for lymphoma and highlights their respective methodologies and validities. Concerns over the technical complexity and arbitrary thresholds of many of these criteria, which have impeded the long-standing endeavor of standardizing response assessment, are also discussed.
Lymphoma comprises a heterogeneous collection of lymphoproliferative malignancies with varying clinical behaviors and response profiles. These disorders are commonly categorized as either Hodgkin lymphoma (HL) or non-Hodgkin lymphoma (NHL), with the latter group constituting most cases. HL tends to be less aggressive and carries a relatively high 5-y survival rate of 85.3% (1). In 2015, this subtype of lymphoma was diagnosed in an estimated 9,050 patients and caused 1,150 deaths in the United States (2). By comparison, NHL includes dozens of distinct conditions with varying etiologies and prognoses. Together, these conditions accounted for approximately 71,850 new cases and 19,790 deaths in the United States in 2015 (2), with a 5-y survival rate of 69.3% (1). The guidelines of the World Health Organization (WHO) subdivide NHL according to cell lineage into mature B-cell neoplasms and mature T-cell and NK-cell neoplasms (3). Diffuse large B-cell lymphoma, which falls into the first classification, represents approximately 40% of all cases of NHL, making it the most common form of the disease (4).
The nodular enlargements characteristic of lymphoma were noted in the medical literature as early as 1661 (5), but the constellation of “lymph node and spleen enlargement, cachexia and fatal termination” was first described by Thomas Hodgkin in 1832 (6). The development of modern treatments occurred over a century later, when the discovery of marked lymphoid and myeloid suppression in soldiers exposed to mustard gas during the Second World War led Louis S. Goodman and Alfred Gilman to test the effects of a related compound—nitrogen mustard—on patients with lymphoma and other hematologic diseases (7).
Even these early chemotherapeutic agents required an objective means of evaluating their in vivo effectiveness in human subjects. Initially, standardized methods for the manual measurement of tumor size before and after therapy were proposed for this purpose. But with the advent of anatomic medical imaging techniques, most notably CT, an array of novel guidelines for response assessment was developed. More recently, functional information from PET has been integrated to complement the anatomic information of CT. Currently, numerous criteria that rely on CT and PET individually, as well as a handful of criteria that combine these imaging modalities, have been reported for assessing treatment response in both solid tumors and hematologic malignancies (Supplemental Table 1; supplemental materials are available at http://jnm.snmjournals.org). Although progress has recently been made toward the standardization of response assessment, the clinical and research communities remain somewhat fragmented in their use of these various criteria. This review article outlines the available criteria and highlights their differences in an attempt to facilitate a more uniform approach to response assessment.
HISTORICAL REVIEW OF RESPONSE ASSESSMENT IN SOLID TUMORS
From the development of the first chemotherapeutic agents in the 1940s to the advent of modern imaging techniques in the 1970s, objective and systematic assessment of treatment response depended largely on physical examination (8). However, palpation as a method of assessing response was imprecise, as demonstrated by a 1976 study by Moertel and Hanley in which 16 oncologists palpated and measured 12 simulated tumor masses using “variable clinical methods” (9). The authors found that criteria defining response as 25% and 50% reductions in the perpendicular diameters of these palpated tumors resulted in false-positive interpretations in 19%–25% and 6.8%–7.8% of cases, respectively.
With the goal of achieving “the standardization of reporting results of cancer treatment,” WHO held a series of meetings between 1977 and 1979 that culminated in the publication of a handbook outlining response assessment criteria, which were subsequently widely publicized and rapidly adopted (10,11). The criteria called for bidimensional tumor measurements to be obtained before and after therapy and the product of these bidimensional measurements to be calculated and summed across several sites of disease to form a single parameter by which to assess response. The changes in these parameters over time classified patients into 1 of 4 response groups: complete response, partial response, no response, and progressive disease (Supplemental Table 2).
Although these guidelines made strides toward standardization of response assessment, they did not explicitly specify critical factors, including the number of masses to be measured and the minimum measureable size of a tumor (12). As a result of these ambiguities, as well as the introduction of imaging modalities such as CT, the WHO criteria eventually became the subject of reinterpretation by various research organizations and clinical groups, thus undermining the standardization it was designed to promote.
To address the gradual divergence of response assessment, institutions such as the National Cancer Institute and the European Organization for Research and Treatment of Cancer (EORTC) began revisiting the WHO criteria throughout the 1990s with the goal of developing new guidelines that would restandardize the practice of evaluating response to therapy. In 1999, the EORTC released its own recommendations for preimaging patient preparation, image acquisition and analysis, tumor sampling, and tumor response classification (13). These were among the first guidelines to use a functional imaging modality, namely PET, as a means of assessing treatment response (Supplemental Table 3). The PET radiotracer 18F-FDG was used to measure metabolic activity and tumor aggressiveness. Moreover, 18F-FDG was shown to delineate the metabolically active tumor borders, providing insight into individual tumor biology. These metabolic classifications of treatment response laid the groundwork for similar 18F-FDG–based criteria in the years that followed.
The incorporation of PET imaging helped to address the issue of residual masses detected after therapy, which frequently comprise inflammatory, necrotic, and fibrotic tissue rather than residual disease (14–16). This phenomenon proved especially problematic for lymphoma, for which the response assessment criteria relied solely on anatomic imaging. Approximately 40% of NHL patients and 20% of HL patients continue to exhibit residual mediastinal or abdominal masses on CT after therapy (17,18). In studies that restaged such patients via laparotomy, between 80% and 95% of residual masses were shown to be nonmalignant on pathology (17,19). Moreover, the presence of residual masses on imaging was found not to be associated with time to relapse or survival (18). Therefore, by shedding light on the metabolic activity and thereby viability of these masses, PET overcame a significant limitation of CT-based response assessment for lymphoma (20).
In 2000, shortly after the EORTC devised its PET-based criteria, a collaboration between the National Cancer Institute and EORTC provided a new set of CT-based guidelines called Response Evaluation Criteria in Solid Tumors (RECIST) (21). Unlike earlier anatomic criteria (11,22), RECIST assessed tumor response on the basis of unidimensional measurements made on CT along the tumor’s longest axis, rendering the process more reproducible and applicable to the clinical setting. RECIST also defined the parameters that had been the source of disagreement between groups implementing the WHO criteria: the maximum number of lesions to be measured was set at 10, with a maximum of 5 per organ, and the minimum size of a lesion to be measured was set at 1 cm. Finally, RECIST redefined the response categories that were established in the WHO criteria (Table 1). These reformulated classifications were conservative relative to the WHO criteria, placing fewer patients in the progressive disease category (21,23,24).
Tumor Response Classifications of RECIST (2000)
However, RECIST was not without shortcomings. RECIST was widely reported to be less suitable for particular cancers, such as mesothelioma and pediatric tumors (23,25,26). Furthermore, the arbitrary number of tumor foci to be measured according to the criteria and the relatively narrow definition of progressive disease were points of contention (27). It was also suggested that the routine clinical implementation of RECIST would significantly increase the workload of radiologists (28).
To address these limitations, the RECIST Working Group set out to amend the criteria, publishing “RECIST 1.1” in 2009 (29). There were a handful of significant changes both to simplify and clarify the criteria and to allow for application in additional cancers and modalities. First, the maximum number of measured tumors was reduced to five, with a maximum of two per organ. This amendment was based on data showing that such a reduction did not result in a significant loss of information (30). Second, the definition of progressive disease was changed to require a minimum absolute increase of 5 mm in the sum of the tumor diameters, thereby preventing changes in individual small lesions from leading to unnecessary classifications of progression. Third, specific guidelines were established for the assessment of lymph node involvement, defining nodes spanning at least 15 mm on their short axis as assessable target lesions and nodes shrinking to less than 10 mm on their short axis as normal. Finally, the criteria paved the way for the incorporation of information from functional imaging modalities such as PET.
In the same year, 2009, Wahl et al. published a paper outlining “PET Response Criteria in Solid Tumors” (PERCIST) (12). These criteria followed several earlier guidelines for response assessment that used PET, namely those proposed by the EORTC in 1999 (13), Hicks et al. in 2001 (31), and Juweid et al. in 2005 (32,33). PERCIST uses similar criteria to those developed for RECIST, but incorporation of the metabolic information to anatomic information sets it apart. The authors stated that CT alone possesses “poor predictive ability” because the residual masses that are detected by this modality often reflect scarring that is mistaken for active tumor. As a PET-based criterion for response assessment, PERCIST was “designed to facilitate trials of drug development, but, if sufficiently robust, could be applied to individual patients” (12).
In their report outlining PERCIST, Wahl et al. specified a host of parameters that would facilitate the standardization of PET-based response assessment once the criteria were widely adopted. Among these suggestions was a proposed maximum of 5 tumor foci of the highest 18F-FDG avidity, with up to 2 foci per organ, to be measured for comparison before and after therapy. It was also recommended that patients undergo 18F-FDG PET scans at least 10 d after an early cycle of chemotherapy to maximize the prognostic value of the scan and minimize the effect of 18F-FDG–avid inflammation caused by chemotherapy and radiation. Moreover, the authors called for the SUVs derived from a PET scan to be corrected for lean body mass (SUL) and compared with reference uptake in the liver or, if necessary, background blood pool (12). Finally, PERCIST retained the same 4 response classifications that were established in RECIST but amended their respective specifications (Table 2). Although not yet fully validated, the PERCIST criteria are increasingly used in clinical trials for assessing therapy response in cancer (34). Such data will potentially help support their more widespread clinical application.
Tumor Response Classifications of PERCIST (2009)
MODERN RESPONSE ASSESSMENT CRITERIA IN LYMPHOMA
Although the guidelines included in the WHO criteria, RECIST, and PERCIST are generalizable to a wide array of cancers, several specialized criteria have also been proposed specifically for the spectrum of hematologic malignancies. As early as the late 1980s, as guidelines began to be developed for response assessment in chronic lymphocytic leukemia (35), HL (36), and acute myelogenous leukemia (37), there were calls for similar efforts toward standardization in NHL (38). However, in the decade that followed, various organizations simply adapted existing criteria to create their own guidelines for response assessment in NHL, thereby hindering the ability to compare data across different groups.
At meetings sponsored by the National Cancer Institute in February and May 1998, an international working group that comprised both American and European experts reached a consensus on response assessment criteria specifically for NHL (39). The resulting International Working Criteria (IWC) defined anatomic parameters, obtained by clinical or radiologic examination, that could be used to group patients into the traditional classifications of complete response, partial response, stable disease, and progressive disease, as well a new classification of “unconfirmed complete response” (Table 3). To support these anatomically based criteria, the IWC defined the upper limit for the size of a normal lymph node as 1 cm along its short axis on the basis of several prior studies (40–42). In the years after its publication, the IWC were also adopted for HL (43).
Tumor Response Classifications of International Workshop Criteria (1999)
In 2005, Juweid et al. integrated the originally CT-based IWC with 18F-FDG PET to create the IWC+PET criteria, which were initially designed and validated for NHL (32) but were subsequently validated for HL as well (44). Citing the prevalence of posttherapy residual masses and the unique ability of PET to accurately predict tumor viability in these masses, the investigators sought to establish a standardized approach that would join the anatomic information of CT with the functional information of PET. The IWC+PET criteria retained the classifications of the original IWC criteria but amended the guidelines to incorporate PET findings (Supplemental Table 4). Juweid et al. found that IWC+PET was a better predictor of progression-free survival than IWC in NHL.
Two publications in 2007, one authored by Cheson et al. (45) and the other by Juweid et al. (33), amended the existing IWC+PET criteria and made recommendations for their clinical use in both HL and NHL as the International Harmonization Project. To avoid false-positive results on PET as a result of therapy-induced inflammation, which can persist for as long as 2 wk after chemotherapy and 3 mo after radiation therapy, both reports recommended that PET acquisition occur at least 3 wk, and preferably 6–8 wk, after chemotherapy and 8–12 wk after radiation therapy. Cheson et al. also addressed the possibility of false-positive PET findings due to “rebound thymic hyperplasia, infection, inflammation, sarcoidosis, and brown fat,” as well as “[spatial] resolution…technique, and variability of 18F-FDG avidity among histologic subtypes” (45). For evaluating the tumor viability of residual masses larger than 2 cm in their greatest transverse diameters, mediastinal blood-pool activity was recommended as a reference. On the other hand, for residual masses smaller than 2 cm, background activity was the recommended reference. Residual hepatic and splenic lesions larger than 1.5 cm detected on CT were deemed positive if their metabolic activity was higher than that of the liver and spleen. These amendments permitted the elimination of the unconfirmed complete response category of tumor response, returning the classification scheme to the classic tetrad of complete response, partial response, stable disease, and progressive disease (Table 4).
Tumor Response Classifications of International Harmonization Project (2007)
An international workshop that first met in Deauville, France, in 2009 conceived of novel criteria for both HL and NHL that signaled a significant change on multiple fronts (46–49). In contrast to the predominantly quantitative guidelines proposed previously, the Deauville 5-point scoring system (D5PS) assessed treatment response qualitatively—specifically, in the form of a 5-point scale that graded the intensity of 18F-FDG uptake relative to the reference activity of the mediastinal blood pool and liver (Table 5) (50). The technical simplicity of this classification system facilitated its widespread clinical adoption. Moreover, the D5PS became a standard-bearer for the rising trend of interim response assessment, which enabled improved determinations of prognosis and earlier treatment modifications during the course of therapy.
Tumor Response Classifications of D5PS (2009)
The D5PS have since been modified by a comprehensive set of recommendations developed at the 11th International Conference on Malignant Lymphomas in 2011 and presented at the Fourth International Workshop on PET in Lymphoma, held in Menton, France, in 2012, and at the 12th International Conference on Malignant Lymphomas, convened in Lugano, Switzerland, in 2013 (51,52). The consensus revision of both the staging criteria and the 2007 IWG response criteria led to the development of the Lugano classification, in which separate sets of response criteria were proposed for PET and CT imaging, although the former is generally preferred for 18F-FDG–avid lymphomas. The PET-based criteria built on the 5-point categoric scale established by D5PS by adding considerations for new or recurring involvement of lymph nodes and bone marrow as well as organomegaly (Table 6) (53). Stand-alone CT-based guidelines were also included, despite the known limitations of anatomic response assessment in 18F-FDG–avid lymphoma, for use when PET/CT imaging is unavailable or when lymphomas have low or variable 18F-FDG avidity.
Tumor Response Classifications of Lugano Criteria (2014)
METHODOLOGIC COMPARISON OF EXISTING RESPONSE ASSESSMENT CRITERIA
The various therapy response criteria discussed in this review apply varying approaches to the use of imaging modalities. The RECIST and 1999 IWC criteria primarily use CT; EORTC and PERCIST rely on PET; and the 2007 IWC and Lugano classifications make use of both modalities, with the former using the International Harmonization Project criteria and the latter the D5PS criteria for PET interpretation. The assorted definitions of the response classifications across these criteria are shown in Table 7, which presents a simplified and standardized scheme comprising 4 groupings: complete response, partial response, stable disease, and progressive disease. Although there are identifiable trends across criteria, even those using the same modality demonstrate considerable variability in their thresholds for each response classification. For example, progressive disease is defined as a tumor size increase of at least 20% by RECIST, at least 25% by the WHO criteria, and at least 50% by IWC.
Comparison of Simplified Classifications of Various Response Criteria
In recent years, the relative simplicity of the D5PS and the associated Lugano classification have distinguished them from their quantitative predecessors, whose technical demands and complexity often precluded their widespread clinical use. However, questions about the reproducibility of the simplified qualitative criteria remain. The literature includes several comparisons between the D5PS and other guidelines using functional imaging for response assessment. A 2010 study of diffuse large B-cell lymphoma patients by Horning et al. compared interobserver agreement in the D5PS and the International Harmonization Project–based Eastern Cooperative Oncology Group criteria, reporting κ values of 0.502 and 0.445, respectively (54). However, this study was limited to a small study population. Another study on diffuse large B-cell lymphoma, by Itti et al., found lower interobserver agreement with D5PS (κ = 0.66) than with a semiquantitative counterpart based on SUVmax (κ = 0.83) (55). In larger standardized studies of HL that used D5PS, Barrington et al., Furth et al., and Gallamini et al. reported κ-values of 0.79–0.85, 0.748, and 0.69–0.84, respectively, proving its superiority (56–58). The implications of these findings on the reproducibility and clinical applicability of the Lugano classification have yet to be determined in prospective studies with large datasets.
FUTURE TRENDS
The recent advent and adoption of the D5PS and Lugano classifications have marked a step toward standardization of interpretation and brought a relatively more objective system. However, this qualitative system of response assessment should also be tested against quantitative criteria to determine their relative effectiveness in patient management. In an earlier study, the tradeoff between simplicity and reproducibility was studied by Lin et al., who compared the prognostic ability of qualitative and quantitative PET analysis in patients with diffuse large B-cell lymphoma (59). Visual analysis was able to predict event-free survival with an accuracy of 65.2%, whereas SUV-based analysis did so with an accuracy of 76.1%. A reduction of 65.7% in the SUVmax of an interim PET scan was found to be the optimal cutoff value in differentiating between favorable and unfavorable responses to therapy. These earlier results suggest that, if optimized for clinical use, standardized PET criteria using a quantitative method may be more adept at assessing tumor response. However, various study biases and the suboptimal technical methodology inherent in the retrospective design of prior studies make it difficult to arrive at a firm conclusion. Moreover, multiple factors, including variability in instrumentation, scanner calibration, and human biology, complicate obtaining reliable PET measurements across medical centers in future studies (34). Thus, the most reproducible and accurate method for PET quantification remains to be determined. As techniques for automated segmentation and quantification continue to improve, these advancements will likely be more readily implemented in the clinical setting and facilitate the use of quantitative response assessment.
CONCLUSION
Over the past 6 decades, the techniques used for evaluating the efficacy of cancer therapies have steadily increased both in precision and in intricacy, moving from crude manual measurement toward more complex structural and functional data acquisition, with many more advanced techniques such as heterogeneity measures, parametric mapping, and kinetic acquisitions on the way. The integration of CT and PET in particular greatly enhanced the ability to assess disease progression, adjust therapeutic regimens, and form an accurate prognosis. However, the vast array of interpretative guidelines that were introduced, each with its own protocols and thresholds, created stifling methodologic variability and sparked calls for “harmonization” that have rung out since the early days of response assessment and continue to echo to this day.
With respect to lymphoma in particular, recent PET/CT-based criteria have made significant strides toward standardization, and their simplified qualitative guidelines have remedied the technical complexity and time intensity that impeded the clinical application of prior quantitative criteria. However, their suitability for certain scenarios, such as in patients with lymphomas of low or variable 18F-FDG avidity or in those receiving immunochemotherapy or biologic therapy, remains to be determined. Moreover, as technologic advances ease the use of quantitative criteria, continued efforts to maintain harmonization in response assessment will likely be necessary to avoid renewed fragmentation.
Footnotes
Published online Apr. 28, 2016.
Learning Objectives: On successful completion of this activity, participants should be able to describe (1) the historical background to the development of response criteria; (2) the general response criteria for solid tumors and their key features; and (3) the response criteria specifically for lymphoma and their key differences.
Financial Disclosure: This work is sponsored by the PET Center of Excellence of the Society of Nuclear Medicine and Molecular Imaging. The authors of this article have indicated no other relevant relationships that could be perceived as a real or apparent conflict of interest.
CME Credit: SNMMI is accredited by the Accreditation Council for Continuing Medical Education (ACCME) to sponsor continuing education for physicians. SNMMI designates each JNM continuing education article for a maximum of 2.0 AMA PRA Category 1 Credits. Physicians should claim only credit commensurate with the extent of their participation in the activity. For CE credit, SAM, and other credit types, participants can access this activity through the SNMMI website (http://www.snmmilearningcenter.org) through June 2019.
- © 2016 by the Society of Nuclear Medicine and Molecular Imaging, Inc.
REFERENCES
- Received for publication February 19, 2016.
- Accepted for publication April 22, 2016.