Introduction

The incidence of oesophageal cancer is increasing and despite advances in therapy, the diagnosis still carries a poor prognosis [1]. As with all tumours, the outcome for patients is strongly associated with the stage at initial diagnosis. The TNM (tumour, node, metastasis) system currently in use for the staging of oesophageal cancer does not take into account non-anatomical factors such as histopathological type, grade or various biomarkers that may be important determinants of prognosis. The pathological stage is given by surgery but this procedure is not possible for all patients because it is associated with a high risk of mortality and morbidity. Therefore a noninvasive staging method would be of great interest, and within this context the primary role of 18F-fluorodeoxyglucose (FDG) positron emission tomography (PET) in oesophageal cancer is the detection of distant metastases [24]. This modality is also gaining acceptance in oesophageal cancer for the assessment of therapy response [5, 6] or for radiotherapy treatment planning [79]. Lately, some authors have also suggested that different parameters derived from initial 18F-FDG PET images could have a role as independent prognostic factors [1014]. The parameters studied include standardized uptake value (SUVmax as the maximum uptake in the primary tumour or in the combined primary and regional area), tumour functional longitudinal length (TL), tumour functional volume (TV), nodal uptake or FDG-avid metastases [1014]. Although a few studies have demonstrated the interest of these indices for determining prognosis, there are conflicting results concerning the independent prognostic value of each of these indices. For example, Hyun et al. [12], analysing results from 151 patients with principally squamous cell carcinoma (SCC), have recently suggested that primary tumour SUVmax is not an independent prognostic factor, in agreement with other studies [10, 15, 16]. On the other hand, Kato et al. [17] based on the analysis of 184 patients with oesophageal SCC have shown that SUVmax of the primary tumour is an independent prognostic factor for overall survival, in agreement with other studies [1820]. These conflicting results can be potentially caused by differences in the methodology used for the analysis of the PET images. Although SUVmax is less sensitive to tumour size, the conflicting results considering its value as an independent prognostic factor may also be due to variability in the tumour sizes considered in the different studies.

Pathological TL has been shown to be an independent prognostic factor in oesophageal carcinoma [21]. Therefore, determining the functional TL in 18F-FDG PET images may be a good surrogate. The use of different thresholds for the determination of the functional TL in the existing studies may be responsible for the conflicting results regarding its value as a predictor of response to chemoradiotherapy [11, 22], while it has been shown to be an independent predictor in patients undergoing surgery [10]. On the other hand, functional TV may be more representative of overall tumour burden. The value of the functional TV has been only recently explored in a single study of patients with oesophageal carcinoma considering a heterogeneous treatment regime (76 and 24% treated by surgery and combined radiochemotherapy, respectively) [12]. In this study both the presence of metastases and the TV were found to be independent prognostic factors for overall patient survival. Tumours were delineated based on a three fixed threshold scale depending on the tumour SUVmax. Although such an approach may be simple to implement in clinical practice, the use of a fixed threshold for functional TV determination suffers from multiple shortcomings which have been highlighted in different studies [23, 24]. In addition, the proposed scale is not universally applicable to the different clinical settings spanning from the acquisition protocols to the scanning systems used and variable associated image qualities.

Therefore, despite early evidence that functional TL and TV may be useful parameters in predicting survival and response to therapy, there is a clear need to assess the influence of the methodology used in obtaining these indices. Finally, the determination of functional TV may allow the calculation of alternative image-derived indices such as the total glycolytic lesion index (TLG) (g), defined as the product of the TV (ml) and its associated mean activity (SUVmean) (g/ml) [25], whose value has not as yet been explored in predicting response to therapy or as a prognostic factor for survival using 18F-FDG in oesophageal carcinoma.

The objective of this study was therefore to retrospectively investigate the prognostic value of 18F-FDG PET in 45 oesophageal cancer patients treated with concomitant radiochemotherapy, considering for the first time in a single study all of the commonly used PET-derived parameters such as functional TL, TV, SUV measurements (max, peak, mean) and TLG. In addition, the impact of different tumour delineation strategies was assessed.

Materials and methods

Patients

A total of 45 patients with newly diagnosed oesophageal cancer treated between 2004 and 2008 with concomitant radiochemotherapy and without surgery were included in this study. The characteristics of the patients are given in Table 1. Of the patients, 41 were male (91%), and the mean age at the time of diagnosis was 66 ± 10 years. Most of the tumours were SCC (73%) and originated from the middle and lower oesophagus (76%). All patients were referred before treatment for an 18F-FDG PET study as part of a routine procedure for the initial staging in oesophageal cancer. The treatment included three courses of 5-fluorouracil/cisplatin and a median radiation dose of 60 Gy given in 180-cGy daily fractions delivered once daily, 5 days a week for 6–7 weeks. Follow-up data were prospectively collected in a database for further analysis and overall survival was calculated. The current analysis was carried out after an approval by the Institutional Ethics Review Board.

Table 1 Patient demographic and clinical characteristics

18F-FDG PET acquisitions

18F-FDG PET studies were carried out prior to the treatment. Patients were instructed to fast for a minimum of 6 h before the injection of 18F-FDG. The administered dose was 5 MBq/kg, and static emission images were acquired (2 min per bed position) from head to thigh beginning 60 min after injection on a Philips GEMINI PET/CT system (Philips Medical Systems, Cleveland, OH, USA). In addition to the emission PET scan, a low-dose CT scan without IV or oral contrast was acquired for attenuation correction. Images were reconstructed with the 3-D row action maximum likelihood algorithm (RAMLA) using standard clinical protocol parameters: 2 iterations, relaxation parameter of 0.05, 5-mm 3-D Gaussian post-filtering and 4×4×4 mm3 voxels grid sampling.

PET image analysis

The following parameters were extracted from each PET image: SUVmax, SUVpeak defined as the mean of SUVmax and its 26 neighbours, mean SUV within the delineated tumour (SUVmean), functional TL in longitudinal direction, functional TV and TLG. SUVpeak, usually defined as a 1-cm circle or sphere [26] [we used a 3×3×3 cube of 4×4×4 mm3 voxels which roughly corresponds to the same region of interest (ROI)], was considered in order to investigate the impact of reducing the potential bias in the SUVmax measurements as a result of its sensitivity to noise.

Whereas SUVmax and SUVpeak are independent on the tumour delineation strategy used, TL, TV, SUVmean and the derived TLG were determined on delineations performed using two strategies. First, an adaptive threshold [23] using a background ROI manually chosen by two experienced nuclear medicine physicians led to two different results T 1bckgrd and T 2bckgrd . Observers were instructed to choose the ROI in the mediastinum at a sufficient distance from the lesion to avoid any overlapping. However, they were allowed to choose the size, shape and exact placement of the ROI. Finally the automatic fuzzy locally adaptive Bayesian (FLAB) algorithm [24, 27] was considered.

Statistical analysis

All quantitative data were expressed as mean ± 1 standard deviation (SD) and summary statistics are given in Table 2.

Table 2 Parameters definition and statistics

The correlation between all parameters extracted using the different methodologies was computed with Pearson coefficients. The differences between methodologies were assessed using Bland-Altman analysis [28] to define bias as the mean error, and agreement intervals (upper and lower limits) as 1.96 times the SD. Kaplan-Meier methods were used to estimate the survival distributions [29]. Overall survival was calculated from the date of initial diagnosis to the date of death or most recent follow-up in cases of patients still alive. Survival curves were generated for each parameter considered. The most discriminating threshold value allowing differentiation of the two groups of patients was selected using receiver-operating characteristic (ROC) methodology [30]. The prognostic value of each parameter in terms of overall survival was assessed by the log-rank test. The significance of the following factors was tested: age, gender, histology type, T, N, M classifications, American Joint Committee on Cancer (AJCC) stage [31], TL, TV, SUVmax, SUVpeak, SUVmean and TLG. Independent prognostic factors for overall survival were determined using multivariate Cox regression analysis [32] by incorporating in the model all parameters that were deemed significant in the univariate analysis. However, the indices obtained by each delineation (T 1bckgrd , T 2bckgrd and FLAB) were incorporated in the multivariate analysis separately since they were found to be highly correlated (Pearson r > 0.8, r 2 > 0.66; see the “Correlation between image-derived indices and between methodologies” section). All tests were carried out using MedCalc™ (MedCalc Software, Mariakerke, Belgium); p values <0.05 were considered statistically significant.

Results

All primary lesions were detected by 18F-FDG PET. The intensity of maximum 18F-FDG uptake in the primary lesion was quite high with a normally distributed SUVmax of 10 ± 3.8. As expected, SUVpeak measurements were comparatively lower (8 ± 3). Measurements related to the dimensions of the tumours were less uniformly distributed than SUV measurements, with a larger SD with respect to the mean. For example the TV (FLAB) was 35 ± 33 cm3.

Correlation between image-derived indices and between methodologies

TL measurements were correlated with TV (p < 0.0001) although with moderate coefficients (r = 0.69, 0.58 and 0.6 for FLAB, T 1bckgrd and T 2bckgrd , respectively). No significant correlation was found between any SUV measurement (SUVmax, SUVpeak, SUVmean) and TV (p > 0.2, r < 0.3), irrespective of the delineation strategy, in line with results of other studies such as van Heijl et al. [33].

All SUVmean measurements derived from TV delineation performed using the two different methodologies considered were highly correlated (p < 0.0001) with coefficients >0.97. TV (r > 0.82), TL (r > 0.91) and TLG (r > 0.95) results were also highly correlated (p < 0.0001) for both methodologies.

Despite high correlation coefficients, large differences were observed for several patients between measurements using the two delineation methodologies considered, and between the two users of the same adaptive thresholding. Figure 1a, b illustrates such differences. In the case of adaptive thresholding these differences were the result of the two users placing the background ROI differently.

Fig. 1
figure 1

Illustration of differences in tumour delineation depending on the methodology for a a small (<8 cm3) and low contrast (approximately 2:1) tumour and b a larger (30 cm3) and higher contrast (approximately 7:1) tumour

A summary of the Bland-Altman analysis carried out to compare the delineation methods and highlight potential differences is presented in Fig. 2c, d and complete results are given in Table 3. The largest differences between methodologies were observed for TV with a bias of up to 50% between the adaptative thresholding and FLAB: both users yielded globally smaller volumes (bias of −50 ± 50% and −21 ± 54% for T 1bckgrd and T 2bckgrd , respectively). Agreement intervals (upper and lower limits) were large for all parameters and for all comparisons, up to +80 and −140% (see Fig. 2b). Even between the two users of the same adaptive thresholding method (see Fig. 2a), mean differences of −30 ± 35% were seen and limits of agreement were large, from −100 to +45%. No significant trend was found regarding the correlation between TV and differences between methodologies (r < 0.2, p > 0.1).

Fig. 2
figure 2

Bland-Altman analysis of differences between a T 1bckgrd and T 2bckgrd and b Tbckgrd and FLAB, for each parameter (TL, TV, SUVmean, TLG). Grey columns and error bars represent the mean differences (bias) and associated SD, respectively. Bold arrows up and down denote upper and lower limits, respectively; 95% confidence intervals for each are given in Table 3

Table 3 Bland-Altman analysis results comparing delineation strategies for all parameters

Better agreement was observed for TL and SUVmean; however, intervals of agreement were large (−50 to −25% lower limit and +20 to +40% upper limit for TL; −80 to −10% lower limit and +10 to +80% upper limit for SUVmean).

Due to the combined effect of TV and SUVmean, TLG differences were in between, with moderate bias but still large agreement intervals (upper and lower limits of −120 to −75% and +40 to +90%, respectively).

Survival analysis

At the time of last follow-up, 10 patients were alive with no evidence of disease, 9 were alive with recurrent oesophageal cancer and 26 had died from the disease. With a median follow-up of 60 months (range 9–82), the overall median survival was 15 months. The 1-year and 2-year survival rates were 63 and 34%, respectively.

The results of the log-rank analysis of significant parameters for overall survival in univariate analysis are given in Table 4. Table 5 summarizes the prognostic value of all the parameters under investigation in this study.

Table 4 Parameters with significant prognostic value after univariate analysis
Table 5 Prognostic value of all parameters

Age, gender and histology types were not significant prognostic factors in the univariate analysis. Neither were T and N classifications. In the univariate analysis, the presence of metastases [median survival of 26 months (M0) versus 12 months (M1), p = 0.01)] and the clinical AJCC stage (p < 0.001) were significant prognostic factors.

Although there was a trend observed, neither SUVmax nor SUVpeak were significant prognostic factors. A SUVmax <5 or <8 tended to be a factor for better outcome with a median survival of 14 vs 7 months (p = 0.08) or 21 vs 13 months (p = 0.1), respectively (see Fig. 3a).

Fig. 3
figure 3

Kaplan-Meier survival curves obtained using SUVmax (a), TV measured by FLAB (b) and T 1bckgrd (c), and defining three groups using TV measured by FLAB (d)

Mean SUVs in the tumour were not significant prognostic factors in univariate analysis. There was however a trend for shorter survival associated with higher SUVmean. For example, the median survival was reduced by a factor of 2 for patients with an SUVmean higher than 5 (13 months vs 21 months, p = 0.06). This was however observed only when the FLAB methodology was used to define TV, while no similar trend was observed with SUVmean parameters obtained with adaptive thresholding.

Functional TV was a significant prognostic factor for overall survival, whatever methodology was used (p < 0.001 using FLAB and p = 0.004 for both T 1bckgrd and T 2bckgrd , see Fig. 3b, c). In addition, using the TV, and independently of the delineation approach used, allowed us to separate our population into three groups with significantly different outcome (p = 0.002, p = 0.02 and p = 0.004 for FLAB, T 1bckgrd and T 2bckgrd , respectively). For instance, volumes defined by FLAB less than 14 cm3, between 14 and 85 cm3 or superior to 85 cm3 were respectively associated with a median survival of 49 (19 patients), 15 (21 patients) and 5.5 (6 patients) months as illustrated in Fig. 3d. In Fig. 4a–c three examples of 18F-FDG PET baseline images of patients belonging to each of these three groups are presented.

Fig. 4
figure 4

18F-FDG PET images (axial, coronal and sagittal views from top to bottom) of patients with a a small tumour (11 cm3, 54 months survival), b medium size tumour (22 cm3, 18 months survival) and c larger tumour (92 cm3, 5 months survival)

Functional TL was also a significant prognostic factor with results similar to TV (p = 0.01, p = 0.02 and p = 0.04 for FLAB, T 1bckgrd and T 2bckgrd , respectively), apart from not being possible to significantly differentiate three groups of patients with different outcome, independently of the strategy.

Similarly, TLG was also a significant prognostic factor whatever methodology was used, while it was equally not possible to significantly differentiate three groups. The median overall survival was 10 months for patients with a TLG (FLAB) >180 g and increased to 21 months for patients with a TLG < 180 g (p = 0.01). Similar results were obtained with adaptive thresholding (20 versus 8 and 20 versus 10 months for T 1bckgrd and T 2bckgrd , respectively).

After multivariate analysis, considering each delineation methodology separately, only TV obtained using FLAB and AJCC stage were found to be independent significant prognostic factors (p = 0.0017 and 0.0021 for TV and AJCC, respectively), whereas only AJCC stage was an independent significant prognostic factor (p < 0.002) when considering TV obtained by adaptive thresholding.

Similar results were obtained when replacing TV by TL, with both TL and clinical AJCC staging found to be independent significant prognostic factors in the case of FLAB (p = 0.017 and p = 0.042 for AJCC stage and TL, respectively), whereas in the case of adaptive thresholding only AJCC staging was an independent significant prognostic factor (p = 0.0021).

On the other hand, in the case of TLG only the AJCC staging was an independent significant prognostic factor (p < 0.002), whatever delineation strategy was considered.

Discussion

An accurate staging in oesophageal cancer is essential for guiding therapy. The standard conventional modalities are endoscopic ultrasonography and computed tomography even if this combined approach suffers from several shortcomings. 18F-FDG PET is more and more often included in the initial staging because it allows a more accurate disease assessment, especially regarding the detection of distant metastases [24]. Since no patient underwent surgery in our study, anatomopathology data were not available. Therefore TNM classifications and AJCC stages were determined using suboptimal conventional staging and this could explain the poor prognostic value of T or N classification in our population.

As found in our study, 18F-FDG uptake is always present in oesophageal cancer if extended at least to submucosa [34]. Some authors suggested that the intensity of 18F-FDG uptake could be related to prognosis in oesophageal cancer, based on the good correlation existing between hexokinase activity or poor differentiation and tumour uptake [35] and also because increasing SUVmax values seem to correlate with T classification, which is part of the TNM staging [36].

In our study, SUV measurements were not significant prognostic factors for overall survival. While various cutoff values of SUVmax tend to be associated with a poor prognosis, none led to statistically significant differentiation. Swisher et al. reported similar results in a uniform group of highly selected patients with locally advanced oesophageal cancer treated by neoadjuvant radiochemotherapy [37]. On the other hand, these results could appear in contrast with our previous report [18], where we initially reported that an SUVmax cutoff value of 9 had an independent prognostic value of overall survival, but this difference may be explained by the different patient characteristics considered in the two studies. We previously considered [18] a daily practice population, half of which underwent curative surgery, while we included here only patients with advanced disease exclusively treated by combined radiochemotherapy.

TL established by pathological examination has been demonstrated to be an independent prognostic factor for long-term survival [21]. Some authors proposed estimating TL based on 18F-FDG PET images using different thresholds [38]. Functional TL has been studied as a predictor of response to neoadjuvant chemoradiotherapy with conflicting results [11, 22]. In a group of 69 patients with oesophageal SCC undergoing curative surgery, Choi et al. demonstrated that functional TL was an independent prognostic factor [10]. However, one may argue that functional TL is a parameter that does not reflect the real volume of the tumour but only its longitudinal extension and could be therefore considered as only a surrogate of tumour spatial extent. This argument can be supported by the data shown in this work, where only a moderate correlation (r < 0.7) was found between TV and TL, suggesting that functional TV may be more accurate in assessing actual tumour burden. In our study we also compared the prognostic value of TL with that of TV. Both parameters were found to be significant prognostic factors irrespective of the functional volume delineation strategy. In addition, both TV and TL were independent prognostic factors for survival in the multivariate analysis. However, this result was found to be dependent on the segmentation algorithm, with both parameters being independent survival prognostic factors only when determined using the automatic FLAB segmentation. This may be related to the higher overall accuracy of FLAB with respect to adaptive thresholding for tumour delineation as previously reported [24, 27, 39]. Despite the similar prognostic values of TL and TV, only TV allowed a statistically significant stratification of patients into three groups, irrespective of the segmentation methodology. More specifically, two different cutoff values (85 and 14 cm3) resulted in significant differentiation of two groups among the patients with median overall survival of 5–6 vs 20 months (p = 0.0005) and 49 vs 13 months (p = 0.036) for 85 and 14 cm3, respectively. Being able to provide such a finer stratification of patient groups could be of value in clinical trials assessing new therapeutic regimes.

SUVmean measured in a volume determined using the different tumour delineation approaches considered was not found to be a prognostic factor for overall survival, although a trend was seen for SUVmean associated with TV defined with FLAB, which tended to differentiate patients with poor and better prognosis (13 vs 21 months, p = 0.06).

A fundamental biological question underlying 18F-FDG PET prognostic value is whether the total volume or the metabolically active portion of the tumour is most important. Intuitively both would seem important and desirable to determine. In our study, both functional TL and TV (representative of the tumour functional spatial extent) were significant prognostic factors compared to SUVmean (representative of the tumour glycolytic metabolism) which was not. Combining both parameters into total lesion glycolysis index (TLG) was a prognostic factor for overall survival whatever methodology was used for tumour delineation. However, it was not an independent significant prognostic factor in the multivariate analysis. Only very few data are available on the potential clinical value of TLG in different cancer models. Xie et al. reported on the prognostic value of TLG in head and neck cancer for long-term survival [40], while Cazaentre et al. demonstrated the usefulness of TLG for predicting response to radioimmunotherapy in lymphoma [41]. To date, the limited use of TV and TLG in clinical studies could be explained by the poor accuracy, robustness and reproducibility of available tumour delineation tools affecting the clinical value of resulting measurements. The fact that TLG was not an independent prognostic factor, whereas TV alone was, suggests that the prognostic value of TLG mainly comes from the volume information and is impaired by the low prognostic value of SUVmean measurements. In addition, the value of TLG might be reduced by a loss of information when combining the TV and the SUVmean into one parameter by simple product, since large tumours with low uptake might result in the same TLG as small tumours with high uptake. Finally, the lack of partial volume effects (PVE) correction might also play a role in the reduced prognostic value of all SUV measurements as well as the resulting TLG, since tumour volumes across the patients range from quite small and significantly affected by PVE (<2 cm in diameter) to very large tumours for which PVE have smaller quantitative impact.

As expected, results concerning parameters dependent on the tumour delineation process were well correlated. On the other hand, our results also highlighted the potential impact of differences between existing tumour delineation methods, with TV and TL being independent survival prognostic factors only when determined using FLAB. This approach has been previously shown to be both robust and accurate [24, 27]. At present most commonly used methods are based on fixed or adaptive thresholds. Fixed thresholding has been demonstrated to be both inaccurate and non-robust [23, 24, 27, 39] and was therefore not considered in this study.

Regarding the adaptive thresholding performance, results from one observer (T 2bckgrd ) were closer to these of FLAB compared to the other one (T 1bckgrd ), with however significant differences, as shown in Fig. 2b and Table 3. Neither TV (T 1bckgrd ) nor TV (T 2bckgrd ) were independent prognostic factors contrary to TV (FLAB). This can be explained by the behaviour of adaptive thresholding (independently of the user) for several tumours. Most of the tumours exhibited simple shapes and homogeneous tracer uptake. However, some were more complex and exhibited higher heterogeneity, or were small (<2–3 cm) with low contrast. Adaptive thresholding has been demonstrated to provide unsatisfactory delineation for such cases [24], because its final threshold is based on the ratio between an isocontour at 70% of the maximum and the background ROI. Such an isocontour tends to overestimate (respectively underestimate) the actual value of the entire tumour for heterogeneous uptake (respectively small tumours will low contrast).

Hence the adaptive thresholding led to significant underevaluation of larger heterogeneous tumours in our study, e.g. a patient with a survival of 6 months had a TV defined by FLAB of almost 97 cm3, whereas TV (T 1bckgrd ) and TV (T 2bckgrd ) were 38 cm3 (−61%) and 50 cm3 (−50%), respectively, clearly missing parts of the tumour. On the other hand, the dependency on the background ROI is higher regarding small tumours with low contrast, e.g. for a patient with 21 months survival, TV (FLAB) was 5.8 cm3, whereas TV (T 1bckgrd ) and TV (T 2bckgrd ) were 1.9 cm3 (−67%) and 26.9 cm3 (+364%), respectively. Several patients were therefore incorporated in the wrong survival curve, mostly patients with large volumes that were underestimated by the adaptive threshold.

In addition, adaptive thresholding was found to be highly user dependent, since we observed a bias up to 30% between the two users measuring TV, the agreement interval being too large for clinical applications (−110 to +45%). This seems to be in agreement with results concerning the level of reproducibility in measuring functional TV from 18F-FDG imaging which can vary from 21 to 90% using automatic and threshold-based approaches, respectively [42]. If advanced segmentation algorithms are not available, the use of adaptive thresholding approaches should be preferred to manual or fixed threshold-based delineation. Automated background ROI determination could reduce the interobserver variability observed in this work.

The limits of this study are its retrospective nature and the limited number of patients. Our results need to be confirmed through a prospective study on a larger cohort of patients. It is finally worth noting that overall survival might have been affected by other factors such as subsequent treatment for patients who relapsed, although this should have minor impact on the results of this study since it applies to all parameters considered. Other outcome measures such as progression-free survival were not investigated in this study.

Conclusion

Our results suggest that the functional tumour volume followed by length has additional value compared to commonly used SUV measurements (SUVmax, SUVpeak, SUVmean) for prognosis in patients with locally advanced oesophageal cancer treated with exclusive concomitant radiochemotherapy. Both parameters were significant prognostic factors for overall survival, independently of the approach used to delineate the tumours. However, only the automatic FLAB algorithm allowed TV and TL to be independent prognostic factors for survival in a multivariate analysis that included standard tumour staging. In addition, the total lesion glycolysis index was a statistically significant, but not independent, prognostic factor irrespective of the delineation algorithm used. Our findings confirm the potential value of 18F-FDG PET to give a useful orientation for patient management purposes in oesophageal cancer, but they also highlight the influence of the methodology used on the degree of pertinence of these PET image-derived parameters of interest as their accuracy and their clinical significance increase if they are computed using more reliable and robust tumour segmentation methods.