Introduction

Laryngeal cancer is the most common primary cancer site of the head and neck, with annually about 700 new cases in The Netherlands [1]. Because of the relatively low incidence and the specialised care, the vast majority of laryngeal cancers in The Netherlands are treated in the eight recognised head and neck cancer centres of the Dutch Head and Neck Oncology Cooperative Group.

Parallel to developments elsewhere, non-surgical treatment modalities in our country are more common than in the past due to improved results of altered fractionation schedules in radiation therapy and the addition of chemotherapy to radiation [24]. The aims of non-surgical treatments are organ preservation and improvement of quality of life [5, 6]. Especially, in this group of patients, early detection of residual or recurrent tumour is of critical importance because prompt salvage surgery improves control of disease.

The diagnostic accuracy of the currently available diagnostic techniques to diagnose persistent or recurrent disease after (chemo)-radiotherapy is currently limited. Post-irradiation inflammation, oedema and necrosis can hamper the detection of residual or recurrent local tumour. CT and MRI rely on structural changes and show a limited accuracy, with reported sensitivities ranging from 50% to 58% and specificities from 33% to 100% for detection of recurrent laryngeal carcinoma [714]. Direct laryngoscopy under general anaesthesia with biopsies runs the risk of complications (e.g., inducing necrosis, infection and further oedema) when biopsies are taken from irradiated tissue and was false negative in 31% of the initial laryngoscopies in previous research [15].

2-Deoxy-2[F-18]fluoro-d-glucose-positron emission tomography (FDG-PET) is a promising technique for tumour detection after (chemo)radiotherapy. FDG-PET seems more accurate for the detection of recurrent head and neck carcinomas than other diagnostic methods [10, 1622]. The reported sensitivity of FDG-PET for the detection of recurrent carcinoma after (chemo)radiotherapy varies between 80% and 100% and the specificity varies between 63% and 93% [1214, 2328].

Evaluation of PET in these studies is typically by visual interpretation, which is prone to observer variation. To the best of our knowledge, no studies have been conducted to investigate observer variation in a multi-centre setting for this indication.

To evaluate the value of FDG-PET in the detection of recurrent laryngeal carcinoma after radiotherapy, a randomised controlled multi-centre trial was started recently within the framework of the Dutch Head and Neck Oncology Cooperative Group [29]. The extent to which the future results can be generalised, and thereby foresee the applicability of PET in daily clinical practice, tends to depend on the degree of agreement among the observers. Therefore, we evaluated the interobserver variability in reporting among 11observers involved in this trial, with a set of FDG-PET scans of patients suspected of having recurrent laryngeal carcinoma after radiotherapy.

Materials and Methods

From the VU University Medical Center (VUmc) PET database, we identified 30 FDG-PET scans of consecutive patients with a clinical suspicion of persistent or recurrent laryngeal carcinoma after radiotherapy from the year 1998 to 2000. This suspicion was based on clinical symptoms, office laryngoscopy or diagnostic imaging, other than FDG-PET. Patients’ T stages prior to radiotherapy were T1 (n = 4), T2 (n = 16), T3 (n = 5) and T4 (n = 5), N stages were N0 (n = 27) and N1 (n = 3) and none of the patients had distant metastases. The mean age at time of the PET scan was 60.9 ± 11.1years. The median interval between the last radiation fraction and the PET scan was 8.7months (range 2.4–32.1months). As reference standard, we used the results of biopsies and histology or the absence of signs of tumour within 6months after the PET scan.

PET Imaging

All patients underwent FDG-PET after at least 6-h fasting. Blood glucose levels were measured before scanning. All patients were non-diabetic. The median blood glucose level measured with a glucotouch stick was 5.9 ± 1.6 mmol/l. Sixty minutes after intravenous injection of 370MBq of 18F-FDG, imaging was performed in a full-ring bismuth germanate oxide PET scanner (ECAT EXACT HR+; CRI/Siemens, Erlangen, Germany). Five patients received a dose of 555MBq because of higher weight. A scanning track from the base of skull to the clavicles was used, i.e. two bed positions per patient, with an acquisition time of 5min per bed position. PET imaging was done with 2D acquisition using Ordered Subset Expectation Maximization (OSEM 2-16) to reconstruct the images. Attenuation correction was not applied in our clinical practice because of the results of the systemic review by Joshi et al. [30]. The acquired images were viewed on a local PC of the participating centre with the PETViewer 2.0.10.570 (Microsoft Windows XP Professional Service Pack 2 (build 2600)).

Data Analysis

The panel of observers consisted of 11 experienced nuclear medicine physicians from the eight Dutch Head and Neck Oncology Cooperative Group medical centres. We recorded the experience of the observers with FDG-PET for this indication in terms of the estimated number of FDG-PET scans for laryngeal cancer they had assessed. Images were reviewed at each site on available PC running under Microsoft Windows (XP/2000). The scans were presented with a PET-viewer that allowed variable gamma and window tuning.

The observers were requested to interpret the PET scans as being indicative for the presence of local residue or recurrence and to classify the result as negative, equivocal or positive. Only increased uptake at the site of the initial primary tumour was used for further analyses. The observers were not provided with specific criteria for determining positivity and negativity. Evaluation was made on an overall basis rather than on a per-lesion basis.

The observers read the scans independently, and clinical information (regarding TNM stage and site of the primary tumour, last radiotherapy fraction, symptoms and the result of other diagnostic tests) was provided to imitate the clinical setting. Correlative anatomic imaging (CT or MRI) was not provided nor were the reports of these scans. There was no time restriction for the assessment. Observers were blinded for the final clinical classification until they had reported all scans; thereafter, the investigator (LvdP) provided them with these results to improve standardised reading during the following prospective randomised trial. Cases that were scored incorrectly (when compared to the reference) or equivocal by at least nine observers were regarded as difficult cases and are described in the results.

The observers were asked which criteria they used for their interpretation of the FDG-PET scan.

Statistical Analysis

Data were obtained for a sensitive and a conservative PET reading strategy: The PET results were dichotomised by assigning equivocal scores to either the PET-positive or to the PET-negative classifications, respectively. Mean sensitivity, specificity, positive predictive value and negative predictive value were determined for either strategy. A Bayesian plot was used to show the probability of proven tumour within 6months for varying prevalences of tumour. To illustrate the probability of tumour in time after the PET scan for the different strategies and for different observers, a Kaplan Meier analysis was performed. Correlation between experience and diagnostic performance (measured with both conservative and sensitive strategies) and percentage of equivocal scores was evaluated.

To analyse the interobserver variability, we used agreement statistics (κ) with a classification according to Landis et al. [31] (Table 1). Linear-weighted kappa was used to determine interobserver variability of the 11 observers compared to the reference standard and pairwise compared to each other, for both conservative and sensitive strategy.

Table 1 Classification of the interobserver variability with kappa

Results

Within 6 months after the FDG-PET scan, a local recurrence was histologically proven in seven patients (23%).

For both conservative (equivocal considered negative) and sensitive strategies (equivocal considered positive), the accuracy was determined per observer and depicted in a box plot (Fig. 1). The mean data of the accuracy of the 11 observers are shown in Table 2. For the conservative reading strategy, mean sensitivity of 87% (range 57–100%), specificity of 81% (range 65–96%), positive predictive value of 61% (range 43–80%) and negative predictive value of 96% (range 88–100%) were found. For the sensitive reading strategy, mean sensitivity of 97% (range 86–100%), specificity of 63% (range 39–87%), positive predictive value of 46% (range 33–70%) and negative predictive value of 99% (range 93–100%) were found.

Fig. 1
figure 1

Accuracy for conservative and sensitive reading strategies depicted in a box-and-whisker plot. The boxes contain the central half of the measurements (heavy line indicating the median). The dots are the values that are extremely far from the central box (outliers).

Table 2 Mean pooled accuracy for conservative (equivocal = negative) and sensitive (equivocal = positive) strategies

In the Bayesian plot (Fig. 2), the two strategies mainly differ in the intermediate ranges of the prior probability of proven recurrence. Also, these differences are larger for the false negatives (lower corner) than for the true (and thus false) positives.

Fig. 2
figure 2

Bayesian plot with the prior and posterior probability of proven recurrence within 6months after FDG-PET for the conservative and the sensitive strategies.

In a Kaplan Meier analysis (Fig. 3), as expected, an observer with a high accuracy (versus the reference) predicted the prognosis for local disease-free control more accurately than an observer with a low accuracy. With the conservative strategy, for both observers, a curve was established with significantly more local recurrences in the PET-positive than in the PET-negative group. Seven recurrences (71%) manifested within 6months after PET, no recurrences were diagnosed between 6 and 12 months and the remaining two recurrences were seen between 12 and 24 months.

Fig. 3
figure 3

Kaplan Meier analysis of a proven recurrence after PET, with stratification of the negative and positive assessed patients using the sensitive (a, b) and conservative (c, d) strategies by an observer with high accuracy (observer 1; a, c) and an observer with low accuracy (observer 2; b, d).

The estimated total number of FDG-PET scans for suspected recurrent laryngeal cancer of the observers had assessed previously varied between 0 and 300 (experience with dual head gamma cameras included). There was no statistically significant association between this experience and the number of ‘equivocal’ scores (p = 0.610, Table 3). The equivocal category was reported at a mean of five (out of 30 cases) per observer (17%, range 1–10). Furthermore, we found no significant correlation between the experience and diagnostic performance (conservative p = 0.360, sensitive p = 0.528, Table 3).

Table 3 Correlations with log-experience between experience of the observers and the number of equivocal score, the accuracy with equivocal as negative (conservative) or positive (sensitive)

The interobserver variability in comparison to the reference (local recurrence within 6months after the PET scan) showed a moderate relation [κ = 0.55; 95% confidence interval (CI): 0.33–0.76]. The interobserver variability as pairwise comparison of the observers, which expresses the consistency between observers, also showed a moderate relation (κ = 0.54; 95% CI: 0.42–0.67).

When reducing the data from a three- to a two-point scale, the conservative strategy proved to result in a better interobserver agreement in comparison to the reference (κ = 0.59; 95% CI: 0.38–0.79) than the sensitive one (κ = 0.43; 95% CI: 0.22–0.63). The same was true for the pairwise comparison of the observers (conservative: κ = 0.58; 95% CI: 0.44–0.71, versus κ = 0.51; 95% CI: 0.37–0.65 for SR).

There were two difficult cases, with much discrepancy between the report of the observers and the reference (Fig. 4; patients #5 and #9). Both cases were negative according to the reference. Patient #5 underwent FDG-PET 5months after the last radiotherapy fraction for a left-sided T3N1 supraglottic laryngeal carcinoma. Clinical suspicion of recurrence was based on unexplained otalgia. A MRI of the neck showed diffuse paraglottic swelling on both sides (mainly on the right side), which could either be post-irradiation effects or recurrent tumour according to the radiologist (Fig. 5). Five observers scored the PET scan (Fig. 5) as equivocal and six as positive. At direct laryngoscopy, irregular tissue at the left aryepiglottic fold and the epiglottis was seen, but biopsy revealed no malignancy. Three years and 2months after PET, the patient died with lung metastases, but a local recurrence was never detected.

Fig. 4
figure 4

All 30 cases and the results of the review (correct, equivocal, incorrect) compared to the reference standard (numbers of observers).

Fig. 5
figure 5

Patient 5: MRI (STIR, axial) with diffuse paraglottic swelling on both sides, mainly right (arrows). PET (axial) with abnormal supraglottic ventral uptake, on the right side more than on the left side (arrows). Below the arrows is a region with abnormal uptake, probably caused by uptake in the crico-arythenoid muscle.

Patient #9 had a PET scan 2years after completion of radiotherapy for a left-sided T2N0 glottic carcinoma. The PET scan (Fig. 6) was indicated because the left side of the glottis appeared suspicious at indirect laryngoscopy. CT scan of the neck showed a suspect area just ventral to the lesion described on the PET scan. Seven observers scored the PET scan as positive, three as equivocal and one as negative. Clinical follow-up was uneventful until 2years, later direct laryngoscopy revealed squamous cell carcinoma at the original tumour site. The laryngectomy specimen contained a squamous cell carcinoma of 1.5cm in diameter located in the glottis with tumour extension into the thyroid cartilage.

Fig. 6
figure 6

FDG-PET scan of case 9 (axial, coronal and sagittal images) reviewed as tumour positive by seven observers, equivocal by three observers and negative by one observer (arrows indicate region suspected of tumour).

The criteria the observers used for their interpretation were information derived from the PET scan, such as localisation, (a)symmetry and aspect of suspicious areas, diffuse versus focal lesions and the intensity of the suspicious areas compared to the intensity of the background, in combination with the clinical data (localisation of primary tumour, interval between radiation and PET).

Discussion

In the present study, we analysed the performance of 11 observers from the eight head and neck cancers centres in The Netherlands for the assessment of FDG-PET scans from patients who were suspected of having a local recurrence of laryngeal carcinoma after primary radiotherapy.

We found a reasonable chance-corrected proportional observer agreement, both in comparison to the reference standard and pairwise. It is difficult to predict how the agreement would change if a larger sample size was studied. To the best of our knowledge, this is the first study that examines the interobserver variability of more than two observers from different institutes in the detection of recurrent laryngeal carcinoma with FDG-PET. Many authors stress the importance of interobserver agreement [32, 33]. Fakhry et al. [34] studied the interobserver variability between two observers of FDG-PET in the detection of recurrent head and neck squamous cell carcinoma and found a good agreement (intraclass correlation coefficient >90). A substantial agreement was also described for metastatic disease. Bohdiewicz et al. [35] found a 90% agreement between two observers who reviewed FDG-PET scans for metastatic disease in the spinal cord, and Lim et al. [36] found a kappa of 0.68 for three observers who reviewed FDG-PET scans for peritoneal metastases. Hashimoto et al. [37] evaluated lung nodules with two observers of FDG-PET and found a kappa of 0.65. In a study performed by Zijlstra et al. [38], 11 observers reviewed FDG-PET scans for suspicion of recurrent lymphoma in 82% to 94% of the tumour-positive patients and 45% of the tumour-negative patients, which were in accordance with the experts.

Because the observer panel in this study consisted of nuclear physicians from all Dutch Head and Neck Cancer Centres, the results give a good impression of the overall diagnostic performance of PET for suspected laryngeal recurrence after radiotherapy in The Netherlands. Especially, since the interobserver agreement was reasonable, an acceptable reproducibility and thereby a general applicability of these results is assumed.

As expected, sensitivity and specificity varied inversely with the threshold of test positivity. Our data (mean sensitivity ranging from 87% to 97%, specificity from 63% to 81%) appear to reflect the distribution of such measures reported in the literature, with sensitivities ranging from 80% to 100% and specificities from 63% to 100% [12, 2628, 39, 40].

Remarkably, no significant correlation between the accuracy and the experience of the observer was found. Also, no correlation was found between experience of the observers and the number of non-conclusive reports. At first glance, these findings may suggest that no specific experience is needed with FDG-PET for laryngeal carcinoma. Another explanation for this finding could be the lack of clinical feedback during daily practice in which a learning curve cannot be established. Therefore, regular feedback during daily practice seems essential also in situations where proof of presence or absence of disease may be obtained several months after PET. Finally, we recognise that the sample size was relatively small and that some observers reported that they were unfamiliar with interpretation of images without attenuation correction. However, the performance of these observers was not clearly different from the others. Moreover, considering the 95% confidence intervals of the correlation coefficients, it seems unlikely that a larger sample size would change these findings. Unfortunately, it was not possible to compare the correlations between accuracy and the experience of the observer or the number of equivocal scores with previous studies, as previous studies used two or three experienced nuclear physicians without differentiation of the level of experience or the relation with equivocal scores. Zijlstra et al. [38] reported that the experts did not have any equivocal scores, while the less experienced observers did have equivocal scores.

A variable amount of cases were scored equivocal (a median of 17% equivocal reports per observer). Although the number of non-conclusive scores differed greatly per observer (range 0–10), this indicates that in contrast to how data are typically reported, dichotomous results of FDG-PET for recurrent carcinoma may be regarded as an artificial and unwanted simplification. To explore the effect on diagnostic performance of this phenomenon, we dichotomised the data. The conservative strategy in which the equivocal scores were analysed as negative resulted in a better overall accuracy and a better interobserver agreement (kappa 0.59 and 0.58) than the sensitive strategy (kappa 0.43 and 0.51).

In our population, the prevalence of histologically proven recurrence was 23%. Because the mean reported prevalence is 50% [14, 25, 27], we compared these prevalences in a Bayesian plot. When the prevalence is 50%, the difference between the two strategies for a negative PET scan, in favour of the sensitive strategy, is larger as compared to the prevalence in the present study.

We assume that in clinical practice, the sensitive reading is used if FDG-PET is used to select patients suspected of recurrent laryngeal carcinoma after radiotherapy for direct laryngoscopy under general anaesthesia. For the physician, the risk of missing a recurrence probably outweighs a futile direct laryngoscopy because early detection of a recurrence can be important for salvage surgery and clinical outcome. An inherent disadvantage of sensitive reading is the higher percentage of false positives and subsequently futile direct laryngoscopies under general anaesthesia and more interobserver variability. It can be expected that the interobserver agreement has improved by the feedback received after the assessment.

We used a disease-free follow-up of 6months as reference standard of patients without recurrence because we assume that local disease manifest itself within this period. Extending this period carries the risk to include recurrent disease that developed after the PET scan. If local recurrences were not detected within the first 6months, these were diagnosed at least 21months after the PET scan. It seems highly unlikely that the lead-time of PET would be that long, but we admit that we cannot exclude the possibility of a very slow growing recurrence.

As was shown in the Kaplan Meier analyses (Fig. 3), a negative PET scan was highly predictive for local control, especially in the first 12months. This suggests that patients with a negative PET scan might be spared a futile laryngoscopy under general anaesthesia and that regular follow-up might be sufficient.

While false-positive reading tends to be a problem, the negative predictive value of FDG-PET is high in both the conservative (96%) and the sensitive strategy (99%). The negative predictive value is, of course, dependent on the prevalence of disease. In the present study, the prevalence was only 23%. In the PET literature, the mean prevalence appears to be about 50% (manuscript in preparation), and high negative predictive values for different prevalence are reported [41]. Therefore, it can be anticipated that a negative FDG-PET excludes recurrent disease with a high certainty. With this unique characteristic, FDG-PET may be safely used as the first diagnostic step of triage for invasive procedures in patients suspected of recurrent laryngeal tumour. By filtering the patients with a negative PET scan out of the further diagnostic process, the percentage of futile diagnostic laryngoscopies can probably be diminished.

To further investigate the potential of FDG-PET for this indication, a prospective study with more patients is recommended. In the current study, the images were not attenuation-corrected. In the future, the fused PET-CT will probably take over the PET alone. Besides the anatomical information of the CT, it also offers the possibility to easily determine the ‘standard uptake values’ for objective assessment. For the present indication, selection of patients with suspicion of recurrent laryngeal carcinoma after radiotherapy for direct laryngoscopy under general anaesthesia, detailed anatomical information is probably not essential. Uptake in the laryngeal area, which indicates further examination, can be assessed on PET alone. For this indication, no literature is available about the diagnostic value of PET-CT in comparison with PET alone. PET-CT may yield slightly different results, and this will be subject of further study [29]. Another relative disadvantage of the present study is the varying interval between the last radiation and the PET scan, with a minimum interval of 2.4 months. McGuirt et al. [41] and Ryan et al. [28] concluded that the accuracy of PET is significantly higher for an interval more than 3 months compared to 1 month.

Conclusions

While acknowledging that additional confirmation is necessary, we propose in view of the acceptable interobserver agreement that FDG-PET yields good negative predictive value for the detection of recurrent laryngeal carcinoma after radiotherapy. It could therefore be used as a first diagnostic step and may reduce the percentage of futile invasive diagnostics.