TO THE EDITOR: The excellent presentation by Scheuermann et al. (1) on how the PET community is addressing sometimes-impaired abilities to compare semiquantitative results among institutions was well worth reading. This article followed an earlier publication of many valuable recommendations (2) for multiinstitutional therapeutic response trials, including a preference for standardized uptake values (SUVs). But with candor, the current study reports specific problems with SUVs. The scanners of one manufacturer give 20% and 4% lower SUVs than the scanners of other manufacturers for a physiologic phantom and a physical phantom, respectively. The former, somewhat of a surrogate phantom, was a rather precise population average of normal-liver 18F-FDG SUVs. More important, it is the absence of results from a quantitative measurement model of all factors controlling the magnitude of this phenomenon that can question confidence in SUVs.
Also disturbing are results from an earlier survey (3) of normal-liver 18F-FDG SUV population averages: a rather wide range of values, from 1.5 to 3.6. This shows an error range far exceeding the somewhat low SE for this physiologic phantom: SE = (liver average SUV of 2.5) × (0.2)/(a significant number averaged)1/2, where the same-scanner coefficient of variation for a normal-liver population is approximately 0.2.
Additionally, in the current study a significant number of participating institutions had difficulty in obtaining an SUV of 1.0 within a known physical phantom. This variation in accuracy occurred despite a necessarily biased sample of volunteering researchers making special efforts to qualify their PET quantitation methodology for clinical trials. It appears that the overall institution-dependent error magnitude would be a composite of these spurious errors (infrequent errors and perhaps of greater magnitude), systematic methodology errors, and instrument errors (probably of lesser magnitude).
It is good to see impressive results from the rigor of physical phantom use being supplemented with physiologic phantom data. I would like to call attention to a way to improve upon usage of liver averaging. A more robust reference having better statistics might be provided by the use of fully corrected population-averaged SUVs from a combination of several organs individually having low coefficients of variation—similar to an approach in a mouse study (4). An atlas compilation of SUV data of many organs shows several candidates whose coefficients of variation are about as low as that of the liver (5).
A step beyond impartial reporting of SUV measuring accuracy could be asking whether findings now suggest revisiting setting-specific decisions to choose the SUV over other markers—whether in the clinical trial setting or in the more commonly encountered single-patient clinical diagnostic setting. Are all systematic and random measurement uncertainties being adequately considered as judgments are made? Additionally, and more rigorously, should any preferred choice among competing markers be justified by studies (e.g., cost–benefit comparisons) for a particular setting? Further, are methodology subclasses of the SUV and other markers also explored as options?
There are various candidates for other markers that compete with the SUV. In a methodology hierarchy of increasing complexity and diagnostically enhancing information, some classes (with subclass examples) to consider include the following: single-scan tissue ratios (e.g., the ratio of a region to an organ average (4), to liver, to cerebellum, or to a contralateral side); single-scan SUVs (with or without varieties of corrections and transformations); dual-time scans (widely spaced in time or an extension of a whole-body scan, with or without patient-specific plasma tracer information for a 2-time-point Patlak plot); and dynamic scans (with a wide range of plasma tracer information options for Patlak or compartmental model analysis). The possible use of a transformed SUV mentioned in this list, such as ln(SUV), stems from statistical distribution considerations when the need for correctly quantified statistical significance plays a noticeable role in diagnostic decision making (6).
The message here is a suggestion to pause and reflect on whether to passively accept SUVs as presented from software, to aggressively improve on the SUV methodology used, or to expend the effort to evaluate and pursue other options. This last alternative is somewhat supported by a recommendation from a study that evaluated statistical considerations for SUV use in early clinical trials having few patients. This is to consider the advantages of a better-performing, accurate PET procedure that permits fewer patients than would the SUV for a given statistical performance, even if the methodology is complex (7).
Finally, returning to the endeavors of institutions to qualify their PET protocols, valuable and rarely tabulated information obtained from a large population of scanned patients has been acquired for this paper (1) regarding human errors and other methodology problems. If these problems involve larger-magnitude errors, even if less frequent and mostly controllable, they nevertheless can increase the probability that SUV measurements will be less accurate overall. If ideally limited by only modest random errors, the SUV might have acceptable potential in many settings. Additional specific recommendations from this work would be a beneficial resource for future PET procedural guidelines.
Footnotes
-
COPYRIGHT © 2010 by the Society of Nuclear Medicine, Inc.