Abstract
The aim of this review is to provide readers with an update on the state of the art, pitfalls, solutions for those pitfalls, future perspectives, and challenges in the quickly evolving field of radiomics in nuclear medicine imaging and associated oncology applications. The main pitfalls were identified in study design, data acquisition, segmentation, feature calculation, and modeling; however, in most cases, potential solutions are available and existing recommendations should be followed to improve the overall quality and reproducibility of published radiomics studies. The techniques from the field of deep learning have some potential to provide solutions, especially in terms of automation. Some important challenges remain to be addressed but, overall, striking advances have been made in the field in the last 5 y.
Nuclear medicine in oncology was revolutionized by the deployment of PET/CT (combined) scanners in the 2000s (1). Subsequent hardware and software innovations have led to improvements in the spatial resolution and the signal-to-noise ratio of reconstructed images, with point-spread-function and time-of-flight information being integrated into reconstruction algorithms (2,3). PET/CT images have always been quantitative, and their accuracy has been increased thanks to these improvements. However, PET/CT images have been exploited in a limited way in most clinical publications, clinical trials and, obviously, routine clinical practice. In most cases, nuclear medicine physicians detect and anatomically localize pathologic uptake visually. Subsequently, the identified lesions are characterized by a single semiquantitative parameter corresponding to the maximum-intensity voxel, known as the SUVmax. An aggregate of several voxels in a 1-cm3 spheric region may be used (SUVpeak) to increase the robustness of the measurement with respect to statistical noise (4). Although the SUVmax has been successful in several clinical applications, including diagnosis and staging, it has also been shown to be insufficiently discriminative in several settings, such as baseline prognosis (5) or prediction of a response to therapy (6).
Nuclear medicine physicians need to go beyond such a simplistic metric, notwithstanding the fact that these data are also images. In that regard, the recent success of deep learning (DL) is a promising development, because DL is specifically aimed at learning patterns relevant for a given task (e.g., segmentation or endpoint prediction) from the data (i.e., images) themselves, instead of relying on “engineered” or “handcrafted” features (7).
In parallel to the improvements in hardware and reconstruction software, several developments in image processing, analysis, and machine learning have been applied to PET/CT and SPECT/CT. First, preprocessing algorithms such as denoising (8,9) and correction of partial-volume effects (10) have led to improvements in both qualitative and quantitative accuracy. Second, compared with experts, (semi)automated algorithms have been able to detect lesions of interest and delineate them with similar accuracy and higher reproducibility and robustness (11). Third, the extraction of quantitative metrics from PET and SPECT images to characterize tumors or organs of interest has been exponentially growing over the last 10 y, relying initially on engineered features (12,13) or, more recently, on “deep” features extracted using convolutional neural networks (CNNs) (14). Finally, the development of multiparametric models using machine learning for disease diagnosis or staging and predicting outcomes also has been exponentially increasing (15,16). These 4 methodologic foundations are key elements of the field of radiomics (17,18).
Radiomics considers images as quantitative data from which to extract information that may not be accessible to the naked eye, even the expertly trained one (19). Thus, “images are more than pictures, they are data” (20); however, images should not be forgotten—that is, data are also images. Although the content of an image can be reduced to a set of quantitative features, the entire image may still provide additional information; it is important to remember this fact with regard to the learning process of DL algorithms.
The goal of this commissioned article is to provide an update on the state of the art, pitfalls, solutions for those pitfalls, future perspectives, and challenges in the quickly evolving field of radiomics (i.e., images as data and vice versa) in nuclear medicine imaging and associated oncology applications.
PET AND SPECT RADIOMICS PUBLICATIONS
Although the radiomics approach was initially developed in the context of radiotherapy and radiology, the number of studies applying radiomics to PET or SPECT has been steadily increasing. On March 25, 2019, about 1,000 publications (excluding abstracts and meetings) using the term radiomics could be found in Web of Science databases—an exponential increase (Fig. 1). About 27% of them concerned PET or PET/CT, and only a few concerned SPECT/CT (e.g., (21)). However, almost one-quarter (22%) of them were editorials and reviews. Also, several papers published before or after the term was introduced could be considered “PET radiomics studies” (e.g., (12,22–24)).
Evolution of number of publications found in Web of Science (all databases, black part) using the term radiomics and containing the term PET or PET/CT or positron (white part).
MAIN PITFALLS (AND SOLUTIONS FOR THOSE PITFALLS) IN NUCLEAR MEDICINE RADIOMICS STUDIES
The work flow of radiomics analysis is the same for any image modality and actually corresponds to the usual machine learning pipeline (Fig. 2): data (images) are input for an extractor (e.g., software calculating features), and then a modeling step is used to map the features to the classification goal (e.g., outcome for patients). This pipeline makes every step highly dependent on the methodologic choices made in the previous steps. Thus, there are several pitfalls in each of them.
Radiomics pipeline in comparison with usual machine learning work flow.
Study Design
Before data collection is actually begun, it is important to define the question to be answered, to determine the kind (and quantity) of data needed to answer it, and to list the study needs and requirements. Several guidelines can help in the design of future studies (25–28), avoiding pitfalls typically associated with each of the next steps. For instance, we recommend relying on the radiomics quality score (29) and the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines (30). A specific example is to ensure having datasets of sufficient size and from different sources to satisfy the training, validation, and testing requirements (31).
Data Acquisition
Image Acquisition and Reconstruction
Data including images must be collected (retrospectively) or acquired (prospectively). When images are retrospectively collected, the associated raw data are not available; therefore, the reconstructed images must be exploited as they are. On the contrary, when images are acquired prospectively, raw data should be stored for research purposes or, at the very least, image reconstruction settings suitable for radiomics analyses should be chosen. Indeed, clinical reconstruction settings are usually optimized for visual analysis tasks that are mostly focused on detection rather than finer characterization—hence, larger voxel sizes (∼4–5 mm, often nonisotropic) and postreconstruction smoothing of images with suboptimal gaussian filtering. For radiomics, smaller (facilitated delineation) and isotropic (unbiased texture computation) voxel sizes (Fig. 3) (32), without postfiltering, should be used—so that if the radiomics pipeline includes preprocessing steps (e.g., denoising, correction of partial-volume effects), these can be applied to unprocessed images.
Axial slice of 18F-FDG PET image of lung tumor reconstructed with 3-dimensional row-action maximization-likelihood algorithm using standard 4 × 4 × 4 mm3 (A) or finer 2 × 2 × 2 mm3 (B) voxels.
Nonimage Data Collection
Collection of information from clinical records and other analyses (e.g., histopathology, transcriptomics, genetics) is a crucial step for which curation quality checks need to be provisioned in the study design. Indeed, this information is usually retrieved from medical records by investigators and manually entered into new research databases, a process prone to errors. Such errors introduced at this level of the work flow can be highly detrimental and complex to identify a posteriori, warranting the need for a well-designed data infrastructure (25).
Multicenter Data
The need for larger multicenter datasets was emphasized previously (33,34). Indeed, developing multiparametric models requires large, representative cohorts to train the models on relevant data and make them as clinically useful and generalizable as possible. Because sharing data in a single storage facility for a centralized analysis is complex for legal, ethical, administrative, and technical reasons, such sharing is not the reality of current radiomics studies, especially in nuclear medicine. As a result, most published radiomics models have not been properly validated (35).
Distributed learning provides a solution to train a model at each institution and update the parameters of the model in a centralized computing station without the data ever leaving clinical centers, as only the parameters of the model are exchanged (36). However, whether images are processed locally or in a centralized fashion, differences in image properties and the resulting variability of features need to be taken into account to build robust models. Indeed, several studies showed that most radiomic features are sensitive (to a variable degree) to differences in scanner models, acquisition parameters, and reconstruction settings (23,37).
Several options for addressing these issues are available. First, standardizing PET/CT acquisition and reconstruction protocol settings is an important aspect of multicenter data collection, with guidelines already available for PET/CT imaging (38,39). However, these are still mainly focused on the SUV and do not include standardization recommendations regarding radiomics, for which harmonization may be more difficult to achieve. Although these efforts should be expanded for radiomics (40) and can help reduce differences in radiomic feature distributions across different sites, they may not be sufficient. In addition, this approach is feasible only for prospectively collected data, but most radiomics studies are still performed by retrospectively analyzing available data.
Second, preprocessing images can help reduce differences, for example, by interpolating them to a common voxel size and filtering so that they have similar resolution and noise characteristics. This approach may be insufficient to suppress all differences in the resulting radiomic feature distributions. Also, performing this approach is not trivial, as there are dozens of algorithms for interpolating and filtering, and identifying an optimal combination is challenging. This approach can also introduce artifacts or reduce the quality of the quantitative information contained in images.
Removing features identified as too sensitive to the variability of acquisition and reconstruction settings is another solution that can help build more robust models when used with an external dataset. This solution has been extensively studied for several modalities, including PET (23,41–44), but similar studies for SPECT are currently lacking. Most of these studies have shown that the robustness, repeatability, or reliability of features (with respect to acquisition, reconstruction, filtering, segmentation, or analysis and computation choices) is highly variable among features in general as well as among features of a given category or specific matrix. The main drawback of this approach is the potential loss of information, as numerous features will be discarded even though they may contain clinically relevant information. Another limitation is that identifying the optimal subset of features that is sufficiently robust and provides enough discriminative power is challenging and likely must be done for each combination of image modality, cancer type, and task. A recently evaluated method consists of dealing with the variability of radiomic features from each dataset or cohort a posteriori in the modeling step itself, by harmonizing feature distributions so that they can be pooled together.
Several methods have been developed to address the same issues in genomics, for which the “batch effect” has a significant impact. The ComBat method (45) has been shown to work well for small samples and to outperform similar techniques (46). ComBat was shown to allow PET radiomics predictive models to achieve higher performance in the external validation step (47,48). This approach has several advantages: it is easy and fast, and it allows all of the information to be exploited because all of the features are retained. One limitation is that a sample population of at least 30 patients from each center dataset must be available, and it cannot be applied on an individual-patient basis. Other techniques, such as rescaling and normalization, were recently evaluated for improving multicenter modeling, with interesting results (49).
Segmentation
The delineation of the object of interest (e.g., a tumor) is the most time-consuming bottleneck step, as full automation is difficult to achieve. In most studies, an expert first isolates the object of interest in a manually or semiautomatically defined volume of interest, and then a (semi)automated algorithm is used for actually delineating the object. This step is more complex for diffuse disease or several lesions. PET tumor delineation has been investigated in numerous studies (11). Despite perfect repeatability and very high interobserver reproducibility, threshold-based techniques have been shown to perform poorly in terms of robustness and absolute accuracy, especially for heterogeneous uptake distributions (11). Manual delineation has well-known limitations regarding inter- and intraobserver variability and should be performed by at least 2 (preferably more) experts, with consensus. A recent MICCAI (Medical Image Computing and Computer-Assisted Intervention) challenge highlighted the poor performance of fixed thresholding as well as the ability of more advanced techniques to achieve higher accuracy (50).
An alternative for reducing variability in the performance of individual algorithms and obtaining a more consistent result across a given dataset is to consider the statistical consensus of several methods (51). Another potential solution is to train an algorithm to select the best method for a given case depending on image properties and other a priori information (52). DL has been especially successful in medical image segmentation tasks (7), as the learning process occurs on the voxel level and not on the entire-image level (as for classification tasks), thereby reducing the requirements regarding the amount of learning data needed for efficient training. Recently, convolutional neural network approaches were applied to PET (50) and PET/CT segmentation (53–55). DL algorithms for PET tumor detection and segmentation (56) may provide fully automated solutions for these steps of the radiomics pipeline. Similar efforts have been made to characterize disease in PET/CT images without the use of DL methods (57).
Feature Calculation
Standardization and Nomenclature
The main pitfall related to this step is the lack of standardization of both nomenclature and implementation. The calculation of features involves several steps and several different choices (mostly for textural features); their implementation is therefore prone to errors.
With regard to these issues, the efforts of the Imaging Biomarkers Standardization Initiative (IBSI) (26,58) should be emphasized. This initiative is performed by more than 20 research groups from 8 countries and aims to establish standardized definitions of usual radiomic features (currently 172) and their calculation; a common nomenclature for the full radiomics pipeline and each step of pre- and postprocessing leading to feature extraction; recommendations regarding interpolation, discretization, and texture matrix design; a benchmark of standardized values based on both a synthetic digital phantom and real clinical images for each radiomic feature calculated in different possible configurations; and recommendations regarding reporting. The 123-page reference document (version 9, updated May 2019) is available online and published as a preprint (26). Although the IBSI is not specifically dedicated to PET imaging, most of its recommendations and results are directly applicable to PET radiomics studies. For instance, we highly recommend checking the IBSI compliance of homemade or commercial/open-source libraries and software before using them in a study, as doing so will greatly increase its reproducibility (27).
Confounding Issues for Volume and Other Metrics
Radiomic features include standard PET metrics (e.g., functional volume, mean, or SUVmax). Regarding additional, more complex quantitative measurements, such as geometric descriptors (e.g., sphericity or surface irregularity) or second- and higher-order textural features (e.g., entropyGLCM or GLNUGLRLM), it is important to check their redundancy and complementary values with both clinical factors (e.g., stage or sex) and other available variables as well as standard PET metrics (e.g., volume or SUVmax). It is pointless to calculate complex image features that are simply surrogates of these (41). This issue is especially important for metabolic volume, as all radiomic features are calculated from a previously determined volume of interest through segmentation of the tumor. It has been shown that the design choices made in the calculation of features, such as the method and parameters used in the discretization of original intensities or the merging strategies of texture matrices, can have a tremendous impact on feature distributions and correlative relationships with volume or SUVmax (33,43,44,59,60). For example, regarding PET radiomics, it was shown by Hatt et al. (41) that the textural features previously identified by Tixier et al. (24) to predict a response to chemoradiotherapy in esophageal cancer were actually highly correlated with the corresponding volume and therefore provided little to no additional information, with a predictive ability similar to that of the volume alone. It was later shown that through different calculation settings (discretization and texture matrix design), the same textural features can provide complementary or additional value relative to volume, including for small tumors. Thus, their combination could lead to better stratification of patients (44), contrary to previous claims that no such complementary value could be obtained for volumes of less than 45 cm3 (61).
Another metric, the so-called “heterogeneity factor”—defined as the derivative (dV/dT) of the volume–threshold function—was reported to be highly correlated with functional volume (62) and was therefore a surrogate of volume rather than an actual heterogeneity measurement (63). Similarly, the CT-derived radiomics signature for lung and head and neck cancers (64) was demonstrated to actually reflect mostly tumor volume rather than actual tumor heterogeneity and shape, as the shape (compactness) and textural (GLNUGLRLM) parameters selected for this signature were later shown to be highly correlated with the corresponding volume (65,66). However, it was also shown that by adopting modified feature definitions, as proposed in the IBSI (i.e., dimensionless, compact, normalized, and merged textural matrices for GLNUGLRLM calculation), it was possible to obtain a higher prognostic power of the same signature compared with volume (65).
New Features
A single feature could have a large number of different values according to the choice of various parameters, including—but not limited to—the intensity discretization method and parameter(s) or merging strategy (directions and averaging). Although the feature definition is always the same, the obtained value can vary greatly from one matrix design to the next and therefore can create an additional variable for the analysis. This variability can actually be a way to optimize texture analysis, as each feature may end up being more informative with specific and different calculation choices (32,60). The robustness of features with respect to their calculation compared with their clinical discriminative power should not be overemphasized; further investigation to determine which features are indeed robust enough with respect to their level of discriminative power for a given endpoint is warranted (67).
New “engineered” or “handcrafted” features with potentially higher discriminative power or better properties are continuously being developed. CoLIAGe (Cooccurrence of Local Anisotropic Gradient Orientations) (68), a metabolic gradient (69), or 3-dimensional Riesz-covariance textures (70) are examples of such new features with a potentially higher differentiation power compared with standard textural features. A novel metric for quantifying PET heterogeneity was also proposed as a more intuitive and simple alternative to textural features; this method involves summing voxelwise distributions of differential SUVs, weighted by the distance of SUV differences among neighboring voxels from the center of the tumor (71). This metric was designed to yield increased values for tumors with peripheral subregions having high SUVs. A new grey-level cooccurrence matrix methodology was recently developed to reduce the redundancy of resulting features, demonstrating a more accurate classification of tumor types in CT images (72). Even if some of these metrics were not specifically developed for PET imaging, they could be directly applied to PET.
Finally, DL has also been the source of new features, commonly denoted as “deep features.” These can be extracted from medical images using pretrained networks. These networks may have been trained on very large medical image datasets as well as on natural images (73). Because these networks have learned from natural images to extract rough to finer features at different scales through different layers, they can extract similar patterns and features from medical images (including PET) and can be used “off the shelf” or after an additional fine-tuning step (also called transfer learning). Most current results obtained with deep features as well as their combination with typical radiomic features have been obtained in CT and MRI applications (74–78), but the same concept can be applied to PET.
Modeling
Statistical analysis for mapping the extracted features to a given endpoint (either classification or regression) is one of the most challenging steps in the entire radiomics process. The goal of this step is usually to identify the optimal combination of the fewest available variables (clinical data, radiomics, and other analyses) allowing the maximization of 1 or several criteria (usually the receiver operating characteristic area under the curve, concordance statistic, specificity, sensitivity, or accuracy). Indeed, statistical analysis was the weakest part of most texture and radiomics studies before 2015 because it tested too many hypotheses (i.e., number of features) for small patient cohorts without correction for type I errors (i.e., false discovery) and without the use of a validation dataset, thereby reporting mere (overfitted) correlations and not actual predictive power. Most radiomics or texture studies with PET have been performed with cohorts of fewer than 150 patients (48) and—because the number of features (and variables) is constantly growing, especially in the case of texture optimization (i.e., calculation of each feature with different parameters)—statistical analysis is fraught with the curse of dimensionality, a high rate of false-positive results, collinearity issues, and risk of overfitting. Choosing a machine learning method is also quite challenging. The most recent comparison studies highlighted the differences between popular methods as well as the fact that none of them performed best across the entire spectrum of datasets and tasks (15,16,79).
Following simple guidelines for robust and reliable statistical analysis and machine learning (31) is crucially important for obtaining reproducible and reliable results. The most important guidelines are splitting the available data into a training set (i.e., learning a model, e.g., a linear combination of 2 variables) and a validation set (i.e., tuning the parameters of the model, e.g., the weights of each variable in the linear combination) and performing the final evaluation with a testing set (i.e., performing the trained model with fixed parameters using a dataset never used in the training and validation steps). In the context of radiomics, different strategies can be used. Splitting a single available dataset into 3 parts is a potential solution. For example, a 100-patient cohort can be split into a training set of 50 patients, a validation set of 30 patients, and a final, testing set of 20 patients. Obviously, the larger the cohort, the better, as evaluating the final model with only 20 patients can provide limited evidence of its usefulness. Alternatively, if different datasets are available (e.g., in a multicenter setting), then it may be appropriate to train and validate with 1 cohort and test the resulting model with other cohorts (47,80). However, this approach requires harmonization of the features because of differences in their distributions across centers.
Different techniques can be used for splitting; we recommend either using stratified sampling (81) to ensure similar distributions in the splits or performing several different splits randomly and reporting the mean and SD for the results. Indeed, random splits can lead to very different distributions in the training, validation, and testing sets (e.g., all “easy” cases end up in the training set or, worse, the training set contains all of the cases to detect but the testing set contains none).
Another important pitfall concerns the imbalance of the data and classification (or regression) task, combined with the metrics used for performance evaluation. In most radiomics studies, the clinical endpoint is not balanced; that is, 1 class (e.g., patients with recurrence) dominates the other (e.g., patients without recurrence). In such a context, a machine learning algorithm classifying each instance as the dominating class would end up being right most of the time. Therefore, it is important to implement strategies to help an algorithm learn the minority cases as well as the majority cases, despite having fewer training examples. Several strategies are available; these include synthesizing additional minority instances (e.g., with the Synthetic Minority Oversampling Technique, [SMOTE] (82)), oversampling the minority class, undersampling the majority class, or tweaking the function cost to raise the cost of minority instance misclassification. Furthermore, it is important to rely on appropriate performance metrics, especially in the case of imbalanced data. For example, the often-considered metrics accuracy or F1 score can indeed provide a biased estimation in the case of imbalanced data; the use of balanced accuracy (the mean of sensitivity and specificity), receiver operating characteristic area under the curve, and Matthews correlation coefficients is recommended instead to provide a reliable estimate of the performance of the model (31). For survival analysis and regression tasks, the use of hazard ratios and the concordance statistic is appropriate for evaluating time-to-event endpoints (15).
Finally, as there seem to be no currently available classifiers or feature selection methods that perform best across the entire spectrum of tasks and types of data, it may be interesting to consider ensemble techniques and the fusion of several different classifiers as a way to obtain more robust models (83).
REMAINING CHALLENGES TO ADDRESS AND HOW TO MOVE TO CLINICAL TRANSLATION
To enable clinical translation, despite numerous recent efforts, the radiomics community still has to address the following main challenges to help reduce the current limitations for both robust and reproducible research as well as actual clinical transfer: finalize and expand standardization efforts; develop tools and methods for collecting, storing, and sharing sufficiently large databases containing images associated with contextual clinical data and other analyses for a large panel of pathologies; reach a level of full automation for the entire pipeline (especially for the detection and segmentation steps); and identify and standardize optimal methods for model building and validation.
Regarding the collection of larger datasets, the main limitation preventing multicenter studies from reaching their full potential (i.e., the sensitivity of most radiomic features to variations in scanner models, acquisition protocols, and reconstruction settings) can be considered resolved on the basis of the use of a posteriori harmonization methods (45,47,49). For most of the remaining challenges, DL techniques can provide potential solutions, either for each of the steps in the radiomics work flow or by entirely replacing the usual work flow with an end-to-end DL-based approach (14,84). In the latter approach, all steps performed separately and sequentially (segmentation, feature extraction, modeling) are now performed by 1 (or several) neural network(s). This approach mostly replaces previous challenges with others specific to the use of DL techniques, such as the need for datasets much larger than those usually available in radiomics studies for efficient training. Therefore, techniques such as transfer learning and data augmentation become crucially important. Another requirement is to provide interpretable models by opening the “black box” that such networks, with the millions of parameters they contain, can appear to be. This requirement could be met by network visualization techniques (85), providing some visual feedback to end users and explaining why and how the network reached its final prediction—for instance, by providing heat maps on the original input images to highlight the most relevant areas in the images or even within the tumor.
CONCLUSION
The field of radiomics has been exponentially growing, including in PET/CT imaging. It is a very active and promising field of research, but it is full of methodologic pitfalls. Until recently, the approach has been mostly to consider images as data by reducing full 3-dimensional images to a vector containing relevant quantitative handcrafted radiomic features. With the advent of DL techniques to solve challenges and lift the limitations of the current radiomics work flow, the radiomics community is returning to images as a whole; in this approach, patterns are captured by multilayer neuronal networks that learn the relevant features instead of selection and combination of handcrafted features.
DISCLOSURE
No potential conflict of interest relevant to this article was reported.
- © 2019 by the Society of Nuclear Medicine and Molecular Imaging.
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.
- 76.
- 77.
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- Received for publication February 1, 2019.
- Accepted for publication March 28, 2019.