RT Journal Article SR Electronic T1 Best Practices for Evaluation of Artificial Intelligence-based Algorithms for Nuclear Medicine: The RELIANCE Guidelines JF Journal of Nuclear Medicine JO J Nucl Med FD Society of Nuclear Medicine SP 2725 OP 2725 VO 63 IS supplement 2 A1 Abhinav Jha A1 Tyler Bradshaw A1 Irene Buvat A1 Mathieu Hatt A1 Prabhat KC A1 Chi Liu A1 Nancy Obuchowski A1 Babak Saboury A1 Piotr Slomka A1 John Sunderland A1 Richard Wahl A1 Zitong Yu A1 Sven Zuehlsdorff A1 Arman Rahmim A1 Ronald Boellaard YR 2022 UL http://jnm.snmjournals.org/content/63/supplement_2/2725.abstract AB 2725 Introduction: Artificial intelligence (AI)-based methods are showing significant promise in multiple aspects of nuclear medicine imaging, including image acquisition, reconstruction, post-processing, dosimetry, diagnostics, prognostics, and clinical decision making. For clinical translation, rigorous and objective evaluation of these algorithms is imperative. This need is even more critical for AI-based algorithms since these algorithms learn their rules from analysis of certain training datasets, and hence can suffer from issues related to limited interpretability, unpredictability in output, and lack of generalizability. Our objective is to address this need by proposing best practices for evaluation of AI algorithms for nuclear medicine. Methods: To define best practices for evaluation of AI methods for nuclear-medicine imaging, the Society of Nuclear Medicine and Molecular Imaging (SNMMI) organized an Evaluation team within its AI Task Force. The team consisted of computational imaging scientists, nuclear-medicine physicians and physicists, biostatisticians, and representatives from industry and regulatory agencies. The team members had multiple deliberations on the specific objectives of evaluating AI methods and the evaluation strategies that would best address these objectives. The team also discussed the best practices for each of these evaluation strategies. Results: The Evaluation team recommends that AI algorithms should be evaluated objectively on clinical tasks and that the study should yield a claim that provides a clear and descriptive characterization of the performance of the AI algorithm and its generalizability. For this purpose, the claim should consist of five components (a) Definition of clinical task (b) Description of patient population for whom the task is defined (c) Definition of imaging process (d) The process to extract task-specific information and (e) Figure of merit (FoM) to evaluate task performance. A key recognition of the team was that at different stages of the AI product life cycle, the evaluation objectives are different, and thus, different sets of evaluation strategies are likely needed. To realize this goal, the team proposes a four-class framework (Fig. 1). The four classes include proof of concept, technical efficacy, clinical utility, and post-deployment efficacy of AI algorithms. The goal of proof-of-concept evaluation is to demonstrate innovations of a new AI algorithm and provide promise for further task-specific evaluation through pilot studies using task-agnostic metrics. Technical evaluation should quantify technical performance of an algorithm on a clinical task using measures such as detection and quantification accuracy, repeatability, and reproducibility. Clinical evaluation should quantify the efficacy of the algorithm to assist in making clinical decisions. AI algorithms that claim improvements in making diagnostic, predictive, prognostic, or therapeutic decisions require clinical evaluation. Finally, the goal of post-deployment evaluation is to monitor algorithm performance in dynamic real-world settings after clinical deployment and to assess its clinical validity and value over time. This may also assess off-label use, such as the utility of the method in populations and diseases beyond the original claim. For each class of evaluation, we provide best practices for study design, data collection, defining reference standard, process to extract task-specific information, and figure of merit. The key recommendations are summarized as the RELAINCE (Recommendations for EvaLuation of AI for Nuclear Medicine) guidelines, consisting of a set of 20 recommendations. We also advocate that AI-evaluation studies should be carried out by multi-disciplinary teams with physicians playing a key role in these evaluations. Conclusions: We envision that the proposed best practices will strengthen trust and accelerate clinical translation of AI, ultimately leading to improvements in quality of healthcare.