Abstract
1410
Objectives: A primary barrier to the development of machine learning models in radiology is the need for high quality physician labels for a large number of exams. Clinical PACS systems store this information, yet it is often buried within unstructured physician-dictated reports. The goal of this study was to evaluate the feasibility of training a model with labels extracted from radiology reports using natural language processing (NLP). We extracted visual 18F-FDG PET lymphoma Deauville scores from radiology reports and trained a classification model to predict responding and non-responding lymphoma patients from their PET images.
Methods: A vendor-provided NLP tool was used to query the University of Wisconsin clinical PACS database for all PET exams related to lymphoma from 2009 to 2018. FDG PET images and radiology text reports were pulled and anonymized. Radiology reports were preprocessed to remove punctuation and stop words. Stemming and n-gram analysis was performed to find all variations in phrasing for Deauville score reporting. Deauville scores of 1-3 (responding) and scores of 4-5 (non-responding) were grouped together. Cases were removed that mentioned post-surgical inflammation, secondary cancer, or in which the physician’s report impression was not definitive (e.g., uncertain reactive nodes). Coronal maximum intensity projections (MIPs) of the whole-body PET images were used as input to an EfficientNet convolutional neural network (CNN). Training was performed using a rectified ADAM optimizer with a cyclic learning rate for 70 epochs. Data was split for training (70%), validation (15%), and testing (15%). A customized local interpretable model-agnostic explanations (LIME) method was used to help explain the model’s incorrect classifications.
Results: Out of 4523 lymphoma PET exams, 1710 had reports containing Deauville scores. Following removal of ambiguous cases, there were 831 responding cases and 838 non-responding cases. In the test data set (N=249), the CNN was able to predict non-responding cases with an accuracy of 85%, sensitivity of 90%, and specificity of 81%. According to LIME, common causes of false-positive classifications included regions of brown adipose tissue uptake, injection site uptake, and kidney/ureter uptake.
Conclusions: This study demonstrated the feasibility of using NLP to extract training labels from clinical radiology reports for training large-scale image classification models. The model was able to predict non-responding from responding lymphoma patients based on PET MIP images.