PT  - JOURNAL ARTICLE
AU  - Bradshaw, Tyler
AU  - Cho, Steve
TI  - Evaluation of large language models in natural language processing of PET/CT free-text reports
DP  - 2021 May 01
TA  - Journal of Nuclear Medicine
PG  - 1188--1188
VI  - 62
IP  - supplement 1
4099  - http://jnm.snmjournals.org/content/62/supplement_1/1188.short
4100  - http://jnm.snmjournals.org/content/62/supplement_1/1188.full
SO  - J Nucl Med2021 May 01; 62
AB  - 1188Objectives: Natural language processing (NLP) has many promising applications in nuclear medicine, including assisted report generation, synoptic reporting, and intelligent information retrieval. Recently, large transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art results on a number of NLP tasks. These models have not been explored in the domain of nuclear medicine, which has a unique vocabulary and reporting style. The goal of this study is to investigate the performance of language models in nuclear medicine through the task of report classification. Methods: Different language models were investigated for their ability to correctly classify free-text PET/CT reports. The task was to classify reports into one of five categories based on the lymphoma FDG PET Deauville five-point visual criteria score (DS) contained in the report. PET/CT reports from 2009-2018 containing “Deauville” and “lymphoma” were identified and extracted from the University of Wisconsin-Madison clinical PACS system and anonymized. DS were automatically extracted from reports, and then all mentions of DS were subsequently removed from the reports. The reports’ findings and impression sections were combined and used for model training. A number of NLP methods were evaluated for their impact on classification performance. Two classes of language models were compared: doc2vec and BERT. For doc2vec, the report text was pre-processed using standard cleaning techniques, including stemming and removal of stop words and punctuation. Custom synonym replacement was also performed. For BERT models, only custom synonym replacement was used. Three types of BERT models were investigated: the baseline BERT model, a bioBERT model trained on a corpus of medical literature, and bio-clinicalBERT trained on a corpus of medical literature and clinical/discharge notes. The added value of appending custom nuclear medicine vocabulary (e.g., “SUV”) to the BERT model’s vocabulary was investigated. For all models, a DS classifier was trained on top of the language model, which used the model’s representation vector as input. To determine if models relied on confounding factors to classify reports (e.g., report length), a subset of reports were manually altered by swapping disease-positive sentences with disease-negative sentences and then vice versa, and the impact of sentence swapping on model predictions was evaluated. Results: A total of 1813 reports were included in the study, with 10% (181) used for validation and 20% (363) used for testing. Of the 3 BERT models, bioBERT had the best overall performance, although differences between models were small. The 5-class accuracy of the doc2vec model was 58% with a weighted kappa of 0.56. The 5-class accuracy of the bioBERT model was 65% with a weighted kappa of 0.64. Adding nuclear medicine vocabulary to the model’s vocabulary had no consistent impact on performance, with gains in accuracy ranging from +10% to -12 %. When categories were collapsed between responding (DS 1-3) and non-responding (DS 4-5) cases, doc2vec and bioBERT had the same classification accuracy of 81%. Sentence swapping changed the model predictions in approximately 40% of the cases, suggesting that models were interpreting report sentiment but also potentially relying on confounding factors. Conclusions: Language models were able to accurately interpret the sentiment contained in free-text PET/CT reports. The large language model bioBERT outperformed doc2vec in the complex task of classifying reports into 5 classes, but both models performed similarly on the simpler task of binary classification. Future work will explore how confounding factors influence model performance using NLP interpretation methods. Research support: This research was supported by GE Healthcare.