Abstract
3238
Introduction: Increasing cancer incidence and mortality has mandated cancer research to learn from existing data by gathering maximum information related to the disease. Medical imaging reports contain rich essential content related to disease type, stage and outcome but exist as free text. Extraction of these information manually from bulk of such unstructured text reports is laborious. The use of NLP tools for information extraction from radiology reports can make it less cumbersome and more effective. In this study, we have developed a machine learning model for extraction of lung carcinoma disease identification phrases from radiology reports.
Methods: This study was approved by the IEC of the hospital as a retrospective study with waiver of consent form. A corpus of 1500 radiology reports including Computed Tomography (CT) as well as Positron Emission Tomography/ Computed Tomography (PET/CT) reports consisting of lung cancer, oesophagus cancer, stomach cancer and soft tissue sarcomas were used for this study. Report extraction, data selection, anonymisation, cleaning and text pre-processing (tokenization, stop word removal, and special character removal) were carried out using Python script developed in-house. 3 models namely, XGBoost, Bi-LSTM_Simple (5 layers) and Bi-LSTM_Dropout (13 layers including dropout layers) were used. The models were trained to classify reports as containing any of the three concepts (Lung carcinoma, Lung Non-Small Cell Carcinoma and Lung Small Cell Carcinoma) or none. The first model uses decision trees for prediction and other two models were deep learning models based on bidirectional long short term memory neural networks. For the XGBoost model nested 5-fold stratified cross validation (CV) was performed with 20 trials. For the two Bi-LSTM models the corpus was split into training and test sets (70:30). We calculated the accuracy, sensitivity, and F1 score for the models.
Results: For the XGBoost model, the mean accuracy was 0.748(0.006), overall sensitivity and F1 score were 0.75 and 0.74 respectively. The Bi-LSTM_simple model gave overall sensitivity and F1 score of 0.75 and 0.74 respectively and accuracy score for test was 0.72. The Bi-LSTM_dropout model gave overall sensitivity and F1 score for identification of reports with the listed concepts of 0.76 and 0.75 respectively and accuracy score for training for test was 0.76.
Conclusions: All three models had comparable performance with Bi-LSTM_dropout model having relatively better performance for classification of lung cancer reports based on pre-defined clinical concepts. However, the limitations of our study are non-inclusion of negation detection and external validation.