Abstract
P160
Introduction: Multimodal deep neural networks (DNNs) can integrate information from multiple modes of data (imaging, tabular, etc) and may enable complex medical decision making and disease prediction. These networks are "black-box" models with limited interpretability and may provide unreliable or untrustworthy performance. There are two approaches to making AI more interpretable: using inherently interpretable models ("interpretable AI") or generating post-hoc explanations for AI ("explainable AI"). While there have been a multitude of methods for post-hoc explanations applied to single mode DNNs, these do not account for the complexity of multimodal data from multiple sources. For example, it is difficult to assess the relative input importance for a DNN with image and tabular inputs. This work seeks to apply post-hoc explanations to multimodal DNNs and is among the first to try to characterize feature importance, not at the input layer, but at deep layers where features from different modalities are fused.
Methods: We developed and compared four different methods for modality importance estimation. In essence, we use methods designed for input feature importance ranking: input gradient importance, permutation importance, LIME, and Shapely values. Input gradient and permutation importance measure the impact of each feature on a model's prediction by examining how the prediction changes when the feature's values are altered. LIME explains a model's predictions by fitting an interpretable model locally around the prediction, and Shapely values measure the contribution of each feature to the prediction by calculating its marginal value for a given prediction. Unlike prior use, we apply these approaches at the fusion layer (the deep layer where information from each separate modality is concatenated) to estimate the importance of each deep feature, then aggregate all of the deep feature importances from a contributing modality to estimate the importance of each modality. To evaluate these methods, we use simulated data with ground truth feature importance from synthetic decision functions. In total, we simulated 8 different multimodal data sets each containing 4 different modalities for a binary classification task with varying modality importance and noise levels.
Results: A hybrid multimodal DNN was fit to each simulated data set and all DNNs achieved a classification accuracy on independent test data of greater than 92%. Estimated modality importance was close to the ground truth as presented in Figure 1, with each importance method predicting importance within less than 8% of the ground truth importance. The root mean square error of importance estimates was 0.12 for gradient, 0.09 for permutation, 0.13 for LIME , and 0.10 for Shapely, suggesting that all methods perform similarly well for modality level importance estimation.
Conclusions: The present study introduces a novel method for estimating the importance of input data in multimodal deep neural networks trained end-to-end. This method allows for the examination of the relative contribution of different input data to the decision-making process of the model, as well as the evaluation of the reliability of the model's predictions and the relevance of the information used by the model. The proposed method is easily compatible with various feature importance techniques and has the potential to make complex multimodal models more interpretable. Future work will involve testing and application to real-world multimodal medical data.