Visual Abstract
Abstract
We evaluated whether the artificial intelligence chatbot ChatGPT can adequately answer patient questions related to [18F]FDG PET/CT in common clinical indications before and after scanning. Methods: Thirteen questions regarding [18F]FDG PET/CT were submitted to ChatGPT. ChatGPT was also asked to explain 6 PET/CT reports (lung cancer, Hodgkin lymphoma) and answer 6 follow-up questions (e.g., on tumor stage or recommended treatment). To be rated “useful” or “appropriate,” a response had to be adequate by the standards of the nuclear medicine staff. Inconsistency was assessed by regenerating responses. Results: Responses were rated “appropriate” for 92% of 25 tasks and “useful” for 96%. Considerable inconsistencies were found between regenerated responses for 16% of tasks. Responses to 83% of sensitive questions (e.g., staging/treatment options) were rated “empathetic.” Conclusion: ChatGPT might adequately substitute for advice given to patients by nuclear medicine staff in the investigated settings. Improving the consistency of ChatGPT would further increase reliability.
The use of PET/CT is expected to increase because of a growing awareness of its value in clinical decision-making (1). With limited staff resources and mounting individual workloads, there is a need to increase efficiency, such as through use of artificial intelligence (AI) (2). Specifically, large language models such as OpenAI’s generative pretrained transformer (GPT) 4 might represent an information tool for patients to answer their questions when preparing for an examination or when reviewing the subsequent report.
However, the reliability of GPT can be undermined by false and potentially harmful responses termed hallucinations (3,4). False responses occur less often with more advanced versions such as GPT-4 (5) but have still been observed by Lee et al. (3).
Within the discipline of nuclear medicine, Buvat and Weber recently reported a brief interview with the AI chatbot ChatGPT. While remaining cautious in providing recommendations or solutions, they found that ChatGPT could answer technical questions well (6). It is neither foreseeable nor desirable that AI tools will replace physicians for informed consent. Furthermore, use of such tools by nuclear medicine departments is currently limited by unsolved liability issues (7). However, if validated in a clinical context, such a tool might still be used by patients to obtain information and general advice currently given by nuclear medicine staff (mainly technologists and physicians) and thereby enhance patient compliance (8).
To our knowledge, ours was the first systematic investigation of ChatGPT (with GPT-4) for patient communications related to PET/CT with [18F]FDG. We evaluated whether ChatGPT provides adequate, consistent responses and explanations to questions frequently asked by patients.
MATERIALS AND METHODS
ChatGPT Responses
OpenAI ChatGPT Plus was used in the May 24 version of GPT-4 (https://openai.com/chatgpt). ChatGPT was accessed on May 25 and 26, 2023. All questions and PET/CT reports were entered as single prompts in separate chats. Each prompt was repeated twice using the regenerate-response function, resulting in 3 trials per prompt. In addition, ChatGPT was asked to provide references (19 tasks).
Rating Process
Three nuclear medicine physicians, all of them native German speakers, rated the ChatGPT responses independently using the rating scale shown in Table 1. Appropriateness and usefulness were assessed with 4-point scales to prevent neutral responses and to facilitate binarization of results. Two readers were board-certified nuclear medicine physicians with more than 10 y of experience in PET/CT reading. The third reader was a resident in nuclear medicine with 2 y of PET/CT experience. The criterion “empathetic” was used only to rate the follow-up questions related to PET/CT reports. A binary item was used to avoid ambiguous or artificial grading with a multipoint scale.
Criteria and Categories Used for Rating
In addition, 1 reader rated the level of inconsistency among the 3 responses generated for each question and checked and rated the validity of all references provided by ChatGPT.
Generating Questions and PET/CT Reports
Thirteen questions frequently asked by patients concerning [18F]FDG PET/CT imaging (Table 2; Q1–Q13) were formulated using simple, nontechnical language (e.g., “PET scan”).
All 25 Tasks Submitted to ChatGPT and Majority Rating
Five PET/CT reports (Table 2; R1–R5) were derived from fictitious reports based on templates from our institution. The German reports were first translated with DeepL and then edited. Additionally, a sample report, “Sample Normal Report #2—Negative SPN,” provided by the Society of Nuclear Medicine and Molecular Imaging (9) was used (R6). In R1–R6, the same prompt, “Please explain my PET report to me: [full text of the report],” was used. Since the PET/CT reports were fictitious, no ethical approval was needed.
Statistical Analysis
The final rating for each task was selected by majority vote (except for “inconsistency” and “validity of references,” which were assessed by only 1 rater). When 3 different ratings arose, the middle category was chosen.
RESULTS
All questions, PET/CT reports, and ChatGPT responses can be found in Supplemental Files 1 and 2 (supplemental materials are available at http://jnm.snmjournals.org).
Rating of ChatGPT Responses
Responses by ChatGPT to 23 of 25 tasks were deemed “quite appropriate” or “fully appropriate” (92%, Table 2), whereas responses to 2 tasks (8%), R1Q1 and R4Q1, were rated “quite inappropriate.” Both questions queried tumor stage on the basis of a PET/CT report that did not explicitly state the tumor stage but contained information that would have enabled determining it using established staging systems. In both instances, ChatGPT identified 2 potential tumor stages, one of which was correct.
ChatGPT responses were rated “very helpful” or “quite helpful” by majority vote for 24 of 25 tasks (96%). The response to Q4 was rated “quite unhelpful” because ChatGPT did not caution against breastfeeding after a PET/CT scan, which might still be relevant for patients who are caretakers of toddlers.
In 5 of 6 follow-up questions (83%) related to the potential consequences of the PET/CT findings, ChatGPT responses were rated “empathetic.”
General Observations
ChatGPT answers were structured so as to form intelligible responses. ChatGPT framed responses that are likely to cause emotional reactions such as anxiety in a reassuring way (e.g., when revealing an advanced stage of metastatic lung cancer [R4Q1]). This is one of the aspects that the raters regarded as general signs of natural and humanlike responses (Supplemental File 3; Supplemental Table 1).
When PET/CT reports were being explained, the level of certainty conveyed by the ChatGPT responses seemed to depend on the clarity and extent of interpretation given in the report itself. We did not observe responses that were unrelated to the specific content of the PET/CT reports (hallucinations).
In 1 response, ChatGPT was able to provide a correct interpretation when explaining the PET/CT report of metastatic lung cancer (R5), although this interpretation was not explicitly provided in the report (Supplemental File 3; Supplemental Table 1).
Variation Among Trials
In responses to 21 of 25 tasks (84%), the 3 trials showed “irrelevant” or “minor” differences (Table 2). Responses to 4 tasks (16%)—3 of which were follow-up questions—were rated as showing “considerable” inconsistencies because ChatGPT addressed the specific tumor stage of the patient inconsistently.
Validity of References
In 2 of the 19 tasks (11%), 1 reference was considered invalid (hallucination) because the article could not be found by a manual search (details in Supplemental File 3; Supplemental Table 2).
References were fully valid in only 4 of 19 investigated tasks (21%). In 11 tasks (58%), at least 1 reference contained an outdated uniform resource locator or was only generic (e.g., “National Institute of Health’s U.S. National Library of Medicine”). In responses to 2 of 19 tasks (11%), the referenced article could be found only via a manual search.
DISCUSSION
None of the answers generated by ChatGPT would have caused harm or left the patient uninformed if the questions and PET/CT reports had been real patient inquiries.
Specifically, ChatGPT responses to more than 90% of questions were adequate and useful even by the standards expected of general advice given by nuclear medicine staff. In the 3 responses rated “quite unhelpful” or “quite inappropriate,” answers in at least one of the repeated trials were precise and correct. Although this observation shows that ChatGPT is per se capable of providing appropriate answers to all 25 tasks, this variation in responses led to a rating of “considerable inconsistency.” With future advances in AI models, the focus should be on reducing variation between responses so as to increase predictability and thus reliability.
The question of liability for AI-generated content still needs to be addressed. In a medical context, ChatGPT may be best regarded as an information tool rather than an advisory or decision tool. Every response from ChatGPT included a statement that the findings and their consequences should always be discussed with the treating physician (Supplemental File 3; Supplemental Table 3). Questions targeting crucial information, such as staging or treatment, were answered with the necessary empathy and an optimistic outlook.
We focused on the most common PET/CT tracer and on indications with a relatively large database of information available to GPT-4. The responses might be less helpful or reliable in the case of rare indications or new tracers, especially if the relevant literature has been published after the model training threshold (GPT-4: September 2021) (6). Validation in other contexts will therefore be required.
The issue of 2 invalid references to original articles that seem to have been hallucinated also demands further investigation.
CONCLUSION
ChatGPT may offer an adequate substitute for informational counseling to patients in lieu of that provided by nuclear medicine staff in the currently investigated setting of [18F]FDG PET/CT for Hodgkin lymphoma or lung cancer. With ever-decreasing time available for communication between staff and patients, readily accessible AI tools might provide a valuable means of improving patient involvement, the quality of patient preparation, and the patient’s understanding of nuclear medicine reports. The predictability and consistency of responses from AI tools should be further increased, such as by restricting their sources of information to peer-reviewed medical databases.
DISCLOSURE
No potential conflict of interest relevant to this article was reported.
KEY POINTS
QUESTION: Might ChatGPT substitute for advice given to patients on [18F]FDG PET/CT?
PERTINENT FINDINGS: ChatGPT responses were appropriate and useful, but we observed some inconsistency between trials.
IMPLICATIONS FOR PATIENT CARE: Proper use of AI tools might improve patients’ involvement and their understanding of PET/CT reports.
Footnotes
Published online Sep. 14, 2023.
- © 2023 by the Society of Nuclear Medicine and Molecular Imaging.
Immediate Open Access: Creative Commons Attribution 4.0 International License (CC BY) allows users to share and adapt with attribution, excluding materials credited to previous publications. License: https://creativecommons.org/licenses/by/4.0/. Details: http://jnm.snmjournals.org/site/misc/permission.xhtml.
REFERENCES
- Received for publication June 2, 2023.
- Revision received August 22, 2023.