The American Psychiatric Association (APA) has updated its Privacy Policy and Terms of Use, including with new information specifically addressed to individuals in the European Economic Area. As described in the Privacy Policy and Terms of Use, this website utilizes cookies, including for the purpose of offering an optimal online experience and services tailored to your preferences.

Please read the entire Privacy Policy and Terms of Use. By closing this message, browsing this website, continuing the navigation, or otherwise continuing to use the APA's websites, you confirm that you understand and accept the terms of the Privacy Policy and Terms of Use, including the utilization of cookies.

×
ProfessionalFull Access

Harnessing AI for Psychiatric Use Requires More Nuanced Discussion

Abstract

If psychiatrists are going to consider whether AI technologies are fit for clinical care, they need to understand how they operate and how to use them.

The landscape of psychiatric care and learning is changing rapidly due to artificial intelligence (AI). A thoughtful discussion of AI in psychiatry requires an in-depth look into the technology itself. AI is a catch-all term that refers to technology that aims to match or exceed human ability and intellect. In this article, I will discuss aspects of AI that I believe have been lacking in our current discourse, with a specific focus on natural language processing (NLP), which is responsible for the power of popular tools like ChatGPT.

NLP is the process of “giving computers the ability to understand text and spoken words in much the same way human beings can,” according to IBM. Large language models (LLMs) are a type of AI and the most salient example of NLP in everyday life. They are AI-powered tools that have been trained on large amounts of written language to predict the most likely response based on an input. OpenAI explains how a model, when asked to complete the sentence “Instead of turning left, she turned ... ,” would initially respond randomly, but after training, it correctly predicts “right” as the next word. This response is based on pattern recognition from training in written language, not isolated logical reasoning.

To increase the nuance of our discussions about how to include this technology in mental health care, we first need to understand the diversity of existing LLMs and their variables that impact performance, explore the concept of prompt engineering, and understand methods to increase the external validity of LLM responses.

Although ChatGPT has received much attention, it is only one of many LLMs. Others include Claude, PaLM, PaLM-2, and LLaMA. Because of its accessibility, a majority of academic medical investigations into LLMs have focused on ChatGPT. When I searched PubMed for articles related to ChatGPT, there were 854 results. However, when I searched for PaLM-2, there were only 27 results. Most opinion pieces by physicians written either in favor or against the use of LLMs in medicine cite only ChatGPT. Furthermore, most of these commentaries fail to appreciate the difference between updated versions of each model. OpenAI has released several versions of its GPT—each iteration increasing in power. GPT-1 was released in 2018, GPT-2 in 2019, GPT-3 in 2020, and GPT-4 in 2023. Given the variable efficacy in answering user questions between different versions of ChatGPT, discussions of this technology in medicine need to ensure they reference the model and version to which they refer.

Understanding that LLMs contain many variables (hyper-parameters), such as “temperature,” is critical. Although somewhat technical, a brief description of how one variable functions can hopefully illustrate the complexity of this technology. Temperature controls a model’s response variability. At zero, it always gives the same response to a question, but at higher settings, responses may vary. For example, if “pizza” is the most common answer to “what is the best food,” an LLM with low variability will always answer “pizza.” Increase the temperature, and it might sometimes answer “sushi.” For LLMs in psychiatric care or education, lower variability is desirable for patient safety and consistent learning.

The manner in which a question is asked of an LLM can impact its performance (“prompt engineering”). Many research studies cite LLMs’ abilities to perform well on board certification exams. The methods sections of many of these papers indicate that the researchers copied and pasted board exam questions into a model and asked it to select the appropriate response. The LLMs often performed well on these tasks; however, this is not how one optimizes LLM performance. OpenAI (among others) has released free courses in prompt engineering that clearly indicate how to ask questions of a model to optimize its success, according to Andrew Ng at DeepLearning.AI. For optimal success, one should first specify a role for the LLM. This can come in the form of: “You are a … .” This provides a certain amount of context for the LLM. Next, your question must be clear and precise. You should provide clear steps for the model to follow to give it time to “think.”

For example, after specifying a role, you could instruct ChatGPT: “Please read the provided question and its multiple-choice options. Then, refer back to your relevant psychiatric knowledge. Next, select which response most accurately answers the question.” This is a much more reliable way to ensure optimal performance as opposed to placing the entire question alone in the model. Furthermore, reliability can be enhanced by providing examples of correct solutions to prompts. When an LLM underperforms on a task, one should be curious as to whether best practices were followed. If they were not, the underperformance of the model may be user error as opposed to an inherent flaw.

Lastly, there are concerns about LLM providing false knowledge as though it were accurate. This behavior is referred to as “hallucinations.” Hallucinations can occur as the model is trained to predict the next word in a sentence, not produce factual information. There have been various strategies to decrease the likelihood of hallucinations. Premium users of ChatGPT were able to ask a question, and ChatGPT would search the Internet to answer it, providing a coherent response with citations to existing websites to support its assertion. The method was not infallible, but it did provide a validity to ChatGPT’s responses that had been lacking, as reported by OpenAI. This beta feature was later removed, but it represented a concrete example of how companies are attempting to increase the reliability of their models.

I am providing all of this context so that discussions about the validity and reliability of LLMs in psychiatry can be more nuanced. Discussions should not be of the sort to purport “We cannot use ChatGPT” or “ChatGPT will fix everything.” Rather, psychiatric opinion writers and investigators should formulate their thoughts in more specific language—just as we do in all other aspects of scientific and clinical inquiry—and highlight what is working, what doesn’t work, and what is of concern. Only then can we safely and responsibly discuss how best AI is going to fit into our mental healthcare system. ■

Declan Grabb, M.D.

Declan Grabb, M.D., is a fourth-year psychiatry resident at Northwestern Memorial Hospital in Chicago. He is interested in the intersection of technology and mental health, having published and presented on artificial intelligence (AI) in the field of mental health. He was the inaugural recipient of APA’s Paul O’Leary Award for Innovation in Psychiatry for his work in AI-related tools in mental health.