Proffering what appears to be a box of treats to a partially clothed patient seated on the examination table, the white-coated doctor invites the patient to choose one saying, “Diagnosis cookie?”
The cartoon, appearing in the New Yorker in 2009, speaks in a sardonic way to a suspicion that the diagnoses physicians give their patients might be as arbitrary as the forecasts in a randomly selected fortune cookie. In fact the reliability of diagnosis—the certainty with which it can be predicted that different clinicians will apply the same diagnosis to the same patient—is rarely if ever absolute in any field of medicine, and in some areas it is far lower than is assumed by the health care–using public.
Yet diagnostic reliability is a subject of enormous importance. “If two clinicians give a patient two different diagnoses, you know at least one of them has to be wrong,” Helena Kraemer, Ph.D., a member of the DSM-5 Task Force with special expertise in study design and measurement issues, told Psychiatric News. “And the clinician who was wrong may have given the patient a treatment that was unnecessary for a condition the patient didn’t have.
“From a research perspective, the ability to detect good treatments for a condition and to find the risk factors for a disorder all depend on the diagnosis,” Kraemer said. “And policymakers should be concerned if patients are being inappropriately treated for conditions they don’t have.”
Kraemer is coauthor of an article in the January American Journal of Psychiatry (AJP) describing the goals and methods of reliability testing in the DSM-5 field trials. The other coauthors are DSM-5 Task Force Chair David Kupfer, M.D.; William Narrow, M.D., M.P.H., APA associate director of research; Darrel Regier, M.D., M.P.H., APA director of research; and Diana Clarke, Ph.D., research statistician with the APA Division of Research.
Titled “DSM-5: How Reliable Is Reliable Enough?,” the article does not present results from the field trials, which are still being analyzed; rather, the aims of Kraemer and colleagues were these:
The authors emphasized that the attention that the DSM-5 Task Force is devoting to the subject of reliability—a field that the Lancet once described as “a backwater of medical research”—is somewhat unique in medicine.
“Many medical diagnoses go into common use without any evaluation, and many believe that the rates of reliability and validity of diagnoses in other areas of medicine are much higher than they are,” Kraemer and colleagues wrote. “Indeed, psychiatry is the exception in that we have paid considerable attention to the reliability of our diagnoses.”
Like everything associated with DSM-5, Kraemer’s work has garnered considerable attention. It has also attracted critics who say that the AJP article hints at rates of reliability for DSM-5 diagnoses that should be regarded as unacceptably low.
But in an interview with Psychiatric News, Kraemer (who was also a consultant to the DSM-IV Task Force) said in fact an honest accounting of the process involved in determining diagnostic reliability for the proposed diagnoses was a purpose of the article; more refined methods of testing reliability have been developed for DSM-5 that are bound to give a more accurate picture of real-world clinical reliability than did field trials for previous editions. (For a description of innovations in reliability testing in DSM-5 trials, see DSM Developers Employ Strategies to Strengthen Diagnostic Reliability.)
“We have learned a lot about medical test evaluations in the two decades since DSM-IV was published, and there is an acknowledgement that testing methods used in the past likely produced inflated results,” Kraemer said. “So we want people to have realistic expectations.”
Some criticism—and misunderstanding of fact—has also surrounded the subject of comparative prevalences of DSM-IV diagnoses and the proposed diagnoses for DSM-5. One critic, in communications with APA leadership, lamented that prevalence rates for the proposed diagnoses haven’t been estimated in the field trials and predicted that the new criteria would wildly increase prevalence and “false-positive” diagnoses.
In fact, prevalence rates for proposed diagnoses have been estimated where they have a corresponding DSM-IV diagnosis. And Kraemer reported that the median difference in prevalence for 49 diagnoses that could be evaluated over 11 field trials sites was actually lower using DSM-5 criteria than for DSM-IV criteria. “If there were big differences in prevalence, I would be worried too,” Kraemer said. “But the fact is that there aren’t.”
Telling as that may be to critics who have feared that proposed criteria will artificially inflate prevalence, Kraemer emphasized that a prevalence rate is, in fact, irrelevant to the reliability of a diagnosis: improvements in the sensitivity of criteria will result in higher prevalence rates, while improvements in specificity will result in lower prevalence rates.
So what can psychiatrists and other mental health professionals reasonably expect about the reliability of proposed diagnostic criteria?
Kraemer and colleagues said that realistic goals for many proposed diagnoses are scores reflecting moderate reliability. For diagnoses that are rare, lower scores may be acceptable (for a description of how reliability scores are estimated and interpreted, see below).
“A rare diagnosis represents a weak signal,” Kraemer told Psychiatric News. “So it’s very easy for it to be overwhelmed by even a moderate amount of noise. For rare diagnoses, you really need biomarkers or some form of objective data that we don’t have. So it may be that in some cases the proposed criteria may be the best we can do.”
But for most diagnoses, Kraemer said the expectations for reliability are in line with what is typically found in evaluations of general medical diagnoses, despite the fact that psychiatric diagnosis—in marked contrast to medical diagnosis—continues to be based on inferences derived from patient self-reports or observations of patient behavior.
“The greater the subjective component, the greater the unreliability, and the more objective the data, the better the reliability,” Kraemer said. “If you have an X-ray, for instance, reliability should be very high.
“I think psychiatric diagnosis is a harder challenge than for a lot of medical diagnoses,” she said. “And most people appreciate that the successive editions of DSM are efforts to incrementally improve diagnoses. One of the goals of the new criteria is to push the science forward—more sensitive and specific diagnoses will improve research on disorders, which will in turn produce even better diagnoses in the next generation.”
The statistical measure of reliability, known as a kappa score, expresses the probability of agreement between raters, correcting for the probability that their agreement is random chance. The resulting score is a proportion: the numerator expressing actual observed agreement minus the probability of chance agreement; the denominator expressing perfect agreement minus the probability of chance agreement.
A kappa score of 1, in which the numerator and denominator are equal, represents perfect reliability: the raters can always, without fail, be expected to agree. (A kappa score of 0, at the other extreme, means that the first diagnosis does not predict the second at all.)
Perfect reliability is found in virtually no scenario that involves human judgment. Short of perfect reliability, kappa scores are invariably rendered as fractions of 1 in decimal form. Interpretation of kappa scores—what constitutes a good score or a bad score—is not an exact science, in part because the research literature is comparatively small; in part because reliability standards should change depending on the context. For example, very rare disorders tend to have lower reliability than more common diagnoses.
But a 1977 article (“The Measurement of Observer Agreement for Categorical Data”) in the journal Biometrics by J.R. Landis, a professor of biostatistics at the University of Pennsylvania, established the following standard that is still generally accepted:
Scores of 01–.20 are deemed to have weak reliability.
.21–.40 represents fair reliability.
.41–.60 equates to moderate agreement.
.61–.80 represents substantial reliability.
Above .81 is regarded as almost perfect reliability.
Helena Kraemer, Ph.D., and colleagues, in the January American Journal of Psychiatry, said that reliability scores for general medical diagnoses have sometimes been in the .6 to .8 range but have more commonly reflected moderate reliability of .4 to .6.
Moreover, examples in the medical literature of the kind of test-retest reliability used in DSM-5 field trials (see DSM Developers Employ Strategies to Strengthen Diagnostic Reliability.) are rare, they wrote. The diagnosis of anemia based on conjunctival inspection using the test-retest design was associated with moderate kappa values (.36 and .60), and the test-retest reliability of various findings of bimanual pelvic examinations was quite low, with kappa values from .07 to .26, according to Kraemer and colleagues.