A 2004 article in this Journal1 argues correctly that clinical judgment will always play an important role in the application of diagnostic test results to individual patients. The basic point made there is that because post-test predictive probabilities (obtained via Bayes theorem) rely on measurements that can only be known and applied relatively imprecisely, cclinical judgmentd will have to be used to weigh up various sources of imprecision and ambiguity to arrive at point or interval assessments of final health risks for the patient.But Wongs concluding inference, that cdiagnostic tests are indeed testing of cliniciansd, is incomplete in two ways. First, by refusing to open and examine the black box of cclinical judgmentd, he understates the real risk assessment and risk communication problems clinicians and their patients face. Second, he ignores the important practical clinical question: what can be done to help clinicians and their patients use cclinical judgmentd to resolve the inference and decision making issues created by these ambiguous ctestingd tests.There is a serious problem for clinicians in calculating and interpreting the results of diagnostic test information, even when diagnostic information is precise. Research2,5 on the statistical innumeracy of physicians shows that an astonishing 70-75% of medical students, house physicians and practicing physicians cannot correctly calculate the inverse probabilities required to generate post-test predictive probabilities. These statistical innumeracy problems appear to get worse for clinicians the longer the elapsed time since graduation from medical school.As Gigerenzer5 puts it, we live in an age of statistical innumeracy, with both clinicians and their patients facing four endemic problems: the illusion of certainty; an ignorance of quantitative risks; an inability to communicate risk information effectively; and clouded thinking about how to use quantitative diagnostic information to draw informed conclusions. The inference problems facing statistically challenged clinicians and their patients will become much worse as we enter the era of cpersonalised medicined.The fundamental idea behind personalised medicine involves widespread use of diagnostic tests based on newly discovered genes and proteins to better predict individual patients clinical responses to specific drug therapies. The directors of the National Institute of Health (NIH) and The Food and Drug Administration (FDA) in the US6 are preparing for an explosion in the growth of these kinds of new diagnostic tests. But we also live in an era of cinformed consentd, where patients, many of whom are well educated, want to be directly involved in the decision making processes and risk assessments that inform and guide their health care and treatment.The combination of (1) a rapid increase in the supply of new diagnostic tests, (2) increased demand for these new diagnostics, and (3) intelligent but statistically innumerate clinicians and patients (especially when an understanding and calculation of inverse conditional probabilities (Bayes theorem) is required) is a recipe for trouble.We have developed a software program that may be of use to clinicians and their patients in solving this problem. The program has an interactive visual display specifically designed to facilitate both the understanding and communication of risk information when the underlying information is imprecise. That is, it allows for quick robustness checks (cwhat-ifd reasoning) with immediate visual feedback on relevant uncertainties.Suppose that a patient in a clinician patient consultation asks: cI am different - older, younger, sicker, healthier, etc - what difference would that make?d The clinician - alone or in consultation with a patient - can easily change relevant input values within reasonable bounds and use the immediate visual feedback to communicate and explore the implications of the patients question for post-test positive or negative predictive probabilities.The display is user-friendly, involving only sliders, menu buttons and input fields. No mathematical sophistication is required or presupposed, so the program can be used by a very wide range of clinicians and patients, from the statistically sophisticated through to the statistically innumerate While simple and intuitive, the program interface also gives users the ability to verify and analyse different scenarios. This allows patients and clinicians to explore any uncertainty they have about the accuracy and reliability of the diagnostic test results and also about the frequency of medical conditions in the population.The interface instantaneously generates updated graphs and frequency numbers permitting patients and clinicians to discuss and communicate the risks of treating or not treating any potential medical conditions that patients might have (see Gigerenzer4,5). And last but not least, the software tool itself is free and based on freely available, open source software that can be easily installed and run on all of the popular PC operating systems (Windows, Mac, Linux) used by clinicians and patients.Figure 1 below is an annotated screenshot of the interface of the program. Once installed, this interface is all that users see. Of course a static , black and white picture cannot do justice to a fully interactive environment with color coding, so we have developed a short tutorial video7 which shows the interface in action. But the general idea about how the interface facilitates diagnostic inferences about health states based on the outcome of both precise and imprecise diagnostic test results is captured in the static screenshots below.Figure 1 has 3 basic interconnected parts. The left panel consists of user controlled diagnostic input variables. The right panel contains two outputs, one tabular, the other graphical, both derived from the input sliders in the left panel. A brief explanation of each of the panels in Figure 1 is followed by examples in Figure 2 and Figure 3 of how clinicians and their patients can use the software to vary the input parameters and observe the effect on diagnostic inferences in order to apply imprecise estimates to individual patients. Figure 1. An annotated picture of the basic interface The table in the top right of Figure 1, essentially a logicians Truth Table for two propositions augmented by natural frequency information, sets out the basic logical possibilities for combinations of health states and test results. The proposition cthe patient has the specific diseased is represented by a binary variable D on the left side of the table, a variable that has two possible values: D=1 if the proposition is true and D=0 if that proposition is false. The proposition cthe diagnostic test on this patient is positive for the diseased is represented by a binary variable T on the left side of the table, a variable that can also take on two possible values, T=1 if the proposition is true and the test is positive for the disease, T=0 if the proposition is false and the test is negative (non-positive) for the disease. The four columns of 1s and 0s in the table identify the four logically possible combinations for disease states {present, absent} and test outcomes {positive, negative}, each labeled with their conventional epidemiological names. The frequency information in the bottom row of the Truth Table is a way of expressing uncertainty about the logical possibilities for (D,T) in the form of hypothetical counts of cases in a hypothetical population. The initial default values in the table of Figure 1 assume a population of 100 cases, 16 of which are true positives, four of which are false negatives, 24 of which are false positives, and 56 of which are true negatives. Empirical research2,5 demonstrates improved accuracy of reasoning, less clouded thinking, and better understanding and communication of risk information for many (but not all) people when uncertainties are expressed as whole number frequencies in a natural sampling format as compared to using the language of conditional, marginal, and joint probabilities using decimals or percentages. Of course the table can also be used with the language of probability to express uncertainties, since scaling whole number frequencies (the column counts) by dividing by 100 or its multiples is a transparent operation for clinicians or patients familiar with the language of probability. The initial values for the frequency numbers are controlled by the sliders and menu buttons on the left hand side of Figure 1. These sliders represent numerical inputs for conventional ways of expressing uncertainty about disease states and diagnostic test outcomes: test sensitivity, test specificity, and the base rate or disease prevalence.8 For example, test sensitivity, the proportion of cases with disease (D=1) among those cases with a positive test outcome (T=1) is 16 out of 16+4=20 in Figure 1, or 80%. Similarly, test specificity is the proportion of cases without the disease (D=0) among those cases with a negative (T=0) test outcome, 24 out of 24+56=70 in Figure 1, or 70%. The base rate or prevalence of the disease in Figure 1 is the proportion of cases in the population who have the disease (D=1), 16+4=20 out of 100 cases, or 20%. There are actually two rows of frequencies, corresponding to the two sets of sliders in the left panel. Having two sets of sliders and associated frequency representations comes in handy when exploring the implications of imprecise inputs into the inference process. The graphical display in the bottom right panel of Figure 1 shows the positive (T=1) post-test predictive probabilities (solid red line) and negative (T=0) post-test predictive probabilities (dashed blue line) for having the disease (D=1) for every possible prevalence rate from 0 to 1 along the x-axis. The curves are calculated from Bayes theorem based on the values of the sensitivity and specificity of the test, but the user does not see the calculation. The vertical line in the figure through 0.2 or 20%, the base rate or prevalence rate of the disease set by the base rate slider, intersects the two post-test predictive probability curves. The upper intersection point identifies the numerical value (0.4) of the predictive probability that the disease is present given a positive diagnostic outcome, denoted P(D=1|T=1). The notation P(D=1|T=1) is just a shorthand way of expressing uncertainty about the chances of having the disease, D=1, when a positive test result, T=1, is observed. Reading the symbols from left to right: P is the probability, representing the degree of uncertainty; D=1 is what we are uncertain about, we want to know if we have the disease; and T=1 captures the information we have, we know the test has come back positive. The positive predictive probability P(D=1|T=1)=0.4 can be calculated by the count of 16 in column 1 where D=1 and T=1, divided by 16+24, the sum of the counts associated with T=1 in columns 1 and 3. The lower intersection point identifies the numerical value (0.07) of the predictive probability that the disease is present given a negative diagnostic outcome, denoted P(D=1|T=0). This notation P(D=1|T=0) is just a shorthand way of expressing uncertainty about the chances of a false negative test result. Analogous to the previous case, P represents the degree of uncertainty, D=1 is what we are uncertain about, and T=0 captures the information we have. The difference is that here the test has come back negative. The negative predictive probability P(D=1|T=0) =0.07 can be calculated by the count of four in column 2 where D=1 and T=0 divided by 4+56=60, the sum of the counts associated with T=0 in columns 2 and 4.The specific numeric values shown in Figure 1 are derived from the slider settings, which simultaneously determine the counts in the columns of the natural frequency table. Figure 2 below shows the impact on post-test predictive probabilities of decreasing the base rate from 20% to 5% when test sensitivity and specificity remain unchanged. As the base rate slider is manipulated, the position of the vertical line changes (with a thinner, more opaque line keeping the original base rate at a benchmark level for comparison purposes). In the example of Figure 2 both post-test predictive probabilities decrease with the decreases in the base rate, and markedly so, from 40% to 12%, for the positive predictive probability. Not only are the levels of the new post-test probabilities instantly recalculated on the graph as the base rate slider is manipulated, but the changing difference between the two post-test probabilities - the vertical gap between the two curves - is transparent. The vertical gap at any specification of the base rate is a measure of the amount of information to be gained from actually doing a diagnostic test. Since diagnostic tests are usually costly in time and resources, if not also in the downstream implications of reacting to false positives (anxiety and further unnecessary interventions) and not reacting to false negatives (misplaced assurance and foregone helpful interventions), it is helpful to have an idea of chow muchd or chow littled information one can expect to learn from a tests before any test is actually performed. The vertical gap between the solid and the dashed curves provides one such measure. When the gap between the solid and dashed lines at relevant pre-test prevalence rates is small, it may not be worthwhile undergoing costly and risky testing in the first place, an important message to get across to patients eager for subsidised diagnostic tests. There are various combinations of sensitivity, specificity, and pre-test base rate that can lead to small differences between the post-test predictive probabilities, all of which can easily be explored with the interface. Figure 2 itself illustrates a very general principal that the gap between post-test predictive probabilities4and therefore the additional information provided by a diagnostic test - will be small whenever the base rate or prevalence of the condition D=1 is either very small or very large. The truth table in Figure 2 helps explain why. For example at a low prevalence rate or pre-test probability of 5% (the patient is unlikely to have the disease) many of the positive test results (29 out of 4+29=33 from columns 1 and 3 of the table) will be false positives. Figure 2. Exploring the effects of changing the base rate keeping test sensitivity and specificity unchanged The sensitivity and specificity values used for the calculations underlying Figures 1 and 2 are low, 80% and 70% respectively. What if the test was more (or less) sensitive or more (or less) specific, or more or less on both counts? The software tool can be used to investigate all of these combinations quickly and correctly. Figure 3 shows the impact of one pair of those changes. Test sensitivity and test specificity are now each close to 95%. Notice that the positive predictive probability (what to infer from a positive test result) now increases above 80% for all but the lowest base rates for diseases, and approaches probabilities of 95% when the pre-test probabilities are greater than 50\/50 or 50%. Similarly the negative predictive probability (what to infer from a negative test result) decreases to less than 20% for all but the highest prevalence rates, and to less than 5% for all pre-test probabilities lower than 50\/50 or 50%. Overall the gap between two curves has increased dramatically (in comparison to the lower sensitivity and specificity in the benchmark case). This reveals that the diagnostic test has more discriminatory power when test sensitivity and specificity are improved, and therefore testing may be more worthwhile performing. Figure 3. Exploring the effects of changing the test sensitivity and specificity keeping the base rate unchanged Of course other combinations of sensitivity and specificity can easily be explored. For example, there is often a tradeoff between sensitivity and specificity when disagreements arise about setting the threshold levels for classifying test results. 5,9 This tradeoff can easily be explored - and explained to interested patients - using sliders in the software interface software tool interface. Once a clinician masters the interface (a very straightforward task as illustrated in the web tutorials) he or she can use it to communicate in easily understandable way with a patient about how relatively imprecise or ambiguous information about that patient and the testing procedure might impact on their post-test chances of having a disease The format that we present can be seen as a logical extension of Gerd Gigerenzers insight as to why natural frequency formats help people make better inferences from diagnostic information in uncertain situations: cthe representation does part of the reasoningd (p48).5 We have augmented the standard natural frequency representation of inference task problems in three ways that are useful for clinicians and patients. First, our interface in a compatible way with standard clinical ways of representing and communicating uncertainties about health risks and diagnostic tests (sensitivity, specificity, and base rate). Second, all calculations and the many recalculations that are necessary when exploring the implications (for post-test health risks) of imprecision and ambiguities in underlying information sources, are done electronically and correctly. Thirdly, the interactive graphical interface provides visually clear and immediate, dynamically updated representations of both inputs to and outputs for the inference task, with an ability to check (via the natural frequency table) relevant calculations and gain understanding into how diagnostic information and health state risks are related. As Edward Tufte says: c...clarity and excellence in thinking is very much like the clarity and excellence in the display of data. When principles of design replicate the principles of thought, the act of arranging information becomes an act of insightd10

Both clinicians and patients experience difficulty with the statistical reasoning required to make inferences about health states on the basis of information derived from diagnostic tests. This problem will grow in importance as we move into the era of personalised medicine where an increasing supply of imprecise diagnostic tests meets an increasing demand to use such tests on the part of intelligent but statistically innumerate clinicians and patients. We describe a user-friendly, interactive, graphical interface for calculating, visualising, and communicating accurate inferences about uncertain health states when diagnostic information (test sensitivity and specificity, and health state prevalence) is imprecise and ambiguous in its application to a specific patient. The software is free, open-source, and runs on all popular PC operating systems (Windows, Mac, Linux)

Wong M-L. Rheumatologic diagnostic serology: tests which test clinicians. N Z Med J. 2004;117(1203).http://journal.nzma.org.nz/journal/117-1203/1095/content.pdfBarbey AK, Sloman SA. Base-rate respect: from ecological rationality to dual processes. Behav Brain Sci. 2007;30:241-54.Berwick D, Fineberg H, Weinstein M. When doctors meet numbers. Am J Med 1981;71:991-998.Gigerenzer G, Gaissmaier W, Kurz-Milcke E, et al. Helping doctors and patients make sense of health statistics. Psychol Sci Public Interest. 2008;8:53-98.Gigerenzer G. Calculated Risks: How to Know When Numbers Deceive You. New York: Simon & Schuster; 2003.Hamburg M, Collins F. The path to personalized medicine. N Engl J Med. 2010;363:301-4Fountain J, Discrete medical Tests: communicating and visualizing diagnostic information. 2010.http://uctv.canterbury.ac.nz/post/4/1130Bernardo J, Smith A. Bayesian Theory. New York: Wiley. 1994, p44, Figure 2.Iles S, Beckert L, Than M, Town I. Making a diagnosis of pulmonary embolism - new methods and clinical issues. N Z Med J . 2003;116(1177). http://journal.nzma.org.nz/journal/116-1177/499/content.pdfTufte E. Visual Explanations: Images and Quantities, Evidence and Narrative. Connecticut: Graphics Press . 1997, p11.

contact nzmj@nzma.org.nz

A 2004 article in this Journal1 argues correctly that clinical judgment will always play an important role in the application of diagnostic test results to individual patients. The basic point made there is that because post-test predictive probabilities (obtained via Bayes theorem) rely on measurements that can only be known and applied relatively imprecisely, cclinical judgmentd will have to be used to weigh up various sources of imprecision and ambiguity to arrive at point or interval assessments of final health risks for the patient.But Wongs concluding inference, that cdiagnostic tests are indeed testing of cliniciansd, is incomplete in two ways. First, by refusing to open and examine the black box of cclinical judgmentd, he understates the real risk assessment and risk communication problems clinicians and their patients face. Second, he ignores the important practical clinical question: what can be done to help clinicians and their patients use cclinical judgmentd to resolve the inference and decision making issues created by these ambiguous ctestingd tests.There is a serious problem for clinicians in calculating and interpreting the results of diagnostic test information, even when diagnostic information is precise. Research2,5 on the statistical innumeracy of physicians shows that an astonishing 70-75% of medical students, house physicians and practicing physicians cannot correctly calculate the inverse probabilities required to generate post-test predictive probabilities. These statistical innumeracy problems appear to get worse for clinicians the longer the elapsed time since graduation from medical school.As Gigerenzer5 puts it, we live in an age of statistical innumeracy, with both clinicians and their patients facing four endemic problems: the illusion of certainty; an ignorance of quantitative risks; an inability to communicate risk information effectively; and clouded thinking about how to use quantitative diagnostic information to draw informed conclusions. The inference problems facing statistically challenged clinicians and their patients will become much worse as we enter the era of cpersonalised medicined.The fundamental idea behind personalised medicine involves widespread use of diagnostic tests based on newly discovered genes and proteins to better predict individual patients clinical responses to specific drug therapies. The directors of the National Institute of Health (NIH) and The Food and Drug Administration (FDA) in the US6 are preparing for an explosion in the growth of these kinds of new diagnostic tests. But we also live in an era of cinformed consentd, where patients, many of whom are well educated, want to be directly involved in the decision making processes and risk assessments that inform and guide their health care and treatment.The combination of (1) a rapid increase in the supply of new diagnostic tests, (2) increased demand for these new diagnostics, and (3) intelligent but statistically innumerate clinicians and patients (especially when an understanding and calculation of inverse conditional probabilities (Bayes theorem) is required) is a recipe for trouble.We have developed a software program that may be of use to clinicians and their patients in solving this problem. The program has an interactive visual display specifically designed to facilitate both the understanding and communication of risk information when the underlying information is imprecise. That is, it allows for quick robustness checks (cwhat-ifd reasoning) with immediate visual feedback on relevant uncertainties.Suppose that a patient in a clinician patient consultation asks: cI am different - older, younger, sicker, healthier, etc - what difference would that make?d The clinician - alone or in consultation with a patient - can easily change relevant input values within reasonable bounds and use the immediate visual feedback to communicate and explore the implications of the patients question for post-test positive or negative predictive probabilities.The display is user-friendly, involving only sliders, menu buttons and input fields. No mathematical sophistication is required or presupposed, so the program can be used by a very wide range of clinicians and patients, from the statistically sophisticated through to the statistically innumerate While simple and intuitive, the program interface also gives users the ability to verify and analyse different scenarios. This allows patients and clinicians to explore any uncertainty they have about the accuracy and reliability of the diagnostic test results and also about the frequency of medical conditions in the population.The interface instantaneously generates updated graphs and frequency numbers permitting patients and clinicians to discuss and communicate the risks of treating or not treating any potential medical conditions that patients might have (see Gigerenzer4,5). And last but not least, the software tool itself is free and based on freely available, open source software that can be easily installed and run on all of the popular PC operating systems (Windows, Mac, Linux) used by clinicians and patients.Figure 1 below is an annotated screenshot of the interface of the program. Once installed, this interface is all that users see. Of course a static , black and white picture cannot do justice to a fully interactive environment with color coding, so we have developed a short tutorial video7 which shows the interface in action. But the general idea about how the interface facilitates diagnostic inferences about health states based on the outcome of both precise and imprecise diagnostic test results is captured in the static screenshots below.Figure 1 has 3 basic interconnected parts. The left panel consists of user controlled diagnostic input variables. The right panel contains two outputs, one tabular, the other graphical, both derived from the input sliders in the left panel. A brief explanation of each of the panels in Figure 1 is followed by examples in Figure 2 and Figure 3 of how clinicians and their patients can use the software to vary the input parameters and observe the effect on diagnostic inferences in order to apply imprecise estimates to individual patients. Figure 1. An annotated picture of the basic interface The table in the top right of Figure 1, essentially a logicians Truth Table for two propositions augmented by natural frequency information, sets out the basic logical possibilities for combinations of health states and test results. The proposition cthe patient has the specific diseased is represented by a binary variable D on the left side of the table, a variable that has two possible values: D=1 if the proposition is true and D=0 if that proposition is false. The proposition cthe diagnostic test on this patient is positive for the diseased is represented by a binary variable T on the left side of the table, a variable that can also take on two possible values, T=1 if the proposition is true and the test is positive for the disease, T=0 if the proposition is false and the test is negative (non-positive) for the disease. The four columns of 1s and 0s in the table identify the four logically possible combinations for disease states {present, absent} and test outcomes {positive, negative}, each labeled with their conventional epidemiological names. The frequency information in the bottom row of the Truth Table is a way of expressing uncertainty about the logical possibilities for (D,T) in the form of hypothetical counts of cases in a hypothetical population. The initial default values in the table of Figure 1 assume a population of 100 cases, 16 of which are true positives, four of which are false negatives, 24 of which are false positives, and 56 of which are true negatives. Empirical research2,5 demonstrates improved accuracy of reasoning, less clouded thinking, and better understanding and communication of risk information for many (but not all) people when uncertainties are expressed as whole number frequencies in a natural sampling format as compared to using the language of conditional, marginal, and joint probabilities using decimals or percentages. Of course the table can also be used with the language of probability to express uncertainties, since scaling whole number frequencies (the column counts) by dividing by 100 or its multiples is a transparent operation for clinicians or patients familiar with the language of probability. The initial values for the frequency numbers are controlled by the sliders and menu buttons on the left hand side of Figure 1. These sliders represent numerical inputs for conventional ways of expressing uncertainty about disease states and diagnostic test outcomes: test sensitivity, test specificity, and the base rate or disease prevalence.8 For example, test sensitivity, the proportion of cases with disease (D=1) among those cases with a positive test outcome (T=1) is 16 out of 16+4=20 in Figure 1, or 80%. Similarly, test specificity is the proportion of cases without the disease (D=0) among those cases with a negative (T=0) test outcome, 24 out of 24+56=70 in Figure 1, or 70%. The base rate or prevalence of the disease in Figure 1 is the proportion of cases in the population who have the disease (D=1), 16+4=20 out of 100 cases, or 20%. There are actually two rows of frequencies, corresponding to the two sets of sliders in the left panel. Having two sets of sliders and associated frequency representations comes in handy when exploring the implications of imprecise inputs into the inference process. The graphical display in the bottom right panel of Figure 1 shows the positive (T=1) post-test predictive probabilities (solid red line) and negative (T=0) post-test predictive probabilities (dashed blue line) for having the disease (D=1) for every possible prevalence rate from 0 to 1 along the x-axis. The curves are calculated from Bayes theorem based on the values of the sensitivity and specificity of the test, but the user does not see the calculation. The vertical line in the figure through 0.2 or 20%, the base rate or prevalence rate of the disease set by the base rate slider, intersects the two post-test predictive probability curves. The upper intersection point identifies the numerical value (0.4) of the predictive probability that the disease is present given a positive diagnostic outcome, denoted P(D=1|T=1). The notation P(D=1|T=1) is just a shorthand way of expressing uncertainty about the chances of having the disease, D=1, when a positive test result, T=1, is observed. Reading the symbols from left to right: P is the probability, representing the degree of uncertainty; D=1 is what we are uncertain about, we want to know if we have the disease; and T=1 captures the information we have, we know the test has come back positive. The positive predictive probability P(D=1|T=1)=0.4 can be calculated by the count of 16 in column 1 where D=1 and T=1, divided by 16+24, the sum of the counts associated with T=1 in columns 1 and 3. The lower intersection point identifies the numerical value (0.07) of the predictive probability that the disease is present given a negative diagnostic outcome, denoted P(D=1|T=0). This notation P(D=1|T=0) is just a shorthand way of expressing uncertainty about the chances of a false negative test result. Analogous to the previous case, P represents the degree of uncertainty, D=1 is what we are uncertain about, and T=0 captures the information we have. The difference is that here the test has come back negative. The negative predictive probability P(D=1|T=0) =0.07 can be calculated by the count of four in column 2 where D=1 and T=0 divided by 4+56=60, the sum of the counts associated with T=0 in columns 2 and 4.The specific numeric values shown in Figure 1 are derived from the slider settings, which simultaneously determine the counts in the columns of the natural frequency table. Figure 2 below shows the impact on post-test predictive probabilities of decreasing the base rate from 20% to 5% when test sensitivity and specificity remain unchanged. As the base rate slider is manipulated, the position of the vertical line changes (with a thinner, more opaque line keeping the original base rate at a benchmark level for comparison purposes). In the example of Figure 2 both post-test predictive probabilities decrease with the decreases in the base rate, and markedly so, from 40% to 12%, for the positive predictive probability. Not only are the levels of the new post-test probabilities instantly recalculated on the graph as the base rate slider is manipulated, but the changing difference between the two post-test probabilities - the vertical gap between the two curves - is transparent. The vertical gap at any specification of the base rate is a measure of the amount of information to be gained from actually doing a diagnostic test. Since diagnostic tests are usually costly in time and resources, if not also in the downstream implications of reacting to false positives (anxiety and further unnecessary interventions) and not reacting to false negatives (misplaced assurance and foregone helpful interventions), it is helpful to have an idea of chow muchd or chow littled information one can expect to learn from a tests before any test is actually performed. The vertical gap between the solid and the dashed curves provides one such measure. When the gap between the solid and dashed lines at relevant pre-test prevalence rates is small, it may not be worthwhile undergoing costly and risky testing in the first place, an important message to get across to patients eager for subsidised diagnostic tests. There are various combinations of sensitivity, specificity, and pre-test base rate that can lead to small differences between the post-test predictive probabilities, all of which can easily be explored with the interface. Figure 2 itself illustrates a very general principal that the gap between post-test predictive probabilities4and therefore the additional information provided by a diagnostic test - will be small whenever the base rate or prevalence of the condition D=1 is either very small or very large. The truth table in Figure 2 helps explain why. For example at a low prevalence rate or pre-test probability of 5% (the patient is unlikely to have the disease) many of the positive test results (29 out of 4+29=33 from columns 1 and 3 of the table) will be false positives. Figure 2. Exploring the effects of changing the base rate keeping test sensitivity and specificity unchanged The sensitivity and specificity values used for the calculations underlying Figures 1 and 2 are low, 80% and 70% respectively. What if the test was more (or less) sensitive or more (or less) specific, or more or less on both counts? The software tool can be used to investigate all of these combinations quickly and correctly. Figure 3 shows the impact of one pair of those changes. Test sensitivity and test specificity are now each close to 95%. Notice that the positive predictive probability (what to infer from a positive test result) now increases above 80% for all but the lowest base rates for diseases, and approaches probabilities of 95% when the pre-test probabilities are greater than 50\/50 or 50%. Similarly the negative predictive probability (what to infer from a negative test result) decreases to less than 20% for all but the highest prevalence rates, and to less than 5% for all pre-test probabilities lower than 50\/50 or 50%. Overall the gap between two curves has increased dramatically (in comparison to the lower sensitivity and specificity in the benchmark case). This reveals that the diagnostic test has more discriminatory power when test sensitivity and specificity are improved, and therefore testing may be more worthwhile performing. Figure 3. Exploring the effects of changing the test sensitivity and specificity keeping the base rate unchanged Of course other combinations of sensitivity and specificity can easily be explored. For example, there is often a tradeoff between sensitivity and specificity when disagreements arise about setting the threshold levels for classifying test results. 5,9 This tradeoff can easily be explored - and explained to interested patients - using sliders in the software interface software tool interface. Once a clinician masters the interface (a very straightforward task as illustrated in the web tutorials) he or she can use it to communicate in easily understandable way with a patient about how relatively imprecise or ambiguous information about that patient and the testing procedure might impact on their post-test chances of having a disease The format that we present can be seen as a logical extension of Gerd Gigerenzers insight as to why natural frequency formats help people make better inferences from diagnostic information in uncertain situations: cthe representation does part of the reasoningd (p48).5 We have augmented the standard natural frequency representation of inference task problems in three ways that are useful for clinicians and patients. First, our interface in a compatible way with standard clinical ways of representing and communicating uncertainties about health risks and diagnostic tests (sensitivity, specificity, and base rate). Second, all calculations and the many recalculations that are necessary when exploring the implications (for post-test health risks) of imprecision and ambiguities in underlying information sources, are done electronically and correctly. Thirdly, the interactive graphical interface provides visually clear and immediate, dynamically updated representations of both inputs to and outputs for the inference task, with an ability to check (via the natural frequency table) relevant calculations and gain understanding into how diagnostic information and health state risks are related. As Edward Tufte says: c...clarity and excellence in thinking is very much like the clarity and excellence in the display of data. When principles of design replicate the principles of thought, the act of arranging information becomes an act of insightd10

Both clinicians and patients experience difficulty with the statistical reasoning required to make inferences about health states on the basis of information derived from diagnostic tests. This problem will grow in importance as we move into the era of personalised medicine where an increasing supply of imprecise diagnostic tests meets an increasing demand to use such tests on the part of intelligent but statistically innumerate clinicians and patients. We describe a user-friendly, interactive, graphical interface for calculating, visualising, and communicating accurate inferences about uncertain health states when diagnostic information (test sensitivity and specificity, and health state prevalence) is imprecise and ambiguous in its application to a specific patient. The software is free, open-source, and runs on all popular PC operating systems (Windows, Mac, Linux)

Wong M-L. Rheumatologic diagnostic serology: tests which test clinicians. N Z Med J. 2004;117(1203).http://journal.nzma.org.nz/journal/117-1203/1095/content.pdfBarbey AK, Sloman SA. Base-rate respect: from ecological rationality to dual processes. Behav Brain Sci. 2007;30:241-54.Berwick D, Fineberg H, Weinstein M. When doctors meet numbers. Am J Med 1981;71:991-998.Gigerenzer G, Gaissmaier W, Kurz-Milcke E, et al. Helping doctors and patients make sense of health statistics. Psychol Sci Public Interest. 2008;8:53-98.Gigerenzer G. Calculated Risks: How to Know When Numbers Deceive You. New York: Simon & Schuster; 2003.Hamburg M, Collins F. The path to personalized medicine. N Engl J Med. 2010;363:301-4Fountain J, Discrete medical Tests: communicating and visualizing diagnostic information. 2010.http://uctv.canterbury.ac.nz/post/4/1130Bernardo J, Smith A. Bayesian Theory. New York: Wiley. 1994, p44, Figure 2.Iles S, Beckert L, Than M, Town I. Making a diagnosis of pulmonary embolism - new methods and clinical issues. N Z Med J . 2003;116(1177). http://journal.nzma.org.nz/journal/116-1177/499/content.pdfTufte E. Visual Explanations: Images and Quantities, Evidence and Narrative. Connecticut: Graphics Press . 1997, p11.

contact nzmj@nzma.org.nz

A 2004 article in this Journal1 argues correctly that clinical judgment will always play an important role in the application of diagnostic test results to individual patients. The basic point made there is that because post-test predictive probabilities (obtained via Bayes theorem) rely on measurements that can only be known and applied relatively imprecisely, cclinical judgmentd will have to be used to weigh up various sources of imprecision and ambiguity to arrive at point or interval assessments of final health risks for the patient.But Wongs concluding inference, that cdiagnostic tests are indeed testing of cliniciansd, is incomplete in two ways. First, by refusing to open and examine the black box of cclinical judgmentd, he understates the real risk assessment and risk communication problems clinicians and their patients face. Second, he ignores the important practical clinical question: what can be done to help clinicians and their patients use cclinical judgmentd to resolve the inference and decision making issues created by these ambiguous ctestingd tests.There is a serious problem for clinicians in calculating and interpreting the results of diagnostic test information, even when diagnostic information is precise. Research2,5 on the statistical innumeracy of physicians shows that an astonishing 70-75% of medical students, house physicians and practicing physicians cannot correctly calculate the inverse probabilities required to generate post-test predictive probabilities. These statistical innumeracy problems appear to get worse for clinicians the longer the elapsed time since graduation from medical school.As Gigerenzer5 puts it, we live in an age of statistical innumeracy, with both clinicians and their patients facing four endemic problems: the illusion of certainty; an ignorance of quantitative risks; an inability to communicate risk information effectively; and clouded thinking about how to use quantitative diagnostic information to draw informed conclusions. The inference problems facing statistically challenged clinicians and their patients will become much worse as we enter the era of cpersonalised medicined.The fundamental idea behind personalised medicine involves widespread use of diagnostic tests based on newly discovered genes and proteins to better predict individual patients clinical responses to specific drug therapies. The directors of the National Institute of Health (NIH) and The Food and Drug Administration (FDA) in the US6 are preparing for an explosion in the growth of these kinds of new diagnostic tests. But we also live in an era of cinformed consentd, where patients, many of whom are well educated, want to be directly involved in the decision making processes and risk assessments that inform and guide their health care and treatment.The combination of (1) a rapid increase in the supply of new diagnostic tests, (2) increased demand for these new diagnostics, and (3) intelligent but statistically innumerate clinicians and patients (especially when an understanding and calculation of inverse conditional probabilities (Bayes theorem) is required) is a recipe for trouble.We have developed a software program that may be of use to clinicians and their patients in solving this problem. The program has an interactive visual display specifically designed to facilitate both the understanding and communication of risk information when the underlying information is imprecise. That is, it allows for quick robustness checks (cwhat-ifd reasoning) with immediate visual feedback on relevant uncertainties.Suppose that a patient in a clinician patient consultation asks: cI am different - older, younger, sicker, healthier, etc - what difference would that make?d The clinician - alone or in consultation with a patient - can easily change relevant input values within reasonable bounds and use the immediate visual feedback to communicate and explore the implications of the patients question for post-test positive or negative predictive probabilities.The display is user-friendly, involving only sliders, menu buttons and input fields. No mathematical sophistication is required or presupposed, so the program can be used by a very wide range of clinicians and patients, from the statistically sophisticated through to the statistically innumerate While simple and intuitive, the program interface also gives users the ability to verify and analyse different scenarios. This allows patients and clinicians to explore any uncertainty they have about the accuracy and reliability of the diagnostic test results and also about the frequency of medical conditions in the population.The interface instantaneously generates updated graphs and frequency numbers permitting patients and clinicians to discuss and communicate the risks of treating or not treating any potential medical conditions that patients might have (see Gigerenzer4,5). And last but not least, the software tool itself is free and based on freely available, open source software that can be easily installed and run on all of the popular PC operating systems (Windows, Mac, Linux) used by clinicians and patients.Figure 1 below is an annotated screenshot of the interface of the program. Once installed, this interface is all that users see. Of course a static , black and white picture cannot do justice to a fully interactive environment with color coding, so we have developed a short tutorial video7 which shows the interface in action. But the general idea about how the interface facilitates diagnostic inferences about health states based on the outcome of both precise and imprecise diagnostic test results is captured in the static screenshots below.Figure 1 has 3 basic interconnected parts. The left panel consists of user controlled diagnostic input variables. The right panel contains two outputs, one tabular, the other graphical, both derived from the input sliders in the left panel. A brief explanation of each of the panels in Figure 1 is followed by examples in Figure 2 and Figure 3 of how clinicians and their patients can use the software to vary the input parameters and observe the effect on diagnostic inferences in order to apply imprecise estimates to individual patients. Figure 1. An annotated picture of the basic interface The table in the top right of Figure 1, essentially a logicians Truth Table for two propositions augmented by natural frequency information, sets out the basic logical possibilities for combinations of health states and test results. The proposition cthe patient has the specific diseased is represented by a binary variable D on the left side of the table, a variable that has two possible values: D=1 if the proposition is true and D=0 if that proposition is false. The proposition cthe diagnostic test on this patient is positive for the diseased is represented by a binary variable T on the left side of the table, a variable that can also take on two possible values, T=1 if the proposition is true and the test is positive for the disease, T=0 if the proposition is false and the test is negative (non-positive) for the disease. The four columns of 1s and 0s in the table identify the four logically possible combinations for disease states {present, absent} and test outcomes {positive, negative}, each labeled with their conventional epidemiological names. The frequency information in the bottom row of the Truth Table is a way of expressing uncertainty about the logical possibilities for (D,T) in the form of hypothetical counts of cases in a hypothetical population. The initial default values in the table of Figure 1 assume a population of 100 cases, 16 of which are true positives, four of which are false negatives, 24 of which are false positives, and 56 of which are true negatives. Empirical research2,5 demonstrates improved accuracy of reasoning, less clouded thinking, and better understanding and communication of risk information for many (but not all) people when uncertainties are expressed as whole number frequencies in a natural sampling format as compared to using the language of conditional, marginal, and joint probabilities using decimals or percentages. Of course the table can also be used with the language of probability to express uncertainties, since scaling whole number frequencies (the column counts) by dividing by 100 or its multiples is a transparent operation for clinicians or patients familiar with the language of probability. The initial values for the frequency numbers are controlled by the sliders and menu buttons on the left hand side of Figure 1. These sliders represent numerical inputs for conventional ways of expressing uncertainty about disease states and diagnostic test outcomes: test sensitivity, test specificity, and the base rate or disease prevalence.8 For example, test sensitivity, the proportion of cases with disease (D=1) among those cases with a positive test outcome (T=1) is 16 out of 16+4=20 in Figure 1, or 80%. Similarly, test specificity is the proportion of cases without the disease (D=0) among those cases with a negative (T=0) test outcome, 24 out of 24+56=70 in Figure 1, or 70%. The base rate or prevalence of the disease in Figure 1 is the proportion of cases in the population who have the disease (D=1), 16+4=20 out of 100 cases, or 20%. There are actually two rows of frequencies, corresponding to the two sets of sliders in the left panel. Having two sets of sliders and associated frequency representations comes in handy when exploring the implications of imprecise inputs into the inference process. The graphical display in the bottom right panel of Figure 1 shows the positive (T=1) post-test predictive probabilities (solid red line) and negative (T=0) post-test predictive probabilities (dashed blue line) for having the disease (D=1) for every possible prevalence rate from 0 to 1 along the x-axis. The curves are calculated from Bayes theorem based on the values of the sensitivity and specificity of the test, but the user does not see the calculation. The vertical line in the figure through 0.2 or 20%, the base rate or prevalence rate of the disease set by the base rate slider, intersects the two post-test predictive probability curves. The upper intersection point identifies the numerical value (0.4) of the predictive probability that the disease is present given a positive diagnostic outcome, denoted P(D=1|T=1). The notation P(D=1|T=1) is just a shorthand way of expressing uncertainty about the chances of having the disease, D=1, when a positive test result, T=1, is observed. Reading the symbols from left to right: P is the probability, representing the degree of uncertainty; D=1 is what we are uncertain about, we want to know if we have the disease; and T=1 captures the information we have, we know the test has come back positive. The positive predictive probability P(D=1|T=1)=0.4 can be calculated by the count of 16 in column 1 where D=1 and T=1, divided by 16+24, the sum of the counts associated with T=1 in columns 1 and 3. The lower intersection point identifies the numerical value (0.07) of the predictive probability that the disease is present given a negative diagnostic outcome, denoted P(D=1|T=0). This notation P(D=1|T=0) is just a shorthand way of expressing uncertainty about the chances of a false negative test result. Analogous to the previous case, P represents the degree of uncertainty, D=1 is what we are uncertain about, and T=0 captures the information we have. The difference is that here the test has come back negative. The negative predictive probability P(D=1|T=0) =0.07 can be calculated by the count of four in column 2 where D=1 and T=0 divided by 4+56=60, the sum of the counts associated with T=0 in columns 2 and 4.The specific numeric values shown in Figure 1 are derived from the slider settings, which simultaneously determine the counts in the columns of the natural frequency table. Figure 2 below shows the impact on post-test predictive probabilities of decreasing the base rate from 20% to 5% when test sensitivity and specificity remain unchanged. As the base rate slider is manipulated, the position of the vertical line changes (with a thinner, more opaque line keeping the original base rate at a benchmark level for comparison purposes). In the example of Figure 2 both post-test predictive probabilities decrease with the decreases in the base rate, and markedly so, from 40% to 12%, for the positive predictive probability. Not only are the levels of the new post-test probabilities instantly recalculated on the graph as the base rate slider is manipulated, but the changing difference between the two post-test probabilities - the vertical gap between the two curves - is transparent. The vertical gap at any specification of the base rate is a measure of the amount of information to be gained from actually doing a diagnostic test. Since diagnostic tests are usually costly in time and resources, if not also in the downstream implications of reacting to false positives (anxiety and further unnecessary interventions) and not reacting to false negatives (misplaced assurance and foregone helpful interventions), it is helpful to have an idea of chow muchd or chow littled information one can expect to learn from a tests before any test is actually performed. The vertical gap between the solid and the dashed curves provides one such measure. When the gap between the solid and dashed lines at relevant pre-test prevalence rates is small, it may not be worthwhile undergoing costly and risky testing in the first place, an important message to get across to patients eager for subsidised diagnostic tests. There are various combinations of sensitivity, specificity, and pre-test base rate that can lead to small differences between the post-test predictive probabilities, all of which can easily be explored with the interface. Figure 2 itself illustrates a very general principal that the gap between post-test predictive probabilities4and therefore the additional information provided by a diagnostic test - will be small whenever the base rate or prevalence of the condition D=1 is either very small or very large. The truth table in Figure 2 helps explain why. For example at a low prevalence rate or pre-test probability of 5% (the patient is unlikely to have the disease) many of the positive test results (29 out of 4+29=33 from columns 1 and 3 of the table) will be false positives. Figure 2. Exploring the effects of changing the base rate keeping test sensitivity and specificity unchanged The sensitivity and specificity values used for the calculations underlying Figures 1 and 2 are low, 80% and 70% respectively. What if the test was more (or less) sensitive or more (or less) specific, or more or less on both counts? The software tool can be used to investigate all of these combinations quickly and correctly. Figure 3 shows the impact of one pair of those changes. Test sensitivity and test specificity are now each close to 95%. Notice that the positive predictive probability (what to infer from a positive test result) now increases above 80% for all but the lowest base rates for diseases, and approaches probabilities of 95% when the pre-test probabilities are greater than 50\/50 or 50%. Similarly the negative predictive probability (what to infer from a negative test result) decreases to less than 20% for all but the highest prevalence rates, and to less than 5% for all pre-test probabilities lower than 50\/50 or 50%. Overall the gap between two curves has increased dramatically (in comparison to the lower sensitivity and specificity in the benchmark case). This reveals that the diagnostic test has more discriminatory power when test sensitivity and specificity are improved, and therefore testing may be more worthwhile performing. Figure 3. Exploring the effects of changing the test sensitivity and specificity keeping the base rate unchanged Of course other combinations of sensitivity and specificity can easily be explored. For example, there is often a tradeoff between sensitivity and specificity when disagreements arise about setting the threshold levels for classifying test results. 5,9 This tradeoff can easily be explored - and explained to interested patients - using sliders in the software interface software tool interface. Once a clinician masters the interface (a very straightforward task as illustrated in the web tutorials) he or she can use it to communicate in easily understandable way with a patient about how relatively imprecise or ambiguous information about that patient and the testing procedure might impact on their post-test chances of having a disease The format that we present can be seen as a logical extension of Gerd Gigerenzers insight as to why natural frequency formats help people make better inferences from diagnostic information in uncertain situations: cthe representation does part of the reasoningd (p48).5 We have augmented the standard natural frequency representation of inference task problems in three ways that are useful for clinicians and patients. First, our interface in a compatible way with standard clinical ways of representing and communicating uncertainties about health risks and diagnostic tests (sensitivity, specificity, and base rate). Second, all calculations and the many recalculations that are necessary when exploring the implications (for post-test health risks) of imprecision and ambiguities in underlying information sources, are done electronically and correctly. Thirdly, the interactive graphical interface provides visually clear and immediate, dynamically updated representations of both inputs to and outputs for the inference task, with an ability to check (via the natural frequency table) relevant calculations and gain understanding into how diagnostic information and health state risks are related. As Edward Tufte says: c...clarity and excellence in thinking is very much like the clarity and excellence in the display of data. When principles of design replicate the principles of thought, the act of arranging information becomes an act of insightd10

Both clinicians and patients experience difficulty with the statistical reasoning required to make inferences about health states on the basis of information derived from diagnostic tests. This problem will grow in importance as we move into the era of personalised medicine where an increasing supply of imprecise diagnostic tests meets an increasing demand to use such tests on the part of intelligent but statistically innumerate clinicians and patients. We describe a user-friendly, interactive, graphical interface for calculating, visualising, and communicating accurate inferences about uncertain health states when diagnostic information (test sensitivity and specificity, and health state prevalence) is imprecise and ambiguous in its application to a specific patient. The software is free, open-source, and runs on all popular PC operating systems (Windows, Mac, Linux)

Wong M-L. Rheumatologic diagnostic serology: tests which test clinicians. N Z Med J. 2004;117(1203).http://journal.nzma.org.nz/journal/117-1203/1095/content.pdfBarbey AK, Sloman SA. Base-rate respect: from ecological rationality to dual processes. Behav Brain Sci. 2007;30:241-54.Berwick D, Fineberg H, Weinstein M. When doctors meet numbers. Am J Med 1981;71:991-998.Gigerenzer G, Gaissmaier W, Kurz-Milcke E, et al. Helping doctors and patients make sense of health statistics. Psychol Sci Public Interest. 2008;8:53-98.Gigerenzer G. Calculated Risks: How to Know When Numbers Deceive You. New York: Simon & Schuster; 2003.Hamburg M, Collins F. The path to personalized medicine. N Engl J Med. 2010;363:301-4Fountain J, Discrete medical Tests: communicating and visualizing diagnostic information. 2010.http://uctv.canterbury.ac.nz/post/4/1130Bernardo J, Smith A. Bayesian Theory. New York: Wiley. 1994, p44, Figure 2.Iles S, Beckert L, Than M, Town I. Making a diagnosis of pulmonary embolism - new methods and clinical issues. N Z Med J . 2003;116(1177). http://journal.nzma.org.nz/journal/116-1177/499/content.pdfTufte E. Visual Explanations: Images and Quantities, Evidence and Narrative. Connecticut: Graphics Press . 1997, p11.

for the PDF of this article