Doctor, It Hurts When I p Ronald L. Wasserstein, Executive Director, ASA NJ ASA Chapter Annual Symposium May 25, 2017 The Talk They think they know all about it already, because they learned about it from others like them. It is not nearly as interesting as they thought it would be.

Theyve stopped listening before youve stopped talking. Chances are, they now understand it even less. Why did the ASA issue a statement on p-values and statistical significance?

"It has been widely felt, probably for thirty years and more, that significance tests are overemphasized and often misused and that more emphasis should be put on estimation and prediction. Cox, D.R. 1986. Some general aspects of the theory of statistics. International Statistical Review 54: 117-126.

A world of quotes illustrating the long history of concern about this can be viewed at David F. Parkhurst, School of Public and Environmental Affairs, Indiana University http://www.indiana.edu/~stigtsts/quotsagn.html Lets be clear. Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters

for decades, to little avail. (Wasserstein and Lazar, 2016) Why did the ASA issue a statement on p-values and statistical significance? Why did the ASA issue a statement on p-values and statistical significance? Science fails to face the shortcomings of statistics

A journal went so far as to ban pvalues P-value clarified (in the ASA Statement) Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value. That definition is about as clear as mud Christie Aschwanden, lead writer for science, FiveThirtyEight

Perhaps this is clearer (Stark, 2016) What goes into the p-value? Many things! Assumption that the null hypothesis is true is typically the only thing considered However,

much more than that goes into the p-value. Many choices by the researcher can affect it. ASA statement articulates six principles 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency 5. A p-value, or statistical significance, does not measure

the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. ASA statement articulates six principles 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the

studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency

5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Does the ASA statement go far enough? The

ASA statement does not go as far as it should go. However, it goes as far as it could go. Biggest takeaway message from the ASA statement bright line thinking is bad for science (S)cientists have embraced and even avidly pursued meaningless differences solely because they are statistically significant, and have ignored important effects because they failed to pass the screen of statistical significanceIt is a safe bet that people have suffered or died because scientists (and editors, regulators, journalists and others) have used

significance tests to interpret results, and have consequently failed to identify the most beneficial courses of action. (Rothman) p equal or nearly equal to 0.06 almost significant almost attained significance almost

significant tendency almost became significant almost but not quite significant almost statistically significant

almost reached statistical significance just barely below the level of significance just beyond significance "...

surely, God loves the .06 nearly as much as the .05." (Rosnell and Rosenthal 1989) p equal or nearly equal to 0.08 a certain trend toward significance a definite trend a

slight tendency toward significance a strong trend toward significance a trend close to significance an expected trend

approached our criteria of significance approaching approaching, borderline significance significance although not reaching, And, God forbid, p close to but not

less than 0.05 hovered at nearly a significant level (p=0.058) hovers on the brink of significance (p=0.055) just about significant (p=0.051) just above the margin of significance (p=0.053) just at the conventional level of significance (p=0.05001) just barely statistically significant (p=0.054) just borderline significant (p=0.058)

Thanks to Matthew Hankins for these quotes https://mchankins.wordpress.com/2013/04/21/still-notsignificant-2/ Hypothesis: Focus on statistical significance levels leads researchers to misinterpret data Participants: Authors of articles published in the 2013 volume of the NEJM

Below is a summary of a study from an academic paper: The study aimed to test how different interventions might affect terminal cancer patients survival. Participants were randomly assigned to one of two groups. Group A was instructed to write daily about positive things they were blessed with while Group B was instructed to write daily about misfortunes that others had to endure. Participants were then tracked until all had died. Participants in Group A lived, on average, 8.2 months post-diagnosis whereas participants in Group B lived, on average, 7.5 months

Response wording 1: Which statement is the most accurate summary of the results? A. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B. B. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were

in Group A was less than that lived by the participants who were in Group B. C. Speaking only of the subjects who took part in this particular study, the average number of post-diagnosis months lived by the participants who were in Group A was no different than that lived by the participants who were in Group B. D. Speaking only of the subjects who took part in this particular study, it cannot be determined whether the average number of post-diagnosis months lived by the participants who were in Group A was greater/no different/less than

that lived by the participants who were in Group B. Response wording 2: Which statement is the most accurate summary of the results? A. The average number of post-diagnosis months lived by the participants who were in Group A was greater than that lived by the participants who were in Group B. B. The average number of post-diagnosis months lived by the participants who were in Group A was less than that lived by the

participants who were in Group B. C. The average number of post-diagnosis months lived by the participants who were in Group A was no different than that lived by the participants who were in Group B. D. It cannot be determined whether the average number of postdiagnosis months lived by the participants who were in Group A was greater/no different/less than that lived by the participants who were in Group B.

Response wording 3: Which statement is the most accurate summary of the results? A. The participants who were in Group A tended to live longer postdiagnosis than the participants who were in Group B. B. The participants who were in Group A tended to live shorter postdiagnosis than the participants who were in Group B. C. Post-diagnosis lifespan did not differ between the participants who were in Group A and the participants who were in Group B.

D. It cannot be determined whether the participants who were in Group A tended to live longer/no different/shorter post-diagnosis than the participants who were in Group B. Results recall: A. group A>group B B. group A In the case of p=0.27, among the great majority of people who

chose did not chose A, there was generally a split between C and D, with C favored by most. Robustness checks Similar study with editorial board of Psychological Science

Marketing Science Institute Young Scholars Statistically trained undergraduates Undergraduates without statistical training There was no substantial change to the pattern of results Study 2: Does this pattern of results extend from

descriptive statements to the evaluation of evidence via likelihood judgments Participants: Authors of articles published in the 2013 volume of the American Journal of Epidemiology Participants completed a likelihood judgment question followed by a choice question. Participants were randomly assigned to one of sixteen conditions following a four by two by two design. The first level

of the design varied whether the p-value was set to 0.025, 0.075, 0.125, or 0.175 and the second level of the design varied the magnitude of the treatment difference (52% and 44% versus 57% and 39%). Below is a summary of a study from an academic paper: The study aimed to test how two different drugs impact whether a patient recovers from a certain disease. Subjects were randomly drawn from a fixed population and then randomly assigned to Drug A or Drug B. Fiftytwo percent (52%) of subjects who took Drug A recovered from the disease while forty-four percent (44%) of subjects who took Drug B recovered from the disease. A test of the null

hypothesis that there is no difference between Drug A and Drug B in terms of probability of recovery from the disease yields a p-value of Likelihood judgment question: Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate? A. A person drawn randomly from the same population as the subjects in the study is more likely to recover from the disease if given Drug A than if given Drug B. B.

A person drawn randomly from the same population as the subjects in the study is less likely to recover from the disease if given Drug A than if given Drug B. C. A person drawn randomly from the same population as the subjects in the study is equally likely to recover from the disease if given Drug A than if given Drug B. D. It cannot be determined whether a person drawn randomly from the same population as the subjects in the study is more/less/equally

likely to recover from the disease if given Drug A or if given Drug B. Likelihood judgment question: Assuming no prior studies have been conducted with these drugs, which of the following statements is most accurate? A. more likely B. less likely C.

equally likely D. cannot be determined Personal choice question: If you were to advise a loved one who was a patient from the same population as those in the study, what drug would you advise him or her to take? A. I would advise Drug A.

B. I would advise Drug B. C. I would advise that there is no difference between Drug A and Drug B. Distant choice question: If you were to advise physicians treating patients from the same population as those in the study, what drug would you advise these physicians prescribe for their patients?

A. I would advise Drug A. B. I would advise Drug B. C. I would advise that there is no difference between Drug A and Drug B. Why did they make these choices?

Participants were provided a text box to explain their reasoning The vast majority who incorrectly answered the judgment question responded with an answer about not achieving statistical significance By the way:

This second study was tested on multiple populations with similar results The size of the effect (the proportions of success for the two drugs) were also varied, with no change in the pattern Dichtomizing evidence leads to strange behaviors! Does screen time affect sleep habits of school age children?

37 The researchers had hypotheses, based on previous research We hypothesized that use of any form of electronic media would be negatively associated with sleep duration. Furthermore, we expected that the strength of the association would vary based on the level of interactivity

of the screen type. More specifically, we hypothesized that interactive forms of screen time, such as computer use and video gaming, would be associated with shorter bedtime sleep duration compared to passive forms of screen time, such as watching television. 38 Why were they interested? Lack

of sleep (insufficient sleep duration) increases risk of poor academic performance as well as certain adverse health outcomes Is there a relationship between weekday nighttime sleep duration and screen exposure (television, chatting, video games)? 39 Who were the subjects? We used age 9 data from an ethnically

diverse national birth cohort study, the Fragile Families and Child Wellbeing Study, to assess the association between screen time and sleep duration among 9-year-olds, using screen time data reported by both the child (n = 3269) and by the child's primary caregiver (n = 2770). 40 Fragile Families and Child Wellbeing Study The FFCW is a longitudinal cohort study that

has followed approximately 5000 children, born between 1998 and 2000, since birth. Data were collected in 20 cities with populations of at least 200,000 across the United States. The sample was designed to include a high number of unmarried parents and racial minorities, along with a high proportion of low socioeconomic status. 41 42 43

44 What did the researchers find? Children who watched more than 2 hours/day of TV had shorter sleep duration compared with those who watched less than 2 hours/day (P<.001) by about 11 minutes. Children who spent more than 2 hours per day of chatting on the computer had shorter sleep

duration than those who chatted less than 2 hours/day (P<.05) by about 16 minutes. The researchers did not find a significant association between playing videogames/working on the computer for more than 2 hours per day and weekday nighttime sleep duration 45 When the researchers adjusted for other factors

Children who watched more than 2 hours/day of TV had shorter sleep duration compared with those who watched less than 2 hours/day (P<.05) by about 6 minutes. No other significant associations found. 46 This is a fairly typical type of study Typical
scientifically Typical statistically Atypical communication 47 Unfortunately, it makes all-tootypical mistakes 48 Children who watched more than 2

hours/day of TV had shorter sleep duration compared with those who watched less than 2 hours/day (P<.001) by about 11 minutes. 49 This means that, if all of the assumptions are correct, including the null hypothesis, there is less than a 1 in 1000 chance that the researchers would have observed the result they did or one even larger. (The result they observed is an average difference of about 11 minutes from one group to
the other.) 50 Hypothesis testing logic A 1 in 1000 chance is not very likely So it is not likely that, if all of the assumptions are correct, we would have observed the outcome we observed (11

minutes difference in sleep time) or one even larger. Therefore, we should evaluate these assumptions, including the null hypothesis 51 What people tend to conclude in these situations? (What will the Research shows that children who watch TV more blogs say?) during the weekday sleep less than those who

dont. And from there it is a short walk to TV is not good for kids and should be limited or TV is causing poor performance in school because it makes kids sleep less. Authors conclusion in abstract: No specific type or use of screen time resulted in significantly shorter sleep duration than another, suggesting that

caution should be advised against excessive use of all screens. In other words, though not demonstrated in the study, all screen usage is 52 There is no p-value transitivity property They argue (in effect):

TV = chatting = video games in this study TV results in less sleep in this study Therefore, we should watch out for all the other things, too. But the study does not and cannot prove the first assertion!

53 If there is enough evidence that one effect is significant, but not enough evidence for the second being significant, that doesnt mean that the two effects are different from each other. Analogously, if you can prove that one suspect was present at a crime scene, but cant prove the other was, that doesnt mean that you have proved that the two suspects were in different places. (emphasis mine)

(http://mindhacks.com/2015/10/03/statistical- 54 What is scientifically appropriate to conclude? The children in this study who watched more than 2 hours/day of TV had shorter sleep duration compared with those who watched less than 2 hours/day by about 11 minutes.

If all of our assumptions, including those about the representativeness of the sample, are correct, the study suggests that nine year old children from this population who watch more than 2 hours/day of TV. 55 In the sleep research, even if all of our assumptions are correct Does 11 minutes less sleep really matter?

Why? Furthermore, the 11 minutes measure is an estimate that has variance we learn nothing about that variance from the way the data summary is reported (i.e., via a pvalue) 56 And what if THIS had happened: Suppose the study showed that children who watched 2 or more hours

of TV slept on average 90 minutes per night less than those who did not, but the p-value was 0.09. Is this result insignificant? 57 A fundamental problem We want P(H|D) but p-values give P(D|H)

The problem illustrated (Carver 1978) What is the probability of obtaining a dead person (D) given that the person was hanged (H); that is, in symbol form, what is p(D|H)? Obviously, it will be very high, perhaps .97 or higher. The problem illustrated (Carver 1978) Now, let us reverse the question: What is the probability that a person has been hanged (H) given that the person is dead

(D); that is, what is p(H|D)? This time the probability will undoubtedly be very low, perhaps .01 or lower. The problem illustrated (Carver 1978) No one would be likely to make the mistake of substituting the first estimate (.97) for the second (.01); that is, to accept .97 as the probability that a person has been hanged given that the person is dead. Carver, R.P. 1978. The case against statistical testing. Harvard Educational Review 48: 378-399.

Inference is hard work. Simplistic (cookbook) rules and procedures are not a substitute for this hard work. Cookbook + artificial threshold for significance = appearance of objectivity Gelman and Carlin (2017) (to appear in JASA) we think the solution is not to reform pvalues or to replace them with some other

statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation. In a world where p<0.05 carried no meaning What would you have to do to get your paper published, your research grant funded, your drug approved, your policy or business recommendation accepted? Youd have to be convincing!
You will also have to be transparent Small steps (Gelman and Carlin) Say No to binary conclusions in our collaboration and consulting projects: resist giving clean answers when that is not warranted by the data. Instead, do the work to present statistical conclusions with uncertainty rather than as dichotomies.

Also, remember that most effects cant be zero (at least in social science and public health), and that an effect is usually a mean in a population (or something similar such as a regression coefficient)a fact that seems to be lost from consciousness when researchers slip into binary statements about there being an effect or no effect as if they are writing about constants of nature. It will be difficult to resolve the many problems with p-values

and statistical significance without addressing the mistaken goal of certainty which such methods have been used to pursue. Wrapping up: P-values themselves are not the problem, but They are hard to explain They are easy to misunderstand

They dont directly address the question of interest When mixed with bright line thinking, they lead to bad science. So, maybe if you have only been dating p-values, its time to start seeing some other statistics.

Haiku Little p-value what are you trying to say of significance? -Steve Ziliak Questions? [email protected]org @RonWasserstein