The other day I posted a short survey (super representative, perfect questionnaire and the best respondents) on twitter what political science should do with this thing called p-value:
Let’s start with some background on why I asked this question. There is an ongoing debate by some of the finest scholars in the discipline how we can prevent fraud (e.g. The Mike LaCour case; upper body strength and self interest) and come up with better means to judge about the meaningfulness of our results (see: [here], [here], [here], [here]). Currently the standard approach is to rely on p-values to judge about if we reject our null hypothesis. P-values thereby tell us something about the statistical significance of our findings. Of course p-values are by no means the only problem leading to fraud in academia. But they appear to provide a crucial incentive structure for fraud.
What’s the matter with p-values?
In my opinion there is not one single issue with it, but tons. Here I will try to focus on the issues which are crucial for this post. In essence, p-values tell us how likely our data are, assuming a true null hypothesis. As such p-values are counterintuitive and scholars frequently conflate them with information on the probabilities of our hypotheses. Make no mistake: Even well trained researchers misunderstand and misinterpret p-values. Now we can keep on persuading us that if we improve teaching of p-values eventually people will get things straight. I suspect that this will never happen. P-values have been around for a while by now, teaching changed, misinterpretations are still around. But this is just my own observation, not a randomized controlled trial. Maybe I’m wrong.
Second, the standard approach when using p-values is to make a binary decision based on arbitrary thresholds to judge if a finding is significant or not. As Thomas Willi mentioned in a conversation on the matter, perfect objectivity will be unreachable. But we can be more and less arbitrary. Using binary decisions to judge about the significance of our results is too arbitrary. The magic number is p<0.05 (3rd option in the survey). P<0.05 affects crucially what is being published in political science. Given the importance of this magic number scholars have strategic incentives to run a lot of models in an effort to make the magical threshold. Gelman calls this “the garden of forking paths” (see: [here], [here], [here]). Now by pure chance eventually scholars are likely to succeed, report this finding and are then being published. Gelman and others rightfully emphasize that this is a main driver of the current replication crisis in social science.
Following this criticism some scholars propose to lower this arbitrary threshold (1st option in the survey). As outlined in detail in Benjamin et al. this might help us to make it more difficult to successfully exploit the garden of forking paths. Yet, as discussed by McShane et al. the threshold remains arbitrary; researchers would become even more confident about the certainty of their findings; publication bias towards papers making the threshold would persist. On top of that funding would become more crucial due to the need to increase power (N) in most applications. This would benefit well-funded and rich institutions and potentially widen the inequality gap between universities. Also, as Lucas Leemann correctly pointed out to me, it would barely affect scholars using survey experiments because the sample is rarely given but controllable to a large extent by researchers. But it would crucially affect scholars using observational data (based for instance on country time-series data) since frequently the sample cannot be increased by the researcher; it is given, unchangeable.
A very radical solution would be the 4th option in the survey. Instead of relying on a frequentist approach to statistics we could enforce people to use Bayesian approaches. This would result in getting rid of p-values entirely and instead evaluating our hypothesis based on probabilities. Now intuitively this makes a lot of sense. Yet, as we know from predictions of elections people tend to also misinterpret probabilities. Maybe even more radical than the p-value, just maybe. For instance take the predictions on the last presidential elections in the US (see: [here]). Pundits and the public alike overemphasized the probabilities that Clinton would win and Trump lose. And the rest is history.
Not that it matters, but what is my opinion?
My opinion is that we should go for the 2nd option. Let’s get rid of p-values and stars in our papers entirely. We should report standard errors and leave it there. In essence the information on statistical significance is still there. But readers need to figure out themselves with a fairly simple test if a coefficient is significant: is the coefficient at least two standard errors away from zero? Thus, readers are forced to look carefully into tables. They will no longer be able to judge based on * or p<0.05 in the blink of an eye. I suppose that this gives authors more leverage to convince reviewers/readers based on other facts than predominantly statistical significance.
We need to do more than judging the meaningfulness of our findings based on an arbitrary threshold. There are other means than p-values to judge the meaningfulness of our findings. Pre-registration should eventually become a must to protect us from fraud. Instead of writing single study papers, we should strive to test our hypotheses on different data-sets in the same paper. Thereby, we should aim to look as carefully as possible into the mechanisms standing behind our effects. I think the key to overcoming replication issues and misinterpretation of p-values is to get rid of the focus on p-values. Let’s not focus on a single metric to judge!
The piece by McShane et al. is super close to my opinion. The major difference is that their opinion is that p-values should be reported without relying on arbitrary thresholds and using stars to mark significance. I forgot to add this option to my survey, stupid me:
I would be afraid that even though there are no stars reviewers/readers would still focus too much on p-values and, thereby, (unconsciously) still judge based on the arbitrary thresholds.
Finally, the 3rd option is not an option at all in my opinion. We cannot stay put even though we are aware of all the issues coming with the status quo. The status quo pretty much combines the major critique outlined above without adding much value. We need to re-think this thing called p-value!
PS: Did I miss any other proposals? Let me know!