The p-value is perhaps very ubiquitous across all levels of scientific and clinical research. It is a statistical measure that determines whether a null hypothesis should be rejected, by testing the significance of a particular result. It is the probability of obtaining a sample outcome, given that the null hypothesis is true. In this case, the null hypothesis is assumed to be true and only rejected when reasonable evidence emerges.
We can never determine, with absolute certainty, that any intriguing result we observe is the result of what we think we have observed. It is possible indeed our results could have been influenced by sampling errors, a certain confounding bias, or some sort of a chance event. Accordingly, the p-value measures the likelihood that a positive result is due to a random chance. Thus, the smaller the p-value, the more evidence we have to reject the null hypothesis. Suppose, for instance, a clinical trial is conducted to test whether a particular drug reduces pain. The null hypothesis would be that there is no difference between the treatment and the control groups, and that any difference that emerges is due to chance events. The p-value calculated would thus determine the likelihood that chance events would produce a difference as large as the one we had in our hypothetical clinical trial, given that the null hypothesis is true. It would provide the measure of evidence against the null hypothesis.
As previously mentioned, a small p-value indicates that the alternative hypothesis is a better explanation for the results obtained than the null hypothesis. A cut-off point of a p-value of 0.05 has traditionally been used to determine whether the null hypothesis should be rejected. A p-value of 0.05 indicates a 95 % likelihood that an observed difference did not occur by chance. If the p-value is less than 0.05, then results are statistically significant and we can reject the null hypothesis. If the p-value, however, is greater than 0.05, then results are not statistically significant and we fail to reject the null hypothesis.
The use of the p-value in statistical analysis owes its origins to one of the 20th century’s greatest scientists, Sir Ronald Fischer. Fisher was an eminent English statistician and geneticist, who pioneered and popularized practical applications of statistics in data analysis. In his 1935 book, The Design of Experiments, Fisher describes an actual incident at a tea party, where a lady claimed to be able to tell, if served a cup of tea, whether the tea or the milk were poured first. To test the lady’s claim, Fisher devised a classic experiment. He gave the lady eight cups of tea, four with tea added first and four with milk added first and asked her to partition the cups into the two correct group classifications. The null hypothesis was that the the order in which tea or milk were added did not affect the lady’s judgement. ALL the cups had to be judged successfully in order to yield a significant result. The probability of classifying all cups correctly would be 1/70 (8!/4!4! or 70 possible partitions). The lady was able to successfully judge three of cups to which milk had been added first, a p-value of 17/20, or 0.24 which is higher than the significance cut-off of 0.05. Therefore, the null hypothesis cannot be rejected at this significance level.
This experiment is a great demonstration of the early uses of the p-value in statistical analysis. But, how meaningful is it? If we were to extend the experiment and give the lady 12 cups of tea. The probability of the lady classifying all 12 cups of tea would be 1/924 (6!6!6!6!/12!0!6!6!0! possible partitions). The probability of her successfully judging 5 cups of each treatment would be 37/924, or 0.04 which is less than the 0.05 cut-off. This would mean that results would still be significant even with one misjudgment. An increase in the size of the experiment can thus allow a significant result to be achieved, even with a higher proportion of misjudgments.
This is perhaps one of the limitations of using the p-value in statistical analysis. A demonstration of statistical significance is not necessarily a demonstration of practical significance or practical meaningfulness. As the size of each comparison group increases, so does the statistical significance of sample means. Accordingly, a statistically significant difference can still be detected, even if it does not exist if the sample size is large. On the other hand, if the sample size is small, a certain difference that might be of practical importance might not be statistically significant. Thus, a significant p-value says nothing about the magnitude of an effect. And, a non-significant p-value that is greater than 0.05 merely says that the obtained result is smaller than what could have occurred through chance events alone.
Even though Fisher did suggest a 0.05 as a significance level, that decision was quite arbitrary. 0.05 is not a preferable significance level to 0.01, or any closer level for that matter. Fisher says that “this is an arbitrary but convenient level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained.”, which really puts forth the rather naive suggestion that a p-value of 0.049 is more significant than a p-value of 0.051. But, a cut-off had to be determined anyways and 0.05 looks rather decisive to the human eye.
The p-value method in statistical analysis is liable to many misuses in scientific and clinical research. For instance, researchers can construct hypotheses from data that’s already present and they can collect tangential data that’s unrelated to the primary analysis and perform incidental comparisons between sub-cohorts, for the sole purpose of selectively cherry-picking a significant p-value. The more comparisons that are to be made, the more likely is it that a significant result will have to come out anyway, even if no real association actually exists. Thus, at a significance level of 0.05, a researcher can perform 20 comparisons and one will emerge to be statistically significant by chance alone. In other words, you torture the data such that it confesses!
Indeed, the p-value does not provide any information with regards to the alternative hypothesis. The acceptance of a hypothesis is contingent upon the improbability of another. Furthermore, the p-value never tells us how exactly different the treatment groups are. But, it is important to note, however, that Fisher never intended it to be used as a measure of significance in hypothesis-testing, in the first place.
“A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection. This inequality statement can therefore be made. However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. Further, the calculation is based solely on a hypothesis, which, in the light of the evidence, is often not believed to be true at all, so that the actual probability of erroneous decision, supposing such a phrase to have any meaning, may be much less than the frequency specifying the level of significance.” – Ronald Fisher, Statistical Methods for Research Workers
Fisher, R.A. (1971). The Design of Experiments, 9th ed., New York: Hafner.
Fisher, R. A. (1970). Statistical Methods for Research Workers, 14th ed., Edinburgh: Oliver & Boyd.