Unlike the null hypothesis, the alternative hypothesis has been given little methodological attention by social scientists. It is argued that for a complete understanding and appreciation of hypothesis-testing, the nature of the alternative hypothesis must be thoroughly studied. Inferences based on rejecting the null hypothesis are inadequate if an inappropriate alternative is inferred. Examples of inappropriate inferences are discussed from the areas of psychology, cosmology, medicine, education and chance to highlight the generality of the problem. The primary purpose of the article is to stimulate thought about the alternative hypothesis by considering: 1) differences between the conceptual and statistical alternatives; 2) the construing of the alternative and; 3) how to best characterize the alternative hypothesis. It is concluded that it is best characterized as a "next best guess explanation" given a rejected null hypothesis.
Cohen (1994), among others (e.g., Bakan, 1966; Loftus, 1993; Meehl, 1978) have long exposed the serious methodological and philosophical problems associated with null hypothesis significance testing (NHST). Some (e.g., Hunter, 1997) have even advocated the extreme position that the procedure be banned completely from psychology journals. Although most articles aimed at evaluating social science’s primary means of statistical inference focus unduly on the “Fisherian” null hypothesis, few have discussed in much detail the nature of the alternative hypothesis. Largely a product of the Neyman-Pearson (1928) model of hypothesis-testing, the alternative hypothesis has received relatively little attention by methodologists. This is despite the fact that the alternative, not the null, is the primary hypothesis of interest to the investigator (Cohen, 1994). Indeed, few have taken a close look at what constitutes a “methodologically sound” alternative to the null.
The purpose of this article is to clarify the nature of the alternative hypothesis. Of first priority will be to distinguish between the “statistical” and “conceptual” alternatives. It will be argued that this distinction is seldom recognized in research circles. Too often, the statistical and conceptual hypotheses are conflated to imply similar statements and are not treated as substantially distinct from one another. A second priority will be to discuss the actual “construction” of the alternative hypothesis. Through examples from various fields, it will be shown that the conceptual alternative is born out of experimental control rather than by any statistical procedure. Again, the differences between both types of alternatives (i.e., statistical vs. conceptual) will be used to substantiate this argument. The examples were chosen from a wide array of scientific fields (e.g., cosmology, medicine, psychology, education, chance) for the purpose of demonstrating the generality of the problem. Regardless of the discipline, where there is a null hypothesis to be rejected, there is an even more daunting task of inferring a correct alternative.1 Finally, what is thought to be an ideal characterization of the alternative hypothesis will be presented, that of it being a “next best guess explanation” in light of a rejected null hypothesis.
The importance of having a firm theoretical and methodological understanding of the nature of the alternative hypothesis cannot be overstated. It is no exaggeration to say that researchers continually rely on the alternative hypothesis for “answers” in the social sciences. Should these answers be wrong, the foundation on which scientific progress is based comes under serious scrutiny. Indeed, multiple “wrong answers” leads not to progress at all, but rather into a possible regress of knowledge, or quite literally, utter confusion. The significance of properly considering the alternative hypothesis can be demonstrated in the historic astronomical debate between geocentric and heliocentric theories of the universe. Ptolemy, based largely on evidence gathered by measurement, propounded the geocentric theory in a form that prevailed for 1400 years. In the realm of null hypothesis testing, one could say that Ptolemy, upon gathering evidence in the form of measurement data, rejected the probability that these measurements could have arisen by mere chance. Instead, he proposed the existence of a “structure” (i.e., universal structure) behind them. This structure formed the basis of his alternative hypothesis in which the sun orbited the earth. Indeed, his alternative to chance appeared very reasonable at the time, despite the fact that it was later found utterly incorrect by Copernicus, among others (e.g., Galileo). The point is that although Ptolemy’s alternative appeared correct for the given time period and socially acceptable given the beliefs of the masses, it was nevertheless replaced by Copernicus’s heliocentric theory. Both famous men can be said to have rejected chance from their measurements, but both arrived at two competing and virtually opposite alternatives. Thus it is the nature of the problem not only for cosmology, but for all sciences, since one cannot “prove” an alternative hypothesis; one can only reject chance as a reasonable explanation, and infer what is considered the more likely explanation. For Ptolemy, the alternative to chance was that the sun revolves around the earth. For Copernicus, the alternative was that the earth revolves around the sun. There is perhaps no greater example in history that highlights the importance of carefully considering inferred alternatives.2 Before discussing further examples, it will do well to review the two primary types of alternative hypotheses.
A first distinction that is paramount to understanding the alternative hypothesis is the existence of both a “statistical” and a “conceptual,” also known as “substantive” (or again, “scientific”) alternative hypothesis.3 As noted by Bolles (1962), although a rejection of the null implies the statistical alternative, it does not necessarily imply the scientific alternative. These are two very different hypotheses. This difference stems largely from the fact that statistical inference is not equal to scientific inference (Morrison & Henkel, 1969). Chow (1996) correctly argues that although the statistical alternative may be quite easily inferred, this inference is much more difficult with regard to the conceptual alternative, because of what he calls the “reality of multiple explanations” (p. 53). Indeed, a multitude of explanations could exist for why the null hypothesis is rejected. In other words, a number of reasons could account for why a null is found improbable. According to the Neyman-Pearson model of hypothesis-testing, if the null is rejected, the alternative is inferred -- that is, the statistical alternative. However, this does not and should not directly imply an inference of the conceptual alternative. In fact, the inference of the statistical alternative simply suggests the possibility of a conceptual hypothesis that may account for why the null was rejected. Without a conceptual alternative, rejecting the null has little meaning. As Chow notes, “multiple conceptual alternative hypotheses give rise to their respective statistical alternative hypotheses” (p. 55). This means that in order to have any statistical alternative, you first need to conceive of a conceptual hypothesis. Without the assumption that a conceptual hypothesis can account for the rejected null, one would hardly be interested in inferring a statistical alternative. Once the conceptual alternative is formulated, you may deduce its statistical alternative, but this does not necessarily imply the truth or confirmation of the conceptual hypothesis. In short, if the null is rejected, the statistical alternative is sure to be inferred -- this much is clear. What is not a “given” is the inference of the conceptual hypothesis. Should the “truth” of both hypotheses be equated (i.e., that of the statistical and conceptual), one could easily infer conceptual alternatives that have no scientific meaning. Consider the following example illustrating this problem.
In a coin-flip paradigm, let the null hypothesis be that the coin is fair. After many trials, if indeed the null is rejected, what shall we infer? Assuming we have an alternative conceptual hypothesis, this necessarily implies a statistical alternative hypothesis. Although the statistical hypothesis may be easily inferred (upon rejection of the null), this does not necessarily suggest the conceptual can be inferred with equal ease. If p < .05, we will reject the null and infer the statistical alternative. This follows from the Neyman-Pearson model of hypothesis-testing. However, we may not be so willing to infer the conceptual alternative, especially if it is something that is not a plausible explanation for why the null was rejected. For instance, we would hardly infer an alternative such as the coin is governed by spiritual agents (given say, many consecutive heads) that are invisible in this room. This conceptual alternative is unlikely in that it would probably not be suggested by a social scientist. The scientist would more likely infer something of the nature that the coin is biased, due to a physical defect. This would be a more “common-sense” alternative to chance factors having produced the successive heads. What is crucial to note is that the alternative could comprise of almost anything and is not restricted to a particular number of hypotheses. Indeed, the logical possibility of alternative hypotheses is practically infinite. The point at hand is that although the “spiritual agents” hypothesis would likely receive little attention, it cannot be refuted based solely on the logic of hypothesis-testing, and certainly not by any statistical procedure. Simply because we have inferred a statistical alternative in no way guarantees that we have inferred the correct conceptual alternative. Indeed, whether the statistical alternative implies any justification for the conceptual alternative is debatable.
To summarize, the statistical alternative can be regarded as a “numerical” alternative (e.g., see Harcum, 1990; McClure & Suen, 1994) to the null hypothesis, while the conceptual alternative may be regarded as an “explanation” or “theory” as to the reason why the null hypothesis has been rejected. Bolles (1962) offers an excellent explanation of this distinction:
The statistician is confronted with just two hypotheses [i.e., the null and the statistical alternative], and the decision which he makes is only between these two. Suppose he has two samples and is concerned with whether the two means differ. The observed difference can be attributed either to random variation (the null hypothesis) or to the alternative hypothesis that the samples have been drawn from two populations with different means. Ordinarily these two alternatives exhaust the statistician’s universe. The scientist, on the other hand, being ultimately concerned with the nature of natural phenomena, has only started his work when he rejects the null hypothesis. (p. 639)
The differences between the statistical and conceptual alternatives should now be clear. Simply put, a rejection of the null represents justified reason for inferring the statistical alternative, but presents only minimal reason (if any) for inferring the conceptual alternative hypothesis. Supposing the null is rejected and the statistical alternative inferred, how shall we arrive at an appropriate conceptual alternative? There is no formal logic (and certainly no statistical logic) concerning how to select or choose the alternative hypothesis. What then, constitutes a logical reason for inferring one alternative over another? When inferring the conceptual, the researcher assumes (or hopes) that all extraneous variables have been controlled and accounted for. It is this element of experimental control that gives the alternative any plausible sense of being “correct” or “valid” -- thus the preference for the laboratory “controlled” settings for research. The rationale is that if we can control all variables except for a single manipulated variable, then we can assert with confidence that our hypothesized result is correct. This hypothesized result is termed in the alternative hypothesis, and upon rejecting the null, we confidently assume the alternative to be correct. Theoretically, if every variable were able to be controlled, then inferring the alternative hypothesis would not be such “risky business.” I stress however that there is always some “guesswork” (especially, but not exclusively in the social sciences) when inferring the alternative. Regardless of how much experimental control we impose in our experiments, there is always the chance that we have overlooked a crucial variable that may be having an effect on the dependent variable. Cowles (1989) gives an excellent historical example of how this can occur:
The names malaria [i.e., “bad air”], marsh fever, and paludism all reflect the view that the cause of the disease was the breathing of damp, noxious air in swamp lands. The relationship between swamp lands and the incidence of malaria is quite clear. The relationship between swamp lands and the presence of mosquitoes is also clear. But it was not until the turn of the century that it was realized that the mosquito was responsible for the transmission of the malarial parasite and only 20 years earlier, in 1880, was the parasite actually observed. . . . This episode is an interesting example of the control of a concomitant or correlated bias or effect that was the direct cause of the observations. (p. 149)
In the preceding selection, we have a perfect example of how the logic of null hypothesis significance testing typically works. The null is that diseases such as malaria are caused by chance, that is, that the disease occurs in individuals randomly, governed by mere chance factors. The alternative hypothesis is that the disease is caused by breathing in noxious swamp air. In this case, had an actual experiment been performed, we would surely reject the null hypothesis, since after all, those individuals living close to the swamp would have a higher incidence of malaria than those individuals living further away from the swamp. Hence, we would have rejected the null hypothesis and would have on sufficient grounds inferred the statistical alternative (i.e., not the null). However, by inferring the conceptual alternative, that of swamp air causing malaria, we have potentially overlooked other possible predictors (such as mosquitoes, in this case) that may have produced the observed difference in our dependent variable (i.e., incidence of malaria). Thus, choosing the “correct” conceptual alternative hypothesis can be a “shoot-and-miss” affair.
A second example, this one taken from the field of medical research, will help further elucidate the magnitude of the problem.
Beauchemin and Hays (1996) undertook a study on SAD (seasonal affective disorder) in which they investigated the effect of natural sunlight on the duration of hospital residence in a sample of psychiatric inpatients. The investigators hypothesized that those patients in brighter rooms would be discharged sooner than those in dim rooms, presumably because of light (i.e., treatment) provided naturally by the sun. Their research hypothesis was that those patients who resided in well-lit rooms would recover at a faster rate from their seasonal depression than those residing in dimly-lit rooms. By abstracting data from two previous years, they found the mean length of stay for “bright-room patients” to be 18.1 days and the mean length of stay for “dim-room patients,” 16.9 days. The resulting t statistic, testing the difference between means, had a probability of p < .05 of occurring by chance alone. The researchers concluded that sunny hospital rooms reduce the latency in recovering from depression. In other words, the alternative hypothesis was given full credit in accounting for the difference in means.
The obvious problem in the above study is that there are an infinite number of very good hypotheses, other than the “bright-light” hypothesis that can equally account for the difference in means. This is not to say that bright light may not be a confounding variable to the true cause behind the difference, but rather is merely to say that rejecting the null in this case in absolutely no way justifies the alternative chosen by the researchers. For instance, it could very well be that increased light is associated with increased reading (since reading requires a certain amount of light), and those patients in well-lit rooms read more than those in dimly-lit rooms. Thus, patients who had the opportunity to read recovered at a faster rate than those who did not read, and light had only a trivial influence (i.e., providing a suitable environment for reading). If this were the true alternative, then presumably enclosing future patients in bright rooms with no literature would have no effect! Another possibility is that those patients residing on the side of the hospital where light was prominent received more cordial care from their doctors and nurses. Thus, those rooms on the East side happen to have a more affectionate and caring staff than those on the West side. This becomes even more interesting when one considers the possibility that maybe the sunlight actually “brightened” the day of the staff, and thus made caring for the depressed a more enjoyable task. Hence, it would seem perfectly reasonable that these bright-room depressed patients would get better faster than dim-room depressed patients, since they had a staff that was more uplifting and encouraging than those residing in the dim rooms. Again, the point is that if the claim of this study is taken literally by psychiatrists, the recommendation is to place depressed patients in sun-lit rooms for a speedier recovery. An overstep of the study’s conclusive power? Indeed. However, a naive interpreter of research, one who is not intimately aware of the multitude of potential alternative hypotheses that may exist, along with not knowing the differences between statistical versus conceptual hypotheses may take the study’s title as factual: “Sunny hospital rooms expedite recovery from severe and refractory depressions.” My argument is that based on their study, the authors have absolutely no justification in making such a claim. Causation is implied in the title, and one need not be a methodologist to know that making an inference of the type made here constitutes not a small leap, but rather a gigantic and largely inappropriate one. Studies of this type can hardly be considered real science.
The preceding discussion can be summarized to suggest that the inferred alternative is nothing more than a “next best guess explanation” given the falsity of the null hypothesis. That was Fisher’s (1966) main problem with positing an alternative, that of it not being “exact.” Fisher argued that we have no way of knowing if the alternative is correct. Although we attempt to control for variables that may be responsible for discounting chance, we are still left with inconclusive support for the alternative hypothesis. Thus, literally, the alternative hypothesis must be inferred and can rarely, if ever, be shown to be true.
Referring again to the coin paradigm, recall that should the null be rejected (i.e., the hypothesis that the coin is fair), we consequently infer the statistical alternative. Given say, 20 consecutive heads, we are prepared to conclude that the result is not due to chance -- something else must account for data as extreme as these. That “something” we include in the alternative conceptual hypothesis. In other words, we devise an explanation for why the null was rejected. That the coin is not fair is a most likely explanation. However, there is little direct support for this conclusion. It is not supported by the null and is not directly supported by the statistical alternative. The most we can say is that there is “reason” to infer a conceptual alternative. The reason is that the statistical alternative has been inferred. Yet, showing the conceptual alternative hypothesis to be true is a next to impossible task. Although the coin being unfair may be a reasonable conclusion for the scientist, a reasonable conclusion for an astrologer may be that the planets are aligned in such a way as to produce such a succession, regardless of the fairness of the coin. The astrologer could argue that the coin is perfectly fair, but is turning up heads because of some astrological event. How can hypothesis-testing logic dismiss this latter possibility? It cannot. Although both the scientist and the astrologer may reject chance (i.e., the null hypothesis) as a plausible explanation, to infer a legitimate alternative hypothesis requires more than mere hypothesis-testing. It requires among other things, what the general community (whether scientific or astrological, in this case) considers to be a plausible explanation that best accounts for the rejected null.
Inferring the alternative is risky business, and as evinced by Cowles’ malaria example, a seemingly correct alternative may be later found incorrect given different circumstances and research interests. In short, an inference of the alternative hypothesis (i.e., conceptual) is much less “scientific” or “statistical” than many of us may first assume.
Perhaps most discouraging is the fact that the conflation of the statistical and conceptual alternatives appears to still elude researchers, some statisticians, and even those teaching the subject. My final example is taken from a statistics textbook (Moore & McCabe, 1999) for which other than the following “glitch,” is an excellent introductory text. In discussing the matched-pairs t-test, the authors set up a good study, but misinform the student in their explanation of its conclusions. In the paradigm, the National Endowment for the Humanities seeks to improve the skills of high school teachers in understanding foreign languages, specifically French. Before being enrolled in the program, 20 teachers are given the Modern Language Association listening test. A higher score indicates a greater skill in understanding French. Upon completion of the program, the teachers are once again given the test to discern whether improvement occurred. Because it is a matched-sample t test, the null hypothesis is that any calculated difference score (i.e., difference between pre- and post-test scores) from 0 is due to chance alone. The statistical alternative is that any calculated difference is not due to chance alone. Note that in this latter statement we have effectively exhausted the statistical possibilities of the statistical alternative hypothesis. That is, the only thing we can statistically conclude from a rejection of the null is that the difference scores are probably not due to chance. Now, consider how the authors explain the results to introductory statistics students: “Software gives the value p = 0.00053. The improvement in listening scores is very unlikely to be due to chance alone. We have strong evidence that the institute was effective in raising scores (p. 513).” False! Based on our statistical test, we have absolutely no evidence to conclude the institute was effective, no more than we can claim that all teachers secretly vacationed in Paris between tests. Of course, if we speak about the methodology, the nature of experimental controls imposed, the random selection of teachers, the steps taken to overcome practice effects, etc., we can begin to build a supportive claim for our conceptual alternative, especially now that we’ve rejected chance as a probability for our data. My point is that the authors make the transition from rejecting the null to the inference of the conceptual alternative a continuous process, which it is far from. The student takes from the example the knowledge that by rejecting the null, we have effectively amassed support for the conceptual alternative. As I have shown, this is unequivocally incorrect. This is but one of many examples in statistics textbooks where the distinction between hypotheses is not properly explained, if mentioned at all.
What should be concluded from the above discussion? Is it that we should not even attempt inferences of alternative hypotheses? Certainly not. What is to be noted by this short exposition is to recognize the “leap” made when inferring conceptual alternative hypotheses, and to be intimately aware of it. Too many times the statistical and conceptual alternatives are conflated, and the latter is assumed to be correct based merely on the “statistical truth” of the former. The primary goal of this note was to clarify that these two hypotheses cannot and should not be equated. To do so constitutes a methodological error. Furthermore, as demonstrated by the above examples, construing the alternative requires rigid isolation of experimental variables, and even then it is difficult to conclude that the correct alternative hypothesis (out of a presumably infinite supply) has been selected.
Is there an ideal strategy or method for arriving at the true conceptual alternative? Unfortunately, the answer to this is no. There is no strategy except for ensuring a maximal degree of variable-control in our experiments. Every researcher should heed to the following principle: an increase in experimental control of variables proportionately increases the probability that the correct conceptual alternative hypothesis will be inferred.4 Hence, more time spent designing research rather than merely executing research could pay dividends once the null is rejected. It is recommended that researchers pay due attention to the differences between statistical and conceptual hypotheses. Furthermore, the error of equating both hypotheses should be corrected early in the training of aspiring researchers. More attention needs to be aimed at considering many alternatives given a rejected null, and not only the alternative hypothesized by the experimenter. Anyone can reject a null, to be sure. The real skill of the scientist is arriving at the true alternative.
Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437.
Beauchemin, K., & Hays, P. (1996). Sunny hospital rooms expedite recovery from severe and refractory depressions. Journal of Affective Disorders, 40, 49-51.
Bolles, R. C. (1962). The difference between statistical hypotheses and scientific hypotheses. Psychological Reports, 11, 639-645.
Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Cowles, M. (1989). Statistics in psychology: An historical perspective. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Publishers.
Fisher, R. A. (1966). The design of experiments. New York: Hafner Publishing Company.
Harcum, E. R. (1990). Distinction between tests of data or theory: Null versus disconfirming results. American Journal of Psychology, 103, 359-366.
Hunter, J. E. (1997). Needed: A ban on the significance test. American Psychological Society, 8, 3-7.
Loftus, G. R. (1993). A picture is worth a thousand p values. On the irrelevance of hypothesis testing in the microcomputer age. Behavioral Research Methods, Instruments, and Computers, 25, 250-256.
McClure, J., & Suen, H. K. (1994). Interpretation of statistical significance testing: A matter of perspective. Topics in Early Childhood Special Education, 14, 88-100.
Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.
Moore, D., & McCabe, G. (1999). Introduction to the practice of statistics, 3rd ed. W. H. Freeman and Company: New York.
Morrison, D. E., & Henkel, R. E. (1969). Significance tests reconsidered. The American Sociologist, 4, 131-140.
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inferences (part 1). Biometrika, 20A, 175-240.
1. As will be seen however, the idea of controlling variables is what gives the experimenter the confidence in inferring the correct alternative. Not surprisingly, those scientists working with rats in well-controlled lab settings enjoy this advantage more than do many social scientists studying individual differences in humans. Ethics no doubt plays a role.
2. Obviously there are many factors that led to each theorist’s conclusions about the universe and I in no way mean to overlook these details by the brevity of my discussion. Nor am I saying that Ptolemy was capricious in making the inference of his alternative hypothesis. My point is merely to highlight the importance of serious consideration of the alternative hypothesis, given that it can appear completely correct, yet later be found incorrect.
3. The name “research alternative hypothesis” is of course also commonly used, but it will be avoided in this paper where possible. The reason for this is because it is often interpreted as representing both the statistical and the research hypotheses combined. This is exactly the conflation error I seek to expose, but at the same time avoid in my own presentation. Thus, throughout this paper, I consistently use the term “conceptual” when referring to the non-statistical alternative hypothesis.
4. It should once again be noted however, that even given much control, we may still infer an alternative that will turn out to be false. The point is that by exercising experimental control to its full extent, we increase our chances of affirming the correct conceptual hypothesis, even if we cannot fully guarantee it is correct.
Theory & Science