Risky Business

Unlike the null hypothesis, the alternative hypothesis has been given little methodological attention by social scientists. It is argued that for a complete understanding and appreciation of hypothesis-testing, the nature of the alternative hypothesis must be thoroughly studied. Inferences based on rejecting the null hypothesis are inadequate if an inappropriate alternative is inferred. Examples of inappropriate inferences are discussed from the areas of psychology, cosmology, medicine, education and chance to highlight the generality of the problem. The primary purpose of the article is to stimulate thought about the alternative hypothesis by considering: 1) differences between the conceptual and statistical alternatives; 2) the construing of the alternative and; 3) how to best characterize the alternative hypothesis. It is concluded that it is best characterized as a "next best guess explanation" given a rejected null hypothesis.

Cohen (1994), among others (e.g., Bakan, 1966; Loftus, 1993; Meehl, 1978) have long exposed the serious methodological and philosophical problems associated with null hypothesis significance testing (NHST). Some (e.g., Hunter, 1997) have even advocated the extreme position that the procedure be banned completely from psychology journals. Although most articles aimed at evaluating social science’s primary means of statistical inference focus unduly on the “Fisherian” null hypothesis, few have discussed in much detail the nature of the alternative hypothesis. Largely a product of the Neyman-Pearson (1928) model of hypothesis-testing, the alternative hypothesis has received relatively little attention by methodologists. This is despite the fact that the alternative, not the null, is the primary hypothesis of interest to the investigator (Cohen, 1994). Indeed, few have taken a close look at what constitutes a “methodologically sound” alternative to the null.

The purpose of this article is to clarify the nature of the
alternative hypothesis. Of first priority will be to distinguish
between the “statistical” and “conceptual” alternatives. It will be
argued that this distinction is seldom recognized in research circles.
Too often, the statistical and conceptual hypotheses are conflated to
imply similar statements and are not treated as substantially distinct
from one another. A second priority will be to discuss the actual
“construction” of the alternative hypothesis. Through examples from
various fields, it will be shown that the conceptual alternative is
born out of *experimental control* rather than by any statistical
procedure. Again, the differences between both types of alternatives
(i.e., statistical vs. conceptual) will be used to substantiate this
argument. The examples were chosen from a wide array of scientific
fields (e.g., cosmology, medicine, psychology, education, chance) for
the purpose of demonstrating the generality of the problem. Regardless
of the discipline, where there is a null hypothesis to be rejected,
there is an even more daunting task of inferring a correct alternative.^{1}
Finally, what is thought to be an ideal characterization of the
alternative hypothesis will be presented, that of it being a “next best
guess explanation” in light of a rejected null hypothesis.

The importance of having a firm theoretical and methodological
understanding of the nature of the alternative hypothesis cannot be
overstated. It is no exaggeration to say that researchers continually
rely on the alternative hypothesis for “answers” in the social
sciences. Should these answers be wrong, the foundation on which
scientific progress is based comes under serious scrutiny. Indeed,
multiple “wrong answers” leads not to progress at all, but rather into
a possible regress of knowledge, or quite literally, utter confusion.
The significance of properly considering the alternative hypothesis can
be demonstrated in the historic astronomical debate between geocentric
and heliocentric theories of the universe. Ptolemy, based largely on
evidence gathered by measurement, propounded the geocentric theory in a
form that prevailed for 1400 years. In the realm of null hypothesis
testing, one could say that Ptolemy, upon gathering evidence in the
form of measurement data, rejected the probability that these
measurements could have arisen by mere chance. Instead, he proposed the
existence of a “structure” (i.e., universal structure) behind them.
This structure formed the basis of his alternative hypothesis in which
the sun orbited the earth. Indeed, his alternative to chance appeared
very reasonable at the time, despite the fact that it was later found
utterly incorrect by Copernicus, among others (e.g., Galileo). The
point is that although Ptolemy’s alternative appeared correct for the
given time period and socially acceptable given the beliefs of the
masses, it was nevertheless replaced by Copernicus’s heliocentric
theory. Both famous men can be said to have rejected chance from their
measurements, but both arrived at two competing and virtually opposite
alternatives. Thus it is the nature of the problem not only for
cosmology, but for all sciences, since one cannot “prove” an
alternative hypothesis; one can only reject chance as a reasonable
explanation, and infer what is considered the more likely explanation.
For Ptolemy, the alternative to chance was that the sun revolves around
the earth. For Copernicus, the alternative was that the earth revolves
around the sun. There is perhaps no greater example in history that
highlights the importance of carefully considering inferred
alternatives.^{2} Before discussing further examples, it will do well to review the two primary types of alternative hypotheses.

A first distinction that is paramount to understanding the
alternative hypothesis is the existence of both a “statistical” and a
“conceptual,” also known as “substantive” (or again, “scientific”)
alternative hypothesis.^{3} As noted
by Bolles (1962), although a rejection of the null implies the
statistical alternative, it does not necessarily imply the *scientific* alternative. These are two very different hypotheses. This difference stems largely from the fact that statistical inference *is not equal to*
scientific inference (Morrison & Henkel, 1969). Chow (1996)
correctly argues that although the statistical alternative may be quite
easily inferred, this inference is much more difficult with regard to
the conceptual alternative, because of what he calls the “reality of
multiple explanations” (p. 53). Indeed, a multitude of explanations
could exist for why the null hypothesis is rejected. In other words, a
number of reasons could account for why a null is found improbable.
According to the Neyman-Pearson model of hypothesis-testing, if the
null is rejected, the alternative is inferred -- that is, the *statistical* alternative. However, this does not and should not directly imply an inference of the *conceptual* alternative. In fact, the inference of the statistical alternative simply suggests the *possibility*
of a conceptual hypothesis that may account for why the null was
rejected. Without a conceptual alternative, rejecting the null has
little meaning. As Chow notes, “multiple conceptual alternative
hypotheses give rise to their respective statistical alternative
hypotheses” (p. 55). This means that in order to have *any*
statistical alternative, you first need to conceive of a conceptual
hypothesis. Without the assumption that a conceptual hypothesis can
account for the rejected null, one would hardly be interested in
inferring a statistical alternative. Once the conceptual alternative is
formulated, you may deduce its statistical alternative, but this does
not necessarily imply the truth or confirmation of the conceptual
hypothesis. In short, if the null is rejected, the statistical
alternative is sure to be inferred -- this much is clear. What is not a
“given” is the inference of the conceptual hypothesis. Should the
“truth” of both hypotheses be equated (i.e., that of the statistical
and conceptual), one could easily infer conceptual alternatives that
have no scientific meaning. Consider the following example illustrating
this problem.

In a coin-flip paradigm, let the null hypothesis be that the coin
is fair. After many trials, if indeed the null is rejected, what shall
we infer? Assuming we have an alternative conceptual hypothesis, this
necessarily implies a statistical alternative hypothesis. Although the
statistical hypothesis may be easily inferred (upon rejection of the
null), this does not necessarily suggest the conceptual can be inferred
with equal ease. If *p* < .05, we will reject the null and infer the *statistical* alternative. This follows from the Neyman-Pearson model of hypothesis-testing. However, we may not be so willing to infer the *conceptual* alternative, especially if it is something that is not a *plausible explanation*
for why the null was rejected. For instance, we would hardly infer an
alternative such as the coin is governed by spiritual agents (given
say, many consecutive heads) that are invisible in this room. This
conceptual alternative is unlikely in that it would probably not be
suggested by a social scientist. The scientist would more likely infer
something of the nature that the coin is biased, due to a physical
defect. This would be a more “common-sense” alternative to chance
factors having produced the successive heads. What is crucial to note
is that the alternative could comprise of almost *anything* and is not restricted to a particular number of hypotheses. Indeed, the *logical possibility* of alternative hypotheses is practically *infinite*.
The point at hand is that although the “spiritual agents” hypothesis
would likely receive little attention, it cannot be refuted based
solely on the logic of hypothesis-testing, and certainly not by any
statistical procedure. Simply because we have inferred a statistical
alternative in no way guarantees that we have inferred the correct
conceptual alternative. Indeed, whether the statistical alternative
implies *any* justification for the conceptual alternative is debatable.

To summarize, the statistical alternative can be regarded as a
“numerical” alternative (e.g., see Harcum, 1990; McClure & Suen,
1994) to the null hypothesis, while the conceptual alternative may be
regarded as an “explanation” or “theory” as to the *reason* why the null hypothesis has been rejected. Bolles (1962) offers an excellent explanation of this distinction:

The statistician is confronted with just two hypotheses [i.e., the null and the statistical alternative], and the decision which he makes is only between these two. Suppose he has two samples and is concerned with whether the two means differ. The observed difference can be attributed either to random variation (the null hypothesis) or to the alternative hypothesis that the samples have been drawn from two populations with different means. Ordinarily these two alternatives exhaust the statistician’s universe. The scientist, on the other hand, being ultimately concerned with the nature of natural phenomena, has only started his work when he rejects the null hypothesis. (p. 639)

The differences between the statistical and conceptual alternatives
should now be clear. Simply put, a rejection of the null represents
justified reason for inferring the statistical alternative, but
presents only minimal reason (if any) for inferring the conceptual
alternative hypothesis. Supposing the null is rejected and the
statistical alternative inferred, how shall we arrive at an appropriate
conceptual alternative? There is no formal logic (and certainly no
statistical logic) concerning how to select or choose the alternative
hypothesis. What then, constitutes a logical reason for inferring one
alternative over another? When inferring the conceptual, the researcher
assumes (or hopes) that all extraneous variables have been controlled
and accounted for. It is this element of *experimental control* that gives the alternative *any*
plausible sense of being “correct” or “valid” -- thus the preference
for the laboratory “controlled” settings for research. The rationale is
that if we can control all variables except for a single manipulated
variable, then we can assert with confidence that our hypothesized
result is correct. This hypothesized result is termed in the
alternative hypothesis, and upon rejecting the null, we confidently
assume the alternative to be correct. Theoretically, if *every*
variable were able to be controlled, then inferring the alternative
hypothesis would not be such “risky business.” I stress however that
there is always some “guesswork” (especially, but not exclusively in
the social sciences) when inferring the alternative. Regardless of how
much experimental control we impose in our experiments, there is always
the chance that we have overlooked a crucial variable that may be
having an effect on the dependent variable. Cowles (1989) gives an
excellent historical example of how this can occur:

The namesmalaria[i.e., “bad air”],marsh fever, andpaludismall reflect the view that the cause of the disease was the breathing of damp, noxious air in swamp lands. The relationship between swamp lands and the incidence of malaria is quite clear. The relationship between swamp lands and the presence of mosquitoes is also clear. But it was not until the turn of the century that it was realized that the mosquito was responsible for the transmission of the malarial parasite and only 20 years earlier, in 1880, was the parasite actually observed. . . . This episode is an interesting example of the control of a concomitant orcorrelatedbias or effect that was the direct cause of the observations. (p. 149)

In the preceding selection, we have a perfect example of how the logic of null hypothesis significance testing typically works. The null is that diseases such as malaria are caused by chance, that is, that the disease occurs in individuals randomly, governed by mere chance factors. The alternative hypothesis is that the disease is caused by breathing in noxious swamp air. In this case, had an actual experiment been performed, we would surely reject the null hypothesis, since after all, those individuals living close to the swamp would have a higher incidence of malaria than those individuals living further away from the swamp. Hence, we would have rejected the null hypothesis and would have on sufficient grounds inferred the statistical alternative (i.e., not the null). However, by inferring the conceptual alternative, that of swamp air causing malaria, we have potentially overlooked other possible predictors (such as mosquitoes, in this case) that may have produced the observed difference in our dependent variable (i.e., incidence of malaria). Thus, choosing the “correct” conceptual alternative hypothesis can be a “shoot-and-miss” affair.

A second example, this one taken from the field of medical research, will help further elucidate the magnitude of the problem.

Beauchemin and Hays (1996) undertook a study on SAD (seasonal
affective disorder) in which they investigated the effect of natural
sunlight on the duration of hospital residence in a sample of
psychiatric inpatients. The investigators hypothesized that those
patients in brighter rooms would be discharged sooner than those in dim
rooms, presumably because of light (i.e., treatment) provided naturally
by the sun. Their research hypothesis was that those patients who
resided in well-lit rooms would recover at a faster rate from their
seasonal depression than those residing in dimly-lit rooms. By
abstracting data from two previous years, they found the mean length of
stay for “bright-room patients” to be 18.1 days and the mean length of
stay for “dim-room patients,” 16.9 days. The resulting *t* statistic, testing the difference between means, had a probability of *p*
< .05 of occurring by chance alone. The researchers concluded that
sunny hospital rooms reduce the latency in recovering from depression.
In other words, the alternative hypothesis was given full credit in
accounting for the difference in means.

The obvious problem in the above study is that there are an
infinite number of very good hypotheses, other than the “bright-light”
hypothesis that can equally account for the difference in means. This
is not to say that bright light may not be a *confounding variable*
to the true cause behind the difference, but rather is merely to say
that rejecting the null in this case in absolutely no way justifies the
alternative chosen by the researchers. For instance, it could very well
be that increased light is associated with increased reading (since
reading requires a certain amount of light), and those patients in
well-lit rooms read more than those in dimly-lit rooms. Thus, patients
who had the opportunity to read recovered at a faster rate than those
who did not read, and light had only a trivial influence (i.e.,
providing a suitable environment for reading). If this were the true
alternative, then presumably enclosing future patients in bright rooms
with no literature would have no effect! Another possibility is that
those patients residing on the side of the hospital where light was
prominent received more cordial care from their doctors and nurses.
Thus, those rooms on the East side happen to have a more affectionate
and caring staff than those on the West side. This becomes even more
interesting when one considers the possibility that maybe the sunlight
actually “brightened” the day of the staff, and thus made caring for
the depressed a more enjoyable task. Hence, it would seem perfectly
reasonable that these bright-room depressed patients would get better
faster than dim-room depressed patients, since they had a staff that
was more uplifting and encouraging than those residing in the dim
rooms. Again, the point is that if the claim of this study is taken
literally by psychiatrists, the recommendation is to place depressed
patients in sun-lit rooms for a speedier recovery. An overstep of the
study’s conclusive power? Indeed. However, a naive interpreter of
research, one who is not intimately aware of the multitude of potential
alternative hypotheses that *may* exist, along with not knowing
the differences between statistical versus conceptual hypotheses may
take the study’s title as factual: “Sunny hospital rooms expedite
recovery from severe and refractory depressions.” My argument is that
based on their study, the authors have absolutely no justification in
making such a claim. Causation is implied in the title, and one need
not be a methodologist to know that making an inference of the type
made here constitutes not a small leap, but rather a gigantic and
largely inappropriate one. Studies of this type can hardly be
considered real science.

The preceding discussion can be summarized to suggest that the
inferred alternative is nothing more than a “next best guess
explanation” given the falsity of the null hypothesis. That was
Fisher’s (1966) main problem with positing an alternative, that of it
not being “exact.” Fisher argued that we have no way of knowing if the
alternative is correct. Although we attempt to control for variables
that may be responsible for discounting chance, we are still left with
inconclusive support for the alternative hypothesis. Thus, literally,
the alternative hypothesis must be *inferred* and can rarely, if ever, be shown to be true.

Referring again to the coin paradigm, recall that should the null
be rejected (i.e., the hypothesis that the coin is fair), we
consequently infer the statistical alternative. Given say, 20
consecutive heads, we are prepared to conclude that the result is not
due to chance -- something else must account for data as extreme as
these. That “something” we include in the alternative *conceptual* hypothesis. In other words, we devise an explanation for *why* the null was rejected. That the coin is not fair is a most likely explanation. However, there is little *direct* support for this conclusion. It is not supported by the null and is not *directly*
supported by the statistical alternative. The most we can say is that
there is “reason” to infer a conceptual alternative. The reason is that
the statistical alternative has been inferred. Yet, showing the
conceptual alternative hypothesis to be true is a next to impossible
task. Although the coin being unfair may be a reasonable conclusion for
the scientist, a reasonable conclusion for an astrologer may be that
the planets are aligned in such a way as to produce such a succession,
regardless of the fairness of the coin. The astrologer could argue that
the coin is perfectly fair, but is turning up heads because of some
astrological event. How can hypothesis-testing logic dismiss this
latter possibility? It cannot. Although both the scientist and the
astrologer may reject chance (i.e., the null hypothesis) as a plausible
explanation, to infer a legitimate alternative hypothesis requires more
than mere hypothesis-testing. It requires among other things, what the
general community (whether scientific or astrological, in this case)
considers to be a *plausible explanation* that best accounts for the rejected null.

Inferring the alternative is risky business, and as evinced by Cowles’ malaria example, a seemingly correct alternative may be later found incorrect given different circumstances and research interests. In short, an inference of the alternative hypothesis (i.e., conceptual) is much less “scientific” or “statistical” than many of us may first assume.

Perhaps most discouraging is the fact that the conflation of the
statistical and conceptual alternatives appears to still elude
researchers, some statisticians, and even those teaching the subject.
My final example is taken from a statistics textbook (Moore &
McCabe, 1999) for which other than the following “glitch,” is an
excellent introductory text. In discussing the matched-pairs *t*-test,
the authors set up a good study, but misinform the student in their
explanation of its conclusions. In the paradigm, the National Endowment
for the Humanities seeks to improve the skills of high school teachers
in understanding foreign languages, specifically French. Before being
enrolled in the program, 20 teachers are given the Modern Language
Association listening test. A higher score indicates a greater skill in
understanding French. Upon completion of the program, the teachers are
once again given the test to discern whether improvement occurred.
Because it is a matched-sample *t* test, the null hypothesis is
that any calculated difference score (i.e., difference between pre- and
post-test scores) from 0 is due to chance alone. The statistical
alternative is that any calculated difference is *not* due to
chance alone. Note that in this latter statement we have effectively
exhausted the statistical possibilities of the statistical alternative
hypothesis. That is, the only thing we can statistically conclude from
a rejection of the null is that the difference scores are probably not
due to chance. Now, consider how the authors explain the results to
introductory statistics students: “Software gives the value *p* =
0.00053. The improvement in listening scores is very unlikely to be due
to chance alone. We have strong evidence that the institute was
effective in raising scores (p. 513).” False! Based on our statistical
test, we have *absolutely no evidence to conclude the institute was effective*,
no more than we can claim that all teachers secretly vacationed in
Paris between tests. Of course, if we speak about the methodology, the
nature of experimental controls imposed, the random selection of
teachers, the steps taken to overcome practice effects, etc., we can
begin to build a supportive claim for our conceptual alternative,
especially now that we’ve rejected chance as a probability for our
data. My point is that the authors make the transition from rejecting
the null to the inference of the conceptual alternative a *continuous*
process, which it is far from. The student takes from the example the
knowledge that by rejecting the null, we have effectively amassed
support for the conceptual alternative. As I have shown, this is
unequivocally incorrect. This is but one of many examples in statistics
textbooks where the distinction between hypotheses is not properly
explained, if mentioned at all.

What should be concluded from the above discussion? Is it that we should not even attempt inferences of alternative hypotheses? Certainly not. What is to be noted by this short exposition is to recognize the “leap” made when inferring conceptual alternative hypotheses, and to be intimately aware of it. Too many times the statistical and conceptual alternatives are conflated, and the latter is assumed to be correct based merely on the “statistical truth” of the former. The primary goal of this note was to clarify that these two hypotheses cannot and should not be equated. To do so constitutes a methodological error. Furthermore, as demonstrated by the above examples, construing the alternative requires rigid isolation of experimental variables, and even then it is difficult to conclude that the correct alternative hypothesis (out of a presumably infinite supply) has been selected.

Is there an ideal strategy or method for arriving at the true
conceptual alternative? Unfortunately, the answer to this is no. There
is no strategy except for ensuring a maximal degree of variable-control
in our experiments. Every researcher should heed to the following
principle: an increase in experimental control of variables
proportionately increases the probability that the correct conceptual
alternative hypothesis will be inferred.^{4}
Hence, more time spent designing research rather than merely executing
research could pay dividends once the null is rejected. It is
recommended that researchers pay due attention to the differences
between statistical and conceptual hypotheses. Furthermore, the error
of equating both hypotheses should be corrected early in the training
of aspiring researchers. More attention needs to be aimed at
considering many alternatives given a rejected null, and not only the
alternative hypothesized by the experimenter. Anyone can reject a null,
to be sure. The real skill of the scientist is arriving at the true
alternative.

Bakan, D. (1966). The test of significance in psychological research. *Psychological Bulletin*, *66*, 423-437.

Beauchemin, K., & Hays, P. (1996). Sunny hospital rooms expedite recovery from severe and refractory depressions. *Journal of Affective Disorders*, *40*, 49-51.

Bolles, R. C. (1962). The difference between statistical hypotheses and scientific hypotheses. *Psychological Reports*, *11*, 639-645.

Chow, S. L. (1996). *Statistical significance: Rationale, validity and utility*. London: Sage Publications.

Cohen, J. (1994). The earth is round (*p* < .05). *American Psychologist*, *49*, 997-1003.

Cowles, M. (1989). *Statistics in psychology: An historical perspective*. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Publishers.

Fisher, R. A. (1966). *The design of experiments*. New York: Hafner Publishing Company.

Harcum, E. R. (1990). Distinction between tests of data or theory: Null versus disconfirming results. *American Journal of Psychology*, *103*, 359-366.

Hunter, J. E. (1997). Needed: A ban on the significance test. *American Psychological Society*, *8*, 3-7.

Loftus, G. R. (1993). A picture is worth a thousand *p* values. On the irrelevance of hypothesis testing in the microcomputer age. *Behavioral Research Methods, Instruments, and Computers*, *25*, 250-256.

McClure, J., & Suen, H. K. (1994). Interpretation of statistical significance testing: A matter of perspective. *Topics in Early Childhood Special Education*, *14*, 88-100.

Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. *Journal of Consulting and Clinical Psychology*, *46*, 806-834.

Moore, D., & McCabe, G. (1999). *Introduction to the practice of statistics, 3rd ed*. W. H. Freeman and Company: New York.

Morrison, D. E., & Henkel, R. E. (1969). Significance tests reconsidered. *The American Sociologist*, *4*, 131-140.

Neyman, J., & Pearson, E. S. (1928). On the use and
interpretation of certain test criteria for purposes of statistical
inferences (part 1). *Biometrika*, *20A*, 175-240.

1. As will be seen however, the idea of controlling variables is what gives the experimenter the confidence in inferring the correct alternative. Not surprisingly, those scientists working with rats in well-controlled lab settings enjoy this advantage more than do many social scientists studying individual differences in humans. Ethics no doubt plays a role.

2. Obviously there are many factors that led to each theorist’s conclusions about the universe and I in no way mean to overlook these details by the brevity of my discussion. Nor am I saying that Ptolemy was capricious in making the inference of his alternative hypothesis. My point is merely to highlight the importance of serious consideration of the alternative hypothesis, given that it can appear completely correct, yet later be found incorrect.

3. The name “research alternative hypothesis” is of course also commonly used, but it will be avoided in this paper where possible. The reason for this is because it is often interpreted as representing both the statistical and the research hypotheses combined. This is exactly the conflation error I seek to expose, but at the same time avoid in my own presentation. Thus, throughout this paper, I consistently use the term “conceptual” when referring to the non-statistical alternative hypothesis.

4. It should once again be noted however, that even given much control, we may still infer an alternative that will turn out to be false. The point is that by exercising experimental control to its full extent, we increase our chances of affirming the correct conceptual hypothesis, even if we cannot fully guarantee it is correct.

Theory & Science