Journal of Clinical Epidemiology 59 (2006) 964e969
Testing multiple statistical hypotheses resulted in spurious associations:
Peter C. Austin,,,Muhammad M. Mamdani,, David N. ,Janet E.
aInstitute for Clinical Evaluative Sciences, G1 06, 2075 Bayview Avenue, Toronto, Ontario, M4N 3M5 Canada
bDepartment of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada
cDepartment of Health Policy, Management and Evaluation, University of Toronto, Canada
dFaculty of Pharmacy, University of Toronto, Canada
eClinical Epidemiology and Health Care Research Program (Sunnybrook & Women’s College Site), Canada
fDivision of General Internal Medicine, Sunnybrook & Women’s College Health Sciences Centre and the University of Toronto, Canada
Objectives: To illustrate how multiple hypotheses testing can produce associations with no clinical plausibility. Study Design and Setting: We conducted a study of all 10,674,945 residents of Ontario aged between 18 and 100 years in 2000. Res-
idents were randomly assigned to equally sized derivation and validation cohorts and classified according to their astrological sign. Usingthe derivation cohort, we searched through 223 of the most common diagnoses for hospitalization until we identified two for which subjectsborn under one astrological sign had a significantly higher probability of hospitalization compared to subjects born under the remainingsigns combined (P ! 0.05).
Results: We tested these 24 associations in the independent validation cohort. Residents born under Leo had a higher probability of
gastrointestinal hemorrhage (P 5 0.0447), while Sagittarians had a higher probability of humerus fracture (P 5 0.0123) compared to allother signs combined. After adjusting the significance level to account for multiple comparisons, none of the identified associationsremained significant in either the derivation or validation cohort.
Conclusions: Our analyses illustrate how the testing of multiple, non-prespecified hypotheses increases the likelihood of detecting
implausible associations. Our findings have important implications for the analysis and interpretation of clinical studies. Ó 2006 ElsevierInc. All rights reserved.
Keywords: Subgroup analyses; Multiple comparisons; Hypothesis testing; Astrology; Data mining; Statistical methods
construct, other investigators have examined the effect ofastrologic signs more rigorously. For example, Gurm and
The second International Study of Infarct Survival (ISIS-
Lauer conducted a study to examine the belief that those
2) demonstrated that the use of aspirin during the acute
born under the sign of Leo are ‘‘big-hearted’’ and at in-
phase of acute myocardial infarction reduced mortality in
creased risk for heart disease. They examined 32,386 patients
a group of more than 17,000 patients A subgroup
who underwent exercise stress testing at the Cleveland Clinic
analysis demonstrated that aspirin increased mortality
between 1990 and 1999 and found a slight excess of deaths
of patients born under the astrological sign of Gemini or
among Leos (9.6% vs. 8.7%). This effect disappeared in
Libra. This biologically implausible finding reinforced the
a matched propensity score analysis (P 5 0.3). Furthermore,
authors’ contention that frivolous subgroup analyses should
they found no correlation between astrological signs and
Although the subgroup analysis in the ISIS-2 trial was in-
While an undue reliance on astrologic phenomena as
tended as an amusing illustration of a fundamental statistical
a guide to health and healthcare may put subjects at riskfor adverse outcomes we examined the relationshipbetween birth sign and health outcomes with a differentintent. The purpose of the current study was to demonstrate
* Corresponding author. Tel.: þ1-416-480-6131; fax: þ1-416-480-
the pitfalls of multiple hypothesis testing and of conducting
analyses without prespecified hypotheses. We hypothesized
0895-4356/06/$ e see front matter Ó 2006 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2006.01.012
P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969
that we could generate numerous statistically significant as-
In the validation cohort, we explicitly tested the 24 hy-
sociations, but that these would be neither reproducible nor
potheses associating astrological sign and illness that were
biologically plausible. For illustrative purposes, we studied
the association between astrological signs and health.
We conducted a population-based retrospective cohort
The number of Ontario residents who were aged be-
study using administrative databases covering 10,674,945
tween 18 and 100 years and who were alive on their birth-
residents of Ontario aged 18e100 years. The Registered
day in 2000 was 10,674,945. The derivation cohort
Person’s Database (RPDB) contains basic demographic
included 5,337,472 residents and the validation cohort in-
data on all residents of Ontario, Canada. We extracted in-
cluded 5,337,473 residents. There were 895 diagnoses for
formation on all residents of Ontario between the ages of
which patients had emergent and urgent hospitalizations
18 and 100 in 2000 and who were alive on their birthday
between January 1, 2000 and December 31, 2001.
in 2000. We then randomly assigned these individuals to
In the derivation cohort, it was necessary to search se-
equally sized derivation and validation cohorts. From the
quentially through admissions for the 223 most common
birth date, we determined the astrological sign under which
causes for hospitalization to identify two diagnoses for
which the probability of hospitalization was statistically
The Canadian Institute of Health Information (CIHI)
significantly greater for residents born under each astrolog-
hospital discharge abstract database contains data on all
ical sign compared to residents born under the remaining 11
hospital separations in the province of Ontario. We exam-
astrological signs. These 223 diagnoses accounted for
ined all admissions to Ontario hospitals among subjects
91.8% of all urgent and emergent hospitalizations in Ontar-
aged 18 to 100 years during a 2-year period (January 1,
io in 2000 and 2001. Of these 223 diagnoses, there were 72
2000 to December 31, 2001), who were classified as either
(32.3%) for which residents born under one astrological
urgent or emergent admissions (i.e., elective or planned
sign had a significantly higher probability of hospitalization
admissions were excluded). Each admission was classified
compared to residents born under the other astrological
according to the most responsible diagnosis, using the first
signs combined (P ! 0.05). The number of diagnoses for
three digits of the ICD-9 coding scheme. Diagnoses were
which residents born under a given astrological sign had
then ranked from most frequent to least frequent. Both
a significantly higher probability of hospitalization com-
the CIHI discharge abstract database and the RPDB data-
pared to residents born under the 11 other astrological signs
base contain encrypted versions of residents’ health card
combined ranged from a low of 2 (Scorpio) to a high of 10
numbers, permitting the two databases to be deterministi-
(Taurus), with a mean of 6 diagnoses for each astrological
cally linked in an anonymous fashion.
sign. The P-values for the 72 significant associations
Beginning with the most frequently occurring urgent or
ranged from 0.0003 to 0.0488. The two most frequently
emergent diagnosis for hospitalization, we determined
occurring diagnoses for which each astrological sign had
whether persons in the derivation cohort were hospitalized
a higher probability of hospitalization compared to the
with that diagnosis in the 365 days following their birthday
other astrological signs combined are described in .
in 2000. We then determined the proportion of subjects
The P-values for testing the significance of the association
born under each astrological sign who were hospitalized
between a particular astrological sign and the probability of
with that same diagnosis in the year subsequent to their
the diagnosis-specific admission ranged from 0.0006 to
birthday in 2000. We then identified the astrological sign
0.0475 among these 24 potential associations. In ,
with the highest hospitalization rate for that diagnosis.
we also report the relative risk comparing the probability
We then determined whether the probability of admission
of hospital admission for residents born under the given as-
for that diagnosis was statistically significantly different
trological sign with the probability of hospital admission
for residents born under this astrological sign than for res-
for residents born under all other astrological signs com-
idents born under all other astrological signs combined
bined. The relative risks ranged from a low of 1.10 to a high
(i.e., we compared the probability of admission between
of 1.80. For example, the probability of hospitalization for
residents born under one astrological sign and residents
lymphoid leukemia was 80% greater for Scorpios than it
born under all other signsda two-sample comparison of bi-
was for residents born under the 11 other astrological signs
nomial proportions). Statistical significance was assessed
using Fisher’s exact test, and a two-tailed significance level
We tested the associations identified in in the
of 0.05 was used to denote statistical significance. This pro-
validation cohort. Of the 24 associations identified in the
cess was repeated for all diagnoses, beginning with the
derivation cohort, only 2 remained statistically significant
most frequent, until two diagnoses were identified for each
in the validation cohort. In the validation cohort, residents
astrological sign. This phase of the study served as the hy-
born under the sign of Leo had a significantly higher prob-
ability of hospitalization due to gastrointestinal hemorrhage
P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969
Table 1Diagnoses for which residents with given astrological sign had a higher probability of hospitalization compared to residents born under the remainingastrological signs combined: results from derivation cohort
Intestinal infections due to other organisms
Intestinal obstruction without mention of hernia
Encounter for other and unspecified procedure and aftercare
Other ill-defined and unknown causes or morbidity and mortality
Other acute and subacute forms of ischemic heart disease
Abbreviation: NEC = not elsewhere classified.
compared to other residents of Ontario, with a relative risk
number of potential associations. We began the study with
of 1.15 (P 5 0.0483). Similarly, residents born under the
no prespecified hypotheses. Rather, we searched sequen-
sign of Sagittarius had a significantly higher probability
tially through a list of diagnosis codes until at least two
of hospitalization for fractures of the humerus compared
diagnoses had been found for each astrological sign, for
to residents born under the remaining 11 astrological signs,
which residents born under that sign were signifi-
with a relative risk of 1.38 (P 5 0.0125). The remaining 22
cantly more likely to be hospitalized compared to residents
associations were no longer significant in the validation
born under the remaining astrological signs combined. This
exercise implicitly involved multiple comparisons for eachdiagnosis. For each astrological sign, we computed the pro-portion of persons born under that sign who were hospital-
ized for that diagnosis in the year subsequent to theirbirthday in 2000. We then selected the astrological sign
We identified at least two diagnoses for which Ontario
for which persons born under that sign had the highest
residents born under each astrological sign had a signifi-
probability of hospitalization. This implicitly involved 66
cantly higher probability of hospitalization compared to
pairwise comparisons, because there are ð12
residents born under the remaining astrological signs com-
of selecting distinct pairs from a set of 12 objects.
bined. Two of these 24 associations remained statistically
The finding that 22 of 24 statistically significant findings
significant when tested in an independent validation cohort.
generated in the derivation cohort were not confirmed in the
These observations yield several important lessons about
validation cohort illustrates the dangers inherent in studies
hypothesis testing, study design, and the interpretation of
involving multiple, non-prespecified hypotheses.
4.1. The pitfalls of multiple significance tests
4.2. Adjusting P-values for multiple comparisons
First, it was relatively simple to generate numerous sta-
Second, our observation that two of the associations
tistically significant associations when we examined a large
identified in the derivation set were confirmed in the
P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969
validation set does not necessarily provide evidence that
primary outcome or endpoint and multiple secondary end-
those born under the sign of Leo have a significantly higher
points. However, as the number of secondary endpoints or
probability of hospitalization for gastrointestinal hemor-
subgroup analyses increases, the risk of erroneously identi-
rhage, or that those born under the sign of Sagittarius have
fying a significant association also increases. To quantify
a higher probability of hospitalization for fractures of the
the prevalence of subgroup analyses and the number of end-
humerus. Under the null hypothesis, P-values are uniformly
points in clinical trials, we examined all 131 randomized
distributed between 0 and 1. The likelihood of a type I
clinical trials published in the Journal of the American
errordidentifying a statistically significant association where
Medical Association, the New England Journal of Medi-
none existsdis 5%, when using a 0.05 significance level.
cine, the Lancet, and the British Medical Journal between
When testing 24 hypotheses in which the null hypothesis
January 1 and June 30 of 2004. The mean and median
is true, the likelihood that at least one will be found to be
number of subgroups in which endpoints were compared
significant simply by chance is 70.8%. Thus, by not making
between treatment arms were 5.1 and 2, respectively
appropriate adjustments for the testing of multiple hypoth-
(IQR 5 0e6), while the mean and median number of sig-
eses, we greatly increased our risk of falsely ‘‘uncovering’’
nificance tests of efficacy and safety endpoints were 26.5
an association between astrological sign and illness. Had
and 19, respectively (IQR 5 9e32). The maximum number
we instead endeavored to preserve an overall type I error
of distinct subgroups in which endpoints were compared
rate of 0.05, we would have had to use a significance level
between treatment arms was 68, while the maximum num-
of 0.00213 for each of the 24 individual hypothesis tests
(this is marginally less conservative than a Bonferroni cor-rection, which would have used a significance level of 0.05/
4.3. The importance of biologic plausibility
24 5 0.00208; both methods require that the multiple com-parisons be independent of one another). Using this signif-
Third, none of the hypotheses generated using the deri-
icance level, none of the 24 hypothesized associations
vation cohort had any apparent biologic plausibility. De-
would have been significant in the validation cohort. San-
spite confirming 2 of the 24 prespecified hypotheses in
koh et al. discuss the relative merits of different
the validation cohort, there is no currently apparent mech-
methods in adjusting for the testing of multiple endpoints
anism by which Leos might be predisposed to gastrointes-
in clinical trials. In particular, they note that the Bonferroni
tinal hemorrhage or Sagittarians to humeral fractures. In
adjustment (which is an approximation to our exact
interpreting the subgroup analyses from the ISIS-2 trial,
method) ignores most of the information from the data
the authors argued that the results were not biologically
and is too conservative when there are many outcomes
plausible, and should be ignored. Caution is required in in-
. Bender and Lange provide an overview of methods
terpreting results that do not have apparent biological plau-
to adjust for multiple testing in medical and epidemiologi-
sibility. In particular, it is important that biologically
plausible associations be specified during the design of
Similarly, in the derivation cohort, there were implicitly
the study, because it is tempting to construct biologically
14,718 comparisons (223 diagnoses Â 66 pairwise compar-
plausible reasons for observed subgroup effects after hav-
isons per diagnosis). To retain an overall 5% type I error
ing observed them . Our study demonstrates that data-
rate, one would need to use a significance level of
driven statistical methods may result in conclusions that
0.000003485 for an individual hypothesis test. Using this
are neither reproducible nor biologically plausible.
significance level, none of the 72 associations identifiedin the derivation cohort would have been identified as
4.4. Subgroup analyses in clinical trials
statistically significant. We should note that there were 72diagnoses for which the astrological sign with the highest
Subgroup analyses are common in randomized con-
probability of hospitalization had a significantly higher
trolled trials. Indeed, the subgroup analysis reported by
probability of hospitalization compared to that for the re-
the ISIS-2 investigators motivated the current study.
maining astrological signs combined. It is highly likely that
Many investigators have cautioned against subgroup analy-
there were other astrological signs (but not the one with the
ses in randomized controlled trials. It has been argued that
highest probability of hospitalization) that had a signifi-
such analyses should be prespecified, and that there should
cantly higher probability of hospitalization compared to
be a pre-specified biologically plausible explanation for the
residents born under the remaining 11 astrological signs
proposed subgroup analysis . Furthermore, it has been
combined. While these comparisons were implicitly con-
suggested that one should not be guided by statistical sig-
sidered in our design, they were not reported on in the cur-
nificance, but rather by trends and consistency, because
rent study. Our study illustrates that in a trial with multiple
such analyses are frequently underpowered Similarly,
hypothesis tests (either secondary outcomes or subgroup
Sleight cautions against subgroup analyses in random-
analyses), the significance level used should be adjusted
ized clinical trials, suggesting that plausible explanations
to preserve an overall type I error of a desired level. It is
for specific findings can often be found for conclusions that
common in randomized clinical trials to examine one
were, in reality, spurious. If our categorization of residents
P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969
had been based upon clinical criteria or demographic char-
acteristics rather than astrological sign, it is likely that post
Finally, there is an increasing interest in ‘‘data mining’’
hoc plausible explanations could have been constructed for
as a means of hypothesis generation, particularly in com-
many of the associations identified. Both Yusuf et al.
mercial endeavors. Data mining has been variously de-
and Oxman and Guyatt provide guidelines for inter-
scribed as ‘‘the nontrivial extraction of implicit, previously
preting the results of subgroup analyses. Freemantle
unknown, and potentially useful information from data’’
suggested that a purist approach would be to examine
and as a ‘‘semi-automatic extraction of patterns,
subgroup analyses and secondary endpoints only if the pri-
changes, associations, anomalies, and other statistically sig-
mary endpoint is statistically significant. Recently, Roth-
nificant structures from large data sets’’ . Data
well discussed arguments for and against subgroup
mining is often conducted in large datasets and often does
analyses and provided guidelines for designing and inter-
not involve prespecified hypotheses. In the current study,
preting subgroup analyses. There are increasing calls for
we began with no prespecified hypotheses, and used auto-
the registration of trial protocols prior to the start of ran-
mated methods to detect apparently significant associations.
domized clinical trials , an initiative that could reduce
Despite the addition of a validation cohort, two unantici-
the number of frivolous subgroup analyses. The current
pated associations remained significant. Our study therefore
study adds a cautionary note concerning the practice of
serves as a cautionary note regarding the interpretation of
conducting numerous significance tests, such as those often
findings generated by data mining, and suggests that conclu-
performed in the setting of a randomized trial.
sions obtained from data mining should be viewed witha healthy degree of skepticism.
In conclusion, we were able to identify multiple signifi-
cant associations, all of them clinically implausible, between
The current study used both derivation and validation
astrological sign and the probability of hospitalization for
datasets. Only 2 of the 24 significant associations that were
specific diagnoses. Two of these associations remained sta-
identified using the derivation cohort remained statistically
tistically significant when tested in an independent valida-
significant in the validation cohort. The use of derivation
tion cohort. Our study emphasizes the hazards of testing
and validation datasets has been frequently advocated in
multiple, non-prespecified hypotheses.
the statistical literature The use of a validation datasetallows one to assess the reproducibility of findings obtainedin the derivation cohort, and serves to protect oneself fromidentifying spurious findings in a single dataset. We suggest
that when surprising associations are obtained, either as aresult of subgroup analyses or analysis of secondary out-
The Institute for Clinical Evaluative Sciences (ICES) is
comes in clinical trials, researchers seek to reproduce these
supported in part by a grant from the Ontario Ministry of
Health and Long-Term Care. The opinions, results, and
This concept is nicely illustrated by two major clinical
conclusions are those of the authors and no endorsement
trials. The Prospective Randomized Amlodipine Survival
by the Ministry of Health and Long-Term Care or the Insti-
Evaluation (PRAISE) study examined the effect of amlodi-
tute for Clinical Evaluative Sciences is intended or should
pine in patients with congestive heart failure and found
be inferred. Drs. Austin, Mamdani, and Juurlink are sup-
no benefit in the primary analysis. In a prespecified sub-
ported by New Investigator awards from the Canadian Insti-
group analysis, amlodipine reduced the risk of fatal and
nonfatal events in patients with severe nonischemic heartfailure (P 5 0.04) . Furthermore, amlodipine seemedto prevent a secondary outcome (mortality) in the same
patients (P ! 0.001). The PRAISE-2 trial, which was
[1] ISIS-2 Collaborative Group. Randomized trial of intravenous strepto-
explicitly designed to examine the effect of amlodipine in
kinase, oral aspirin, both, or neither among 17187 cases of suspected
nonischemic heart failure patients, found no effect on
acute myocardial infarction: ISIS-2. Lancet 1988;2(8607):349e60.
mortality or cardiac events This trial was never re-
[2] Gurm HS, Lauer MS. Predicting incidence of some critical events by
ported in detail. Similarly, the Evaluation of Losartan in
sun signsethe PISCES Study. ACC Curr J Rev 2003;Jan/Feb:22e4.
[3] Philips DP, Ruth TE, Wagner LM. Psychology and survival. Lancet
the Elderly (ELITE) trial suggested a survival benefit in el-
derly heart failure patients treated with the angiotensin II
[4] Sankoh AJ, Huque MF, Dubey SD. Some comments on frequently
antagonist losartan compared to the ACE inhibitor captopril
used multiple endpoint adjustment methods in clinical trials. Stat
. This finding was not replicated in the ELITE II trial
. The results of the PRAISE/PRAISE-2 and ELITE/
[5] Bender R, Lange S. Adjusting for multiple testingewhen and how? J
ELITE II trials illustrate that subgroup analyses, even when
[6] Topol EJ, Califf RM, Van de Werf F, Simoons M, Hampton J,
specified, can result in findings that are not subsequently
Lee KL, et al. Perspectives on large-scale cardiovascular clinical tri-
als for the new millennium. Circulation 1997;95:1072e82.
P.C. Austin et al. / Journal of Clinical Epidemiology 59 (2006) 964e969
[7] Sleight P. Debate: subgroup analyses in clinical trials e fun to look
[14] Packer M, O’Connor CM, Ghali JK, Pressler ML, Carson PE,
at, but don’t believe them? Curr Control Trials Cardiovasc Med
Belkin RN, et al. Effect of amlodipine on morbidity and mortality
in severe chronic heart failure. N Engl J Med 1996;335:1107e14.
[8] Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpreta-
[15] Thackray S, Witte K, Clark AL, Cleland JGF. Clinical trials update:
tion of treatment effects in subgroups of patients in randomized clin-
OPTIME-CHF, PRAISE-2, ALL-HAT. Eur J Heart Fail 2000;2:209e12.
ical trials. J Am Med Assoc 1991;266:93e8.
[16] Pitt B, Segal R, Martinez FA, Meurers G, Cowley AJ, Thomas I, et al.
[9] Oxman AD, Guyatt GH. A consumer’s guide to subgroup analysis.
Randomized trial of losartan versus captopril in patients with heart
failure (Evaluation of Losartan in the Elderly Study, ELITE). Lancet
[10] Freemantle N. Interpreting the results of secondary end points and
subgroup analyses in clinical trials: should we lock the crazy aunt
[17] Pitt B, Poole-Wilson PA, Segal R, Martinez FA, Dickstein K,
Camm AJ, et al. Effect of losartan compared with captopril on mor-
[11] Rothwell PM. Subgroup analysis in randomized controlled trials: im-
tality in patients with symptomatic heart failure: randomized tri-
portance, indications, and interpretation. Lancet 2005;365:176e86.
aldthe Losartan Heart Failure Survival Study ELITE II. Lancet
[12] DeAngelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R,
et al. International Committee of Medical Journal Editors. Clinical
[18] Everitt BS. The Cambridge dictionary of statistics, 2nd edition.
trial registration: a statement from the International Committee of
Cambridge: Cambridge University Press; 1998.
Medical Journal editors. J Am Med Assoc 2004;292:1363e4.
[13] Picard RR, Berk KN. Data splitting. Am Stat 1990;44:140e7.

Evaluation of the GERD Impact Scale, an international, validated patient questionnaire, in daily practice. Results of the ALEGRIA study E. Louis1, J. Tack2, G. Vandenhoven3, C. Taeter3(1) Department of Gastroenterology, CHU of Liege Belgium ; (2) Department of Pathophysiology, KU Leuven, Belgium ; (3) AstraZeneca, Belgium. Abstract up to half of all cases, GERD is associated with erosive,