All is not well in the biomedical sciences. In 1986, Drummond Rennie, the deputy editor of the Journal of the American Medical Association, offered the research community an execration biblical in its intensity.
There seems to be no study too fragmented, no hypothesis too trivial, no literature too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, and too contradictory, no analysis too self-serving, no argument too circular, no conclusions too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print.1
In 2005, an eminence at the Stanford University School of Medicine, John Ioannidis, published a paper entitled “Why Most Published Research Findings Are False,” in PLoS Medicine.2 In the history of the journal, no paper has been more frequently downloaded. Ioannidis defended the assertion that, so far as research goes, false is more likely than true.3 It is easy enough to get to what is false
when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance.4
These are the very conditions met by most research in the biomedical sciences. No wonder so many research reports are false. It is a miracle that any of them are true.
Six years later, John Arrowsmith found that the success rates for new development projects in Phase II trials had fallen from 28 to 18 percent.5 The most frequent reason for failure would appear to be that the drugs do not work. Bayer AG cannot replicate about two-thirds of published studies identifying possible drug targets.6 Amgen found an even higher rate of failure. Writing in Nature in 2011, Amgen researchers observed that 65 percent of medical studies were inconsistent when re-tested, and the main data set was only completely reproducible for seven percent of studies.7 In a remarkable example of auto-infection, it would seem that papers regarding reproducibility may themselves not be reproducible.8
One reason that so few results are reproducible is just that there are so many results. More than 25,000,000 scientific papers were published between 1996 and 2011.9 Societies and commercial publishers have grown rich publishing the ensuing wad (hereinafter the Wad) which doubles roughly every nine years.10 The system offers few obvious incentives leading anyone to question anything.11 Needless to say, no one can read all of the Wad. In time freed from the fairway, clinicians devote between three and four hours to reading the biomedical literature each week.12 Those performing research are overwhelmed.13
If the sheer size of the Wad is one problem, promiscuous authorship is another. Some scientists sign more than 900 papers a year, thus suggesting an assembly line far more than a research laboratory. Most papers are short enough, but some represent laboratory research on which these prodigies of endeavor are listed as the lead researcher.14 It is strictly for show. Graduate students and postdoctoral fellows conduct the preponderance of biomedical research.15 They work like dogs and are treated like donkeys. Ioannidis affirms the obvious: that “[m]any otherwise seemingly independent, university-based studies may be conducted for no other reason than to give physicians and researchers qualifications for promotion or tenure.”16 To make matters worse, some scientific journals now charge their contributors, thus bringing the benefits of vanity publication to the academic marketplace.17 Harvard biologist John Bohannon submitted a pseudonymous paper on the effects of a lichen-derived chemical on cancer cells to 304 such journals, all of them proudly boasting of peer review. The paper was nonsense, the chief researcher fictitious, and the university where the research was conducted, non-existent. More than half the journals accepted the paper for publication.18 Vanity publication goes knife-in-hand with salami slicing, as when researchers distribute their data to different journals.19 Slices inevitably become smaller, even as the salami grows larger. Authors may find themselves citing one article ten times under the mistaken impression that they are citing ten articles one time.20
Many findings cannot be reproduced because researchers refuse to provide, or in any case do not provide, the necessary data. A 2011 review by Ioannidis found that of 500 randomly selected papers published in the world’s 50 leading journals, 149 were not subject to any data availability policy, 60 were published in journals without a specific data-sharing statement, and 89 contained data not covered by the specific public deposition policy in their journals. Of the remaining 351 that claimed to comply with some data-sharing policy, only 47 deposited their full primary data online.21 Christine Laine, the editor of the Annals of Internal Medicine, observed in 2013 that five years earlier, six researchers out of ten said they would share their raw data if asked; now fewer than half say as much—or do as little.22
It is a common misconception that biologists can now do with animals what physicists do with bosons. In fact, even clones, unlike bosons, may be vastly different in size, color, and temperament.23 Problems occur even on a cellular level. In the most rigorous of studies, results might be reproducible only under very specific conditions. Suppose that 50 mice were used in two different studies of a potential anti-carcinogenic molecule. In the first, 28 mice went into remission; in the second, 30. By the standards of physics, the second study would not have reproduced the first; by the standards of biology, it would have been a cause for celebration. Outside the academic world, there are no identical rodents.
Flexibility is the great enemy of rigor, allowing researchers leeway to transform what would be negative into positive results. The more standardization, the better. Fields without standard designs lend themselves to reporting only their best results.24 In order to declare environmental tobacco smoke a Group A carcinogen (known to cause cancer in humans), the Environmental Protection Agency (EPA) abandoned the standard 95 percent confidence level used in epidemiology for 90 percent.25 What became noxious was not the smoke, but the EPA.
Peer review, although widely admired as a form of scholarly saintliness, does not appear to improve reproducibility. It is a flawed system. For one thing, no one quite knows how peer review should be done. Send out the scribbled-over manuscript? The pre-publication version with Schmetterling’s name misspelled? After we get the damn thing in print? Isn’t that like putting a crash helmet on a corpse?26 Peer review is supposed to screen out flawed experiments, but who has the time, the inclination, or the money, to reproduce an experiment in which a thousand white Albino mice were taught to enjoy the better variety of Cuban cigars?
Look, we’re lucky if our reviewers just read the manuscripts to make sure they’re written in something like a human language.
The Cochrane Collaboration exists, its website modestly affirms “so that healthcare decisions get better.”27 It is a service “for anyone who is interested in using high-quality information to make health decisions.” Doctors, nurses, patients, carers, and researchers are welcome; funders are especially welcome. The Cochrane Collaboration is pleased to recognize itself as “the gold standard in evidence-based health care.” In 2007, the Collaboration, a name that somehow suggests the Borg on Star Trek, reported that “[a]t present, little empirical evidence is available to support the use of editorial peer review as a mechanism to ensure quality of biomedical research.”28 It added, however, that studying peer review is quite complex, thus indicting and excusing peer review in one florid gesture.
Multiple studies have shown that if several authors are asked to review a paper, their agreement on whether it should be published is little higher than would be expected by chance.29 Errors? Outright bloops? Not a bit of it:
At the British Medical Journal we took a 600 word study that we were about to publish and inserted eight errors. We then sent the paper to about 300 reviewers. The median number of errors spotted was two, and 20% of the reviewers did not spot any. We did further studies of deliberately inserting errors, some very major, and came up with similar results.30
Another problem with pre-publication peer review is bias. In one well-known study the authors took 12 studies that came from prestigious institutions that had already been published in psychology journals, retyped the papers, made minor changes in the titles, abstracts, and introductions, but changed the authors’ names and institutions. Thereafter, the papers were resubmitted to the journals that had first published them. In three instances, journals recognized that they had already published the papers; eight of the remaining nine papers were rejected.31
In biomedical research, good news is not only admired but indispensable. Negative results are less likely to be published than positive results; the trend is increasing at the rate of about six percent per year.32 A 2014 survey of social scientists found them largely unwilling to pass along bad news: “Why bother?” was the most commonly given explanation.33
Far from ensuring higher quality, the more prestigious journals attract the opposite. A 2014 commentary in Proceedings of the National Academy of Sciences observes that the very fact that some journals have extraordinary reputations puts “pressure on authors to rush into print, cut corners, exaggerate their findings, and overstate the significance of their work.”34
The more sensational findings are least likely to be reproducible. If it sounds too good to be true it probably is. The managing editor of Science-Based Medicine, David Gorski, put the matter with more finesse but no less force: “clinical trials examining highly improbable hypotheses are far more likely to produce false positive results than clinical trials examining hypotheses with a stronger basis in science.”35 The more extreme the conclusion, perhaps the greater its relevance if true, but the less likely it is to be true. One explanation why sensational usually means wrong is simple enough. It is a deviation from the mean and the mean exists for a reason. Established science certainly can be wrong, especially if it has only been recently established. But a paper finding evidence that aspirin prevents foot and mouth disease should be looked at askance.
The hard sciences are mathematics, physics, chemistry, and parts of biology. All the rest are soft.36 What do the hard sciences have in common beside their hardness? For one thing, they have much less of a bias against publishing negative results. Daniele Fanelli, at the University of Edinburgh, found that
the odds of reporting a positive result were around five times higher among papers in Psychology and Psychiatry and Economics and Business than in Space Science, … 2.3 times significantly higher for papers in the social sciences compared to the physical sciences, and about 3.4 times significantly higher for behavioral and social studies on people compared to physical–chemical studies on non-biological material.37
Whatever comparisons he made, the harder sciences were more open to negative studies.
Psychologists are aware of their reputation for conducting studies that no one can reproduce. They are said to be interested in self-improvement. In March 2013, Dr. Brian Nosek unveiled the Center for Open Science, a new independent laboratory, endowed with $5.3 million from a single foundation, aiming to make replication more respectable. This is a well-funded step in the right direction. If psychology is bad but getting better, epidemiology is just plain rotten. Even high-powered epidemiological studies, Ioannidis notes glumly, may have only a one in five chance of being true.38
A commentary published in Nature in 2012 called for improvements along these lines, noting a need for higher standards of experimental design and reporting, and, in particular, far more diligent oversight from the National Institutes of Health and funding bodies:
An important gatekeeper of quality remains the peer review of grant applications and journal manuscripts. We therefore call upon funding agencies and publishing groups to take actions to reinforce the importance of methodological rigor and reporting.39
We shall see; time will tell; yes, of course.