Danyl McLauchlan reviews Stuart Ritchie’s Science Fictions, which outlines the staggering systemic flaws in the funding and publication of scientific papers.
Back in August of 2006 a number of New Zealand scientists were caught up in a media controversy about whether Māori had a genetic predisposition towards violent crime. It kicked off when an epidemiologist at ESR presented research about his work on a gene called monoamine oxidase A, or MAOA, which codes for an enzyme that plays a key role in regulating neurotransmitters and is vital to brain function. There are different versions of MAOA and one in particular was linked to a variety of antisocial behavioural traits, including aggression. This variant became known as “the warrior gene” and this epidemiologist’s study found that this variant was overrepresented among Māori men (“historically Māori were fearless warriors”).
There was an outcry: the study’s findings and methodologies were criticised, with other scientists pointing out that the sample size was tiny, the researcher ignored environmental factors, ethnicity is a very slippery notion in genetics, also … this all seemed pretty racist. The researcher walked back his comments about violence and insisted his work had been taken out of context by the media. The goal of his research, he announced, was to assist in strategies “to reduce Māori over-representation in smoking and alcohol addiction statistics”.
No one disputed that there was a gene that caused antisocial behaviour. The link between MAOA and violent crime was well established in the literature. There had been numerous studies dating back to the early 1990s that found the “warrior gene” variant was overrepresented in families with histories of criminal offending. Boys who carried it were more likely to join gangs. It was implicated in alcoholism and numerous other traits, including depression and schizophrenia.
By the mid 2000s there was also an extensive body of work on the biology behind the gene’s function and malfunction. And all of this was part of a much wider field of research called candidate gene association studies, a methodology that enjoyed massive media exposure from the early 90s to the late 2000s as biologists identified the genes they claimed were responsible for obesity, mental illness, diabetes, addiction, gambling, cancer, crime, and numerous other behavioural disorders and diseases.
This was a golden age for high-profile genetics research. Almost every week brought another announcement that a candidate gene for some social problem or illness had been found. And, these announcements concluded, now that the gene causing the problem was known and the biological nature of the malady identified, a cure was surely close. Grants were funded and biotech companies were launched, billions of dollars were invested and exciting new technologies were trialed, all based on candidate gene discoveries.
And none of them worked because all of this was bullshit.
Warrior genes, depression genes, obesity genes – all the peer-reviewed papers in famous journals, the lavish media coverage, which went on for decades … None of it was true. There are rare cases in which variations in single genes do cause a disease (trinucleotide repeats in the huntington gene being a famous example). But in almost all cases, genes just don’t work the way candidate gene researchers claimed they did. Most gene effects are “polygenic”, meaning they’re the result of many genes – usually hundreds, sometimes thousands – interacting with each other. Genes are vast orchestras rather than lone guitar soloists. They also interact with transcription factors and other gene regulators, and environmental causes. Most individual gene variants have absolutely minute effects. Just as I was writing this review a study on the genetics of height, based on the genomes of four million people, revealed that about 50% of human height is attributable to 10,000 common genetic variants, each of them with tiny individual effects.
We know this now because in the mid 2000s the cost of genome sequencing decreased dramatically. A new research methodology called GWAS (Genome Wide Association Studies) allowed molecular biologists and statisticians to use the larger datasets now available to them to look for linkages between genes and traits. Instead of getting together a group of 46 guys and checking to see if they had “the warrior gene”, researchers could look at the genotypes of tens of thousands of people and work backwards from the traits they were studying. You could now reasonably ask – and answer – “given that this group are obese, or depressed, or have a history of violent crime, what differentiates their genotypes from everyone else’s?”
You would expect that a GWAS analysis of the genetics of aggression would throw up the warrior gene variant of MAOA. After all, hundreds of studies had verified that MAOA was a candidate gene for violent and antisocial behaviour, and thousands of additional papers based on those studies had worked out in great detail how the gene variant caused these problems. Instead, GWAS demonstrated that variants in MAOA had no measurable impact on antisocial behaviour. And they showed the same for the candidate gene studies into obesity, depression and schizophrenia. Very little of the candidate gene research – decades of high-profile, well-funded science – held up.
Science isn’t magic. Scientific findings are always provisional; they’re always subject to revision. In September a team of astronomers announced that they might have found phosphine in the atmosphere of Venus, and that this was a strong indication of biological life on another planet. More recently they admitted they’d looked closer at the data and they might not have found phosphine after all. That’s science.
But imagine if they’d made that first announcement, then thousands of astronomers spent years obtaining additional data confirming the existence of phosphine and publishing papers and studies describing the nature of lifeforms in the atmosphere of Venus in great detail, before admitting they’d taken a new measurement using better tech and they were wrong about the existence of phosphine and there were no lifeforms and never were … would that have been science? That’s pretty much what happened with candidate genes. What was going on there?
Here’s a more haunting question. Molecular biology was going through a revolutionary phase in the 2000s. New technologies, new methodologies, exponentially bigger datasets. It’s quite rare for a field to experience that level of disruption so suddenly, and in a way that sweepingly disproves a widely held, high-profile hypothesis. So: how much scientific research, carried out by credentialed experts, verified by peer review and published in reputable journals, is equivalent to candidate gene studies prior to GWAS? Which is to say: how much science only looks like science but has no actual truth value because it’s junk science that just hasn’t been debunked yet?
The answer, according to the psychologist and statistician Stuart Ritchie in his book Science Fictions: Exposing Fraud, Bias, Negligence and Hype in Science, is: a lot. Like, a lot. So much that it almost defies belief. Richie is a lecturer at the Institute of Psychiatry, Psychology and Neuroscience at King’s College London. He’s part of a new generation of scientists, many of them grounded in psychology and statistics, who are critical of what they perceive to be a shocking lack of rigour across the natural and social sciences.
And they cite an embarrassingly rich caseload of terrible science as evidence for their critique. On almost every page of Science Fictions is a description of a fascinating and important study that you’ve read about in the media or a popular science bestseller (Matthew Walker’s Why We Sleep and Daniel Kahneman’s Thinking, Fast and Slow are both called out), or heard cited in a TED talk, or even studied at school or university, but which has subsequently – and often very quietly – proved to be untrue. And on every other page of Science Fictions is an argument that there are many more instances of similar fraudulent, or hyped or biased or simply just bad science in that same category.
In 2011, back when Ritchie was a PhD student in psychology, he read a paper by the Cornell professor of psychology Daryl Bem, about an experiment in which Bem provided statistical evidence for the existence of psychic powers. The paper was published in the Journal of Personality and Social Psychology. This experiment was easily replicated, so Ritchie and a few fellow researchers carried it out, exactly how Bem described it, and found no statistical evidence for (cough) precognition. They wrote up their findings and sent them to the Journal of Personality and Social Psychology, which refused to publish their work, because they didn’t publish replications of previously published papers.
That same year the Dutch psychologist Diederik Stapel – who became world famous after his discovery that people were more prone to racial prejudice and racist stereotypes when making judgements in a messier or dirty environment – was suspended from Tilburg University, after revelations that he’d fabricated the data for most of his findings. His papers – all peer-reviewed, most published in prestigious, high-impact journals – were retracted.
The cases of Bem and Stapel reminded some researchers of a 2005 essay by the physician and scientist John Ioannidis. It was titled “Why Most Published Research Findings Are False”. In it Ioannidis proposed a mathematical model demonstrating that, given the very low barriers to meeting the criteria for “statistical significance” in a research paper, and all the varied ways in which research could go wrong but still meet those criteria, any given paper was more likely to be false than true. The essay created a great deal of discussion, but not much reform. Considering the vast amount of money, time, ingenuity and effort that went into scientific research around the world, the notion that most of the papers that all this diligent work produced were wrong was unthinkable. (Ioannidis has, in the past year, become a Covid contrarian, publishing papers filled with the same sloppy inaccuracies and errors he once critiqued in his fellow scientists).
But Bem’s psychic powers study and Stapel’s fake findings looked like the kind of thing Ioannidis’ paper predicted. Both of them were easy to spot: Bem’s work defied basic scientific understandings of how reality worked, and Stapel just made his numbers up, arousing the suspicion of his colleagues when they noticed that he had all this astonishing data but never conducted experiments. So if they made it past peer review and into prestigious journals, didn’t that suggest there could be more false papers that weren’t so blatantly fraudulent?
Many social psychology experiments are comparatively cheap and easy to carry out (as opposed to, say, a clinical trial of a cancer drug). You round up a bunch of undergrad students, tell them they won’t get credit for a course unless they “volunteer” to be experimental subjects, and then turn your postgraduate students loose on them. That made many published psychology papers good candidates for replication studies. So in the early 2010s psychologists began attempting to replicate well-known findings in the literature to see how many of them stood up.
The highest-profile study, the results of which were published in Nature in 2015, chose 100 studies from three top psychology journals and tried to replicate them using larger sample sizes. Only 39% replicated successfully. Another study, replicating papers published in Science and Nature, the world’s most prestigious science journals, found a replication rate of 62%. And almost all replication studies, even when successful, found that original studies exaggerated the size of their effects.
This problem is now known as the replication crisis. It’s sometimes referred to as “psychology’s replication crisis” which is rather unfair: the psychologists have been more diligent at diagnosing the scale of the problem but the real crisis seems to stretch across all scientific domains. Ritchie cites an attempt by researchers at the biotechnology company Amgen to replicate 53 landmark preclinical cancer studies that had been published in top scientific journals. Eleven percent were successful. More ominously, further attempts to replicate findings in medical or biological research found that the methodologies described in the studies were so vague – and, when contacted, the scientists who conducted the original experiments were so unhelpful at clarifying the nature of their work – that they couldn’t even attempt to replicate them. Ritchie cites another study that found 54% of biomedical papers failed to identify the animals, chemicals or cells they used in their experiment. The work itself could not be reproduced, let alone the results.
How could this be possible? How could science, a calling and a profession – an entire global enterprise – that is synonymous with precision and accuracy and scrupulous attention to detail be churning out such a massive volume of hype, disinformation and nonsense? Ritchie’s book explains how, in eviscerating detail.
He sorts the problem into four categories. The first is fraud. This is, he argues, much more widespread than scientists like to think. He documents a handful of high-profile fraudsters, like Andrew Wakefield, with his bogus studies alleging a link between vaccines and autism, who misrepresented or altered the clinical details of all 12 children that he studied. Or Paolo Macchiarini, the Italian surgeon who falsified his work on thoracic reconstruction and became a superstar of medicine, hired by the prestigious Karolinska Institute in Sweden – which awards the Nobel Prize for Medicine – while his patients were systematically dying
But more concerning are the studies that suggest fraudulent publication of research data is widespread. In 2016 the microbiologist Elisabeth Bik led a study to examine western blot images (blotting is a widespread technique that biologists use to separate out and identify the molecules they’re interested in). She found that 3.8% of papers containing blot images had been dishonestly manipulated. Based on this finding she estimates there are 35,000 papers in the biology literature that should be retracted because of fraudulent gel images alone.
Fraud, Ritchie suggests, is not something that happens in isolated cases, or substandard institutions. He quotes the neuroscientist Charles Gross who, in a review of scientific fraud, describes the archetypal fraudster as “a bright and ambitious young man working at an elite institution in a rapidly moving and highly competitive branch of modern biology or medicine, where results have important theoretical, clinical or financial implications.”
Adjacent to fraud is hype. Scientists inevitably blame the media for the endless cycle of news stories about astonishing new cancer cures (in mice), or nutrition studies about what may or may not give you cancer, or heart disease, or obesity, or make you live longer. (Ritchie is especially scathing about nutritional epidemiology, a field that often seems to generate more fake news and misinformation than every conspiracy theorist on Facebook combined [Ed: please note this piece was written in October 2020]). He points out that most hyperbolic media coverage of science can be traced back to the press release in which the findings were announced; documents that are generally co-written by the lead researcher and a comms advisor, and signed off by the researcher before release.
And there’s a chapter on negligence: the dumb mistakes that anyone can make, but which seem to routinely slip through the peer review process. The star of this chapter is the infamous Reinhart-Rogoff paper of 2010, a hugely influential, now infamous macroeconomics paper arguing that a nation’s economic growth suffers when debt reaches a certain point. It was often cited as evidence for the austerity economics of the post-GFC era, and its findings were predicated on a simple error in Excel. When the coding error was corrected the paper’s findings were invalidated.
As usual Ritchie provides evidence suggesting that this high-profile case is just a signifier to a deeper, more systemic problem: that science is riddled with obvious errors. One of his examples is a mistake I’ve been caught out on: when you cut and paste lists of genes into Excel the software will helpfully rename anything that looks like a date. So MARCH1 (Membrane Associated Ring-CH-Type Finger 1) becomes 1-Mar. Which is not a gene. A 2016 study found the error affected about 20% of all genetics papers. Genes with date-like names have all been renamed; it seemed easier than getting Microsoft to fix Excel.
Fraud, hype and negligence are all problems, and they can be amusing or salacious or awkward to read about. But they tend to be problems with people rather than problems with science. The most important chapter in Ritchie’s book is about bias, which points to deep, systemic problems within science itself.
There’s been a lot of attention paid to scientific bias over the last few years. There’s overconfidence bias (ubiquitous); gender and racial bias which skews countless results (including the famous facial recognition algorithms that only work on white people because that’s who they were tested on); there are debates about political bias and ideological orthodoxy, especially in the social sciences. In 2010 the evolutionary biologist Joseph Henrich pointed out that most findings in the psychology literature are based on research conducted on people who are WEIRD (Westerm, Educated, Industrialised, Rich and Democratic) and that these people are a very unusual minority of humans overall, so experiments based on them probably don’t generalise to the rest of our species.
All of these are problems, but for Ritchie the huge challenges in contemporary science are positivity bias, publication bias and the statistical bias known as p-hacking.
Imagine you’re a genetic epidemiologist back in 2005, and you decided to do a study showing that some candidate gene didn’t have an effect on behaviour. In theory that’s a valid hypothesis and a valid experiment. In practice, it’s just not the sort of work most scientists do. There is a strong bias towards novel discoveries and positive effects; towards finding genes or techniques or chemicals or treatments or effects that do something. We know with the benefit of hindsight that a study demonstrating that a candidate gene didn’t have a major effect would have been more accurate, and thus more valuable than all the rest of the studies in that field at the time, but it’s just not what gets funded.
This is positivity bias, and it creates problems when it meets with publication bias. The publication of research papers is the primary metric of success in almost every research scientist’s career. And scientific journals are far more likely to publish positive or novel results. This positivity bias extends to studies with null results, i.e. studies that find nothing. If your study fails to prove anything then it’s very unlikely to get accepted for publication. And this creates a strong incentive for scientists to make their studies prove something.
If our hypothetical 2005 epidemiologist got a grant to study candidate genes and their role in, say, addiction, they’d ideally get a sample group of addicts and a control group of non addicts, choose their candidate and sequence their genomes. Let’s say they start out looking at variants of DRD2, the dopamine receptor G2 gene, often nominated as a candidate gene for addiction. If they compared the gene variants of the addicts and non addicts and found no differences, then in publication terms the study would be a failure; a career damaging waste of time and money.
So a hardworking and ambitious scientist might wonder: what if there’s some other candidate gene involved in addiction? Maybe I can find something else that’s statistically significant about my dataset? This term “statistical significance” is crucial. For many scientific studies, significance is determined by the calculated probability value, universally known as the p value. This is arrived at via computing a statistical formula (or, more realistically, running the p value function over your dataset in R or Excel or some other statistical software).
What the p value is measuring is the probability that you’d find that result in any given dataset by chance, if the effect you were measuring didn’t exist. So if candidate genes don’t do anything (and they mostly don’t) there’s still a chance that you’ll randomly find them overrepresented in your sample. The p value measures that chance. And the standard criteria for statistical significance, i.e. that what you’ve found probably isn’t just there by random chance, is a p value of 0.05, which corresponds to a 95% probability that your finding is significant (therefore a 5% chance that it isn’t). If you look for a candidate gene variant and find it in your sample group, and your statistical software tells you the p value for that finding is less than 0.05, then your result is statistically significant and your experiment is publishable.
But p values can be hacked, and doing this is so easy many scientists who do it don’t even know they’re doing anything wrong. Here’s how it works.
If you roll a dice then the chances that it’ll come up six are one-sixth, or 17%. If you roll two dice then the chances they’ll both come up six at once are 2.78%, which is fairly close to the p value of 0.05.
If you carried out your addiction gene study but didn’t find anything, you could then go looking for some other gene that’s overrepresented in your sample. But! Your chance of randomly finding something increases with each new search. If you keep rerolling your dice then your chance of randomly rolling two sixes with each individual roll is always the same (2.78%), but the chances that you’ll eventually hit that combination rapidly increases to 100%, at which point you can say “Wow! Two sixes! What are the odds?” or, “Wow! I found overrepresentation of monoamine oxidase A in my sample group! This proves that the warrior gene is implicated in addictive behaviour!” And when you write your paper up you don’t say “I started out looking at DRD2 and didn’t find anything so I searched for dozens of gene variants until I randomly found one.” You say “This study examines the role of MAOA in addiction.”
Here’s why all this matters. In 2018 the psychiatrist Ymkje Anna de Vries and her colleagues examined 105 clinical trials for antidepressants approved by the Food and Drug Administration in the US. The results came back almost 50/50: 53 studies found that the drug worked better than placebo or control; the other 52 registered results that were either “questionable” or null.
Further, 98% of the positive results were published compared to only 48% of the negative findings (publication bias). Of the 25 papers that did get null results, 10 were p-hacked, just the way our hypothetical epidemiologist’s were: the researchers switched the outcomes when they saw their initial study wasn’t statistically significant, so their paper flipped from a negative to positive result.
There are now only 15 published studies left in the dataset with clear negative outcomes. And 10 of them were written up in papers that spun the research to present it in a positive light (positivity bias). Imagine you’re a health ministry or drug funding agency trying to decide what antidepressants you’ll approve. The first thing you’ll do is look for a meta-analysis based on the published literature, which is completely corrupted by bias.
Many people with long-term mood disorders have gone through the ordeal of being prescribed a promising new drug only to find it does nothing, or made things worse. And their physician will usually handwave the failure away to the complexities of personalised medicine. “We just have to find the right drug for you.” But what if the drug didn’t work for you because it never worked at all, and it got approved because the clinical literature was garbage? What if most fields of research are just as bad, or worse, and we can only see the dysfunction in medicine because they make you pre-register your intentions for clinical trials?
The problems of p-hacking, publication bias and positivity bias extend across the sciences. And we are a scientific culture: decisions around policy and investment are supposed to be evidence based. If you’re a corrections department looking at solutions for criminal reform, or an education ministry looking at new educational techniques, a hospital looking at new surgical procedures, or an investment fund considering which biotechnology companies to back, you’ll look at the scientific literature before you make a decision. And if the literature is filled with things that sound credible but don’t work, because those are the papers that positivity bias, publication bias and p-hacking incentivise scientists to produce, then they can’t make good decisions, and all of the policies and investments and techniques and interventions will fail.
Science works. Ritchie’s book is not postmodern; the name “Foucault” does not appear in the index. In his epilogue he points to a host of recent scientific breakthroughs: successful gene therapy for immunocompromised children; engineers teleporting information using quantum entanglement; astronomers photographing a black hole. While I was writing this review Pfizer announced positive results in clinical trials for a Covid-19 vaccine with 90% efficacy. We should not abandon the pursuit of scientific knowledge, or see that knowledge as “a function of power” or just another metanarrative. The solution to science’s problems, Ritchie argues, is more science: and it should take the form of the scientific study of science. Or, as it’s now called, “metascience”.
Most books that critique something as vast and formidable as science end with a final, rather vague chapter in which the author briefly gestures at solutions: “We should improve society somewhat”. Science Fictions is very specific. Science needs to constantly scrutinise itself. All the elegant metascience studies detecting fraud and hype and bias throughout the book help address the problems Ritchie describes, and he ends with a long and specific list of demands targeted at the institutions that fund and conduct scientific research.
Funders need to fund replication studies. Research institutes need to conduct them. Journals need to publish them, along with null results. Scientists need to pre-register their study intentions so they can’t p-hack their datasets. When papers are published the data should be open access so that other researchers can reproduce and validate the findings. All of this should be mandatory; all taught at universities, preconditions for grants, funding and publication. If scientists aren’t doing it, Ritchie argues, many of them aren’t doing actual research, just something that looks like research but has no value.
I talked to a few New Zealand scientists about local awareness of these issues; they indicated there is awareness but no widespread momentum for the kind of reforms Ritchie advocates. There are pre-registration criteria for clinical trials, and the Health Research Council – one of our primary funders for biological and medical research – has a data-monitoring committee that scrutinises applicant intentions for statistical analysis.
I emailed the Auckland University statistician Thomas Lumley – who enjoys a brief but distinguished appearance in Science Fictions, debunking the analysis of a study on the microbiomics of autism – and he replied that he supports these kinds of reforms in fields in which they’re useful, and that enforcement by funding agencies is important, but worries that the scope of the measures is being overestimated.
Lumley points out that although mandatory “open data” is fine for many psychology experiments, it runs up against both medical confidentiality and indigenous data sovereignty in many clinical and population health studies. He added, “Perhaps more importantly, clinical trials need pre-registration, publication of negative results, and replication, because it’s unethical to carry out clinical trials that are much larger than the minimum needed, and because theory is relatively unhelpful in predicting which interventions will work. Clinical trials can only ever provide fairly weak evidence, so they need all the safeguards we can come up with. Other areas should ideally be upgrading the quality of evidence instead.” Which is a diplomatic way of saying that papers should be more rigorous so that researchers don’t have to spend their time replicating garbage.
But he also argues, “In areas of science with stronger theories, you get more increase in knowledge by doing research that improves the theory. The statistical problems are more likely to be about the fit of the analysis and the theory than about error in hypothesis testing. For example, some of my colleagues work on signal detection in gravitational wave astronomy, and have for years been modelling the expected ‘sound’ of black hole collisions and core-collapse supernovas. There was no point in replication funding and no real possibility of pre-registration for this sort of work, and there isn’t such a thing as a ‘null result’. We know they were successful because LIGO is now picking up large numbers of black-hole collisions, and that was always going to be the only real test.”
Peter Thiel famously said that we wanted flying cars, instead we got 140 characters. He was talking about “the great stagnation”: the slump in technological innovation lasting from the 1970s until today. The world experienced massive and rapid technological change from the 1820s to the 1960s, the economic historians tell us, and then it slowed way down, with a handful of industries and fields – digital technology, energy, molecular biology – coasting onwards, elaborating on breakthroughs made in the mid 20th century. The period in which we transitioned to a scientific culture saw a dramatic decline in the rate of scientific discovery.
No one knows why this happened. There are many theories and the most popular is that the 19th and early 20th centuries picked all the low-hanging fruit of technological modernity. They discovered all the obvious stuff: everything from then on just got harder. And maybe that’s true, but Ritchie’s book points to another piece of the puzzle: that we just aren’t as scientific as we thought we were. Which would explain a lot, and is also a cause for optimism. Maybe the 21st century can be the period in which we stop deluding ourselves that we’re doing science, and actually start doing it.
The Spinoff Weekly compiles the best stories of the week – an essential guide to modern life in New Zealand, emailed out on Monday evenings.