There is now enough evidence to say what many have long thought: that any claim coming from an observational study is most likely to be wrong – wrong in the sense that it will not replicate if tested rigorously.
Thus opens the 2011 paper Deming, data and observational studies by S. Stanley Young and Alan Karr. Observational studies frequently cover medical and biological topics, such as cancer risks or nutritional effects, which means they may be consulted by medical doctors and popularized by mass media. The relevance of Young & Karr’s paper therefore extends far outside the academic community. First, some examples of their rather shocking findings, variations of which have been known since at least 1988:
In a small sample in 2005, of 49 claims coming from highly cited studies, 14 either failed to replicate entirely or the magnitude of the claimed effect was greatly reduced (a regression to the mean). Six of these 49 studies were observational studies, and in these six, in effect, randomly chosen observational studies, five failed to replicate. This last is an 83% failure rate.
We ourselves carried out an informal but comprehensive accounting of 12 randomised clinical trials that tested observational claims […]. The 12 clinical trials tested 52 observational claims. They all confirmed no claims in the direction of the observational claims. […] To put it another way, 100% of the observational claims failed to replicate. In fact, five claims (9.6%) are statistically significant in the clinical trials in the opposite direction to the observational claim.
Studies claim to have meaningful results based on a sufficiently small P-value, indicating that mere coincidence is highly unlikely. But the statistical analysis is performed by the same people who conduct the study, which is an open invitation to game the system.
There is general recognition that a paper has a much better chance of acceptance if something new is found. This means that, for publication, the claim in the paper has to be based on a p-value less than 0.05. From Deming’s point of view, this is quality by inspection. The journals are placing heavy reliance on a statistical test rather than examination of the methods and steps that lead to a conclusion. As to having a p-value less than 0.05, some might be tempted to game the system through multiple testing, multiple modelling or unfair treatment of bias, or some combination of the three that leads to a small p-value. Researchers can be quite creative in devising a plausible story to fit the statistical finding.
The solution is to control the process of study creation itself, rather than merely reviewing the final result, analogous to W. Edwards Deming’s industrial process control.
Deming’s insight was to control each step of the process where errors occur so that the final frequency of bad product is greatly reduced. Now, world-wide, industrial production is process control. Control the steps of the process and the final product will largely take care of itself. Consider the production of an observational study: Workers – that is, researchers – do data collection, data cleaning, statistical analysis, interpretation, writing a report/paper. It is a craft with essentially no managerial control at each step of the process. […] [J]ournal editors and referees inspect only the final product of the observational study production process and they release a lot of bad product.
Process control for observational studies implies reproducible research:
Reproducible research is research where the study protocol, the electronic data set used for the paper, and the analysis code are all publicly available. Managers can also require split-sample analysis strategies and other methods to protect against false positives. At present, researchers – and, just as important, the public at large – are being deceived, and are being deceived in the name of science. This should not be allowed to continue.
So far the “deception” continues unabated, and you would be well advised to keep that in mind when you see the latest amazing health news derived from observational studies, as opposed to controlled experiments.