• 1288阅读
  • 0回复

统计的误差能走多远?

级别: 管理员
Errors behind fluke results

Can we really believe what we read in the press - not the tabloids, but scientific papers in world-renowned research journals? This disturbing question has been raised by the discovery of several statistical blunders in two such journals, Nature and the British Medical Journal (BMJ), by researchers at the University of Girona, Spain.


Emili Garca-Berthou and Carles Alcaraz took a sample of papers from the two journals and checked the calculations in them - specifically, those used to measure "statistical significance", a widely used criterion for deciding which findings are worth taking seriously.

The researchers found at least one error in one-quarter of the papers in the BMJ and more than one-third of those in Nature. Most were trivial slips, but the researchers suspect that about 4 per cent could have a big impact on the conclusions drawn.

The results, published in the online journal BMC Medical Research Methodology, have sparked controversy; The Economist last month condemned "sloppy stats which shame science".

Many scientists will see this as an over-reaction. Yet both responses are misplaced. The reality is that these occasional slips are just a small part of a scientific scandal that has been rumbling on for years.

It centres not on computational blunders, but on the routine abuse by scientists of the very concept of statis tical significance. Its impact raises questions about far more than the 4 per cent of results that triggered the recent hand-wringing.

Introduced in the 1930s by the eminent British statistician R.A. Fisher (later Sir Ronald Fisher), "significance testing" has become a ritual for scientists trying to back up claims that they have made an interesting discovery. Experimental data - say, the cure-rates of two competing drugs - are fed into a formula, which spits out a number called a P-value. If this is less than 0.05, the result is said to be "statistically significant".

The technique was seized on by researchers looking for a hard-and-fast rule for making sense of their data. The P-value has since become a standard feature of research papers in many fields, especially softer sciences such as psychology.

Yet almost as soon as it was introduced, significance testing caused alarm among statisticians. Their concern stemmed from the fact that the whole concept of the P-value rests on an assumption of which many scientists seemed unaware.

At first sight, a P-value appears to be the probability that a finding is just a fluke; with a "statistically significant" P-value of 0.05 implying a 95 per cent chance that the finding is not a fluke, and thus worth taking seriously.

Statisticians warned, however, that this was a dangerous misconception. The theory behind P-values, they pointed out, assumes as a precondition that every finding is a fluke. It then asks what is the probability of observing a result at least as extreme as that seen, given that the finding is a fluke.

This probability cannot measure the chances of the finding being a fluke in the first place, as that assumption has already been made. Yet this is precisely what scientists began to use P-values for - with potentially disastrous consequences for the reliability of research.

As long ago as 1963, a team of statisticians at the University of Michigan warned that P-values were "startlingly prone" to see significance in fluke results. Such warnings have been repeated many times since. During the 1980s, James Berger, a professor at Purdue University, and colleagues published a series of papers showing that P-values exaggerate the true significance of implausible findings by a factor of 10 or more - implying that vast numbers of "statistically significant" results in the scientific literature are actually meaningless flukes.

Despite this, scientists still routinely use P-values to assess new findings - not least because getting "statistically significant" results is virtually a sine qua non of having papers accepted by leading journals.

My own study of all the papers in a recent volume of the leading journal Nature Medicine shows that almost two-thirds cite P-values. Of these, more than 30 per cent show clear signs that the paper's authors do not understand their meaning.

Nor is this atypical: in a study to be published this summer, Gerd Gigerenzer and colleagues at the Max Planck Institute for Human Development in Berlin describe a survey showing that, even among academics teaching statistics in six German universities, 80 per cent had a flawed understanding of significance testing.

The impact of this on the practice and reliability of scientific research is disturbing. By exaggerating the real "significance" of findings, P-values have led to a host of spurious assertions gaining false credibility, from health scares to claims for paranormal phenomena (see right). Attempts to replicate such findings have wasted untold time, effort and money.

Prof Gigerenzer and his colleagues will call for a radical overhaul in statistics education to wean scientists off reliance on P-values. Yet experience suggests it is the editors of leading journals that hold the key to bringing about change. In 1986, Prof Kenneth Rothman of Boston University, editor of the American Journal of Public Health declared he would no longer accept results based on P-values. His stance led to changes in statistical courses at leading public health schools, which taught their students more sophisticated statistical methods.

Medical journals have also begun to require that authors do more than calculate P-values to back up their claims. Dr Juan Carlos Lopez, chief editor of Nature Medicine, says that while the journal has no plans to eliminate P-values, it is carrying out an investigation into the scale of the problem before deciding on action.

After decades of warnings, scientists may finally be waking up to the dangers of P-values. Their ability to exaggerate significance may have led to many headline-grabbing findings - but the price has been to make academic journals barely more credible than the tabloids.

The writer is visiting reader in science at Aston University, Birmingham

BIZARRE CLAIMS BASED ON FLUKE RESULTS The ability of P-values to exaggerate the real "significance" of meaningless fluke results has led to a host of implausible discoveries entering the scientific literature. Many headline-grabbing health stories are based on evidence backed by P-values: last month, Japanese researchers used P-values to claim that women who lose their teeth in later life are at higher risk of heart disease. Concern about use of P-values to back implausible claims is mounting. In March, a team from the US National Cancer Institute, Bethesda, Maryland, warned: "Too many reports of associations between genetic variants and common cancer sites and other complex diseases are false positives". It added: "A major reason for this unfortunate situation is the strategy of declaring statistical significance based on a P-value alone." The use of P-values has proved valuable to those seeking scientific backing for such flaky notions as the existence of bio-rhythms and the effectiveness of wishing for good weather. One of the most bizarre examples centres on a study by Leonard Leibovici of the Rabin Medical Centre, Israel, purporting to show the effectiveness of "retroactive prayer". Some early research has hinted that patients may benefit from being prayed for. According to Prof Leibovici's study, published in the British Medical Journal in 2001, prayers even helped patients who had already recovered. The findings, whose supposed significance was demonstrated using P-values, sparked calls for a complete overhaul in notions of space and time. To statisticians, however, the results are just further proof of the dangers of misunderstanding P-values.

统计的误差能走多远?

我们真能相信从报刊上看到的内容吗?我指的不是庸俗小报,而是世界著名研究刊物上的科学论文。一位西班牙吉罗纳大学的研究人员揭发,《自然》和《英国医学杂志》在统计数据上有多处重大错误。这使我们提出上述恼人的问题。


艾米莉?加卡-博托和卡尔?阿尔卡拉斯用这两份杂志里的论文作为样本,核对了其中的数据计算,尤其是那些用来衡量“统计显著性”的数据。人们广泛以“统计显著性”作为判断某些科学发现是否值得认真看待的标准。

两位研究者发现,《英国医学杂志》论文四分之一的内容至少有一处错误,《自然》杂志则超过三分之一。大多数只是很细小的错误。但他们怀疑,大约4%的错误可能对得出的结论有重大影响。

他们的研究结果在网上杂志《BMC 医学研究方法》上发表以后引起了争议;《经济学家》杂志上月批评这种“令科学蒙羞的草率统计”。

许多科学家会认为这是反应过度。但正反两方面的反应都没击中要害。现实是,这些偶然的错误只是一个闹腾多年的科学丑闻的一部分。

问题的要害并不是在于计算失误,而是在于科学家对“统计显著性”这个概念常规性的滥用。

自1930年代杰出的英国统计学家R.A.费什(后为罗纳德?费什爵士)提出“显著性测试”以来,这种方法就成为科学家给新发现提供依据的例行习惯。把实验数据(比方说,两种治疗同一种病的药的治愈率)填进某个方程式,推导出一个称作P值的数字。如果这个数字小于0.05,这个结果就被称作“具有统计显著性”。

研究者为了让数据说明问题,就得寻找一种必须遵循的硬性规则。他们选中了上述方法。从此P值就成为许多领域,尤其是诸如心理学等软性科学领域,研究论文中的一个标准。

但是,几乎从引入这种方法一开始,就有统计学家对显著性测试提出警告。他们的担心来自于这样一个事实:整个P值的概念是建立在一种许多科学家好像没有意识到的假设基础之上。

乍一看,一个P值似乎就是指某个发现是意外的概率;有了“统计显著性”,0.05的P值就意味着这个发现有95%的可能性不是意外获得的,因此应当认真看待。

但是,统计学家警告说,这是一种危险的误解。他们指出,P值背后的理论是假设每一个发现都是意外为前提。假定发现是一种意外,那么这种理论就提出疑问:评述一个一如所见那样极端的结果有多大概率?

正如假设的那样,这种概率无法评估“发现首先是意外”的可能性。但这正是科学家为什么引入P值的原因――结果对研究的可靠性可能是灾难性的。

早在1963年,密歇根大学一组统计学家就警告说,P值“惊人地倾向于”让人在偶然的结果中看到重要性。那以后,这种警告被重复过多次。在1980年代,普渡大学教授詹姆斯?伯格(James Berger)和他同事发表了一系列论文,要说明P值对某些看不真实发现的真正重要性夸大了十倍或者更多。这就意味着,在科学文献中大量具有“统计显著性”的计算结果,实际上只是毫无意义的意外。

尽管如此,科学家还是照常用P值来评估新发现,很重要的原因是,获得“具有统计显著性”的计算结果,实际上是论文被重要刊物所接受的必要条件。

我自己对重要刊物《自然医学》最近一期全部论文的研究显示,几乎三分之二的论文引用了P值,其中,30%以上很明显地表明,论文作者并不理解它们的意思。

以下的事例也不无典型性:在今年夏天即将发表的研究报告中,柏林马普学会人类发展研究所的杰德?吉格伦泽(Gerd Gigerenze)和他同事用一项调查说明,即使在六所德国大学教授统计学的学者中,也有80%的人对显著性测试的理解有问题。

这个问题对科学研究实践及可靠性的影响令人担忧。由于夸大了科学发现的真正“重要性”,P值导致大量为了获得虚假的可信度而做出的虚假论断,从制造医学恐慌到非正常现象等等。而重复这类发现的尝试已经浪费了难以计数的时间、精力和金钱。

吉格伦泽教授和同事呼吁,对统计学教育进行一次彻底的检查,断绝科学家对P值的依赖。但是凭经验就可以想到,掌握变革关键的是重要刊物的主编。1986年,《美国公共卫生杂志》主编、波斯顿大学的肯尼斯?罗斯曼教授(Kenneth Rothman)宣布,他不再接受基于P值基础上的研究结果。他的立场导致一些主要公共卫生学校改变统计学课程。这些学校转而教授学生更复杂的统计方法。

医学刊物也开始要求作者在计算P值之外做更多工作,以支持他们的论断。

在发出警告几十年之后,科学家也许终于觉醒过来,发现P值的危险之处。他们对重要性的夸大可能已经导致许多能上报刊头条,可代价是使学术刊物的可信度比庸俗小报高不了多少。

作者系伯明翰阿斯顿大学访问高级讲师
描述
快速回复

您目前还是游客,请 登录注册