• 1226阅读
  • 0回复

英国牧师与反垃圾邮件

级别: 管理员
A Good Sort

An 18th-century English vicar can save you from spam
February 24, 2006
Here's this week's tip. If you want to sort the wheat from the chaff -- whether it's separating email from spam, hot stocks from duds, or great movies from bombs, you'll need the help of an 18th-century vicar.

Take, for example, the experience of Matthew Prince, chief executive of Utah-based antispam consultancy Unspam Technologies Inc. Hooked on the annual Sundance Film Festival since 1996, he has had the same problem facing everyone who attends any big cinema festival: Which of the 200 or so films being shown are worth watching? So he and a group of friends began trying to find ways of picking the best ones based on reviews of the films being screened. One day, chatting with a fellow software engineer, they both realized that picking the best films was really the same problem as deciding whether an email message was spam.

What's with that, I hear you say? Let me take a moment to explain why most of the world's spam -- and there's a lot of it -- doesn't end up in your inbox. It's all because of Thomas Bayes, an 18th-century English vicar, who came up with a theorem to calculate the probability of a future event based on past events. His theorum forms the basis of modern-day spam filters used by most Internet Service Providers and email services. Put simply, if a piece of spam you receive contains the word "Viagra," chances are high that subsequent emails you receive containing that word also will be junk email. A Bayesian filter will inspect all the words in an email -- including hidden formatting, the headers and other telltale signs of spam, and assign a probability of the email message being junk. All you have to do is to train the filter by showing it a handful of junk and email messages, telling it "this is one is spam, this one isn't" and it starts quickly filtering out the rubbish.

Mr. Prince's bright idea was to apply the same filtering technique to film reviews. Would it be possible, he wondered, to throw film reviews of the past few Sundance festivals through a Bayesian filter and see whether it could pick the likely winners? The U.S. Sundance festival, which is a leading showcase for independent cinema, releases a guide to the films being screened every year. Mr. Prince and some colleagues gathered 10 years of guides to more than 360 Sundance films. Based on the individual film's success at the festival and subsequently, each was assigned to one of three categories, or baskets: Below average, average and above average. Their findings gave birth to the Web site (http://deconstructingsundance.com).

What Mr. Prince and his colleagues found was that, among other things, words were a pretty good indicator of success. But not necessarily the words you might expect in a review: Best. Fascinating. Emotional. Inspired. Great. All are, in the words of the Deconstructing Sundance Web site, "the kiss of death" for a movie. Riveting, for example, appeared in 46% of reviews for what turned out to be below-average movies, as opposed to 22% of above-average movies. How so? Why would a reviewer call a dud "riveting?" Mr. Prince has his own theory: "Maybe writers, when they struggle with something good to say about something, revert to adjectives like 'riveting' rather than actually describing the movie in a more tangible way?"

Pretty neat. But why stop there? If an 18th-century cleric can help you figure out which movies are going to make it, why not use the technique to predict other things, such as stock market movements, blood clots, or volcanic eruptions? Well, actually, there are people thinking like this. U.S. shopping search engine Shopzilla.com uses a Bayesian filter to sift customer emails according to topic and, where, relevant, fire back canned responses.

But what can this do for you? Well, if your ISP or office network isn't filtering out your spam, you can set up your own Bayesian filter. I suggest going with POPFile (http://popfile.sourceforge.net) a free, all-platform version of a commercial product called PolyMail developed by John Graham-Cumming. (It was he and POPFile who made the whole Deconstructing Sundance thing possible.) It's relatively easy to set up.

I've used POPFile for a few years and it's kept the spam at bay. Recently I decided to make it work harder. As with Mr. Prince and his crew, I felt that if the software did such a good job with spam, why not let it sort all my email out for me? Email's big problem, you see, isn't just about filtering out spam. It's about sorting everything that comes in, so it doesn't all land (and usually stay) in one big oversize inbox.

My advice is to set up two baskets -- say, Personal, and Work -- and a Bayesian filter will quickly figure out where your email will go. Instead of having to write a rule for every sender, or for every email with the words "Loan Shark" in the subject field, you can just teach it where a few sample emails go, and then leave it alone. I'm now experimenting with three baskets: 1) what I need to deal with now, 2) stuff I can save for later, and 3) stuff I'll never need. So far it's working pretty well.

Of course, in all honesty, we don't know quite why the Bayesian system works. It just does. Expect the good vicar's theorem to spread beyond spam control to other applications on the Internet.

Oh, and the Deconstructing Sundance project got it right in shortlisting some of the potential winners at this year's Sundance festival, which finished a few weeks ago. They tracked the buzz on two films, for example, that ultimately won the festival's two top awards: "Quincea?era" (dramatic) and "God Grew Tired of Us" (documentary).
英国牧师与反垃圾邮件

以下是这周的小贴士。如果你想分清良莠──无论是从垃圾邮件中挑出有用的邮件,从垃圾股中挑出热门股,还是从滥片中选出上乘的电影,你都需要一位18世纪牧师的帮忙。

在这里说说反垃圾邮件咨询公司Unspam Technologies Inc.的首席执行长马修?普林斯(Matthew Prince)的经历吧。从1996年开始参加一年一度的圣丹斯电影节(Sundance Film Festival)以来,他就同参加大型电影节的所有人面临一个同样的困惑:200来部展出的电影有哪些是值得一看的呢?因此他和一些朋友开始寻找从中筛选出最佳电影的途径。一天,在和本公司的一个软件工程师交谈时,他俩认识到挑选出最佳电影同判断哪些电子邮件是垃圾邮件在本质上都是一样的。

这可能会令你感到不解。让我用点笔墨解释一下为什么大多数垃圾邮件都没有进入到你的收件箱中(尽管你还是会收到许多垃圾邮件)。这都应归功于18世纪的英国牧师托马斯?贝叶斯(Thomas Bayes),他创立了根据过去的事件计算未来事件发生概率的理论。他的理论成为当前大多数互联网服务提供商(ISP)和电子邮件服务提供商使用的垃圾邮件过滤器的基础。简而言之,如果你收到的一封垃圾邮件中包含“Viagra”这个词,那么你今后收到的也包含这个词的邮件就很可能是垃圾邮件。一个贝叶斯过滤器将检查邮件中的所有词,包括隐含格式、标题和其他垃圾邮件的蛛丝马迹,并判断此封电子邮件是垃圾邮件的可能性。你要做的事情就是用大量垃圾邮件信息训练过滤器,告诉它“这个是垃圾邮件,这个不是”等等,它就会很快过滤出垃圾邮件。

普林斯的想法是将同样的过滤技术应用到电影评论中。他考虑是否能将过去几届圣丹斯电影节中的电影评论加入到贝叶斯过滤器中,看看能否挑选出可能的得奖者?圣丹斯电影节每年都发布筛选电影的指导意见。普林斯和部分同事搜集了10年中对360多部圣丹斯电影节参展影片的指导意见。根据在电影节上及随后公映中的成功情况,每部电影都会得到三个分类中的一个:低于平均水平、平均水平和高于平均水平。他们的研究结果最先发布在网站上(http://deconstructingsundance.com)。

普林斯和他的同事发现,词汇是非常好的评论成功的指标。但不见得是你想象中的那些词藻:最好的、迷人的、深情的、有灵感的、伟大的,等等。在Deconstructing Sundance网站看来,所有这些词都预示著电影的失败。比如,“令人目不转睛”这个词在低于平均水平的影片中的出现比率是46%,而在高于平均水平的影片中的出现概率是22%。怎么会这样?为什么评论家认为“目不转睛”意味著失败呢?普林斯有他自己的理论:也许作者在难以找到描述某部影片的褒奖之词时,就会使用目不转睛等形容词,而不是用更切合实际的方式描述这部影片。

很有道理。但为什么就此打住呢?如果这位18世纪的牧师能帮助你解决电影鉴赏的问题,为何不将这种技术用于预测其他事情,如股市的波动、血栓或火山爆发呢?实际上,已经有人考虑到了这点。美国的购物搜索引擎Shopzilla.com使用贝叶斯过滤器根据主题和相关的回复筛选客户的电子邮件。

但这一切能为你做些什么呢?如果你的ISP或办公室网络没有过滤出你收到的垃圾邮件,你可以设置你自己的贝叶斯过滤器。我建议使用POPFile (http://popfile.sourceforge.net),这是约翰?格雷厄姆-卡明(John Graham-Cumming)开发的被称作PolyMail的一种适用于所有平台的免费版本。(正是他和POPFile帮助我们完成了Deconstructing Sundance的全部工作。)这个软件也很容易设置。

我使用POPFile已经有几年了,它能将垃圾邮件拒之门外。最近,我决定让它更努力地工作。同普林斯和他的同事一样,我觉得既然这款软件对垃圾邮件这么有效,为什么不让它帮我筛选所有电子邮件呢?你知道,电子邮件的主要问题并不仅仅在于过滤垃圾邮件。它还应筛选收到的所有邮件,使这些邮件不都挤在不堪重负的收件箱中。

我的意见是设立两个收件箱,比如一个私人的,一个工作信箱,贝叶斯过滤器会迅速判断出你的电子邮件进入哪个收件箱。你不必为在主题栏用“放高利贷者”等词语为每个发送者或每封电子邮件拟定规则,只需用几封样本邮件告诉过滤器它们应进入哪个收件箱,然后就不用管了。我现在设立了三个信箱:1)我需要现在处理的,2)我可以保存下来以后处理的,3)我根本不需要的信件。迄今为止,这一切工作的非常好。

当然,坦率地说,我们不清楚贝叶斯系统的工作原理。但它的确有效。预计这位牧师的理论将从垃圾邮件控制扩展到互联网的其他领域。

对了,Deconstructing Sundance项目在缩小今年圣丹斯电影节潜在获奖片的工作中表现良好,本届电影节已于几周前结束了。比如,这个项目就捕捉到了最终赢得大奖的两部影片:Quinceanera (故事片奖)和God Grew Tired of Us (纪录片奖)。
描述
快速回复

您目前还是游客,请 登录注册