Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to public posts

From: Kevin O'Connor
Date: Sat Jan 24 2004 - 15:17:35 EST


On Thu, Jan 22, 2004 at 10:35:54AM -0800, David Lang wrote:
> On Thu, 22 Jan 2004, David Ford wrote:
> > Considering that Bayesian filters are useless against the new spam that
> > is proliferating these days, that's laughable. Spam now comes with a
> > good 5-10K of random dictionary words.

I'm curious what Bayesian filters you're using. The filter I use
(bogofilter.sf.net) regularly catches and properly categorizes these
spams.

A good Bayesian spam filter isn't nearly as susceptible to random words as
some people think. Words that are likely to be spam (along with words that
are frequently "ham") are given _exponentially_ more weight than other
words. The only way a group of random words is likely to sway the score is
if it happens upon enough "ham" words to outweigh the message's "spam"
words, and there is just as much chance of randomly picking a "spam" word
as there is of randomly finding "ham". In any case, you'd need random word
blocks _much_ bigger than 5-10k to make it statistically likely of catching
"ham" tokens.

> so we need to extend the Bayesian filters to deal with multi-word combos,
> how many legit mail has those dictionary words in them? properly traind
> their presence should help identify the spam.

If filters start looking for grammatically correct phrases or sentences,
the spammers will just start pasting in random sections of books or web
pages. Multi-word and "markov chains" tests will only be helpful if the
filter also does proper weighting of the results. And, since my filter
works fine today, I'm in no rush to upgrade to a more complex one.

-Kevin

--
---------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| kevin@xxxxxxxxxxxx 'IMHO', 'FAQ', 'BTW', etc. !" |
---------------------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/