Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed topublic posts

From: Linus Torvalds
Date: Thu Jan 22 2004 - 18:00:26 EST




On Thu, 22 Jan 2004, jw schultz wrote:
>
> Beyes is the wrong aproach for those random words from the
> dictionary blocks.

Bayes is not wrong per se, but doing bayes on pure word statistics is
wrong. It always was. People knew how it could be broken. The current rash
of spams is just the obvious way to do it.

> Those i've seen seem to be a long string of words all longer
> than 4 characters. A rule that gave a score of based on the
> number of consecutive words longer than some number or
> characters would catch those fairly easily. If i get
> annoyed enough i may figure out how to write such a rule.

Don't. That's easily broken too, as you realized yourself.

> What we need is a bounty on these scum. $1000 fine per
> reported recipient with half going to the reporter would be
> nice.

What you should aim for, and which should be much harder to break, is to
realize that random words that make no sense give a really unlikely
score when you build up a markov chain of them.

So to avoid the random words problem, do Bayes on the _chain_ of words
instead.

Now, you can try to overcome this by spamming with something that makes
"sense" from the markov chain standpoint, but by then that spam is going
to be hilarious. Once I start getting spams that are generated by markov
generators and read like "real" email, I might stop filtering them, just
because they are bound to be a lot of fun to read.

Have you played with Markov chains? What happens is that you don't just
build up a list of words and their likelihood of being spam or ham, you
build up a list of word _combinations_ and the likelihood of one
particular word following another one.

That's how a lot of the "random phrase" generators on the web work.

They can be absolutely hilarious, exactly because the sentences they
generate actually _almost_ make sense. Sometimes you get an almost
readable story, but one that reads like somebody having a bad trip and his
reality just shifted 90 degrees. (Usually the best stories come if the
training material is coherent, which email sadly usually isn't).

Do a google search for "Mark V Shaney", and you should get some idea
about this.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/