Fun with spam

    geiseri's picture
    2004
    10
    Mar

    Well now that I have been working with getting Kolab to handle my spam and virus issues I think I have finally found a nice solution. A few weeks ago when I was testing my email spam filters here I went out on usenet and posted my name a few times. Now that I get about 1500 spam messages a day I have a very nice test set ;)

    So I installed Spam Assassin and dSpam both on my server so I could give them a whirl. Spam Assassin was easy to install but it was a real hog on resources and seems to catch more real mail than spam.... as I tried to tune it, it just drove me nuts.

    Next we installed dSpam on our server. This was a little bit harder to install and training took about two weeks. Now about 10-20 spams a day get through, and about 1500 are caught. dSpam also seems very light and fast compared to other spam filter solutions. There are so far no false positives past the first few days so this is a nice bonus.

    So here is the issue... dSpam, its maintance intensive, you have to train it, and clean the databases,etc... its annoying, but damn effective. SpamAssassin, was easy to install, but proved itself pretty useless in a few days. For the amount of tweaking, dSpam soon became less offensive. But, will users train their dSpam? Are they annoyed enough with spam that they will train their filter and keep on top of the 5 or 6 spams that get through? Not sure... we need to figure this out.

    I think if KMail would make using a remote Bayesian filter easier it could be doable... but still a "drop in" solution like Spam Assassin is attractive...

    Any opinions from the peanut gallery?

    Comments

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.
    dominicc's picture

    it's already possible

    I was tinkering with KMail today trying to make it integrate properly with dspam, and finally found that it can be done in a very acceptable way, although the solution was not immediately obvious.

    Create an action that pipes to a command (if you run dspam locally) or forwards/attaches the mail (if it is run remotely), and then deletes the spam; name it "-- Mark as Spam"; make it a named action (add to the apply filter menu) and make sure it never gets run in any of the normal ways; give it a nice icon (I chose the picture of the turkey -- user agent ;-).

    Now you have an action that can be executed, but it is in a stupid place (beneath Apply Filter Actions). Instead, configure the toolbar and you will find that you can add the icon directly to the toolbar -- I put it next to the delete bin. That's it; a bit tricky to set up but easy to operate.

    pfremy's picture

    Spam is definitely becoming t

    Spam is definitely becoming the number 1 problem of every internet user. Of course people will do whatever they have to to get rid of spam. They would actually be ready to do a lot more than what you imagine. So, spam filtering in kmail is strongly needed.

    As for me, I use bogofilter which catches my 300 daily spam and 50 viruses without any problem. It took less than one week to train it.

    andre's picture

    Spamassassin

    I find it odd you are having such problems with SA. I am using it all the time, and it catches about 98% of my spam (some 200 a day), without false positives.
    On the resource intensiveness: try running the spamd/spamc combo. It basicly keeps SA loaded, making it much faster for checking your bulk email.

    geiseri's picture

    One false positive...

    is too many... also 98% of 200 is still 4 messages, and thats only 200, im running about 1500 spams a day here, 98% sucks ass :) dSpam got to 99% in a week, barring us finding a more sane training approach we might get 99.9% This isnt some kids basement, this is hundreds of users and thousands of emails a day.

    ralsina@drupal.org's picture

    You are exaggerating

    If 99% is good, 98% can't "suck ass". It's almost the same any way you look at it.

    If you are getting 2000 pams a day, 99% lets 20 get through, 98% lets 40.

    Now, I will assume that 2000 spams a day are not for your account only, so if it's 10 accounts, it's the difference between 2 and 4 spams. If you personally are getting 2000 spams a day, in anything but a honeypot account, well, you have my condolences ;-)

    On other issues: spamassassin is not any heavier than the alternatives, *if* you use spamd/spamc ( maybe you need to disable the DNS checks, they don't tax the computer, but add delay, but they do make it more effective).

    I have installed spamassassin in a server handling 75K messages a day, with a threshold of 7.5, and it has caught 98.5% of spam with a negligible amount of false positives (10 in a month :-).

    You can also train spamassassin. If you are comparing untrained spamassassin to trained competition, how do you expect it to do well?

    You can even let your users train it! Just create them spam/nospam folders in their imap accounts, and they can train them for themselves by moving mails around.

    Spamassassin can even self-train (although it hasn't done too much for me) by learning from high-scored messages.

    geiseri's picture

    my main address is...

    will be 10 years old now as of august 28th. Lucky my fidonet node address has been dead for a few years, or i suppose that would be worse :)

    This account without help gets 500 spams a day. The rest are because of forwards from postmaster, and webmaster at about 10 domains. Throw in a few mailing lists and i get about the same in legitimate messages a day.

    Now my "honeypot" account forwards to my main account for the sole reason of testing. We have clients who want spam filtering on their kolab server, but don't want to lose messages. 1 false positive is too much. a) you will never find it, b) if it was important you just defeated the entire system.

    Spam Assassin's self trainer proved to be unreliable enough for larger sites that dSpam quickly was able to overtake it. Having users manually train their spam filter is a pain, but 0 false positives and over 99% success rates are attractive. When it was catching kde-cvs commits to the website that was too much of a risk to run on a real environment.

    For a large site less than 99.9% really is not an option. Ideally dSpam will get there soon. As for overhead, dSpam is a small C app, vs a big perl app no matter how you cut it, dSpam usually exits on our mailserver here before top can pick it up. Its pretty slick if I could work out the training to be more integrated with mailers it would be perfect in my book.

    aseigo's picture

    how generic is the learning t

    how generic is the learning that dSpam does? could its rules be packaged up once you hit your desired 99.9% and used by others? could the learning input of others be used to augment your efforts?

    i know very little about the state of the art in spam filtering, but i would think it can't be too individual-specific and that having a shared, replicated database would be a great thing, no?

    geiseri's picture

    still up in the air...

    We are still trying to figure that out... the big problem is when you have a group corpus file people's concept of spam needs to be agreed on. This though imho is the only way to make filters like dSpam work. Users cannot be bothered to train their filters, they just want them to work. At the same token I'm not sure I want them removed from the equation of deciding what spam is and what isnt. The "False Positive" issue effectively scares most companies from implementing anything though...

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.