Kalysto - /Utilities/mail-scripts - (dspam learning wrapper)

Sections

ChangeLog (Lasts changes)
Download (Download Area)
INSTALL (Install Instructions)
COPYING (Licence terms)

README

SPECIAL THANKS TO John Seifarth FOR HIS SUPPORT.

What is mail-scripts ?
-----------------------

This package holds the 'dspam-learn' bash script that will help greatly using
dspam learning by looking at mailbox directories rather than a forwarding
method. This is much as you could have set it with Spamassassin and the
sa-learn.

I had terrible experiences installing dspam, as I wanted to set it a way that
doesn't seem to be thought for : I wanted to use it only through procmail
recipes. This appear to me much more logical and less messy than the default
installation.

Actually, dspam can mark mail through recipes in procmail pretty much as
spamassassin does : by adding a special line in the header of the mail where
it puts its conclusion on the mail : is it spam or not ?. But dspam learning
part involves (in dspam official documentation) forwarding mails to other mail
boxes (which create lots of mail-boxes which I didn't want to... feeling this
quite messy).

What I've had understood of spam filtering and learning was really clear
using SpamAssassin, but dspam managed to break all the simple concepts,
and introduce us to new complicated ones (as the quarantine box, or the
spam/false-positive distinction more tricky than the spam/non-spam
distinction) where it seems there is no need to such concepts.

Do I need mail-scriptss ?
-------------------------

Only on very special cases :

Required :

- You want to use dspam for spam detection.
- You are using procmail as MUA.
- You have mail "directories" where mail are supposedly classified.
- You have/use (or you are willing to create) mail "dir(s)" as :
- spam boxes : which purpose is to contain SPAM
- ham boxes : which contain non SPAM mail
- You want to trigger dspam learning depending on where mails are classified
in your mail directory structure. Letting the user teaching dspam by moving
mail along the directories.

NOTE : THIS MEAN THE USER MUST HAVE AN IMAP ACCESS TO HIS MAIL ACCOUNT.
- It worked on courier IMAP / procmail / spamassassin / dspam combination
- It is compatible with mbox files (all mails in one file) or standard maildir
(with new/ cur/ tmp/ internal representations).

Not required but possible :

- You prefer using dspam invocation in procmail recipes.
- You have spamassassin, or other automatic spam detection that move mail
around in your IMAP dirs and want dspam to learn from these other method.
- You would like that learning phase do not alter (delete or move) mail...
- Or on contrary, you would like that after learning phase, mail are deleted.

What does mail-scriptss ?
-------------------------

mail-scriptss holds the dspam-learn script, and that's all for now. :)

This dspam-learn scripts wraps the dspam binary to automagically learn what is
spam and what isn't spam from where you actually classified your mail in your
mail dirs.

It will ensure that you won't feed dspam two times with the same SPAM by
storing MD5 of each spam already fed to dspam. It'll call dspam with the
correct arguments whether the mail was previously marked by dspam correctly
or not.

This means that if dspam do not catch all your SPAM, you'll only have to move
the missed spam in a "SPAM" directory in your IMAP structure... Inversely,
if dspam marked wrongly an email as spam, you'll have to move the mail in the
proper directory (that contains no spam) to feed the mail as "false-positive".

You could also do a "copy" of the mail you want to feed dspam in 2 IMAP dirs :
one for SPAM and one for false-positives. But this seems more confusing for
me.

How does it work ?
------------------

It simply parses all your mail in the directory specified in the configuration
file. When it finds a mail, it checks that it wasn't already fed to dspam
by looking for its md5 in its list. It checks also that dspam hasn't already
marked the mail as spam if it must be taught as being spam or "Innocent" if
it must be taught as Innocent.

If there's no mark, it'll send it as a "corpus" mail. If it is marked, it'll
send it as classification error with the "--spam" or "--false-positive"
arguments.

You can safely launch several time dspam-learn. The MD5 list ensure that you
won't teach the same mail two times.

How do I use it ?
-----------------

You must install it correctly (this involves setting up a correct config file),
see the installation section that follows this one.

Then you'll have to launch it :

# dspam-learn

That's all. It'll use the configuration file to fetch its informations. This
will help if you want to use it as a cron job.

Calling dspam-learn will feed dspam with all message that weren't already fed
and thus upgrade dspam experiences with your mail found where you told it
to look in the configuration file to find ham(non-spam) and spam.

You can notice that it uses heavily pretty ASCII colors, that are not pretty
at all actually in mail output (as cron could send to you).

You can deactivate ascii colors by setting the environnement variable
'ascii_color' to "no" by doing for example :

# export ascii_color=no
# dspam-learn

Or shorter :

# ascii_color=no dspam-learn

That can fit neatly in your cron job.

How do I install it ?
---------------------

This is a GNU packages, so a simple :

# ./configure && make && make install

should do the trick. It'll install a single dspam-learn script.

Next, you should take a look in the source package at

src/sample/dspam-learn.rc

which is a good commented template for creating a correct configuration file.

The configuration file is supposed to be found in "~/.dspam/dspam-learn.rc".

Note : this could be tweaked depending on your configuration. You might even
be able to do a single general "dspam-learn.rc" somewhere else. Just look
at the corresponding section.

So you could :

$ cp src/sample/dspam-learn.rc ~/.dspam/dspam-learn.rc

Note : this command assume that your current working directory is the package
source directory. And that you are logged in as the destination user that
will use dspam.

and edit ~/.dspam/dspam-learn.rc

When finished you can launch the first dspam-learn by launching

$ dspam-learn

If you have a lots of mail to be taught to dspam, this could take time.

What procmail rules should I use ?
----------------------------------

With this config, you should make all your mail pass through dspam without
interfering with the delivery : dspam should then be called in top of your
procmailrc :

:0fw: dspam.lock
* < 256000
| dspam --user username --stdout

You should read attentively the configuration of dspam and the --deliver-spam
and --deliver-fp at runtime. They might be of use as you wich that innocent
mail AND spam must be delivered.

When you feel that dspam as a good experience (by looking to headers and
looking if it marks correctly Innocent and Spam well or by launching
dspam-learn and looking at output) you can add a rule to delete spam or as
this example, to move spam detected by dspam to a special dir :

:0 H:
* $ ^(X-DSPAM-Result: Spam)
${MAILDIR}/.SPAM.dspam/

This is for maildir format (mails are in separate files in a folder).
Or

:0 H:
* $ ^(X-DSPAM-Result: Spam)
${MAILDIR}/spam

This is for mbox format (mails are concatenated in same file).

Note: (MAILDIR var must have been defined before to use these rule).

Can I use SpamAssassin AND Dspam ?
----------------------------------

Yes of course. In fact, this seems a good way to teach dspam in the beginning,
and SpamAssassin uses a totaly different way of spam detection (in exception of
the bayesian system).

I use SpamAssassin to automatically move mail rated with more than 8.0 points
to my spam dir. And when my cron job launches the dspam-learn, these mail are
checked and learned if dspam didn't spot them as spam.

I think this is a great combination. I actually have less than one spam a day
managing to get thru the two filters, this out of 100-200 spam a day. Spam that
goes thrue are very special : usually viruses (labeled "your file"), or empty
mails.

I HAVN'T THOUGHT OF SPAMASSASSIN AND DSPAM MISLEADING THEMSELF WITH THEIR
MARKS.

WHAT CAN BE TAKEN FOR SURE, IS THAT THESE SYSTEM ARE REALLY WORKING WELL ON
MY CURRENT SYSTEM, AND COULD POSSIBLY BE EXPLAINED BY THE "LEARNING" ALGORITHMS
OF DSPAM AND THIS COULD EVEN PRODUCE BETTER RESULTS BY JOINING THE QUALITIES
OF EACH FILTERS.

TO MAKE A CONCLUSION, AN EXHAUSTIVE TEST REMAINS NECESSARY. If you've come
along some tests on this topic, please drop me a mail about this.

How must I set up dspam / procmail for a proper installation ?
--------------------------------------------------------------

Go for : http://splodge.fluff.org/docs/dspam-for-sa-users
Which speaks of dspam/procmail integration for non-IMAP integration, but a
great part of info found there applies also in IMAP config.

The ascii-colors display annoys me ! can I remove it ?
------------------------------------------------------

Yep, this is new in the version 0.0.2 . You can just set the shell variable
'ascii_color' to 'no'. So this could be a correct call :

ascii_color=no dspam-learn

This is highly recommended if the output must be mailed, as it could be in
when called by a cron job. Or if you want a clean log by forwarding the
output to a logfile.

Why do you make sure that a mail isn't taught two times to dspam ?
------------------------------------------------------------------

I've received mail stating that it wasn't usefull to check that a mail isn't
taught twice to dspam. Here's my answer :

It is clearly stated in the README of dspam that teaching dspam until the
mail is correctly filtered could lead to strange behavior. So this was a
reason why I did the MD5 check stuff. And this was really helpfull when I
first launched dspam-learn on my thousands of spam to learn : I had to cancel
it several times, so dspam-learn had to look again to each mail, and didn't
reteach them to dspam-learn. This provide a stability measure in fact : you
can Ctrl-C or kill dspam-learn when you want, it'll resume it's job without
any drawbacks.

There's a dspam configure time option to force mail in the bayesian engine until
it is correctly filtered. If you want this behavior to occur this might be
a solution.

I have question, can I mail you ?
---------------------------------

Of course, i'll try to reply quickly. Here's my email : <vaab@free.fr>

I found a bug, or to modify the script... what should I do ?
------------------------------------------------------------

Contact me at <vaab@free.fr>.

I have installed vlfs-shlib, is dspam-learn using them ?
--------------------------------------------------------

Yes, these are included by default statically, but if you have installed
vlfs-shlib, you could do :

# shlib d dspam-learn

this will greatly reduce the size of the script and its readability.

What is this vlfs-shlib all about ?
-----------------------------------

These are shell libraries i'm using quite often. Look for the package
"vlfs-shlib", there's some info.

YOU DO NOT NEED vlfs-shlib TO USE/INSTALL mail-scriptss. The libraries used
by dspam-learn are included by default in the shell script.

In some aspect, you could see this as if the "vlfs-shlibs" were linked
statically in mail-scripts...

How do I modify the default location of the config script ?
-----------------------------------------------------------

you can easily change at run-time the default location of the script by
specifying your path to configuration file in the environnement variable
DSPAMLEARN_RC.

You could set for example :

# export DSPAMLEARN_RC="/etc/mail/dspam-learn.rc"
# dspam-learn

This could offer the possibility to use on general config file.
or shorter :

# DSPAMLEARN_RC="/etc/mail/dspam-learn.rc" dspam-learn

You can also modify the defaults in the bash script. (I'll think of a configure
time option in next releases).

And at last, you could specify several location separated by spaces in the
DSPAMLEARN_RC, the first file found will be used.

So you could :

# export DSPAMLEARN_RC="~/.dspam/dspam-learn.rc /etc/mail/dspam-learn.rc"
# dspam-learn

or shorter :

# DSPAMLEARN_RC="~/.dspam/dspam-learn.rc /etc/mail/dspam-learn.rc" dspam-learn

Note : All the configuration file are read if present. They are read in the
order they are listed in DSPAMLEARN_RC. For multiple option definitions, only
the first definition will work. (This rules do not work for 'hambox' and
'spambox' options which will concatenate all values found).

Hint : You could set DSPAMLEARN_RC with global file and a local file. The
global will have the defaults. And the local leave the user free to redefine
locally some variables. In this case you'll have to set DSPAMLEARN_RC with
first the local file (ie : ~/.dspam-learn.rc) and last the server wide config
file (ie : /etc/mail/dspam-learn.rc). This would do :

# DSPAMLEARN_RC="~/.dspam-learn.rc /etc/mail/dspam-learn.rc" dspam-learn