This is a new distribution based on a plugin developed by Christian Holler decoder_at_own-hero_dot_net
|
In order to remove the confusion with regards to versions with this plugin, there is now a SVN repository where the latest version of the plugin resides. This latest version in particular, has removed the Image::Magick dependency which was very troublesome for a lot of people, but adds gifsicle as a required dependency to deal with animated GIF images.
Even though this new version is still considered in development, it is very stable.
svn://fuzzyocr.own-hero.net/trunk/devel
Downloaded: 305 Time(s)
Fixed:
sh: $efile: ambiguous redirect
This message was being generated when using complex scansets, because
the 'value' was only translated once. In complex scansets, this value
may be specified multiple times.
FuzzyOcr.cf
Fixed outstanding errors. Variable mismatches are now fixed.
FuzzyOcr.pm
Trap ImageMagick errors better, and logs them.
When processing Animated-GIF files, due to the algorithm, it is possible
to discard all frames, leaving an empty image. Now, this special case
is treated as a corrupt image, and triggers FUZZY_OCR_CORRUPT_IMG with
$Score{corrupt} points (2.5 by default).
Changed:
Option: focr_personal_wordlist
Now, if the option value begins with '/', the value is not treated as
relative to the efective user's HOME directory, but as a fixed path.
Added:
Option: 'focr_score_ham' Default: 0.0
When set to 1, images that are below the 'focr_counts_required' threshold,
are scored with the formula: $Score{Add} * $cnt; this gives marginally bad
images some positive score instead of just allowing them without score.
Removed:
Util: gif2anim
This script is no longer used in the plugin, so it is removed from the
distribution, although if needed, it may be found in the previous version.
Fixed:
The plugin was stuck in infinite loop in the case where there is more
than one attachment with the same name. The tie-breaking was not working.
When processing GIF files, extra care has to be taken so that ImageMagick
properly recognizes the files as GIF images, otherwise, an error occurs
because ImageMagick cannot properly determine the image 'type' and cannot
determine the image size, resulting in an invalid hash. Code is now in place
to prevent this, and in the case where invalid image size is encountered,
the processing of this image is skipped.
Changed:
When the plugin determines that words from the lists are found in the images,
it now stores these words in 'focr_db_hash' so that when we encounter the same
image hash in another message, the report will add the words 'found' to the
report, giving the end user more information, instead of just the
FOCR_KNOWN_IMAGE_HASH rule firing with the previous score.
Require:
New Perl Module
Image::Magick;
Added:
Option: 'focr_anim_delay' Default: 100
This option is used with animated GIF files, and keeps all images
that are displayed for at least 1 sec.
Option: 'focr_anim_max_frames' Default: 2
This option is used with animated GIF files, and keeps top N
largest frames.
Fixed:
Option: 'focr_digest_hash'
Fixed internal parameter to reflect option from original plugin (Thanks Bill).
Option: 'focr_db_hash'
Updated FuzzyOcr.cf to reflect plugin option.
Option: 'focr_db_safe'
Updated FuzzyOcr.cf to reflect plugin option.
Option: 'focr_counts_required'
Fixed default value of '2' was set to '5' making it behave as the original plugin.
Removed:
Option: 'focr_bin_identify'
Option: 'focr_bin_convert'
These options are no longer valid, since the external programs are no longer called
in favor of using PERL module. Makes things 'simpler'.
Option: 'focr_bin_gifasm'
Option: 'focr_bin_tifftopnm'
external program not used anymore.
Changed:
The plugin now uses Image::Magick module to access ImageMagick functions from PERL instead
of accessing external programs. This makes for fewer system calls to run external programs.
(Idea from Eric Yiu)
Added:
Option: focr_keep_bad_images
The default value for this option is zero(0).
When set to 1, the plugin will not remove a tempdir whenever it registers
an error or timeout from any of the 'helper' apps.
When set to 2, the plugin will always keep the tempdir. Beware that on heavily
loaded systems, this might fill your /tmp partition.
Util: fuzzy-cleantmp
This utility can be used to remove tempdirs left behind if the plugin was
configured to save them. It takes one parameter: hours to keep (12 by default)
This can safely be placed inside CRON to prune /tmp.
Util: gif2anim
This utility (from ImageMagic) extracts images from animated gifs as well
as giving information regarding delays and image sizes. Requires identify and
convert to work (these are required, so not a problem).
Fixed:
Bug: 'convert'
An invalid parameter was specified when using 'convert' to assemble animated gifs
resulting in an error message, and the image was not scanned.
Bug: 'safe_db'
When checking for images in safe_db hash, because we score then as zero (0),
we did not 'short circuit' correctly. This has now been fixed.
Bug: wrong_ctype
There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score'
parameter to take effect. This has now been fixed.
Changed:
known_image_hash
This procedure was called with two parameters: $digest and $score.
$digest was not used, so it has been removed. Also, just in the off chance
that $score is zero, it uses $Score{base} to score the image.
fuzzyocr_check
Added code to better determine the name of the attachment. Sometimes, the name
is hidden in the 'content-id' header of the image/* MIME part, so we extract
it from there if no name is given when this header is available. Also it makes
shure that problematic characters are changed so as to not give PERL any more
grief, as well as timing out spamd in rare instances.
A copy of the original message is now saved in the tempdir created, so that
when we instruct the plugin to keep the created tempdir, we have a copy of the
original message to further assist in troubleshooting problems.
A file is created in tempdir containing all the expanded commands used to
process the images. This can help to troubleshoot invalid command errors.
Removed some debuglog lines to reduce the lines logged.
Uses gif2anim (if available) to extract images from animated gifs.
TODO:
I will try to the generated anim file to root out animated gif spam where
the spam message is not in the largest frame, or is in the frame with the
largest delay, as well as other tricks...
Fixed:
Properly initialized $h and $w to zero so that when getting the height and width
from an image, if the size parameters cannot be parsed, they can get properly
tested.
Hashing now works. $digest was getting reset because it went out of scope. grrr.
$efile was only being replaced for first occurrence in complex scansets generating
$efile: ambiguous redirect errors.
Various bugs where: Use of uninitialized values were reported.
Fixed:
Option: 'focr_db_safe'
This option was not included in the @pgm_options array.... oops (thanks UxBoD)
Score: wrongctype
This was not used correctly, thus it was not scoring... (thanks Eric)
Changed:
It now works with tempfiles only
This hopefully reducing the need to read/write image data from memory after each
'filter'. This will hopefully reduce IO and memory usage for the plugin.
Scanset Syntax: $pfile
Because of the use of tempfiles, there is a need to specify the image file to be
used as input. '$pfile' must be used to specify the input filename. Please note
that in cases where scansets use pipes, only specify $pfile as the input to the
first 'filter' program.
Scanset Syntax: $efile
With every scanset, stderr is redirected to '$efile', which is different for each
image. When using multiple filters in a scanset, use '$efile' to redirect stderr
to this file, making shure the plugin will correctly recognize an error when it
occurs.
Require:
Plugin officially requires SA 3.1.4 or higher
New Perl Modules
DB_File
Storable
MLDBM
Previous
String::Approx
Removed:
Option: 'focr_pre314'
Not used as it now requires SA 3.1.4
Added:
Option: 'focr_path_bin'
Its value is treated as path for searching of @bin_utils, potentially
requiring less configuration options;
Directories in the path that don't exists, are skipped;
Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin
Option: 'focr_db_hash'
Its value holds the filename to use for storing hash database; See below.
Default value: /etc/mail/spamassassin/FuzzyOcr.db
Option: 'focr_db_safe'
Its value holds the filename to use for storing hash database; See below.
Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db
Option: 'focr_db_max_days'
Its value holds the filename to use for storing hash database; See below.
Default value: 35
Option: 'focr_keep_bad_images'
If this is set to 1, then this plugin will not remove the temporary image
directory created where the images are stored and processed if it
determines that the image was corrupt, or an error occurred with any
of the auxiliary programs that process the images. Usefull while
debugging.
Default value: 0
Changed:
Option: 'focr_logfile'
Defaults to 'stderr' so that logging goes there
Option: 'focr_enable_image_hashing' if set to 2:
Use MLDBM to store Hash info in true DB file for faster access.
Stores hashes of images that exceed set thresholds in file
specified by option focr_db_hash
Stores hashes of 'clean' images (without matching words)
specified by option focr_db_safe to also cache good images.
Keeps statistics of Hash-Hits and displays #times matched in log.
Saves name of attachment and content/type as reference
Automatically imports known-hashes from focr_digest_db into focr_db_hash
Automatically expire 'old' records if not matched in more than
the number of days specified in option 'focr_db_max_days'
Instead of having a 'global' timeout, the 'focr_timeout' is used per
external program used, this will ensure that there are no timeouts
recorded because of complex scansets, or because of temporary spikes
in load. Also, it now displays the name and return code information
for the binary that timedout, making it easier to debug problems.
Fixed:
A bug where option focr_counts_required was not recognized;
Logging to file when option 'focr_logfile' set now works;
Individual word scores are now applied correctly
Storing only images with matched words to hash database (Thanks to Robert LeBlanc)
Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu)
Ignores empty lines in wordlists (global and local)
Ignores comments starting with (#) to EOL
Require:
Plugin officially requires SA 3.1.1 or higher
Added:
Support for BMP/TIFF Images
Changed:
Major internal restructuring
Use SpamAssassin Logging Facility instead of own logfile
Fixed:
A bug related to database hashing
Updated: Nov 2, 2006