version 2.3j: Fixed: sh: $efile: ambiguous redirect This message was being generated when using complex scansets, because the 'value' was only translated once. In complex scansets, this value may be specified multiple times. FuzzyOcr.cf Fixed outstanding errors. Variable mismatches are now fixed. FuzzyOcr.pm Trap ImageMagick errors better, and logs them. When processing Animated-GIF files, due to the algorithm, it is possible to discard all frames, leaving an empty image. Now, this special case is treated as a corrupt image, and triggers FUZZY_OCR_CORRUPT_IMG with $Score{corrupt} points (2.5 by default). Changed: Option: focr_personal_wordlist Now, if the option value begins with '/', the value is not treated as relative to the efective user's HOME directory, but as a fixed path. version 2.3i: Added: Option: 'focr_score_ham' Default: 0.0 When set to 1, images that are below the 'focr_counts_required' threshold, are scored with the formula: $Score{Add} * $cnt; this gives marginally bad images some positive score instead of just allowing them without score. Removed: Util: gif2anim This script is no longer used in the plugin, so it is removed from the distribution, although if needed, it may be found in the previous version. Fixed: The plugin was stuck in infinite loop in the case where there is more than one attachment with the same name. The tie-breaking was not working. When processing GIF files, extra care has to be taken so that ImageMagick properly recognizes the files as GIF images, otherwise, an error occurs because ImageMagick cannot properly determine the image 'type' and cannot determine the image size, resulting in an invalid hash. Code is now in place to prevent this, and in the case where invalid image size is encountered, the processing of this image is skipped. Changed: When the plugin determines that words from the lists are found in the images, it now stores these words in 'focr_db_hash' so that when we encounter the same image hash in another message, the report will add the words 'found' to the report, giving the end user more information, instead of just the FOCR_KNOWN_IMAGE_HASH rule firing with the previous score. version 2.3h: Require: New Perl Module Image::Magick; Added: Option: 'focr_anim_delay' Default: 100 This option is used with animated GIF files, and keeps all images that are displayed for at least 1 sec. Option: 'focr_anim_max_frames' Default: 2 This option is used with animated GIF files, and keeps top N largest frames. Fixed: Option: 'focr_digest_hash' Fixed internal parameter to reflect option from original plugin (Thanks Bill). Option: 'focr_db_hash' Updated FuzzyOcr.cf to reflect plugin option. Option: 'focr_db_safe' Updated FuzzyOcr.cf to reflect plugin option. Option: 'focr_counts_required' Fixed default value of '2' was set to '5' making it behave as the original plugin. Removed: Option: 'focr_bin_identify' Option: 'focr_bin_convert' These options are no longer valid, since the external programs are no longer called in favor of using PERL module. Makes things 'simpler'. Option: 'focr_bin_gifasm' Option: 'focr_bin_tifftopnm' external program not used anymore. Changed: The plugin now uses Image::Magick module to access ImageMagick functions from PERL instead of accessing external programs. This makes for fewer system calls to run external programs. (Idea from Eric Yiu) version 2.3g: Added: Option: focr_keep_bad_images The default value for this option is zero(0). When set to 1, the plugin will not remove a tempdir whenever it registers an error or timeout from any of the 'helper' apps. When set to 2, the plugin will always keep the tempdir. Beware that on heavily loaded systems, this might fill your /tmp partition. Util: fuzzy-cleantmp This utility can be used to remove tempdirs left behind if the plugin was configured to save them. It takes one parameter: hours to keep (12 by default) This can safely be placed inside CRON to prune /tmp. Util: gif2anim This utility (from ImageMagic) extracts images from animated gifs as well as giving information regarding delays and image sizes. Requires identify and convert to work (these are required, so not a problem). Fixed: Bug: 'convert' An invalid parameter was specified when using 'convert' to assemble animated gifs resulting in an error message, and the image was not scanned. Bug: 'safe_db' When checking for images in safe_db hash, because we score then as zero (0), we did not 'short circuit' correctly. This has now been fixed. Bug: wrong_ctype There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score' parameter to take effect. This has now been fixed. Changed: known_image_hash This procedure was called with two parameters: $digest and $score. $digest was not used, so it has been removed. Also, just in the off chance that $score is zero, it uses $Score{base} to score the image. fuzzyocr_check Added code to better determine the name of the attachment. Sometimes, the name is hidden in the 'content-id' header of the image/* MIME part, so we extract it from there if no name is given when this header is available. Also it makes shure that problematic characters are changed so as to not give PERL any more grief. A copy of the original message is now saved in the tempdir created, so that when we instruct the plugin to keep the created tempdir, we have a copy of the original message to further assist in troubleshooting problems. A file is created in tempdir containing all the expanded commands used to process the images. This can help to troubleshoot invalid command errors. Removed some debuglog lines to reduce the lines logged. Uses gif2anim (if available) to extract images from animated gifs. TODO: I will try to the generated anim file to root out animated gif spam where the spam message is not in the largest frame, or is in the frame with the largest delay, as well as other tricks... version 2.3f: Fixed: Properly initialized $h and $w to zero so that when getting the height and width from an image, if the size parameters cannot be parsed, they can get properly tested. Fixed: Hashing now works. $digest was getting reset because it went out of scope. grrr. Fixed: $efile was only being replaced for first occurrence in complex scansets. Fixed: Various bugs where: Use of uninitialized values were reported. version 2.3e: Fixed: Option: 'focr_db_safe' This option was not included in the @pgm_options array.... oops (thanks UxBoD) Score: wrongctype This was not used correctly, thus it was not scoring... (thanks Eric) Changed: It now works with tempfiles only This hopefully reducing the need to read/write image data from memory after each 'filter'. This will hopefully reduce IO and memory usage for the plugin. Scanset Syntax: $pfile Because of the use of tempfiles, there is a need to specify the image file to be used as input. '$pfile' must be used to specify the input filename. Please note that in cases where scansets use pipes, only specify $pfile as the input to the first 'filter' program. Scanset Syntax: $efile With every scanset, stderr is redirected to '$efile', which is different for each image. When using multiple filters in a scanset, use '$efile' to redirect stderr to this file, making shure the plugin will correctly recognize an error when it occurs. version 2.3d: Require: Plugin officially requires SA 3.1.4 or higher New Perl Modules DB_File Storable MLDBM Previous String::Approx Removed: Option: 'focr_pre314' Not used as it now requires SA 3.1.4 Added: Option: 'focr_path_bin' Its value is treated as path for searching of @bin_utils, potentially requiring less configuration options; Directories in the path that don't exists, are skipped; Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin Option: 'focr_db_hash' Its value holds the filename to use for storing hash database; See below. Default value: /etc/mail/spamassassin/FuzzyOcr.db Option: 'focr_db_safe' Its value holds the filename to use for storing hash database; See below. Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db Option: 'focr_db_max_days' Its value holds the filename to use for storing hash database; See below. Default value: 35 Option: 'focr_keep_bad_images' If this is set to 1, then this plugin will not remove the temporary image directory created where the images are stored and processed if it determines that the image was corrupt, or an error occurred with any of the auxiliary programs that process the images. Usefull while debugging. Default value: 0 Changed: Option: 'focr_logfile' Defaults to 'stderr' so that logging goes there Option: 'focr_enable_image_hashing' if set to 2: Use MLDBM to store Hash info in true DB file for faster access. Stores hashes of images that exceed set thresholds in file specified by option focr_db_hash Stores hashes of 'clean' images (without matching words) specified by option focr_db_safe to also cache good images. Keeps statistics of Hash-Hits and displays #times matched in log. Saves name of attachment and content/type as reference Automatically imports known-hashes from focr_digest_db into focr_db_hash Automatically expire 'old' records if not matched in more than the number of days specified in option 'focr_db_max_days' Instead of having a 'global' timeout, the 'focr_timeout' is used per external program used, this will ensure that there are no timeouts recorded because of complex scansets, or because of temporary spikes in load. Also, it now displays the name and return code information for the binary that timedout, making it easier to debug problems. Fixed: A bug where option focr_counts_required was not recognized; Logging to file when option 'focr_logfile' set now works; Individual word scores are now applied correctly Storing only images with matched words to hash database (Thanks to Robert LeBlanc) Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu) Ignores empty lines in wordlists (global and local) Ignores comments starting with (#) to EOL version 2.3c: Require: Plugin officially requires SA 3.1.1 or higher Added: Support for BMP/TIFF Images Changed: Major internal restructuring Use SpamAssassin Logging Facility instead of own logfile Fixed: A bug related to database hashing