FuzzyOcr 2.3g

This is a new distribution based on a plugin developed by Christian Holler decoder_at_own-hero_dot_net

Changes

version 2.3g

    Added:
        Option: focr_keep_bad_images
            The default value for this option is zero(0).
            When set to 1, the plugin will not remove a tempdir whenever it registers
                an error or timeout from any of the 'helper' apps.
            When set to 2, the plugin will always keep the tempdir. Beware that on heavily
                loaded systems, this might fill your /tmp partition.
        
        Util: fuzzy-cleantmp
            This utility can be used to remove tempdirs left behind if the plugin was 
            configured to save them.  It takes one parameter: hours to keep (12 by default)
            This can safely be placed inside CRON to prune /tmp.

        Util: gif2anim
            This utility (from ImageMagic) extracts images from animated gifs as well
            as giving information regarding delays and image sizes. Requires identify and
            convert to work (these are required, so not a problem).

    Fixed:
        Bug: 'convert'
            An invalid parameter was specified when using 'convert' to assemble animated gifs
            resulting in an error message, and the image was not scanned.

        Bug: 'safe_db'
            When checking for images in safe_db hash, because we score then as zero (0),
            we did not 'short circuit' correctly. This has now been fixed.

        Bug: wrong_ctype
            There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score'
            parameter to take effect. This has now been fixed.

    Changed:
        known_image_hash
            This procedure was called with two parameters: $digest and $score.
            $digest was not used, so it has been removed. Also, just in the off chance
            that $score is zero, it uses $Score{base} to score the image.

        fuzzyocr_check
            Added code to better determine the name of the attachment. Sometimes, the name
            is hidden in the 'content-id' header of the image/* MIME part, so we extract
            it from there if no name is given when this header is available. Also it makes
            shure that problematic characters are changed so as to not give PERL any more
            grief, as well as timing out spamd in rare instances.

            A copy of the original message is now saved in the tempdir created, so that
            when we instruct the plugin to keep the created tempdir, we have a copy of the
            original message to further assist in troubleshooting problems.

            A file is created in tempdir containing all the expanded commands used to
            process the images. This can help to troubleshoot invalid command errors. 

            Removed some debuglog lines to reduce the lines logged.

            Uses gif2anim (if available) to extract images from animated gifs.
            TODO:
                I will try to the generated anim file to root out animated gif spam where
                the spam message is not in the largest frame, or is in the frame with the
                largest delay, as well as other tricks...

Changes

version 2.3f

    Fixed:
        Properly initialized $h and $w to zero so that when getting the height and width
            from an image, if the size parameters cannot be parsed, they can get properly
            tested.
        
        Hashing now works. $digest was getting reset because it went out of scope. grrr.

        $efile was only being replaced for first occurrence in complex scansets generating
            $efile: ambiguous redirect errors.

        Various bugs where: Use of uninitialized values were reported.

Changes

version 2.3e

    Fixed:
        Option: 'focr_db_safe'
            This option was not included in the @pgm_options array.... oops (thanks UxBoD)

        Score: wrongctype
            This was not used correctly, thus it was not scoring... (thanks Eric)

    Changed:
        It now works with tempfiles only
            This hopefully reducing the need to read/write image data from memory after each
            'filter'. This will hopefully reduce IO and memory usage for the plugin.

        Scanset Syntax: $pfile
            Because of the use of tempfiles, there is a need to specify the image file to be
            used as input. '$pfile' must be used to specify the input filename. Please note
            that in cases where scansets use pipes, only specify $pfile as the input to the
            first 'filter' program.

        Scanset Syntax: $efile
            With every scanset, stderr is redirected to '$efile', which is different for each
            image. When using multiple filters in a scanset, use '$efile' to redirect stderr
            to this file, making shure the plugin will correctly recognize an error when it
            occurs.
            

version 2.3d

    Require:
        Plugin officially requires SA 3.1.4 or higher
        New Perl Modules
            DB_File
            Storable
            MLDBM
        Previous
            String::Approx

    Removed:
        Option: 'focr_pre314'
            Not used as it now requires SA 3.1.4

    Added:
        Option: 'focr_path_bin'
            Its value is treated as path for searching of @bin_utils, potentially
                requiring less configuration options;
            Directories in the path that don't exists, are skipped;
            Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin

        Option: 'focr_db_hash'
            Its value holds the filename to use for storing hash database; See below.
            Default value: /etc/mail/spamassassin/FuzzyOcr.db

        Option: 'focr_db_safe'
            Its value holds the filename to use for storing hash database; See below.
            Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db

        Option: 'focr_db_max_days'
            Its value holds the filename to use for storing hash database; See below.
            Default value: 35

        Option: 'focr_keep_bad_images'
            If this is set to 1, then this plugin will not remove the temporary image
                directory created where the images are stored and processed if it 
                determines that the image was corrupt, or an error occurred with any
                of the auxiliary programs that process the images. Usefull while
                debugging.
            Default value: 0
            

    Changed:
        Option: 'focr_logfile'
            Defaults to 'stderr' so that logging goes there
        Option: 'focr_enable_image_hashing' if set to 2:
            Use MLDBM to store Hash info in true DB file for faster access.
            Stores hashes of images that exceed set thresholds in file
                specified by option focr_db_hash
            Stores hashes of 'clean' images (without matching words)
                specified by option focr_db_safe to also cache good images.
            Keeps statistics of Hash-Hits and displays #times matched in log.
            Saves name of attachment and content/type as reference
            Automatically imports known-hashes from focr_digest_db into focr_db_hash
            Automatically expire 'old' records if not matched in more than
                the number of days specified in option 'focr_db_max_days'
        Instead of having a 'global' timeout, the 'focr_timeout' is used per
            external program used, this will ensure that there are no timeouts
            recorded because of complex scansets, or because of temporary spikes
            in load. Also, it now displays the name and return code information
            for the binary that timedout, making it easier to debug problems.

    Fixed:
        A bug where option focr_counts_required was not recognized;
        Logging to file when option 'focr_logfile' set now works;
        Individual word scores are now applied correctly
        Storing only images with matched words to hash database (Thanks to Robert LeBlanc)
        Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu)
        Ignores empty lines in wordlists (global and local)
        Ignores comments starting with (#) to EOL

version 2.3c

    Require:
        Plugin officially requires SA 3.1.1 or higher
    
    Added:
        Support for BMP/TIFF Images

    Changed:
        Major internal restructuring
        Use SpamAssassin Logging Facility instead of own logfile

    Fixed:
        A bug related to database hashing
Home Top 2.3c 2.3d 2.3e 2.3f 2.3g

Updated: Sep 12, 2006