FuzzyOcr 3.4(svn)

This is a new distribution based on a plugin developed by Christian Holler decoder_at_own-hero_dot_net

SVN

News

In order to remove the confusion with regards to versions with this plugin, there is now a SVN repository where the latest version of the plugin resides. This latest version in particular, has removed the Image::Magick dependency which was very troublesome for a lot of people, but adds gifsicle as a required dependency to deal with animated GIF images.

Even though this new version is still considered in development, it is very stable.

Latest

    svn://fuzzyocr.own-hero.net/trunk/devel

Changes

version 2.3j

    Downloaded:  305 Time(s)

    Fixed:
        sh: $efile: ambiguous redirect
        This message was being generated when using complex scansets, because
        the 'value' was only translated once. In complex scansets, this value
        may be specified multiple times.

        FuzzyOcr.cf
        Fixed outstanding errors. Variable mismatches are now fixed.

        FuzzyOcr.pm
        Trap ImageMagick errors better, and logs them.

        When processing Animated-GIF files, due to the algorithm, it is possible
        to discard all frames, leaving an empty image.  Now, this special case
        is treated as a corrupt image, and triggers FUZZY_OCR_CORRUPT_IMG with
        $Score{corrupt} points (2.5 by default).

    Changed:
        Option: focr_personal_wordlist
        Now, if the option value begins with '/', the value is not treated as
        relative to the efective user's HOME directory, but as a fixed path.

version 2.3i

    Added:
        Option: 'focr_score_ham'  Default: 0.0
        When set to 1, images that are below the 'focr_counts_required' threshold,
        are scored with the formula: $Score{Add} * $cnt; this gives marginally bad
        images some positive score instead of just allowing them without score.
        
    Removed:
        Util: gif2anim
        This script is no longer used in the plugin, so it is removed from the
        distribution, although if needed, it may be found in the previous version.

    Fixed:
        The plugin was stuck in infinite loop in the case where there is more
        than one attachment with the same name. The tie-breaking was not working.

        When processing GIF files, extra care has to be taken so that ImageMagick
        properly recognizes the files as GIF images, otherwise, an error occurs 
        because ImageMagick cannot properly determine the image 'type' and cannot
        determine the image size, resulting in an invalid hash. Code is now in place
        to prevent this, and in the case where invalid image size is encountered,
        the processing of this image is skipped.

    Changed:
        When the plugin determines that words from the lists are found in the images,
        it now stores these words in 'focr_db_hash' so that when we encounter the same
        image hash in another message, the report will add the words 'found' to the
        report, giving the end user more information, instead of just the 
        FOCR_KNOWN_IMAGE_HASH rule firing with the previous score.

version 2.3h

    Require:
        New Perl Module
            Image::Magick;
    Added:
        Option: 'focr_anim_delay'  Default: 100
            This option is used with animated GIF files, and keeps all images
            that are displayed for at least 1 sec.

        Option: 'focr_anim_max_frames' Default: 2
            This option is used with animated GIF files, and keeps top N
            largest frames. 

    Fixed:
        Option: 'focr_digest_hash'
            Fixed internal parameter to reflect option from original plugin (Thanks Bill).

        Option: 'focr_db_hash'
            Updated FuzzyOcr.cf to reflect plugin option.

        Option: 'focr_db_safe'
            Updated FuzzyOcr.cf to reflect plugin option.

        Option: 'focr_counts_required'
            Fixed default value of '2' was set to '5' making it behave as the original plugin.

    Removed:
        Option: 'focr_bin_identify'
        Option: 'focr_bin_convert'
            These options are no longer valid, since the external programs are no longer called
            in favor of using PERL module. Makes things 'simpler'.

        Option: 'focr_bin_gifasm'
        Option: 'focr_bin_tifftopnm'
            external program not used anymore.

    Changed:
        The plugin now uses Image::Magick module to access ImageMagick functions from PERL instead
        of accessing external programs. This makes for fewer system calls to run external programs.
        (Idea from Eric Yiu)
        

version 2.3g

    Added:
        Option: focr_keep_bad_images
            The default value for this option is zero(0).
            When set to 1, the plugin will not remove a tempdir whenever it registers
                an error or timeout from any of the 'helper' apps.
            When set to 2, the plugin will always keep the tempdir. Beware that on heavily
                loaded systems, this might fill your /tmp partition.
        
        Util: fuzzy-cleantmp
            This utility can be used to remove tempdirs left behind if the plugin was 
            configured to save them.  It takes one parameter: hours to keep (12 by default)
            This can safely be placed inside CRON to prune /tmp.

        Util: gif2anim
            This utility (from ImageMagic) extracts images from animated gifs as well
            as giving information regarding delays and image sizes. Requires identify and
            convert to work (these are required, so not a problem).

    Fixed:
        Bug: 'convert'
            An invalid parameter was specified when using 'convert' to assemble animated gifs
            resulting in an error message, and the image was not scanned.

        Bug: 'safe_db'
            When checking for images in safe_db hash, because we score then as zero (0),
            we did not 'short circuit' correctly. This has now been fixed.

        Bug: wrong_ctype
            There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score'
            parameter to take effect. This has now been fixed.

    Changed:
        known_image_hash
            This procedure was called with two parameters: $digest and $score.
            $digest was not used, so it has been removed. Also, just in the off chance
            that $score is zero, it uses $Score{base} to score the image.

        fuzzyocr_check
            Added code to better determine the name of the attachment. Sometimes, the name
            is hidden in the 'content-id' header of the image/* MIME part, so we extract
            it from there if no name is given when this header is available. Also it makes
            shure that problematic characters are changed so as to not give PERL any more
            grief, as well as timing out spamd in rare instances.

            A copy of the original message is now saved in the tempdir created, so that
            when we instruct the plugin to keep the created tempdir, we have a copy of the
            original message to further assist in troubleshooting problems.

            A file is created in tempdir containing all the expanded commands used to
            process the images. This can help to troubleshoot invalid command errors. 

            Removed some debuglog lines to reduce the lines logged.

            Uses gif2anim (if available) to extract images from animated gifs.
            TODO:
                I will try to the generated anim file to root out animated gif spam where
                the spam message is not in the largest frame, or is in the frame with the
                largest delay, as well as other tricks...

version 2.3f

    Fixed:
        Properly initialized $h and $w to zero so that when getting the height and width
            from an image, if the size parameters cannot be parsed, they can get properly
            tested.
        
        Hashing now works. $digest was getting reset because it went out of scope. grrr.

        $efile was only being replaced for first occurrence in complex scansets generating
            $efile: ambiguous redirect errors.

        Various bugs where: Use of uninitialized values were reported.

version 2.3e

    Fixed:
        Option: 'focr_db_safe'
            This option was not included in the @pgm_options array.... oops (thanks UxBoD)

        Score: wrongctype
            This was not used correctly, thus it was not scoring... (thanks Eric)

    Changed:
        It now works with tempfiles only
            This hopefully reducing the need to read/write image data from memory after each
            'filter'. This will hopefully reduce IO and memory usage for the plugin.

        Scanset Syntax: $pfile
            Because of the use of tempfiles, there is a need to specify the image file to be
            used as input. '$pfile' must be used to specify the input filename. Please note
            that in cases where scansets use pipes, only specify $pfile as the input to the
            first 'filter' program.

        Scanset Syntax: $efile
            With every scanset, stderr is redirected to '$efile', which is different for each
            image. When using multiple filters in a scanset, use '$efile' to redirect stderr
            to this file, making shure the plugin will correctly recognize an error when it
            occurs.
            

version 2.3d

    Require:
        Plugin officially requires SA 3.1.4 or higher
        New Perl Modules
            DB_File
            Storable
            MLDBM
        Previous
            String::Approx

    Removed:
        Option: 'focr_pre314'
            Not used as it now requires SA 3.1.4

    Added:
        Option: 'focr_path_bin'
            Its value is treated as path for searching of @bin_utils, potentially
                requiring less configuration options;
            Directories in the path that don't exists, are skipped;
            Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin

        Option: 'focr_db_hash'
            Its value holds the filename to use for storing hash database; See below.
            Default value: /etc/mail/spamassassin/FuzzyOcr.db

        Option: 'focr_db_safe'
            Its value holds the filename to use for storing hash database; See below.
            Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db

        Option: 'focr_db_max_days'
            Its value holds the filename to use for storing hash database; See below.
            Default value: 35

        Option: 'focr_keep_bad_images'
            If this is set to 1, then this plugin will not remove the temporary image
                directory created where the images are stored and processed if it 
                determines that the image was corrupt, or an error occurred with any
                of the auxiliary programs that process the images. Usefull while
                debugging.
            Default value: 0
            

    Changed:
        Option: 'focr_logfile'
            Defaults to 'stderr' so that logging goes there
        Option: 'focr_enable_image_hashing' if set to 2:
            Use MLDBM to store Hash info in true DB file for faster access.
            Stores hashes of images that exceed set thresholds in file
                specified by option focr_db_hash
            Stores hashes of 'clean' images (without matching words)
                specified by option focr_db_safe to also cache good images.
            Keeps statistics of Hash-Hits and displays #times matched in log.
            Saves name of attachment and content/type as reference
            Automatically imports known-hashes from focr_digest_db into focr_db_hash
            Automatically expire 'old' records if not matched in more than
                the number of days specified in option 'focr_db_max_days'
        Instead of having a 'global' timeout, the 'focr_timeout' is used per
            external program used, this will ensure that there are no timeouts
            recorded because of complex scansets, or because of temporary spikes
            in load. Also, it now displays the name and return code information
            for the binary that timedout, making it easier to debug problems.

    Fixed:
        A bug where option focr_counts_required was not recognized;
        Logging to file when option 'focr_logfile' set now works;
        Individual word scores are now applied correctly
        Storing only images with matched words to hash database (Thanks to Robert LeBlanc)
        Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu)
        Ignores empty lines in wordlists (global and local)
        Ignores comments starting with (#) to EOL

version 2.3c

    Require:
        Plugin officially requires SA 3.1.1 or higher
    
    Added:
        Support for BMP/TIFF Images

    Changed:
        Major internal restructuring
        Use SpamAssassin Logging Facility instead of own logfile

    Fixed:
        A bug related to database hashing
Home Top 2.3c 2.3d 2.3e 2.3f 2.3g 2.3h 2.3i 2.3j

Updated: Nov 2, 2006