[Spambayes] Windows compatibility - OCR [was: Unwanted stock solicitations]

Vibe Grevsen grevsen at gmail.com
Sun Nov 5 23:16:08 CET 2006


Hi there :)


>    Vibe> 4. Change this
> 
>    Vibe>    for line in open(orf):
>    Vibe>        if line.startswith("lines"):
>    Vibe>            nlines = int(line.split()[1])
>    Vibe>            if nlines:
>    Vibe>                ctokens.add("image-text-lines:%d" %
>    Vibe>                            int(log2(nlines)))
> 
> 
>    Vibe> into this
> 
>    Vibe>    nlines = ctext.count('\n')
>    Vibe>    if nlines:
>    Vibe>        ctokens.add("image-text-lines:%d" %
>    Vibe>                    nlines )
 

> Not the same:
...
> Note that the out.txt file suggests there is only one line in the file while
> the actual file contains two.  It appears that's simply an off-by-one issue
> (maybe ocrad always adds a blank line to the end of its output text), though
> I've only looked at the above case and one other.

You're right. Simply off-by-one. Tested on five images.

    nlines = ctext.count('\n') - 1

I also noted that the line number was often different from the perceived line count.
(I.e. if you look at the image and try to estimate the number of lines).
If python supports regexp's we could strip empty lines from the output before the count...
It may be a good idea, but I suspect it is not significant however.



Happy coding :)

Vibe


More information about the SpamBayes mailing list