[Spambayes] Windows compatibility - OCR [was: Unwanted stock solicitations]
Vibe Grevsen
grevsen at gmail.com
Sun Nov 5 23:16:08 CET 2006
Hi there :)
> Vibe> 4. Change this
>
> Vibe> for line in open(orf):
> Vibe> if line.startswith("lines"):
> Vibe> nlines = int(line.split()[1])
> Vibe> if nlines:
> Vibe> ctokens.add("image-text-lines:%d" %
> Vibe> int(log2(nlines)))
>
>
> Vibe> into this
>
> Vibe> nlines = ctext.count('\n')
> Vibe> if nlines:
> Vibe> ctokens.add("image-text-lines:%d" %
> Vibe> nlines )
> Not the same:
...
> Note that the out.txt file suggests there is only one line in the file while
> the actual file contains two. It appears that's simply an off-by-one issue
> (maybe ocrad always adds a blank line to the end of its output text), though
> I've only looked at the above case and one other.
You're right. Simply off-by-one. Tested on five images.
nlines = ctext.count('\n') - 1
I also noted that the line number was often different from the perceived line count.
(I.e. if you look at the image and try to estimate the number of lines).
If python supports regexp's we could strip empty lines from the output before the count...
It may be a good idea, but I suspect it is not significant however.
Happy coding :)
Vibe
More information about the SpamBayes
mailing list