UnicodeError with OCR Text

Gilles Lenfant glenfant at NOSPAM.bigfoot.com
Fri May 23 15:55:09 EDT 2003


"Paradox" <JoeyTaj at netzero.com> a écrit dans le message de news:
924a9f9c.0305230909.4e6c293d at posting.google.com...
> I am extracting OCR data from SQL Server Text field through ADO and
> putting it into a string called fileContent. For some reason it thinks
> that every record is a UNICODE string which it is not. For most
> records the following line of code will work to get it back to normal
> thinking but eventually it will throw a unicode error on one of the
> records.
>
> fullText = fullText + fileContent.encode('ascii') + '\n'
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>
> I think I isolated it to the degree character "º" HEX is BA, ASCII is
> 186.
>
> The fileContent will not even print to the output screen. But what is
> strange is this code
>
> test = '125º'
> print test # This prints well enough
> test = test.encode('ascii') #This throws a UnicodeError exception.
>
> Any workarounds to this problem would be appreciated.
>
>                                    Thanks Joey

"°" is iso-8859-1, not ascii.

Try using xxx.encode('iso-8859-1') or xxx.encode('utf-8').
Both are ASCII compatible. UTF-8 should (crossing fingers) never raise
UnicodeError.

Other solution, tune the OCR app such it reads only ascii characters and
discards chars > 127.

HTH

--Gilles





More information about the Python-list mailing list