Trouble fixing a broken ASCII string - "replace" mode in codec not working.
Neil Cerutti
horpner at yahoo.com
Tue Feb 6 21:14:01 EST 2007
On 2007-02-06, Robert Kern <robert.kern at gmail.com> wrote:
> John Nagle wrote:
>> File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
>> sitetext = sitetext.encode('ascii','replace') # force to clean ASCII
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in
>> position 29151: ordinal not in range(128)
>>
>> Why is that exception being raised when the codec was told 'replace'?
>
> The .encode('ascii') takes unicode strings to str strings.
> Since you gave it a str string, it first tried to convert it to
> a unicode string using the default codec ('ascii'), just as if
> you were to have done unicode(sitetext).encode('ascii',
> 'replace').
>
> I think you want something like this:
>
> sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')
This is the cue for the translate method, which will be much
faster and simpler for cases like this. You can build the
translation table yourself, or use maketrans.
>>> asciitable = string.maketrans(''.join(chr(a) for a in xrange(127, 256)),
... '?'*127)
You'd only want to do that once. Then to strip off the non-ascii:
sitetext.translate(asciitable)
I used a similar solution in an application I'm working on that
must uses a Latin-1 byte-encoding internally, but displays on
stdout in ascii.
--
Neil Cerutti
More information about the Python-list
mailing list