Trouble fixing a broken ASCII string - "replace" mode in codec not working.

Tue Feb 6 21:14:01 EST 2007

On 2007-02-06, Robert Kern <robert.kern at gmail.com> wrote:
> John Nagle wrote:
>>   File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
>>      sitetext = sitetext.encode('ascii','replace')  # force to clean ASCII
>> 
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in
>> position 29151: ordinal not in range(128)
>> 
>>     Why is that exception being raised when the codec was told 'replace'?
>
> The .encode('ascii') takes unicode strings to str strings.
> Since you gave it a str string, it first tried to convert it to
> a unicode string using the default codec ('ascii'), just as if
> you were to have done unicode(sitetext).encode('ascii',
> 'replace').
>
> I think you want something like this:
>
>   sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

This is the cue for the translate method, which will be much
faster and simpler for cases like this. You can build the
translation table yourself, or use maketrans.

>>> asciitable = string.maketrans(''.join(chr(a) for a in xrange(127, 256)), 
...    '?'*127)

You'd only want to do that once. Then to strip off the non-ascii:

sitetext.translate(asciitable)

I used a similar solution in an application I'm working on that
must uses a Latin-1 byte-encoding internally, but displays on
stdout in ascii.

-- 
Neil Cerutti