Trouble fixing a broken ASCII string - "replace" mode in codec not working.

Tue Feb 6 15:05:45 EST 2007

John Nagle wrote:
>     I'm trying to clean up a bad ASCII string, one read from a
> web page that is supposedly in the ASCII character set but has some
> characters above 127.  And I get this:
> 
>   File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
>      sitetext = sitetext.encode('ascii','replace')  # force to clean ASCII
> 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151: 
> ordinal not in range(128)
> 
>     Why is that exception being raised when the codec was told 'replace'?

The .encode('ascii') takes unicode strings to str strings. Since you gave it a
str string, it first tried to convert it to a unicode string using the default
codec ('ascii'), just as if you were to have done
unicode(sitetext).encode('ascii', 'replace').

I think you want something like this:

  sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco