SAX unicode and ascii parsing problem

Ulrich Eckhardt ulrich.eckhardt at dominolaser.com
Wed Dec 1 03:57:54 EST 2010


goldtech wrote:
> I tried this but nothing changed, I thought this might convert it and
> then I'd paerse the new file - didn't work:
> 
> uc = open(r'E:\sc\ppb4.xml').read().decode('utf8')
> ascii = uc.decode('ascii')
> mex9 = open( r'E:\scrapes\ppb5.xml', 'w' )
> mex9.write(ascii)

This doesn't make sense either. decode() will convert bytes into (Unicode)
characters. After the first decode('utf8'), you have those already. Calling
decode('ascii') on that doesn't make sense. If you want ASCII, as the
assignee suggests, you need to _encode_ the string. Be aware that not all
characters can be represented as ASCII though, and the presence of such a
character seems to have caused your initial problem.

BTW: 
- XML is not necessarily UTF-8, but that's a different issue.
- I would suggest you open files with 'rb' or 'wb' in order to suppress any
conversions on line endings. Especially writing UTF-16 would fail if that
is active.

Good luck!

Uli

-- 
Domino Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932




More information about the Python-list mailing list