encoding ascii data for xml

Tino Wildenhain tino at wildenhain.de
Sat Oct 4 02:36:28 EDT 2008


harrelson wrote:
> I have a large amount of data in a postgresql database with the
> encoding of SQL_ASCII.  Most recent data is UTF-8 but data from
> several years ago could be of some unknown other data type.  Being
> honest with myself, I am not even sure that the most recent data is
> always UTF-8-- data is entered on web forms and I wouldn't be
> surprised if data of other encodings is slipping in.

First I would highly recommend to clean up the database and get
everything into UTF-8, then re-initdb the cluster with a correct
utf-8 locale and database encoding "unicode", then cleanly restore
the data. This way the database can make sure further inserts
are with the correct encoding and you only have to do the cleanup
once - not every time your xml interface gets used.

...
> 
> import xml.dom.minidom
> print chr(3).encode('utf-8')
> dom = xml.dom.minidom.parseString( "<test>%s</test>" %
> chr(3).encode('utf-8') )

> chr(3) is the ascii character for "end of line".  I would think that
> trying to encode this to utf-8 would fail but it doesn't-- I don't get

Nope, ascii (ord(x) < 128) is contained in utf-8. So 3 is indeed
a valid codepoint in utf-8.

> a failure till we get into xml land and the parser complains.  My
> question is why doesn't encode() blow up?  It seems to me that
> encode() shouldn't output anything that parseString() can't handle.

It just can't be put literally into XML - this is another step.
You basically need to encode into xml charref or have your xml library
do so.

It seems a little googling turns up this one, which might be helpful:

http://www.xml.com/pub/a/2002/11/13/py-xml.html

Regards
Tino
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3241 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20081004/4ad7f338/attachment-0001.bin>


More information about the Python-list mailing list