encoding ascii data for xml

Sat Oct 4 04:51:23 EDT 2008

On Oct 4, 7:41 am, harrelson <harrel... at gmail.com> wrote:
> I have a large amount of data in a postgresql database with the
> encoding of SQL_ASCII.  Most recent data is UTF-8 but data from
> several years ago could be of some unknown other data type.  Being
> honest with myself, I am not even sure that the most recent data is
> always UTF-8-- data is entered on web forms and I wouldn't be
> surprised if data of other encodings is slipping in.
>
> Up to the point I have just ignored the problem-- on the web side of
> things everything works good enough.  But now I am required to stuff
> this data into xml datasets and I am, of course, having problems.  My
> preference would be to force the data into UTF-8 even if it is
> ultimately an incorrect encoding translation but this isn't working.
> The below code represents my most recent problem:
>
> import xml.dom.minidom
> print chr(3).encode('utf-8')
> dom = xml.dom.minidom.parseString( "<test>%s</test>" %
> chr(3).encode('utf-8') )
>
> chr(3) is the ascii character for "end of line".  I would think that
> trying to encode this to utf-8 would fail but it doesn't-- I don't get
> a failure till we get into xml land and the parser complains.  My
> question is why doesn't encode() blow up?  It seems to me that
> encode() shouldn't output anything that parseString() can't handle.

The encode method is doing its job, which is to encode ANY and EVERY
unicode character as utf-8, so that it can be transported reliably
over an 8-bit-wide channel. encode is *not* supposed to guess what you
are going to do with the output.

Perhaps instead of "forcing the data into utf-8", you should be
thinking about what is actually in your data. What is the context that
chr(3) appears in? Perhaps when you get around to print
repr(some_data), you might see things like "\x03harlie \x03haplin" --
it's a common enough keyboarding error to hit the Ctrl key instead of
the Shift key and unfortunately a common-enough design error for there
to be no checking at all.

BTW, there's no forcing involved -- chr(3) is *already* utf-8.

HTH,
John