[Python-3000] str/unicode tests: pyexpat.c and read(n)

Talin talin at acm.org
Sat Jul 21 19:36:05 CEST 2007


James Y Knight wrote:
> On Jul 21, 2007, at 12:25 AM, Fred L. Drake, Jr. wrote:
> 
>> On Saturday 21 July 2007, Joe Gregorio wrote:
>>> Should xml.parsers.expat.XMLParser.ParseFile(file) operate on
>>> both text and binary streams?
>> No.  XML is a serialization of a markup language containing Unicode  
>> character
>> into an encoded stream.
> 
> Well...there's many reasons why it is useful to be able to parse an  
> already-decoded unicode stream into XML, and to serialize XML into a  
> unicode string. For example, if combining into a larger unicode  
> document, or parsing from a literal string in the source code.
> 
> Sure, normally XML is serialized to bytes, but it is also  
> serializable to unicode, and that's a useful feature to have (if  
> implementable).

The general use case for XML is reading or writing a document, where 
"document" means a bytestream from either a file or a socket.

The question is whether it would also be useful to parse Python strings 
that contain XML markup, or format an XML document into a Python string.

Some care needs to be taken here, because XML has its own way of 
specifying the character encoding. For example, suppose I have a python 
string that contains the characters:

    '<?xml version="1.0" encoding="utf-8" ?>'

Well, the problem with this is that the encoding *isn't* UTF-8. Python 
3000 strings are internally encoded as UTF-16 (although generally it 
tries to hide that fact from you so most of the time you don't have to 
care.)

Suppose then that you write this string out to a file (perhaps after 
combining it with other strings.) If I happen to write the file as 
UTF-8, then everything is fine, but if I happen to pick some other 
encoding that doesn't match the encoding attribute in the prologue then 
we have the potential for confusion.

This matters because there are lots of people who write XML documents 
with print statements (and many of them forget to handle things like 
escaping of entities and such.)

This also matters because the Python XML parsing libraries are mostly 
based on expat, which is C code that doesn't have any special knowledge 
of Python strings - it only works on the encodings that it can detect, 
or which you tell it to use.

So if you wanted to directly parse a Python string as XML, you would 
probably have to treat it as a byte array and override the encoding 
detection, telling it explicitly to use UTF-16.

> James
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/talin%40acm.org
> 


More information about the Python-3000 mailing list