Any reason why cStringIO in 2.5 behaves different from 2.4?

Mon Jul 30 04:33:39 EDT 2007

Stefan Scholl wrote:
> Michael L Torrie <torriem at chem.byu.edu> wrote:
>> xml.sax's use of parseString() is exactly correct.  xml.sax should
>> *never* parse python unicode strings as by definition XML must be
>> encoded as a *byte stream*, which is what a python string is.
> 
> I don't care about the definition of XML at this point of the
> program.

Maybe you should, as you're dealing with XML. But then, maybe you're lucky and
no-one will have to use your software.

> It's not parseString() that tells me something is wrong with the
> parameter. It's cStringIO, which is used on platforms where it is
> available. On other platforms no exceptions are thrown, because
> then StringIO is used, which behaves in Python 2.4 and Python 2.5
> the same, regarding unicode strings.
> 
> Other libraries like LXML (not included) parse unicode strings.

But only in very well defined cases. lxml actually checks if the string uses
an XML declaration. And if it does, you will get an exception.

lxml's parser only accepts a unicode object as input if the unicode object is
the only way to determine the encoding. That's lxml's interpretation of
transport-provided encoding information, which is allowed by the spec.

If you want PyXML to behave the same, go ahead and send a patch to python-dev
or xml-sig.

> And these are two additional lines in my code now:
> 
>     if isinstance(string, unicode):
>             string = string.encode("utf-8")

This will only work as long as your XML does not have an encoding declaration.
Does your code guarantee that? Because otherwise it is broken. Have you
documented that?

>> A python /unicode/ string could be held internally in any number of
>> ways, 2, 3, 4, or even 8 bytes per character if the implementation
>> demanded it (a bit contrived, I admit).  Since the xml parser is only
>> ever intended to parse *XML*, why should it ever know what to do with
>> python unicode strings, which could be stored any number of ways, making
>> byte-parsing impossible.
> 
> xml.sax is no external parser.

Right, it's a package. But it contains an *XML* parser.

> The program doesn't have to
> communicate with the outside world at this point of execution.
> The Python programm calls a Python function of a Python class and
> passes a Python unicode string as parameter.

A sequence of unicode characters, right.

Why not just pass XML? Would make your life easier.

> The behavior of cStringIO (the original topic of this thread) is
> correct and documented. parseString() uses the old idiom where
> cStringIO is imported as StringIO, when available. Despite the
> fact that they behave differently.
> 
> In my personal opinion: If parseString() shouldn't support
> unicode strings, then it should check for it and throw a
> meaningful exception.

Well, it does. But not because you passed a unicode string but because you
passed a unicode string that does not map 1:1 to the standard XML 1-byte
encoding. A sequence of plain ASCII characters passed as a unicode string is
perfectly ok.

So, the API is even forgiving enough to accept unicode strings, it just obeys
the XML spec after that, that's all.

> At the moment the code just looks as if someone has overlooked
> the fact that unicode strings (with non-ascii characters in it)
> cause a problem. Missing test?

No, just wrong assumption on your side. Read the spec, learn, think, understand.

>> So your code is faulty in its assumptions, not xml.sax.
> 
> As I said in the conclusion, a few messages before: Undocumented,
> implementation dependent behavior.

Well, the implementation was correct *under the assumption* that cStringIO
behaved as expected. But as cStringIO deviated from its documentation,
xml.sax.parseString could not work as expected.

> Or maybe just a bug, considering the following on
> http://docs.python.org/lib/module-xml.sax.html
> 
>         A typical SAX application uses three kinds of objects:
>         readers, handlers and input sources. ``Reader'' in this
>         context is another term for parser, i.e. some piece of
>         code that reads the bytes or characters from the input
>         source, and produces a sequence of events.
> 
> 
> Bytes _or_ characters.

I think they were just trying to have more people understand what they wanted
to say, not only those who know XML.

Stefan