XML: minidom toxml() does not work for non English files! :-(

Trent Mick trentm at ActiveState.com
Mon May 6 16:03:07 EDT 2002


Micheal, Marc-Andre,

Perhaps you could help me shed some light on this. There are two issues
that I see:
    1. The actual problem that Jaros³aw reported.
        > from xml.dom import minidom  
        > xmldoc = minidom.parse('myfile.xml')
        > print xmldoc.toxml() 
        > 
        > It works for 7-bit text fine. But the problem is it works ONLY for
        > pure ASCII text. :-( If I try to use any of non English characters,
        > Python raise an exception:
        > 
        >   UnicodeError: ASCII encoding error: ordinal not in range(128)

       Jaros³aw mentions that the problem goes away if he replaces
       ActivePython 2.2.1's StringIO.py with the one from the PythonLabs
       distro. That would be fine (a bug in ActivePython) except that
       ActivePython has the more *recent* StringIO.py. So is Jaros³aw
       misusing StringIO.py or is this StringIO.py checkin incorrect or
       am I confused:
            MAL's checkin on the trunk:
            http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py.diff?r1=1.19&r2=1.20
            Micheal's back port to Python 2.2:
            http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py.diff?r1=1.19&r2=1.19.12.1
    
    2. It looks to me like Python 2.2.1 does *not* include the
       StringIO.py that is part of the 'r221' Python CVS tag. Am I
       wrong?


Any insight would be appreciated.

Thanks,
Trent



Further information:

- StringIO CVS log:
  http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py


- diff of Python Labs' 2.2.1 StringIO.py with StringIO.py in CVS at the
  'r221' tag:
    
    C:\>diff -u C:\PythonLabs22\Lib\StringIO.py D:\cvs\python-r221\dist\src\Lib\StringIO.py
    --- C:\PythonLabs22\Lib\StringIO.py     Mon Sep 24 13:34:52 2001
    +++ D:\cvs\python-r221\dist\src\Lib\StringIO.py Mon Mar 18 05:31:30 2002

    @@ -28,7 +28,7 @@
       bytes that occupy space in the buffer.
     - There's a simple test set (see end of this file).
     """
    -
    +import types
     try:
         from errno import EINVAL
     except ImportError:
    @@ -38,8 +38,10 @@

     class StringIO:
         def __init__(self, buf = ''):
    -        # Force self.buf to be a string
    -        self.buf = str(buf)
    +        # Force self.buf to be a string or unicode
    +        if type(buf) not in types.StringTypes:
    +            buf = str(buf)
    +        self.buf = buf
             self.len = len(buf)
             self.buflist = []
             self.pos = 0
    @@ -135,8 +137,9 @@
             if self.closed:
                 raise ValueError, "I/O operation on closed file"
             if not s: return
    -        # Force s to be a string
    -        s = str(s)
    +        # Force s to be a string or unicode
    +        if type(s) not in types.StringTypes:
    +            s = str(s)
             if self.pos > self.len:
                 self.buflist.append('\0'*(self.pos - self.len))
                 self.len = self.pos



[Jaros³aw Zabie³³o wrote]
> I have a small code:
> 
> from xml.dom import minidom  
> xmldoc = minidom.parse('myfile.xml')
> print xmldoc.toxml() 
> 
> It works for 7-bit text fine. But the problem is it works ONLY for
> pure ASCII text. :-( If I try to use any of non English characters,
> Python raise an exception:
> 
>   UnicodeError: ASCII encoding error: ordinal not in range(128)
> 
> It does NOT work even on utf-8 xml files with any character outside
> 7-bit ASCII character set. It is strange, because utf-8 should be
> correctly parsed by all xml tools.
> 
> Is it mean toxml() or toprettyxml() methods of minidom are useless for
> non English strings? I need them to cut one big xml file into smaller
> pieces and write them into several files.


[Jaros³aw Zabie³³o wrote]
> I found a solution.
> 
> The last release of ActivePython has bad StringIO.py file! I compare
> it with Python 2.2.1rc from www.python.org _which works fine_.



-- 
Trent Mick
TrentM at ActiveState.com





More information about the Python-list mailing list