XML: minidom toxml() does not work for non English files! :-(

M.-A. Lemburg mal at lemburg.com
Tue May 7 05:33:59 EDT 2002


Trent Mick wrote:
> 
> Micheal, Marc-Andre,
> 
> Perhaps you could help me shed some light on this. There are two issues
> that I see:
>     1. The actual problem that Jaros³aw reported.
>         > from xml.dom import minidom
>         > xmldoc = minidom.parse('myfile.xml')
>         > print xmldoc.toxml()
>         >
>         > It works for 7-bit text fine. But the problem is it works ONLY for
>         > pure ASCII text. :-( If I try to use any of non English characters,
>         > Python raise an exception:
>         >
>         >   UnicodeError: ASCII encoding error: ordinal not in range(128)
> 
>        Jaros³aw mentions that the problem goes away if he replaces
>        ActivePython 2.2.1's StringIO.py with the one from the PythonLabs
>        distro. That would be fine (a bug in ActivePython) except that
>        ActivePython has the more *recent* StringIO.py. So is Jaros³aw
>        misusing StringIO.py or is this StringIO.py checkin incorrect or
>        am I confused:
>             MAL's checkin on the trunk:
>             http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py.diff?r1=1.19&r2=1.20
>             Micheal's back port to Python 2.2:
>             http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py.diff?r1=1.19&r2=1.19.12.1

I think you have to provide more information here, e.g.
the traceback and a dump of the local variables.

The patch only reenables handling Unicode with StringIO,
so I can't see why this fails. Could be that minidom mixes
Unicode and strings and that this causes the UnicodeError
to trigger due to implicit coercion going on inside StringIO.

>     2. It looks to me like Python 2.2.1 does *not* include the
>        StringIO.py that is part of the 'r221' Python CVS tag. Am I
>        wrong?
> 
> Any insight would be appreciated.
> 
> Thanks,
> Trent
> 
> Further information:
> 
> - StringIO CVS log:
>   http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py
> 
> - diff of Python Labs' 2.2.1 StringIO.py with StringIO.py in CVS at the
>   'r221' tag:
> 
>     C:\>diff -u C:\PythonLabs22\Lib\StringIO.py D:\cvs\python-r221\dist\src\Lib\StringIO.py
>     --- C:\PythonLabs22\Lib\StringIO.py     Mon Sep 24 13:34:52 2001
>     +++ D:\cvs\python-r221\dist\src\Lib\StringIO.py Mon Mar 18 05:31:30 2002
> 
>     @@ -28,7 +28,7 @@
>        bytes that occupy space in the buffer.
>      - There's a simple test set (see end of this file).
>      """
>     -
>     +import types
>      try:
>          from errno import EINVAL
>      except ImportError:
>     @@ -38,8 +38,10 @@
> 
>      class StringIO:
>          def __init__(self, buf = ''):
>     -        # Force self.buf to be a string
>     -        self.buf = str(buf)
>     +        # Force self.buf to be a string or unicode
>     +        if type(buf) not in types.StringTypes:
>     +            buf = str(buf)
>     +        self.buf = buf
>              self.len = len(buf)
>              self.buflist = []
>              self.pos = 0
>     @@ -135,8 +137,9 @@
>              if self.closed:
>                  raise ValueError, "I/O operation on closed file"
>              if not s: return
>     -        # Force s to be a string
>     -        s = str(s)
>     +        # Force s to be a string or unicode
>     +        if type(s) not in types.StringTypes:
>     +            s = str(s)
>              if self.pos > self.len:
>                  self.buflist.append('\0'*(self.pos - self.len))
>                  self.len = self.pos
> 
> [Jaros³aw Zabie³³o wrote]
> > I have a small code:
> >
> > from xml.dom import minidom
> > xmldoc = minidom.parse('myfile.xml')
> > print xmldoc.toxml()
> >
> > It works for 7-bit text fine. But the problem is it works ONLY for
> > pure ASCII text. :-( If I try to use any of non English characters,
> > Python raise an exception:
> >
> >   UnicodeError: ASCII encoding error: ordinal not in range(128)
> >
> > It does NOT work even on utf-8 xml files with any character outside
> > 7-bit ASCII character set. It is strange, because utf-8 should be
> > correctly parsed by all xml tools.
> >
> > Is it mean toxml() or toprettyxml() methods of minidom are useless for
> > non English strings? I need them to cut one big xml file into smaller
> > pieces and write them into several files.
> 
> [Jaros³aw Zabie³³o wrote]
> > I found a solution.
> >
> > The last release of ActivePython has bad StringIO.py file! I compare
> > it with Python 2.2.1rc from www.python.org _which works fine_.
> 
> --
> Trent Mick
> TrentM at ActiveState.com

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/





More information about the Python-list mailing list