XML: minidom toxml() does not work for non English files! :-(
Trent Mick
trentm at ActiveState.com
Mon May 6 16:03:07 EDT 2002
Micheal, Marc-Andre,
Perhaps you could help me shed some light on this. There are two issues
that I see:
1. The actual problem that Jaros³aw reported.
> from xml.dom import minidom
> xmldoc = minidom.parse('myfile.xml')
> print xmldoc.toxml()
>
> It works for 7-bit text fine. But the problem is it works ONLY for
> pure ASCII text. :-( If I try to use any of non English characters,
> Python raise an exception:
>
> UnicodeError: ASCII encoding error: ordinal not in range(128)
Jaros³aw mentions that the problem goes away if he replaces
ActivePython 2.2.1's StringIO.py with the one from the PythonLabs
distro. That would be fine (a bug in ActivePython) except that
ActivePython has the more *recent* StringIO.py. So is Jaros³aw
misusing StringIO.py or is this StringIO.py checkin incorrect or
am I confused:
MAL's checkin on the trunk:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py.diff?r1=1.19&r2=1.20
Micheal's back port to Python 2.2:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py.diff?r1=1.19&r2=1.19.12.1
2. It looks to me like Python 2.2.1 does *not* include the
StringIO.py that is part of the 'r221' Python CVS tag. Am I
wrong?
Any insight would be appreciated.
Thanks,
Trent
Further information:
- StringIO CVS log:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Lib/StringIO.py
- diff of Python Labs' 2.2.1 StringIO.py with StringIO.py in CVS at the
'r221' tag:
C:\>diff -u C:\PythonLabs22\Lib\StringIO.py D:\cvs\python-r221\dist\src\Lib\StringIO.py
--- C:\PythonLabs22\Lib\StringIO.py Mon Sep 24 13:34:52 2001
+++ D:\cvs\python-r221\dist\src\Lib\StringIO.py Mon Mar 18 05:31:30 2002
@@ -28,7 +28,7 @@
bytes that occupy space in the buffer.
- There's a simple test set (see end of this file).
"""
-
+import types
try:
from errno import EINVAL
except ImportError:
@@ -38,8 +38,10 @@
class StringIO:
def __init__(self, buf = ''):
- # Force self.buf to be a string
- self.buf = str(buf)
+ # Force self.buf to be a string or unicode
+ if type(buf) not in types.StringTypes:
+ buf = str(buf)
+ self.buf = buf
self.len = len(buf)
self.buflist = []
self.pos = 0
@@ -135,8 +137,9 @@
if self.closed:
raise ValueError, "I/O operation on closed file"
if not s: return
- # Force s to be a string
- s = str(s)
+ # Force s to be a string or unicode
+ if type(s) not in types.StringTypes:
+ s = str(s)
if self.pos > self.len:
self.buflist.append('\0'*(self.pos - self.len))
self.len = self.pos
[Jaros³aw Zabie³³o wrote]
> I have a small code:
>
> from xml.dom import minidom
> xmldoc = minidom.parse('myfile.xml')
> print xmldoc.toxml()
>
> It works for 7-bit text fine. But the problem is it works ONLY for
> pure ASCII text. :-( If I try to use any of non English characters,
> Python raise an exception:
>
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>
> It does NOT work even on utf-8 xml files with any character outside
> 7-bit ASCII character set. It is strange, because utf-8 should be
> correctly parsed by all xml tools.
>
> Is it mean toxml() or toprettyxml() methods of minidom are useless for
> non English strings? I need them to cut one big xml file into smaller
> pieces and write them into several files.
[Jaros³aw Zabie³³o wrote]
> I found a solution.
>
> The last release of ActivePython has bad StringIO.py file! I compare
> it with Python 2.2.1rc from www.python.org _which works fine_.
--
Trent Mick
TrentM at ActiveState.com
More information about the Python-list
mailing list