From janssen at parc.com  Tue Jun 10 21:13:35 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 10 Jun 2008 12:13:35 PDT
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
Message-ID: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>

I've been using the minidom to produce little properly-formatted XML
documents, by building a DOM tree, then calling "toxml" to generate
the actual XML.  But I tripped over the optional "encoding" argument
to that function.

I figured that the only point of having an encoding argument would be
to allow the user to control the output character set encoding, but it
turns out that specifying an encoding of, say, "ASCII", doesn't do
that.  It just raises encoding exceptions when you attempt to encode a
non-ASCII character.  What's the point of having an encoding argument
when it always has to be "UTF-8"?

Especially since it seems that this could be made useful by changing
one line of code.  In xml/dom/minidom.py, in the class Node, in the
method "toprettyxml", change the line

    writer = codecs.lookup(encoding)[3](writer)

to

    writer = codecs.lookup(encoding)[3](writer, "xmlcharrefreplace")

What am I missing here?

Bill


From stefan_ml at behnel.de  Tue Jun 10 21:33:31 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 10 Jun 2008 21:33:31 +0200
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
Message-ID: <484ED70B.8060107@behnel.de>

Hi,

Bill Janssen wrote:
> I figured that the only point of having an encoding argument would be
> to allow the user to control the output character set encoding, but it
> turns out that specifying an encoding of, say, "ASCII", doesn't do
> that.  It just raises encoding exceptions when you attempt to encode a
> non-ASCII character.

Well, what did you expect? That it magically transmogrifies your non-ASCII
data into plain ASCII data?


> What's the point of having an encoding argument
> when it always has to be "UTF-8"?

Did you try any other encoding besides "ASCII"?


> Especially since it seems that this could be made useful by changing
> one line of code.  In xml/dom/minidom.py, in the class Node, in the
> method "toprettyxml", change the line
> 
>     writer = codecs.lookup(encoding)[3](writer)
> 
> to
> 
>     writer = codecs.lookup(encoding)[3](writer, "xmlcharrefreplace")

Could be done, yes. ElementTree and lxml do it that way. It's not required,
though. If you say you want to serialise plain ASCII data, nothing keeps an
XML serialiser from shouting at you when it finds non-ASCII data. Same for
latin1 data or kyrillic data, or ...

Stefan


From janssen at parc.com  Tue Jun 10 22:59:06 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 10 Jun 2008 13:59:06 PDT
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484ED70B.8060107@behnel.de> 
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
Message-ID: <08Jun10.135914pdt."58698"@synergy1.parc.xerox.com>

Stefan,

> >     writer = codecs.lookup(encoding)[3](writer, "xmlcharrefreplace")
> 
> Could be done, yes. ElementTree and lxml do it that way. It's not required,
> though. If you say you want to serialise plain ASCII data, nothing keeps an
> XML serialiser from shouting at you when it finds non-ASCII data. Same for
> latin1 data or kyrillic data, or ...

I'm not sure what you're saying.  The "encoding" parameter is about
the character set encoding of the XML output file; it has little or
nothing to do with the input data, which in my case is all unicode
strings.  Clearly, with XML, one can use ASCII, for instance, as a
character set encoding.  Why not make this parameter work?

Bill

From janssen at parc.com  Tue Jun 10 23:00:17 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 10 Jun 2008 14:00:17 PDT
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484ED70B.8060107@behnel.de> 
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
Message-ID: <08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>

> Well, what did you expect? That it magically transmogrifies your non-ASCII
> data into plain ASCII data?

Yep.  And there's no reason I can see why it can't do exactly that.  I
think the "encoding" argument should either be removed, or made to
work.

Bill

From stefan_ml at behnel.de  Tue Jun 10 23:05:44 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 10 Jun 2008 23:05:44 +0200
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
Message-ID: <484EECA8.9000508@behnel.de>

Hi,

Bill Janssen wrote:
>> Well, what did you expect? That it magically transmogrifies your non-ASCII
>> data into plain ASCII data?
> 
> Yep.  And there's no reason I can see why it can't do exactly that.  I
> think the "encoding" argument should either be removed, or made to
> work.

That's why I asked if you tried other encodings. Obviously, you only tried
"UTF-8" and "ASCII". There's tons of other encodings out there, and I bet they
work just fine - as does "ASCII" (for ASCII data, that is).

Stefan

From bkline at rksystems.com  Tue Jun 10 23:51:12 2008
From: bkline at rksystems.com (Bob Kline)
Date: Tue, 10 Jun 2008 17:51:12 -0400
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484EECA8.9000508@behnel.de>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>	<484ED70B.8060107@behnel.de>	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de>
Message-ID: <484EF750.8020907@rksystems.com>

Stefan Behnel wrote:
> Hi,
>
> Bill Janssen wrote:
>   
>>> Well, what did you expect? That it magically transmogrifies your non-ASCII
>>> data into plain ASCII data?
>>>       
>> Yep.  And there's no reason I can see why it can't do exactly that.  I
>> think the "encoding" argument should either be removed, or made to
>> work.
>>     
>
> That's why I asked if you tried other encodings. Obviously, you only tried
> "UTF-8" and "ASCII". There's tons of other encodings out there, and I bet they
> work just fine - as does "ASCII" (for ASCII data, that is).
>
> Stefan
>   

I suspect there's a certain amount of unarticulated assumptions on both 
sides of this exchange.  I'm guessing that Bill might be thinking 
something like: "it's possible to represent any Unicode character in XML 
as &#<code-position-for character>"; and was hoping that the method 
would do just that for the non-ASCII characters if he asks for ASCII 
encoding.  Stefan is (if he even realizes that Bill might be thinking 
this) himself possibly thinking "no way is the method going to do that 
much work for the caller."  Of course, I realize that it's always risky 
trying to guess what people are thinking, but I throw this out as a 
possibility in the hopes that, if I turn out to be right for even just 
one side of the exchange, this might help clear the air a little bit. :-)

-- 
Bob Kline
http://www.rksystems.com
mailto:bkline at rksystems.com


From stefan_ml at behnel.de  Wed Jun 11 00:16:54 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 11 Jun 2008 00:16:54 +0200
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484EF750.8020907@rksystems.com>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>	<484ED70B.8060107@behnel.de>	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de> <484EF750.8020907@rksystems.com>
Message-ID: <484EFD56.5020908@behnel.de>

Hi,

Bob Kline wrote:
> Stefan is (if he even realizes that Bill might be thinking
> this) himself possibly thinking "no way is the method going to do that
> much work for the caller."

No, I'm actually just saying that this is not a bug and maybe not even a
missing feature. It's a design decision. Saying that the "encoding" keyword
doesn't work, just because it detects the error that the user passed an
encoding target that cannot represent the data, is pretty obviously wrong.
Some people may expect this error.

Stefan

From janssen at parc.com  Wed Jun 11 01:32:53 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 10 Jun 2008 16:32:53 PDT
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484EF750.8020907@rksystems.com> 
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de> <484EF750.8020907@rksystems.com>
Message-ID: <08Jun10.163259pdt."58698"@synergy1.parc.xerox.com>

> I suspect there's a certain amount of unarticulated assumptions on both 
> sides of this exchange.  I'm guessing that Bill might be thinking 
> something like: "it's possible to represent any Unicode character in XML 
> as &#<code-position-for character>"; and was hoping that the method 
> would do just that for the non-ASCII characters if he asks for ASCII 
> encoding.

Yep, that's what I was thinking.  I don't see any other reason to have
that parameter there.

The reason I asked on this list (instead of just committing the change
:-) is that I don't really know much about the grubby details of XML,
and wanted to engage some minds to consider possible nasty
side-effects of making that change.  For instance, would emitting
charrefs in a CDATA section or a Processing Instructions section
really be a good idea?

Bill

From janssen at parc.com  Wed Jun 11 01:41:19 2008
From: janssen at parc.com (Bill Janssen)
Date: Tue, 10 Jun 2008 16:41:19 PDT
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484EFD56.5020908@behnel.de> 
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de> <484EF750.8020907@rksystems.com>
	<484EFD56.5020908@behnel.de>
Message-ID: <08Jun10.164124pdt."58698"@synergy1.parc.xerox.com>

> Hi,
> 
> Bob Kline wrote:
> > Stefan is (if he even realizes that Bill might be thinking
> > this) himself possibly thinking "no way is the method going to do that
> > much work for the caller."
> 
> No, I'm actually just saying that this is not a bug and maybe not even a
> missing feature. It's a design decision. Saying that the "encoding" keyword
> doesn't work, just because it detects the error that the user passed an
> encoding target that cannot represent the data, is pretty obviously wrong.

That's not what I'm saying.  I'm objecting to the fact that the
encoding target *can* represent the data, but the code isn't written
to do that.  That's the bug I'm pointing out.

If in fact the "encoding" argument is about "type-checking" the input
data against some character set, rather than being about the XML
character set encoding, then both the code and the documentation are
broken.  But that's a different bug, and to my way of thinking a much
less interesting and less useful way to perceive the situation.

Bill

From bkline at rksystems.com  Wed Jun 11 03:47:13 2008
From: bkline at rksystems.com (Bob Kline)
Date: Tue, 10 Jun 2008 21:47:13 -0400
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <08Jun10.164124pdt."58698"@synergy1.parc.xerox.com>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de> <484EF750.8020907@rksystems.com>
	<484EFD56.5020908@behnel.de>
	<08Jun10.164124pdt."58698"@synergy1.parc.xerox.com>
Message-ID: <484F2EA1.2060307@rksystems.com>

I believe there are reasonable grounds for both Bill's and Stefan's 
interpretation of the somewhat ambiguous documentation for the method, 
and further, that the documentation would benefit from some 
clarification one way or another.  I don't think Bill is correct in 
thinking that is no other possible reason for having the encoding 
parameter than to induce the method to use numeric character references 
for those characters which don't directly fit in the selected encoding.  
I have used similar methods which are known to raise an exception for 
mismatched encodings/values to determine the most widely supported 
encoding which adequately handles all the characters in a Unicode 
string, and I've seen others do the same.  Of course, it's more 
justifiable to rely on such behavior when the documentation makes it 
clear exactly when the exceptions will be raised.  In this case, given 
the current wording, it would technically be up to the whim of the 
implementor.  In general, this page of the standard library 
documentation could use a little cleanup (for example, the docs for the 
next method refers to "the encoding argument" which - according to the 
signature for the method - doesn't accept that argument at all).  Once 
it's clear that there is consensus as to which behavior is the most 
appropriate for toxml(), I'll be happy to contribute to a documentation 
patch which will nail things down more clearly.

-- 
Bob Kline
http://www.rksystems.com
mailto:bkline at rksystems.com


From stefan_ml at behnel.de  Wed Jun 11 06:57:34 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 11 Jun 2008 06:57:34 +0200
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <08Jun10.163259pdt."58698"@synergy1.parc.xerox.com>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de> <484EF750.8020907@rksystems.com>
	<08Jun10.163259pdt."58698"@synergy1.parc.xerox.com>
Message-ID: <484F5B3E.9050307@behnel.de>

Hi,

Bill Janssen wrote:
>> I suspect there's a certain amount of unarticulated assumptions on both 
>> sides of this exchange.  I'm guessing that Bill might be thinking 
>> something like: "it's possible to represent any Unicode character in XML 
>> as &#<code-position-for character>"; and was hoping that the method 
>> would do just that for the non-ASCII characters if he asks for ASCII 
>> encoding.
> 
> Yep, that's what I was thinking.  I don't see any other reason to have
> that parameter there.

Have you considered that it may be there to allow other encodings than UTF-8?
Check the codecs module to see how many others there are.

Stefan

From janssen at parc.com  Wed Jun 11 18:14:36 2008
From: janssen at parc.com (Bill Janssen)
Date: Wed, 11 Jun 2008 09:14:36 PDT
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <484F5B3E.9050307@behnel.de> 
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>
	<484ED70B.8060107@behnel.de>
	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>
	<484EECA8.9000508@behnel.de> <484EF750.8020907@rksystems.com>
	<08Jun10.163259pdt."58698"@synergy1.parc.xerox.com>
	<484F5B3E.9050307@behnel.de>
Message-ID: <08Jun11.091436pdt."58698"@synergy1.parc.xerox.com>

Stefan, I think we are talking past each other.  I know it's there to
allow encodings other than UTF-8, and I'm familiar with the codecs
module, and I like the parameter, in general.  The problem is that if
you ignore the documentation, which seems to know that it's broken,
and specify an encoding other than UTF-8, the generated XML sometimes
doesn't conform to that encoding.  Instead, an exception is raised
from deep inside Python, which contains no indication of what piece of
input data caused it.  And there's no need for that to happen.  XML
can fully support any output encoding for any Unicode input stream,
and it should do that.

Bill

> Hi,
> 
> Bill Janssen wrote:
> >> I suspect there's a certain amount of unarticulated assumptions on both 
> >> sides of this exchange.  I'm guessing that Bill might be thinking 
> >> something like: "it's possible to represent any Unicode character in XML 
> >> as &#<code-position-for character>"; and was hoping that the method 
> >> would do just that for the non-ASCII characters if he asks for ASCII 
> >> encoding.
> > 
> > Yep, that's what I was thinking.  I don't see any other reason to have
> > that parameter there.
> 
> Have you considered that it may be there to allow other encodings than UTF-8?
> Check the codecs module to see how many others there are.
> 
> Stefan


From sap28 at kent.ac.uk  Wed Jun 11 17:50:48 2008
From: sap28 at kent.ac.uk (sap28 at kent.ac.uk)
Date: Wed, 11 Jun 2008 16:50:48 +0100 (BST)
Subject: [XML-SIG] installing PyXML for Python 2.5
Message-ID: <1472.129.12.16.180.1213199448.squirrel@webmail.cs.kent.ac.uk>

Hello!

I'm trying to install PyXML for Python 2.5 on windows XP to be able to use
the validating parser but I have been unsuccessful so far. I tried to
compile the PyXML from the any platform package but I got the following
error:
"running build_ext
error: Python was built with Visual Studio 2003;
extensions must be built with a compiler than can generate compatible
binaries.
Visual Studio 2003 was not found on this system. If you have Cygwin
installed,
you can try compiling with MingW32, by passing "-c mingw32" to setup.py."
(The Python package was added by the installation of the Haptic API H3D
Beta).
I then installed Cygwin and tried compiling using the indication provided
but either the -c command wasn't recognized or the mingw32 wasn't defined.
I've never really used Cygwin either so I don't quite know if I've
inputted the wrong command or if it just doesn't work.

I also naively tried to just copy the files to check if it could work
without compilation, but i get the following error:
    from xml.parsers import expat
  File "C:\Python25\lib\site-packages\_xmlplus\parsers\expat.py", line 4,
in <module>
    from pyexpat import *
ImportError: DLL load failed: The specified module could not be found.

I really need it to have it working for my phd work, so I would be
grateful if you could help me.

Thanks for your help,
Best regards,
Sabrina


From stefan_ml at behnel.de  Wed Jun 11 19:47:46 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 11 Jun 2008 19:47:46 +0200
Subject: [XML-SIG] installing PyXML for Python 2.5
In-Reply-To: <1472.129.12.16.180.1213199448.squirrel@webmail.cs.kent.ac.uk>
References: <1472.129.12.16.180.1213199448.squirrel@webmail.cs.kent.ac.uk>
Message-ID: <48500FC2.9000100@behnel.de>

Hi,

sap28 at kent.ac.uk wrote:
> I'm trying to install PyXML for Python 2.5 on windows XP

Sounds like a FAQ to me.


> to be able to use the validating parser

Try lxml.

http://codespeak.net/lxml/

Stefan


From jeanmarc.chourot at free.fr  Fri Jun 20 22:40:42 2008
From: jeanmarc.chourot at free.fr (jeanmarc.chourot at free.fr)
Date: Fri, 20 Jun 2008 22:40:42 +0200
Subject: [XML-SIG] elementtree and uncomplete parsing
Message-ID: <1213994442.485c15ca1aaca@imp.free.fr>


Hi all,

As a noob, I cannot find the way to make an incomplete parse of a tree.
For instance, please consider the following xml file

<node>
This text <thistag> is completely crap </thistag> because <anothertag> blabla
</anothertag>
</node>
<node>
This is another <thisnotag> node </thisnotag> with <anothertaggy> random tags
</anothertaggy>
</node>

I would like to retrieve what is between the tags <node> ...</node> into
strings, the "subelements" being considered as simple string and not processed
by elelement tree.
In other words, this could be badly formed HTML  not processed embeded into well
formed xml tags.

i.e. :
string1 = "This text <thistag> is completely crap </thistag> because
<anothertag> blabla </anothertag>"
string2="This is another <thisnotag> node </thisnotag> with <anothertaggy>
random tags </anothertaggy>"

Could anyone help me with this  ?
Thanks a lot

From stefan_ml at behnel.de  Sat Jun 21 07:39:17 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 21 Jun 2008 07:39:17 +0200
Subject: [XML-SIG] elementtree and uncomplete parsing
In-Reply-To: <1213994442.485c15ca1aaca@imp.free.fr>
References: <1213994442.485c15ca1aaca@imp.free.fr>
Message-ID: <485C9405.7010606@behnel.de>

Hi,

jeanmarc.chourot at free.fr wrote:
> <node>
> This text <thistag> is completely crap </thistag> because <anothertag> blabla
> </anothertag>
> </node>
> <node>
> This is another <thisnotag> node </thisnotag> with <anothertaggy> random tags
> </anothertaggy>
> </node>
> 
> I would like to retrieve what is between the tags <node> ...</node> into
> strings, the "subelements" being considered as simple string and not processed
> by elelement tree.

You are trying to make an XML parser not parse XML, that's bound to fail.


> In other words, this could be badly formed HTML  not processed embeded into
> well formed xml tags.

If you really have something like "embedded HTML", it must be escaped in your
data to be parsable. There is no way an XML parser can return what you want
without modifying your 'data' (at least loosing whitespace etc.).

I think the easiest option (if you have it) is to talk to the idiots who sent
you the data and have them fix it.

Stefan

From jeanmarc.chourot at free.fr  Sat Jun 21 10:23:08 2008
From: jeanmarc.chourot at free.fr (Jean-Marc Chourot)
Date: Sat, 21 Jun 2008 10:23:08 +0200
Subject: [XML-SIG]  elementtree and uncomplete parsing
In-Reply-To: <485C9405.7010606@behnel.de>
References: <1213994442.485c15ca1aaca@imp.free.fr> <485C9405.7010606@behnel.de>
Message-ID: <1214036588.6525.9.camel@jeanmarc-laptop>


> Hi,
> 
> jeanmarc.chourot at free.fr wrote:
> > <node>
> > This text <thistag> is completely crap </thistag> because <anothertag> blabla
> > </anothertag>
> > </node>
> > <node>
> > This is another <thisnotag> node </thisnotag> with <anothertaggy> random tags
> > </anothertaggy>
> > </node>
> > 
> > I would like to retrieve what is between the tags <node> ...</node> into
> > strings, the "subelements" being considered as simple string and not processed
> > by elelement tree.
> 
> You are trying to make an XML parser not parse XML, that's bound to fail.
> 
> 
> > In other words, this could be badly formed HTML  not processed embeded into
> > well formed xml tags.
> 
> If you really have something like "embedded HTML", it must be escaped in your
> data to be parsable. There is no way an XML parser can return what you want
> without modifying your 'data' (at least loosing whitespace etc.).
> 
> I think the easiest option (if you have it) is to talk to the idiots who sent
> you the data and have them fix it.
> 
> Stefan
> 
Thanks for you help, 
The real problem is not about "badly formed HTML" : each node will
correspond to a leaf of a wx.TreeCtrl and the data associated to the
leaf will be the content of a wx.RichTextCtrl. When saving the whole
tree content in one file, I want to be able to get the structure of the
tree and relocate the data to each leaf and definitely not touch the
content which is parse the wxrichTxtCtrl. 
I was hoping Elementtree could help with this.. but maybe I am wrong and
should think of a simplier system of tags in the text.


From stefan_ml at behnel.de  Sat Jun 21 12:02:06 2008
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 21 Jun 2008 12:02:06 +0200
Subject: [XML-SIG] elementtree and uncomplete parsing
In-Reply-To: <1214036588.6525.9.camel@jeanmarc-laptop>
References: <1213994442.485c15ca1aaca@imp.free.fr> <485C9405.7010606@behnel.de>
	<1214036588.6525.9.camel@jeanmarc-laptop>
Message-ID: <485CD19E.7020708@behnel.de>

Hi,

Jean-Marc Chourot wrote:
> The real problem is not about "badly formed HTML" : each node will
> correspond to a leaf of a wx.TreeCtrl and the data associated to the
> leaf will be the content of a wx.RichTextCtrl. When saving the whole
> tree content in one file, I want to be able to get the structure of the
> tree and relocate the data to each leaf and definitely not touch the
> content which is parse the wxrichTxtCtrl.

If I understand correctly, your XML-like string content comes from user input
in the RichTextCtrl. Meaning: when you copy it into the XML tree, it should
get escaped (i.e. '<' replaced by '&lt;' etc.). Then every XML parser will
read this as you expect.

ElementTree will do the escaping for you when you set the ".text" property of
a leaf node.

Or did you mean to say that wxPython saves the broken XML for you?

Stefan


From martin at v.loewis.de  Mon Jun 23 06:54:25 2008
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 23 Jun 2008 06:54:25 +0200
Subject: [XML-SIG] "encoding" argument to xml.dom.minidom.toxml()?
In-Reply-To: <08Jun11.091436pdt."58698"@synergy1.parc.xerox.com>
References: <08Jun10.121338pdt."58698"@synergy1.parc.xerox.com>	<484ED70B.8060107@behnel.de>	<08Jun10.140019pdt."58698"@synergy1.parc.xerox.com>	<484EECA8.9000508@behnel.de>
	<484EF750.8020907@rksystems.com>	<08Jun10.163259pdt."58698"@synergy1.parc.xerox.com>	<484F5B3E.9050307@behnel.de>
	<08Jun11.091436pdt."58698"@synergy1.parc.xerox.com>
Message-ID: <485F2C81.8050003@v.loewis.de>

> Stefan, I think we are talking past each other.  I know it's there to
> allow encodings other than UTF-8, and I'm familiar with the codecs
> module, and I like the parameter, in general.  The problem is that if
> you ignore the documentation, which seems to know that it's broken,
> and specify an encoding other than UTF-8, the generated XML sometimes
> doesn't conform to that encoding.

Can you give an example? I'm unable to reproduce the behavior you
are seeing; it works just fine for me:

py> import xml.dom.minidom
py>
d=xml.dom.minidom.getDOMImplementation().createDocument(None,"root",None)
py> t=d.createTextNode(u"\u20ac")
py> x=d.documentElement.appendChild(t)
py> d.toxml(encoding="iso-8859-15")
'<?xml version="1.0" encoding="iso-8859-15"?><root>\xa4</root>'

AFAICT, this is the correct byte string. Can you give an example where
toxml returns an incorrect byte string?

Regards,
Martin