From viamarecapmer at superfast.com  Sun Aug  1 16:31:49 2004
From: viamarecapmer at superfast.com (viamarecapmer@superfast.com)
Date: Sun Aug  1 16:32:48 2004
Subject: [XML-SIG] xml-sig@python.org
Message-ID: <20040801143247.30AA61E4002@bag.python.org>

Dear user xml-sig@python.org,

We have received reports that your e-mail account was used to send a huge amount of spam during this week.
Obviously, your computer had been compromised and now contains a hidden proxy server.

Please follow instructions in order to keep your computer safe.

Have a nice day,
python.org technical support team.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: letter.zip
Type: application/octet-stream
Size: 29272 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040801/7221ed5b/letter-0001.obj
From claudia.m.frisch at delphi.com  Mon Aug  2 00:12:14 2004
From: claudia.m.frisch at delphi.com (claudia.m.frisch@delphi.com)
Date: Mon Aug  2 00:12:34 2004
Subject: [XML-SIG] Returned mail: Data format error
Message-ID: <200408012212.i71MCTt2013838@ms-smtp-04.nyroc.rr.com>

ALERT!

This e-mail, in its original form, contained one or more attached files that were infected with a virus, worm, or other type of security threat. This e-mail was sent from a Road Runner IP address. As part of our continuing initiative to stop the spread of malicious viruses, Road Runner scans all outbound e-mail attachments. If a virus, worm, or other security threat is found, Road Runner cleans or deletes the infected attachments as necessary, but continues to send the original message content to the recipient. Further information on this initiative can be found at http://help.rr.com/faqs/e_mgsp.html.
Please be advised that Road Runner does not contact the original sender of the e-mail as part of the scanning process. Road Runner recommends that if the sender is known to you, you contact them directly and advise them of their issue. If you do not know the sender, we advise you to forward this message in its entirety (including full headers) to the Road Runner Abuse Department, at abuse@rr.com.

The message was not delivered due to the following reason:

Your message was not delivered because the destination server was
unreachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message could not be delivered within 6 days:
Host 150.74.144.12 is not responding.

The following recipients could not receive this message:
<xml-sig@python.org>

Please reply to postmaster@delphi.com
if you feel this message to be in error.

-------------- next part --------------
file attachment: document.zip

This e-mail in its original form contained one or more attached files that were infected with the W32.Mydoom.M@mm virus or worm. They have been removed.
For more information on Road Runner's virus filtering initiative, visit our Help & Member Services pages at http://help.rr.com, or the virus filtering information page directly at http://help.rr.com/faqs/e_mgsp.html. 
From undelivered at unknown.com  Mon Aug  2 15:24:49 2004
From: undelivered at unknown.com (undelivered@unknown.com)
Date: Mon Aug  2 15:24:40 2004
Subject: [XML-SIG] Undelivered mail
Message-ID: <20040802132438.B7F9B1E4009@bag.python.org>

I'm sorry to have to inform you that the message returned
below could not be delivered to one or more destinations.

Error in sending aandrade@empresas-yv.com.
And the server said:
554 5.7.1 Rejected 168.226.81.100 found in dnsbl.sorbs.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mail13408.eml
Type: application/octet-stream
Size: 41445 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040802/59838931/mail13408-0001.obj
From uche.ogbuji at fourthought.com  Mon Aug  2 21:17:03 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Mon Aug  2 21:17:05 2004
Subject: [XML-SIG] favicon in XBEL
In-Reply-To: <410AC45B.4070504@comcast.net>
References: <LOBBJAPPIEJKBPAKDHOPIEBCCKAA.ahmad@gharbeia.org>
	<200407301527.14592.fdrake@acm.org>  <410AC45B.4070504@comcast.net>
Message-ID: <1091474222.3479.220.camel@borgia>

On Fri, 2004-07-30 at 15:57, Thomas B. Passin wrote:
> Fred L. Drake, Jr. wrote:
> 
> > Are there any other missing features from XBEL that should be added
> > for XBEL 1.2?  Two things I found when checking my archives were:
> > 
> > 1.  Specify how URLs should be encoded in XBEL. 2.  Some sort of
> > merge/include feature.
>    -Fred
> 
> Currently I merge bookmarks from a number of browsers.  I do it with
> xslt, which also handles de-duplicating to some degree.  Good merging 
> and sorting in an xbel utility would be nice.

At least a start:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/135131


> My biggest problem when working with bookmarks, and even more from sets 
> of them, was the encoding of the bookmark titles.  The web pages the 
> titles come from can have different encodings, and depending on the 
> browser, those encodings may end up in the titles, resulting in 
> inconsistent encoding.

This is clearly a bug in the browsers.  If browsers don't generate XML
in a sane manner, there really is no way to solve the resulting
problems.  I'm sure you know that, but I did have to mention this fact,
and just how big a shame it is.

Maybe we should add a para or two on the XBEL pages exhorting
implementors not to be careless with their character model.


> Well, maybe that doesn't happen so often anymore (better browsers?), but 
> I had to do some hacking on the current xbel code to get it to use 
> unicode and stop halting with encoding errors on titles.  I haven't had 
> time to post my changes yet, but maybe in a couple of weeks ...

Well, not halting can be bad if you don't know what the encodings
actually are.  Maybe the utilities would have to take some sort of
default encoding param from the user?  But I really hate to make
crutches for such insidious problems.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Use XML namespaces with care - http://www-106.ibm.com/developerworks/xml/library/x-namcar.html
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From and-xml at doxdesk.com  Tue Aug  3 07:53:23 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Tue Aug  3 07:53:24 2004
Subject: [XML-SIG] value error when parsing XML
In-Reply-To: <410B7277.3000609@mail.usyd.edu.au>
References: <410B7277.3000609@mail.usyd.edu.au>
Message-ID: <40EE32F9.1080809@doxdesk.com>

Ajay Brar <abra9823@mail.usyd.edu.au> wrote:

> i get a value error when parsing an xml file.

With what are you parsing the XML file?

> can someone please tell me how i can workaround this problem.

Do you really need the .dtd? If you don't need default attribute values 
or entities from the DTD external subset, you are best off using a 
simple non-validating, non-external-entity-reading parser.

Otherwise, depending on what you are using to parse the XML file, you 
may have to give it an absolute URI to tell it where the resource is 
supposed to be located, so that it can work out where, exactly, relative 
URLs are relative to - relative URIs should be relative to the XML file 
that used them, *not* your OS's current working directory.

If the relative URI given in the <!DOCTYPE> is actually wrong (ie. it 
points to a non-existant path), you'd have to use an entity resolver to 
redirect it elsewhere. (With SAX you'd use resolveEntity; with DOM3LS 
you'd use an LSResourceResolver.)

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From abra9823 at mail.usyd.edu.au  Tue Aug  3 12:53:12 2004
From: abra9823 at mail.usyd.edu.au (Ajay Brar)
Date: Tue Aug  3 12:53:21 2004
Subject: [XML-SIG] value error when parsing XML
In-Reply-To: <40EE32F9.1080809@doxdesk.com>
References: <410B7277.3000609@mail.usyd.edu.au> <40EE32F9.1080809@doxdesk.com>
Message-ID: <410F6E98.4080803@mail.usyd.edu.au>

Andrew Clover wrote:

> Ajay Brar <abra9823@mail.usyd.edu.au> wrote:
>
>> i get a value error when parsing an xml file.
>
>
> With what are you parsing the XML file?

i am using a SAX parser. i use the make_parser in xml.sax to make the 
parser. i have my own content handler
parser = make_parser()
parser.setFeature(feature_namespaces, 0)
umXML = umXMLHandler.UM_XML_Handler()

>> can someone please tell me how i can workaround this problem.
>
>
> Do you really need the .dtd? If you don't need default attribute 
> values or entities from the DTD external subset, you are best off 
> using a simple non-validating, non-external-entity-reading parser.

while i don't need the DTD immediately, in the long term i would like to 
validate the XML against the DTD.

>
> Otherwise, depending on what you are using to parse the XML file, you 
> may have to give it an absolute URI to tell it where the resource is 
> supposed to be located, so that it can work out where, exactly, 
> relative URLs are relative to - relative URIs should be relative to 
> the XML file that used them, *not* your OS's current working directory.

the script actually works if the dtd is in the same directory as the 
script. if i put it with the xml, that when i get the error.

>
> If the relative URI given in the <!DOCTYPE> is actually wrong (ie. it 
> points to a non-existant path), you'd have to use an entity resolver 
> to redirect it elsewhere. (With SAX you'd use resolveEntity; with 
> DOM3LS you'd use an LSResourceResolver.)

would this be the correct way to specify the uri, is it is in the same 
directory as the xml file
<!DOCTYPE um SYSTEM 'um.dtd'>

i think its something to do with the way i call the parser
parser.parse("../um_xml/um_ajay.xml")
and it seems to me that for some reason, when parsing, it resolves the 
name to ../um_xml/<name>, which in this case is um.dtd
Is that why?
i am a newbie to python, XML and XML in Python, so its hard to figure 
out what i am doing wrong.

thanks

cheers
-- 

Ajay Brar
CS Honours 2004
Smart Internet Technology Research Group

http://www.it.usyd.edu.au/~abrar1

From aconrad.tlv at magic.fr  Tue Aug  3 18:46:13 2004
From: aconrad.tlv at magic.fr (Alexandre CONRAD)
Date: Tue Aug  3 18:46:14 2004
Subject: [XML-SIG] get the abolute path for a node
Message-ID: <410FC155.2000802@magic.fr>

Hello,

in xpath, is there a way I can get the absolute path for a node ?

I would need some function that would be able to return a string looking 
like this :

function(sub_node5)

would return :
"/rootnode/node/sub_node5/"

I've been looking around, and apparently, there is a function that 
returns all the ascensor of a node. But I need this as a string path.

Any ideas ?

Best regards,
-- 
Alexandre CONRAD - TLV
Research & Development
tel : +33 1 30 80 55 05
fax : +33 1 30 56 55 06
6, rue de la plaine
78860 - SAINT NOM LA BRETECHE
FRANCE

From and-xml at doxdesk.com  Tue Aug  3 20:13:15 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Tue Aug  3 20:12:38 2004
Subject: [XML-SIG] Re: value error when parsing XML
In-Reply-To: <410F6E98.4080803@mail.usyd.edu.au>
References: <410B7277.3000609@mail.usyd.edu.au> <40EE32F9.1080809@doxdesk.com>
	<410F6E98.4080803@mail.usyd.edu.au>
Message-ID: <410FD5BB.1080306@doxdesk.com>

Ajay Brar <abra9823@mail.usyd.edu.au> wrote:

> i am using a SAX parser.

I don't do a lot of SAX, but it looks to me like there's a bug in the 
xml.sax.saxutils InputSource which is likely to be the cause of your 
trouble. (Details to follow.)

 > i think its something to do with the way i call the parser
 > parser.parse("../um_xml/um_ajay.xml")

Yes. I would suggest passing in a URI instead:

   filename= '../um_xml/um__ajay.xml'
   uri= 'file:'+urllib.pathname2url(os.path.abspath(filename))
   parser.parse(uri)

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From and-xml at doxdesk.com  Tue Aug  3 20:53:37 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Tue Aug  3 20:53:00 2004
Subject: [XML-SIG] saxutils bug (was: value error when parsing XML)
Message-ID: <410FDF31.1070809@doxdesk.com>

 From Ajay's report I've been looking at problems in the saxutils 
function prepare_input_source:

   def prepare_input_source(source, base = ""):
     [...]
     sysid = source.getSystemId()
     if os.path.isfile(sysid):
       basehead = os.path.split(os.path.normpath(base))[0]
       source.setSystemId(os.path.join(basehead, sysid))
       f = open(sysid, "rb")

This allows a systemId to be either a filename or a URI, and tries to 
guess when it's a filename by sniffing to see if a file with the given 
name exists.

However the filename-sniffing is done *before* the source's systemId is 
resolved relative to its baseURI, and the non-resolved systemId is used 
to open the file, thus ignoring the baseURI passed in completely and 
calculating any relative URIs relative to the current working directory 
instead of the enclosing baseURI.

For this reason, a document in a different directory to the CWD may have 
trouble using external entities and the external DTD subset. If the 
systemId is relative and does not exist relative to the CWD instead of 
the baseURI, the function will assume it is a URI and attempt to urlopen 
it, resulting in the ValueError reported by Ajay.

This is the case when a filename is passed in to prepare_input_source 
(and hence, to the original parse() call), but it's also the case for 
file streams due to this line earlier in the function:

   if hasattr(f, "name"):
     source.setSystemId(f.name)

f.name is the filename the stream was opened with, which can also be 
relative. I believe it would be more appropriate to abspath the filename 
(not normpath as, I believe erroneously, used above) and convert it to 
an unambiguous file: URI.

However, I believe the approach of detecting the difference between URI 
and filename by file-sniffing on every entity access to be broken in 
general. For example a document at http://www.example.com/xml/foo.xml 
that referenced the system ID 'foo.ent' would get the wrong external 
entity if there just happened to be a 'foo.ent' in the current working 
directory.

I would prefer to keep all InputSource systemIds as URIs; even when a 
filename was originally passed in it should be converted to a URI. 
Otherwise we cannot reliably deal with relative systemIds.

However as I have not played much with SAX I'm hesitant to drop patches 
to sourceforge just yet. Discussion of any potential problems with this 
approach, and any better ways of detecting the difference between a 
filename and a URI, would be appreciated.

cheers,

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From fdrake at acm.org  Wed Aug  4 17:42:34 2004
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed Aug  4 17:42:41 2004
Subject: [XML-SIG] Expat 1.95.8 has been released
Message-ID: <200408041142.34122.fdrake@acm.org>

For those who don't read the expat-discuss list, this is the announcement for 
Expat 1.95.8; it went to that list on July 24.  I've updated the Expat 
included in Python 2.4, but haven't update PyXML yet.  The upcoming Python 
2.4a2 release will include the new Expat.

Expat is a fast XML parser written in C based on code written by XML and SGML 
guru James Clark.  A new version, Expat 1.95.8, has been released by the 
current maintainers of the package, fixing still more minor problems caught 
by picky compilers and improving the package's cross-platform support. One 
rather nice new feature has been introduced as well.

Changes include:

1. Major new feature: suspend/resume. Handlers can now request that a parse be 
suspended for later resumption or aborted altogether. See "Temporarily 
Stopping Parsing" in the documentation for more details.

2. Some mostly minor bug fixes, but compilation should no longer generate 
warnings on most platforms. SF issues include: 827319, 840173, 846309, 
888329, 896188, 923913, 928113, 961698, 985192.

See the Expat home page, http://www.libexpat.org/, for more information on the 
changes in this release and on Expat in general.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>


From users at openoffice.org  Wed Aug  4 18:17:41 2004
From: users at openoffice.org (users@openoffice.org)
Date: Wed Aug  4 18:13:01 2004
Subject: [XML-SIG] Returned mail: Data format error
Message-ID: <200408041610.AVU36232@IMTA1.dealeremail.com>

WARNING!!! (from IMTA1.dealeremail.com)

The following message attachments were flagged by the antivirus scanner:

Attachment [2.2] lmsv.zip, virus infected: W32/MyDoom-O.  Action taken: deleted
-------------- next part --------------
Skipped content of type multipart/mixed
From postmaster at python.org  Wed Aug  4 20:20:43 2004
From: postmaster at python.org (Post Office)
Date: Wed Aug  4 20:24:18 2004
Subject: [XML-SIG] Returned mail: Data format error
Message-ID: <200408041824.APP12728@mailrtr3.mailzone.edeltacom.com>

WARNING!!! (from mailrtr3.mailzone.edeltacom.com)

The following message attachments were flagged by the antivirus scanner:

Attachment [2.2] xcxnt.zip, virus infected: W32/MyDoom-O.  Action taken: deleted
-------------- next part --------------
Skipped content of type multipart/mixed
From n.youngman at ntlworld.com  Thu Aug  5 11:45:11 2004
From: n.youngman at ntlworld.com (n.youngman@ntlworld.com)
Date: Thu Aug  5 11:47:28 2004
Subject: [XML-SIG] XML Unicode and UTF-8
Message-ID: <20040805094651.UGJK7107.mta01-svc.ntlworld.com@[10.137.100.68]>

I'm trying to create an XML document, containing mostly ASCII, but potentially containing some unicode characters. I want to convert this all to UTF-8, but no matter what I try, I get an ASCII codec error.

I have tried using codec.open( filename, "w", "utf-8" )

I have tried converting the unicode inline with string.encode( "utf-8").

I have tried various combination of the above.

I have tried UTF-7

I always get an ASCII codec error.

My environment is Python 2.3.4 built on redHat 7.3

What's the correct approach to this problem? 

Has anyone done this successfully?


-----------------------------------------
Email provided by http://www.ntlhome.com/


From martin at v.loewis.de  Thu Aug  5 12:41:59 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu Aug  5 12:41:53 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <20040805094651.UGJK7107.mta01-svc.ntlworld.com@[10.137.100.68]>
References: <20040805094651.UGJK7107.mta01-svc.ntlworld.com@[10.137.100.68]>
Message-ID: <41120EF7.8000804@v.loewis.de>

n.youngman@ntlworld.com wrote:

> I'm trying to create an XML document, containing mostly ASCII, but
> potentially containing some unicode characters. I want to convert
> this all to UTF-8, but no matter what I try, I get an ASCII codec
> error.

It would be good if you had shown what precisely you have tried.

> I have tried using codec.open( filename, "w", "utf-8" )

This works fine for me.

> I have tried converting the unicode inline with string.encode(
> "utf-8").

This also.

> I have tried various combination of the above.

This is not a good idea.

> I have tried UTF-7

This is worse.

> What's the correct approach to this problem?

State all the information that you have, preferably in the form:
1. this is what I did
2. this is what happened
3. this is what I expected to happen instead.

Regards,
Martin

From martin at v.loewis.de  Thu Aug  5 12:44:00 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu Aug  5 12:43:57 2004
Subject: [XML-SIG] Expat 1.95.8 has been released
In-Reply-To: <200408041142.34122.fdrake@acm.org>
References: <200408041142.34122.fdrake@acm.org>
Message-ID: <41120F70.9090204@v.loewis.de>

Fred L. Drake, Jr. wrote:
> For those who don't read the expat-discuss list, this is the announcement for 
> Expat 1.95.8; it went to that list on July 24.  I've updated the Expat 
> included in Python 2.4, but haven't update PyXML yet.  The upcoming Python 
> 2.4a2 release will include the new Expat.

I'd like to release PyXML at the end of next week. I'd be happy to 
synchronize PyXML with Python - unless you do it faster.

Regards,
Martin
From martin at v.loewis.de  Thu Aug  5 12:49:51 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu Aug  5 12:49:46 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <410FC155.2000802@magic.fr>
References: <410FC155.2000802@magic.fr>
Message-ID: <411210CF.5090300@v.loewis.de>

Alexandre CONRAD wrote:
> I've been looking around, and apparently, there is a function that 
> returns all the ascensor of a node. But I need this as a string path.
> 
> Any ideas ?

There is no such function, and it would be difficult to define one.
For example, /rootnode/node/sub_node5 might refer to a different node,
if node has multiple children with a name of sub_node5. So one could
try to find a better-matching string, such as /rootnode/node/sub_node5[3].

Or, such a function might generate something like 
/following::node()[1564], which is probably not what you want, but
would match what you have requested.

Regards,
Martin
From n.youngman at ntlworld.com  Thu Aug  5 13:03:17 2004
From: n.youngman at ntlworld.com (n.youngman@ntlworld.com)
Date: Thu Aug  5 13:05:32 2004
Subject: [XML-SIG] XML Unicode and UTF-8
Message-ID: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]>


> 
> From: "Martin v. L?wis" <martin@v.loewis.de>
> Date: 2004/08/05 Thu AM 10:41:59 GMT
> To: n.youngman@ntlworld.com
> CC: xml-sig@python.org
> Subject: Re: [XML-SIG] XML Unicode and UTF-8

<SNIP>

> State all the information that you have, preferably in the form:
> 1. this is what I did
> 2. this is what happened
> 3. this is what I expected to happen instead.

Well, I was trying to state the problem and not impose my own preconceptions of how it should be done, but if you want to go straight into debugging that's fine with me.

First Pass:

                segment_tag.appendChild( charset_tag )
                unicode_tag = doc.createElement( 'unicode' )
                unicode_tag.appendChild( doc.createTextNode( segment[0] ) )
                segment_tag.appendChild( unicode_tag )

Inserts binary data into the segment/unicode tag

Saving with 

    XMLFILE = open( filename, "w" )

    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")

    XMLFILE.close()

Leaves binary data in the document. I have assumed that this was raw Unicode, may be that's a flawed assumption? 

Second Pass:

Save with
    XMLFILE = open( filename, "w" )
    XMLFILE.write( xml.documentElement.toxml( "utf-8" ) )
    XMLFILE.close()

results in:

Traceback (most recent call last):
  File "./storemail.py", line 347, in ?
    save_message( message, raw_message, savedir + "/" + filename + ".xml" )
  File "./storemail.py", line 135, in save_message
    XMLFILE.write( xml.documentElement.toxml( "utf-8" ) )
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 48, in toxml
    return self.toprettyxml("", "", encoding)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 62, in toprettyxml
    self.writexml(writer, "", indent, newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml
    _write_data(writer, "%s%s%s"%(indent, self.data, newl))
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data
    writer.write(data)
  File "/usr/local/lib/python2.3/codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)

I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me.

3rd pass:

    XMLFILE = codecs.open( filename, "w", "utf-8" )
    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
    XMLFILE.close()

produces

Traceback (most recent call last):
  File "./storemail.py", line 347, in ?
    save_message( message, raw_message, savedir + "/" + filename + ".xml" )
  File "./storemail.py", line 137, in save_message
    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 820, in writexml
    node.writexml(writer,indent+addindent,addindent,newl)
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1039, in writexml
    _write_data(writer, "%s%s%s"%(indent, self.data, newl))
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 304, in _write_data
    writer.write(data)
  File "/usr/local/lib/python2.3/codecs.py", line 400, in write
    return self.writer.write(data)
  File "/usr/local/lib/python2.3/codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)

I hoped this would convert everything to UTF-8 and save it . The appearance of an ASCII codec was a complete surprise to me.

I won't bore you with other combinations, which I didn't expect to work. They didn't.

Neil Youngman


-----------------------------------------
Email provided by http://www.ntlhome.com/


From n.youngman at ntlworld.com  Thu Aug  5 13:14:57 2004
From: n.youngman at ntlworld.com (n.youngman@ntlworld.com)
Date: Thu Aug  5 13:17:12 2004
Subject: [XML-SIG] XML Unicode and UTF-8
Message-ID: <20040805111635.XRQ7107.mta01-svc.ntlworld.com@[10.137.100.68]>


> 
> From: "Martin v. L?wis" <martin@v.loewis.de>
> Date: 2004/08/05 Thu AM 10:41:59 GMT
> To: n.youngman@ntlworld.com
> CC: xml-sig@python.org
> Subject: Re: [XML-SIG] XML Unicode and UTF-8

<SNIP>

> State all the information that you have, preferably in the form:
> 1. this is what I did
> 2. this is what happened
> 3. this is what I expected to happen instead.
> 
> Regards,
> Martin

I missed out pass 4:

Create the node with

   unicode_tag.appendChild( doc.createTextNode( segment[0].encode( "utf-8") ) )

and print with 

    XMLFILE = open( filename, "w" )
    xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
    XMLFILE.close()

Produces the error
Traceback (most recent call last):
  File "./storemail.py", line 347, in ?
    save_message( message, raw_message, savedir + "/" + filename + ".xml" )
  File "./storemail.py", line 130, in save_message
    xml = message_to_xml( message, raw_message )
  File "./storemail.py", line 179, in message_to_xml
    entity_tag = entity_to_xml( entity, doc )
  File "./storemail.py", line 215, in entity_to_xml
    unicode_tag.appendChild( doc.createTextNode( segment[0].encode( "utf-8") ) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)


-----------------------------------------
Email provided by http://www.ntlhome.com/


From martin at v.loewis.de  Thu Aug  5 13:35:18 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu Aug  5 13:35:13 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]>
References: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]>
Message-ID: <41121B76.6090603@v.loewis.de>

n.youngman@ntlworld.com wrote:
> First Pass:
> 
> segment_tag.appendChild( charset_tag ) unicode_tag =
> doc.createElement( 'unicode' ) unicode_tag.appendChild(
> doc.createTextNode( segment[0] ) ) segment_tag.appendChild(
> unicode_tag )
> 
> Inserts binary data into the segment/unicode tag

What is segment[0] here? In XML, there is no notion of "binary data".

> Leaves binary data in the document. I have assumed that this was raw
> Unicode, may be that's a flawed assumption?

There is nothing that could be called "raw Unicode", either. Again,
XML does not support binary data.

> consumed = self.encode(object, self.errors) UnicodeDecodeError:
> 'ascii' codec can't decode byte 0xee in position 0: ordinal not in
> range(128)
> 
> I hoped this would convert everything to UTF-8 and save it . The
> appearance of an ASCII codec was a complete surprise to me.

You can only encode Unicode objects. Since you apparently have put
a byte string object (<type 'str'>) into the DOM tree, it needs to
convert the byte string into a Unicode string first, before it
can encode the Unicode string as UTF-8. For that, it uses the system
default encoding, which is us-ascii.

Now, the byte string contains the byte '\xee', which is not supported
in ASCII.

> 3rd pass:
> 
> XMLFILE = codecs.open( filename, "w", "utf-8" ) 
> xml.documentElement.writexml( XMLFILE, indent="", addindent="",
> newl="") XMLFILE.close()
> 
> produces
> 
> Traceback (most recent call last): File "./storemail.py", line 347,

The problem is that your DOM tree is already ill-formed. You should
not put binary data into a DOM tree.

 > I missed out pass 4:
 >
 > Create the node with
 >
 >   unicode_tag.appendChild( doc.createTextNode(
 >       segment[0].encode( "utf-8") ) )

Same issue: Apparently, segment[0] is a byte string, but you can only
encode Unicode strings. *If* segment[0] is an UTF-8 encoded byte string,
you should write

    segment[0].decode( "utf-8")

Regards,
Martin
From n.youngman at ntlworld.com  Thu Aug  5 14:22:43 2004
From: n.youngman at ntlworld.com (n.youngman@ntlworld.com)
Date: Thu Aug  5 14:24:58 2004
Subject: [XML-SIG] XML Unicode and UTF-8
Message-ID: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]>

> From: "Martin v. L?wis" <martin@v.loewis.de>
> Date: 2004/08/05 Thu AM 11:35:18 GMT
> To: n.youngman@ntlworld.com
> CC: xml-sig@python.org
> Subject: Re: [XML-SIG] XML Unicode and UTF-8
> 
> n.youngman@ntlworld.com wrote:
> > First Pass:
> > 
> > segment_tag.appendChild( charset_tag ) unicode_tag =
> > doc.createElement( 'unicode' ) unicode_tag.appendChild(
> > doc.createTextNode( segment[0] ) ) segment_tag.appendChild(
> > unicode_tag )
> > 
> > Inserts binary data into the segment/unicode tag
> 
> What is segment[0] here? In XML, there is no notion of "binary data".

Sorry, I missed a key point out. Segment[0] is the decoded part of the output from email.Header.decode_header(). I believed this was a unicode string, but checking back in the documentation it doesn't actually say that, so I guess at least part of the problem is I'm getting some sort of binary data, which I thought was Unicode, but isn't.

> > Leaves binary data in the document. I have assumed that this was raw
> > Unicode, may be that's a flawed assumption?
> 
> There is nothing that could be called "raw Unicode", either. Again,
> XML does not support binary data.

XML doesn't, Python does. If I ask it to print without encoding it, I don't know whether it's passed through unchanged. Raw Unicode seems to me like a reasonable term for the data in a unicode string.

> > consumed = self.encode(object, self.errors) UnicodeDecodeError:
> > 'ascii' codec can't decode byte 0xee in position 0: ordinal not in
> > range(128)
> > 
> > I hoped this would convert everything to UTF-8 and save it . The
> > appearance of an ASCII codec was a complete surprise to me.
> 
> You can only encode Unicode objects. Since you apparently have put
> a byte string object (<type 'str'>) into the DOM tree, it needs to
> convert the byte string into a Unicode string first, before it
> can encode the Unicode string as UTF-8. For that, it uses the system
> default encoding, which is us-ascii.
> 
> Now, the byte string contains the byte '\xee', which is not supported
> in ASCII.

OK. That kind of makes sense, but I now have to figure out what is in the byte string and how to transform it to UTF-8. I guess that it's actually raw data in the character set given by the other part of the pair. Assuming it's a string in koi8-r, I have to get a codec that witll transform koi8-r to UTF-8, probably via unicode.

OK. I read the opaque documentation^W^W fine manual for a while, then googled for a while, and finally decided to just hack about with what I had.

I now have

    charset_tag.appendChild( doc.createTextNode( segment[1] ) )
    unicode = segment[0].decode( segment[1] ).encode( "utf-8")
    unicode_tag = doc.createElement( 'unicode' )
    unicode_tag.appendChild( doc.createTextNode( unicode ) )

This appears to be working, or at least it doesn't generate any errors.

Martin

You have neatly pinpointed where I was confused. Your assistance is much appreciated.

Many Thanks

Neil Youngman


-----------------------------------------
Email provided by http://www.ntlhome.com/


From xmlsig at codeweld.com  Thu Aug  5 14:51:09 2004
From: xmlsig at codeweld.com (xmlsig@codeweld.com)
Date: Thu Aug  5 14:51:12 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <411210CF.5090300@v.loewis.de>
References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de>
Message-ID: <1091710269.41122d3db3cec@webmail.codeweld.com>

> Alexandre CONRAD wrote:
> > I've been looking around, and apparently, there is a function that
> > returns all the ascensor of a node. But I need this as a string path.
> >
> > Any ideas ?
>
> There is no such function, and it would be difficult to define one.
> For example, /rootnode/node/sub_node5 might refer to a different node,
> if node has multiple children with a name of sub_node5. So one could
> try to find a better-matching string, such as /rootnode/node/sub_node5[3].
>
> Or, such a function might generate something like
> /following::node()[1564], which is probably not what you want, but
> would match what you have requested.
>
> Regards,
> Martin
Does this help?

def abs_path( node ):
    successors = 1
    parent = node.previousSibling
    while parent:
        if parent.nodeName == node.nodeName: successors += 1
        parent = parent.previousSibling
    name = node.nodeName == '#text' and 'text()' or node.nodeName
    path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
    if node.parentNode and node.parentNode.nodeName != '#document':
        return abs_path( node.parentNode )+path
    return path

Kind Regards
Florian


From paul.boddie at ementor.no  Thu Aug  5 15:26:34 2004
From: paul.boddie at ementor.no (Paul Boddie)
Date: Thu Aug  5 15:26:42 2004
Subject: [XML-SIG] XML Unicode and UTF-8
Message-ID: <FD72AF7813F1294C95279EC6D9784A2F01571152@100NOOSLMSG004.common.alpharoot.net>

n.youngman@ntlworld.com wrote:
>
> OK. That kind of makes sense, but I now have to figure out what is in
the
> byte string and how to transform it to UTF-8. I guess that it's
actually
> raw data in the character set given by the other part of the pair.
> Assuming it's a string in koi8-r, I have to get a codec that witll
> transform koi8-r to UTF-8, probably via unicode.

I've only been following this thread in a vague way, but the easiest way
to approach this problem and many others that you might have with
character encodings is to convert input data to Unicode objects as soon
as possible. Note that there's a distinction between Unicode (which you
can think of as a scheme where any character value can be stored and
addressed) and UTF-8 (which is a way of serialising most of those
character values in a byte stream). When you're converting to Unicode
you aren't converting to UTF-8 or any other such representation - you're
actually putting the data in Python Unicode objects. Meanwhile, UTF-8 is
a side issue which you only need to think about when you're producing
textual output for other systems to process - you should be able to keep
UTF-8 data out of your program.

> OK. I read the opaque documentation^W^W fine manual for a while, then
> googled for a while, and finally decided to just hack about with what
I
> had.
>
> I now have
>
>     charset_tag.appendChild( doc.createTextNode( segment[1] ) )
>     unicode = segment[0].decode( segment[1] ).encode( "utf-8")

This actually produces a byte (normal Python) string containing a UTF-8
representation of the text. This is not the same as having that text in
a Unicode object, which is the most useful form to have it in. Consider
checking the length of the text - you won't necessarily get the true
number of characters. (Moreover, you're trampling on the unicode
function here.)

Do this instead:

      utext = segment[0].decode( segment[1] )

>     unicode_tag = doc.createElement( 'unicode' )
>     unicode_tag.appendChild( doc.createTextNode( unicode ) )

And this:

      unicode_tag.appendChild( doc.createTextNode( utext ) )

When you need to serialise this, the serialiser should then be able to
choose a suitable character encoding (eg. UTF-8) without running into
the problems you were experiencing.

Paul

From martin at v.loewis.de  Thu Aug  5 15:30:48 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu Aug  5 15:30:43 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]>
References: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]>
Message-ID: <41123688.5000600@v.loewis.de>

n.youngman@ntlworld.com wrote:
> Sorry, I missed a key point out. Segment[0] is the decoded part of
> the output from email.Header.decode_header(). I believed this was a
> unicode string, but checking back in the documentation it doesn't
> actually say that, so I guess at least part of the problem is I'm
> getting some sort of binary data, which I thought was Unicode, but
> isn't.

Indeed. decode_header gives you a list of (byte, encoding) pairs
precisely because it does not attempt to decode them. In turn, it
does not try to decode them because Python might not have a codec
for some of the encodings. Generally, you would do

def u_decode_header(header):
   result = []
   for h, enc in Header.decode_header(header):
       result.append(h.decode(enc))
   return u"".join(result)

which will raise a LookupError if there is an unsupported encoding.
As you are going to put the header into an XML document, you really
have little choice what to do in that case - if giving up is not
acceptable,

      try:
        result.append(h.decode(enc))
      except LookupError:
        result.append(h.decode('us-ascii', 'replace'))

might be your next best choice: this will assume that any encoding
is an ASCII superset, and replace all non-ASCII bytes with question
marks.

All that decode_header is is to decode the transfer encoding (i.e.
Q or B).

>>> Leaves binary data in the document. I have assumed that this was
>>> raw Unicode, may be that's a flawed assumption?
[...]
> XML doesn't, Python does. If I ask it to print without encoding it, I
> don't know whether it's passed through unchanged. Raw Unicode seems
> to me like a reasonable term for the data in a unicode string.

Ah, that. Don't worry about the internal representation of a Unicode
string. It may have 2 or 4 bytes, and be big or little endian. You
are never going to see that directly, as there is *always* an encoding
going on to convert the Unicode object into a byte string. Of course,
you could create a buffer object to really find out, but that should
not be done.

> You have neatly pinpointed where I was confused. Your assistance is
> much appreciated.

You are welcome!

Martin
From fdrake at acm.org  Thu Aug  5 15:52:31 2004
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu Aug  5 15:52:41 2004
Subject: [XML-SIG] Expat 1.95.8 has been released
In-Reply-To: <41120F70.9090204@v.loewis.de>
References: <200408041142.34122.fdrake@acm.org> <41120F70.9090204@v.loewis.de>
Message-ID: <200408050952.31595.fdrake@acm.org>

On Thursday 05 August 2004 06:44 am, Martin v. L?wis wrote:
 > I'd like to release PyXML at the end of next week. I'd be happy to
 > synchronize PyXML with Python - unless you do it faster.

Sounds like a good plan.  It's not ready to sync yet; some of the changes to 
Expat will allow more efficient exiting of the parse when exceptions occur, 
but I've not yet made the changes to pyexpat to make that happen.

I'd also like to expose the suspend/resume capability we've added to the 
parser.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>

From aconrad.tlv at magic.fr  Thu Aug  5 16:22:16 2004
From: aconrad.tlv at magic.fr (Alexandre CONRAD)
Date: Thu Aug  5 16:22:18 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <1091710269.41122d3db3cec@webmail.codeweld.com>
References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de>
	<1091710269.41122d3db3cec@webmail.codeweld.com>
Message-ID: <41124298.6090705@magic.fr>

> Does this help?
> 
> def abs_path( node ):
>     successors = 1
>     parent = node.previousSibling
>     while parent:
>         if parent.nodeName == node.nodeName: successors += 1
>         parent = parent.previousSibling
>     name = node.nodeName == '#text' and 'text()' or node.nodeName
>     path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
>     if node.parentNode and node.parentNode.nodeName != '#document':
>         return abs_path( node.parentNode )+path
>     return path


Because I always strip out spaces in XML documents, and because I want 
to show the 1st node with node[1], I changed your code so:

- name = node.nodeName == '#text' and 'text()' or node.nodeName
- path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
+ path = '/%s[%s]' % (node.nodeName, successors)


This is function is pretty neat. But still, there is 1 more little thing 
that I'm having a hard time figuring out how to fix. I keep getting 
"/playlist[2]" as the root node. I can't have 2 root nodes anyway...

<?xml version='1.0' encoding='UTF-8'?>
<playlist>
   <group> <-- shows: /playlist[2]/group[1]
     <video>foo.mpg</video> <-- shows: /playlist[2]/group[1]/video[1]
     <video>bar.mpg</video> <-- shows: /playlist[2]/group[1]/video[2]
   </group>
</playlist>

And this looping code inside the function everytime makes me loose track 
of what's doing on. Well done though.

Best regards,
-- 
Alexandre CONRAD - TLV
Research & Development
tel : +33 1 30 80 55 05
fax : +33 1 30 56 55 06
6, rue de la plaine
78860 - SAINT NOM LA BRETECHE
FRANCE

From martin at v.loewis.de  Thu Aug  5 16:30:49 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu Aug  5 16:30:46 2004
Subject: [XML-SIG] Expat 1.95.8 has been released
In-Reply-To: <200408050952.31595.fdrake@acm.org>
References: <200408041142.34122.fdrake@acm.org> <41120F70.9090204@v.loewis.de>
	<200408050952.31595.fdrake@acm.org>
Message-ID: <41124499.3080200@v.loewis.de>

Fred L. Drake, Jr. wrote:
> Sounds like a good plan.  It's not ready to sync yet; some of the changes to 
> Expat will allow more efficient exiting of the parse when exceptions occur, 
> but I've not yet made the changes to pyexpat to make that happen.

I'm very much in favour of many small sync steps, instead of a single 
large one - the time needed to synchronise them grows with the number
of changes (atleast the way I do it normally, change by change). So
I'll see what I can do.

Regards,
Martin
From fdrake at acm.org  Thu Aug  5 16:57:32 2004
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu Aug  5 16:57:42 2004
Subject: [XML-SIG] Expat 1.95.8 has been released
In-Reply-To: <41124499.3080200@v.loewis.de>
References: <200408041142.34122.fdrake@acm.org>
	<200408050952.31595.fdrake@acm.org> <41124499.3080200@v.loewis.de>
Message-ID: <200408051057.32432.fdrake@acm.org>

On Thursday 05 August 2004 10:30 am, Martin v. L?wis wrote:
 > I'm very much in favour of many small sync steps, instead of a single
 > large one - the time needed to synchronise them grows with the number
 > of changes (atleast the way I do it normally, change by change). So

Ok, if you want to use small steps, then go ahead and pick up my last two 
changes: 

- Update the Expat sources to from Expat 1.95.8
- Expose additional error constants in pyexpat


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>

From xmlsig at codeweld.com  Thu Aug  5 17:29:38 2004
From: xmlsig at codeweld.com (xmlsig@codeweld.com)
Date: Thu Aug  5 17:29:40 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <41124298.6090705@magic.fr>
References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de>
	<1091710269.41122d3db3cec@webmail.codeweld.com>
	<41124298.6090705@magic.fr>
Message-ID: <1091719778.41125262b3d17@webmail.codeweld.com>

Quoting Alexandre CONRAD <aconrad.tlv@magic.fr>:

> > Does this help?
> >
> > def abs_path( node ):
> >     successors = 1
> >     parent = node.previousSibling
> >     while parent:
> >         if parent.nodeName == node.nodeName: successors += 1
> >         parent = parent.previousSibling
> >     name = node.nodeName == '#text' and 'text()' or node.nodeName
> >     path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
> >     if node.parentNode and node.parentNode.nodeName != '#document':
> >         return abs_path( node.parentNode )+path
> >     return path
>
>
> Because I always strip out spaces in XML documents, and because I want
> to show the 1st node with node[1], I changed your code so:
>
> - name = node.nodeName == '#text' and 'text()' or node.nodeName
> - path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
> + path = '/%s[%s]' % (node.nodeName, successors)
>
>
> This is function is pretty neat. But still, there is 1 more little thing
> that I'm having a hard time figuring out how to fix. I keep getting
> "/playlist[2]" as the root node. I can't have 2 root nodes anyway...
>
> <?xml version='1.0' encoding='UTF-8'?>
> <playlist>
>    <group> <-- shows: /playlist[2]/group[1]
>      <video>foo.mpg</video> <-- shows: /playlist[2]/group[1]/video[1]
>      <video>bar.mpg</video> <-- shows: /playlist[2]/group[1]/video[2]
>    </group>
> </playlist>
>
> And this looping code inside the function everytime makes me loose track
> of what's doing on. Well done though.
>
> Best regards,
> --
> Alexandre CONRAD - TLV
> Research & Development
> tel : +33 1 30 80 55 05
> fax : +33 1 30 56 55 06
> 6, rue de la plaine
> 78860 - SAINT NOM LA BRETECHE
> FRANCE

The line that ranslates '#text' to 'text()' has the advantage that it translates
the path to a valid xpath the other line that eliminates [1] still preserves
this valid xpath, and I thought it's nicer to look at :).
I found the source and the cure of the problem. The source is ( as you can
easely verify with http://www.codeweld.com/files/dom_view.pyw, just use
'file://yourfile.xml' ) that the Sax2 reader for some reason puts a second node
with the same nodeName in. The cure is to take for comparision the localName, as
this name seems to be different for those. Additionaly he's also different for
some other nodes which might otherwise in border situations made trouble. This
is the new function. ( I also gave one variable a more reasonable name, was
confusing otherwise )

def abs_path( node ):
    successors = 1
    previous = node.previousSibling
    while previous:
        if previous.localName == node.localName: successors += 1
        previous = previous.previousSibling
    path = '/%s[%s]' % (node.nodeName, successors)
    if node.parentNode.nodeName != '#document':
        return abs_path( node.parentNode )+path
    return path

Kind Regards
Florian
From aconrad.tlv at magic.fr  Thu Aug  5 18:21:41 2004
From: aconrad.tlv at magic.fr (Alexandre CONRAD)
Date: Thu Aug  5 18:21:43 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <1091719778.41125262b3d17@webmail.codeweld.com>
References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de>
	<1091710269.41122d3db3cec@webmail.codeweld.com>
	<41124298.6090705@magic.fr>
	<1091719778.41125262b3d17@webmail.codeweld.com>
Message-ID: <41125E95.5070204@magic.fr>


xmlsig@codeweld.com wrote:

> Quoting Alexandre CONRAD <aconrad.tlv@magic.fr>:
> 
> 
>>>Does this help?
>>>
>>>def abs_path( node ):
>>>    successors = 1
>>>    parent = node.previousSibling
>>>    while parent:
>>>        if parent.nodeName == node.nodeName: successors += 1
>>>        parent = parent.previousSibling
>>>    name = node.nodeName == '#text' and 'text()' or node.nodeName
>>>    path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
>>>    if node.parentNode and node.parentNode.nodeName != '#document':
>>>        return abs_path( node.parentNode )+path
>>>    return path
>>
>>
>>Because I always strip out spaces in XML documents, and because I want
>>to show the 1st node with node[1], I changed your code so:
>>
>>- name = node.nodeName == '#text' and 'text()' or node.nodeName
>>- path = successors>1 and '/%s[%s]'%(name,successors) or '/%s'%name
>>+ path = '/%s[%s]' % (node.nodeName, successors)
>>
>>
>>This is function is pretty neat. But still, there is 1 more little thing
>>that I'm having a hard time figuring out how to fix. I keep getting
>>"/playlist[2]" as the root node. I can't have 2 root nodes anyway...
>>
>><?xml version='1.0' encoding='UTF-8'?>
>><playlist>
>>   <group> <-- shows: /playlist[2]/group[1]
>>     <video>foo.mpg</video> <-- shows: /playlist[2]/group[1]/video[1]
>>     <video>bar.mpg</video> <-- shows: /playlist[2]/group[1]/video[2]
>>   </group>
>></playlist>
>>
>>And this looping code inside the function everytime makes me loose track
>>of what's doing on. Well done though.
>>
>>Best regards,
>>--
>>Alexandre CONRAD - TLV
>>Research & Development
>>tel : +33 1 30 80 55 05
>>fax : +33 1 30 56 55 06
>>6, rue de la plaine
>>78860 - SAINT NOM LA BRETECHE
>>FRANCE
> 
> 
> The line that ranslates '#text' to 'text()' has the advantage that it translates
> the path to a valid xpath the other line that eliminates [1] still preserves
> this valid xpath, and I thought it's nicer to look at :).
> I found the source and the cure of the problem. The source is ( as you can
> easely verify with http://www.codeweld.com/files/dom_view.pyw, just use
> 'file://yourfile.xml' ) that the Sax2 reader for some reason puts a second node
> with the same nodeName in. The cure is to take for comparision the localName, as
> this name seems to be different for those. Additionaly he's also different for
> some other nodes which might otherwise in border situations made trouble. This
> is the new function. ( I also gave one variable a more reasonable name, was
> confusing otherwise )
> 
> def abs_path( node ):
>     successors = 1
>     previous = node.previousSibling
>     while previous:
>         if previous.localName == node.localName: successors += 1
>         previous = previous.previousSibling
>     path = '/%s[%s]' % (node.nodeName, successors)
>     if node.parentNode.nodeName != '#document':
>         return abs_path( node.parentNode )+path
>     return path
> 
> Kind Regards
> Florian

Ur da man !! :D

I fixed the prob on my side but was doing a dirty trick :

if parent.nodeName == node.nodeName and parent.nodeName != 
node.ownerDocument.firstChild.nodeName: successors += 1

Uuugh ! I don't like that. I feel better that you have found the 
solution. Don't like to know there's dirty code in my application. ;)

Thank you so much for your help. That's a great function to be able to 
build the xpath of a given node.

Very best regards,
-- 
Alexandre CONRAD - TLV
Research & Development
tel : +33 1 30 80 55 05
fax : +33 1 30 56 55 06
6, rue de la plaine
78860 - SAINT NOM LA BRETECHE
FRANCE

From rsalz at datapower.com  Thu Aug  5 18:37:39 2004
From: rsalz at datapower.com (Rich Salz)
Date: Thu Aug  5 18:37:09 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <41125E95.5070204@magic.fr>
References: <410FC155.2000802@magic.fr>
	<411210CF.5090300@v.loewis.de>	<1091710269.41122d3db3cec@webmail.codeweld.com>	<41124298.6090705@magic.fr>	<1091719778.41125262b3d17@webmail.codeweld.com>
	<41125E95.5070204@magic.fr>
Message-ID: <41126253.8050107@datapower.com>

FYI, here is how ZSI does it; walking *up* from an element to a provided 
root:

def _backtrace(elt, dom):
     '''Return a "backtrace" from the given element to the DOM root,
     in XPath syntax.
     '''
     s = ''
     while elt != dom:
         name, parent = elt.nodeName, elt.parentNode
         if parent is None: break
         matches = [ c for c in _child_elements(parent)
                         if c.nodeName == name ]
         if len(matches) == 1:
             s = '/' + name + s
         else:
             i = matches.index(elt) + 1
             s = ('/%s[%d]' % (name, i)) + s
         elt = parent
     return s

-- 
Rich Salz, Chief Security Architect
DataPower Technology                           http://www.datapower.com
XS40 XML Security Gateway   http://www.datapower.com/products/xs40.html
XML Security Overview  http://www.datapower.com/xmldev/xmlsecurity.html
From webworldl at yahoo.com  Thu Aug  5 22:21:56 2004
From: webworldl at yahoo.com (Luke Bradley)
Date: Thu Aug  5 22:21:58 2004
Subject: [XML-SIG] need help: Sax can't read w3 dtds?
Message-ID: <20040805202156.62158.qmail@web53504.mail.yahoo.com>

Hi, I  am looking for help with processing XTHML
documents in python with SAX or DOM. If this is not
the right place to ask, could you please refer me to a
good place?

My problem is that when I try to parse XHTML1.1
documents with pythons SAX implementation, it throws
an error claiming that there are errors in the W3C's
DTD's. given an XHTML page generated by the W3's TIDY
generator called hello.html:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
      "HTML Tidy for Windows (vers 1st June 2004), see
www.w3.org" />
    <meta http-equiv="Content-Type" content=
      "text/html; charset=us-ascii" />
    <title>Hello World</title>
  </head>
  <body>
    <p>Hello World!</p>
  </body>
</html>

and the python code:

import xml.sax.handler
xml.sax.parse("hello.html",
    xml.sax.handler.ContentHandler()
              )

a fatal error occurs with the following stacktrace:

Traceback (most recent call last):
  File "D:/projects/pyper/saxtest.py", line 4, in
-toplevel-
    xml.sax.handler.ContentHandler()
  File
"D:\PYTHON23\Lib\site-packages\_xmlplus\sax\__init__.py",
line 31, in parse
    parser.parse(filename_or_stream)
  File
"D:\PYTHON23\Lib\site-packages\_xmlplus\sax\expatreader.py",
line 109, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File
"D:\PYTHON23\Lib\site-packages\_xmlplus\sax\xmlreader.py",
line 123, in parse
    self.feed(buffer)
  File
"D:\PYTHON23\Lib\site-packages\_xmlplus\sax\expatreader.py",
line 220, in feed
    self._err_handler.fatalError(exc)
  File
"D:\PYTHON23\Lib\site-packages\_xmlplus\sax\handler.py",
line 38, in fatalError
    raise exception
SAXParseException:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
error in processing external entity reference

any ideas? am I missing something basic? thanks.


__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 
From mike at skew.org  Thu Aug  5 22:27:29 2004
From: mike at skew.org (Mike Brown)
Date: Thu Aug  5 22:27:26 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <FD72AF7813F1294C95279EC6D9784A2F01571152@100NOOSLMSG004.common.alpharoot.net>
	"from Paul Boddie at Aug 5, 2004 03:26:34 pm"
Message-ID: <200408052027.i75KRT01076110@chilled.skew.org>

Paul Boddie wrote:
> Do this instead:
> 
>       utext = segment[0].decode( segment[1] )

The resulting Unicode object may contain characters which are not allowed in 
XML, and thus the text may not be serializable (at least not in a way that 
would produce well-formed XML).

To embed arbitrary bytes in XML, the usual advice is to first convert the 
bytes into a character sequence that is permitted in XML. Base64 is a popular 
and easily implemented option, albeit inefficient. The article at 
http://www.javaworld.com/javaworld/javatips/jw-javatip117-p2.html suggests 
that a custom Huffman implementation is nearly 1:1. I've mapped bytes into the 
Private Use Area of Unicode before, too, although that's definitely not 
efficient.
From chekhan at gepros.com.tn  Thu Aug  5 23:40:01 2004
From: chekhan at gepros.com.tn (Gepros)
Date: Fri Aug  6 00:34:29 2004
Subject: [XML-SIG] Prise de contact - Gepros Tunisie - projet de partenariat
Message-ID: <20040805223933.298E83790B@smtp.gnet.tn>


Bonjour,

Nous vous contactons dans le but de d�velopper une relation commerciale avec vous.

Domaine d'activit� : Notre soci�t� " G�pro's " est une soci�t� industrielle sp�cialis�e dans la production de produits alimentaires � base de c�r�ales (bl�, mais, riz et multi grains) - c�r�ales pour le petit d�jeun� et snacks sal�s.
Nos produits sont aussi destin�s aux fabricants de glaces, yaourts et chocolats.
	
Unit� de production : G�pro's est certifi�e ISO 9001 et HACCP et dispose d'�quipements neufs et de premier ordre.

Localisation : Tunis - Tunisie -Afrique du Nord

Nos march�s : Notre circuit de distribution couvre actuellement le march� Maghr�bin (Tunisie, Alg�rie et Libye) et pour le Moyen- Orient. Nous r�alisons une croissance annuelle � deux chiffres et souhaitons d�velopper notre croissance.
Nous vous invitons � visiter notre Site Web www.gepros.com.tn pour de plus amples informations sur notre soci�t�.

Objectifs :

1.	Nous souhaitons d�velopper des partenariats de distribution sur vos march�s. Deux cas sont possibles :
a.	Distribution de nos produits sous notre nom de marque
b.	Distribution de nos produits avec votre nom de marque  si vous disposez d'une marque � promouvoir
2.	d�veloppement d'un partenariat industriel. Ce partenariat peut prendre plusieurs formes :
a.	d�veloppement de relations de sous-traitance pour votre compte
b.	production de vos produits sous votre nom de marque dans le but de les commercialiser sur le march� tunisien, maghr�bin, africain et au Moyen Orient.

Avantages :
i.	d�veloppement de vos march�s
ii.	rapprochement de vos march�s cibles
iii.	co�ts de stockage r�duits et adaptation de la production � la demande sur les march�s cibles respectifs
iv.	exon�ration de frais de douanes sur les march�s maghr�bin (accords bilat�raux) et moyen orient
v.	incitations aux investissements en Tunisie  http://www.tunisieindustrie.nat.tn
From pb3 at bizbuzz.pbf.gatech.edu  Fri Aug  6 01:06:05 2004
From: pb3 at bizbuzz.pbf.gatech.edu (Paula_Britton)
Date: Fri Aug  6 01:06:08 2004
Subject: [XML-SIG] Away from the Office until 8/16/04
Message-ID: <200408052306.i75N65910704@bizbuzz.pbf.gatech.edu>

I will be out of the office starting August 5th and returning August 
16th.  Please contact Judy Whitfield with any issues at 404-894-9054 or 
judy.whitfield@business.gatech.edu.

Thank You.
Paula Britton
From and-xml at doxdesk.com  Fri Aug  6 10:06:05 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Fri Aug  6 10:05:28 2004
Subject: [XML-SIG] need help: Sax can't read w3 dtds?
In-Reply-To: <20040805202156.62158.qmail@web53504.mail.yahoo.com>
References: <20040805202156.62158.qmail@web53504.mail.yahoo.com>
Message-ID: <41133BED.7010108@doxdesk.com>

Luke Bradley <webworldl@yahoo.com> wrote:

> My problem is that when I try to parse XHTML1.1
> documents with pythons SAX implementation, it throws
> an error claiming that there are errors in the W3C's
> DTD's.

It's right - there are. Many other parsers won't accept them either. The 
(first) error is at line 37 char 20 of 
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-special.ent:

   <!ENTITY lt "&#38;&#60;" ><!-- less-than sign, U+003C ISOnum -->

Since character references are decoded once at entity-definition time 
this actual defines the entity lt as containing '&<', which is grossly 
ill-formed as well as being incompatible with &lt;'s canonical content.

Exactly how much of an error this is in XML is a arguable point, given 
that this entity is not actually used after its declaration. However 
parsers that need to report the declared entity content independently of 
their references (such as DOM implementations) cannot possibly allow it.

This is a bug in XHTML Modularization that makes handling today's XHTML 
1.1 with validation a bit of a non-starter (along with all the other 
problems connected with XHTML 1.1). Unfortunately W3C process has 
prevented the error from being fixed before the forthcoming XHTML 
Modularization Second Edition.

If you need to handle XHTML 1.1 at the moment, do it without 
validation/external entities.

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From mike at skew.org  Fri Aug  6 10:14:26 2004
From: mike at skew.org (Mike Brown)
Date: Fri Aug  6 10:14:26 2004
Subject: [XML-SIG] need help: Sax can't read w3 dtds?
In-Reply-To: <41133BED.7010108@doxdesk.com> "from Andrew Clover at Aug 6, 2004
	05:06:05 pm"
Message-ID: <200408060814.i768EQkR078907@chilled.skew.org>

Andrew Clover wrote:
> It's right - there are. Many other parsers won't accept them either. The 
> (first) error is at line 37 char 20 of 
> http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-special.ent:
> 
>    <!ENTITY lt "&#38;&#60;" ><!-- less-than sign, U+003C ISOnum -->
> 

That's not an error. Read the spec carefully.

From and-xml at doxdesk.com  Fri Aug  6 15:44:21 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Fri Aug  6 15:43:48 2004
Subject: [XML-SIG] need help: Sax can't read w3 dtds?
In-Reply-To: <200408060814.i768EQkR078907@chilled.skew.org>
References: <200408060814.i768EQkR078907@chilled.skew.org>
Message-ID: <41138B35.3050007@doxdesk.com>

Mike Brown <mike@skew.org> wrote:

>>   <!ENTITY lt "&#38;&#60;" ><!-- less-than sign, U+003C ISOnum -->

> That's not an error.

It *is* an error, regardless of your opinion of whether XML technically 
allows "&#38;&#60;" as a literal entity value(*). XML 1.0 SE 4.6 says:

   If the entities lt or amp are declared, they must be declared as
   internal entities whose replacement text is a character reference to
   the respective character (less-than sign or ampersand) being escaped

The entity value "&#38;&#60;" yields replacement text "&<" which clearly 
is not a character reference to the less-than sign.

This is acknowledged and fixed in m12n SE:

   http://www.w3.org/TR/2004/WD-xhtml-modularization-20040218/
   dtd_module_defs.html#a_module_XHTML_Special_Characters

   <!ENTITY lt "&#38;#60;" ><!-- less-than sign, U+003C ISOnum -->


* - IMO such a replacement text technically allowable by implication of 
XML 1.0 SE 2.3:

   Although the EntityValue production allows the definition of an entity
   consisting of a single explicit < in the literal (e.g., <!ENTITY mylt
   "<">), it is strongly advised to avoid this practice since any
   reference to that entity will cause a well-formedness error.

but it's incompatible with tools like DOM which require the replacement 
text to be parsed as-is without an explicit entity reference, to form 
the content of the Entity node.

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From mike at skew.org  Fri Aug  6 17:56:40 2004
From: mike at skew.org (Mike Brown)
Date: Fri Aug  6 17:56:38 2004
Subject: [XML-SIG] need help: Sax can't read w3 dtds?
In-Reply-To: <41138B35.3050007@doxdesk.com> "from Andrew Clover at Aug 6, 2004
	10:44:21 pm"
Message-ID: <200408061556.i76FueSO081407@chilled.skew.org>

Andrew Clover wrote:
> >>   <!ENTITY lt "&#38;&#60;" ><!-- less-than sign, U+003C ISOnum -->
> 
> > That's not an error.
> 
> It *is* an error

Sorry, I am used to correcting people on that one.
I thought the issue was the leading "&#38;".
You're right, though; I overlooked the extra "&".
I apologize for firing off that terse email 5 minutes before going to bed :)

-Mike
From and-xml at doxdesk.com  Fri Aug  6 20:28:59 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Fri Aug  6 20:28:24 2004
Subject: [XML-SIG] need help: Sax can't read w3 dtds?
In-Reply-To: <200408061556.i76FueSO081407@chilled.skew.org>
References: <200408061556.i76FueSO081407@chilled.skew.org>
Message-ID: <4113CDEB.1050707@doxdesk.com>

Mike Brown <mike@skew.org> wrote:

> I thought the issue was the leading "&#38;".
> You're right, though; I overlooked the extra "&".

You can be forgiven - so did W3C!

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From n.youngman at ntlworld.com  Sat Aug  7 08:48:18 2004
From: n.youngman at ntlworld.com (Neil Youngman)
Date: Sat Aug  7 08:48:20 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <200408052027.i75KRT01076110@chilled.skew.org>
References: <200408052027.i75KRT01076110@chilled.skew.org>
Message-ID: <200408070748.18432.n.youngman@ntlworld.com>

On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> Paul Boddie wrote:
> > Do this instead:
> >
> >       utext = segment[0].decode( segment[1] )
>
> The resulting Unicode object may contain characters which are not allowed
> in XML, and thus the text may not be serializable (at least not in a way
> that would produce well-formed XML).

Yes, but it's being written out through a UTF-8 codec to a file which 
specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that it's 
got a unicode string and convert it to utf-8 with no futher programmer 
intervention. 

Of course a week ago, Python was just another buzzword to me, so I could be 
wrong.

> To embed arbitrary bytes in XML, the usual advice is to first convert the
> bytes into a character sequence that is permitted in XML. Base64 is a
> popular and easily implemented option, albeit inefficient. The article at
> http://www.javaworld.com/javaworld/javatips/jw-javatip117-p2.html suggests
> that a custom Huffman implementation is nearly 1:1. I've mapped bytes into
> the Private Use Area of Unicode before, too, although that's definitely not
> efficient.

All neat ideas, but as I want UTF-8 encoding, they would just add an 
unnecessary layer of complexity.

Thanks for trying to help, but I think I've got what I need.

Neil Youngman

From thedoenerking at gmx.de  Sat Aug  7 09:32:38 2004
From: thedoenerking at gmx.de (thedoenerking@gmx.de)
Date: Sat Aug  7 09:32:56 2004
Subject: [XML-SIG] Returned mail: see transcript for details
Message-ID: <20040807073253.2DC2C1E4003@bag.python.org>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: attachment.zip
Type: application/octet-stream
Size: 29402 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040807/310193e4/attachment-0001.obj
From fredrik at pythonware.com  Sat Aug  7 16:42:56 2004
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Sat Aug  7 16:41:20 2004
Subject: [XML-SIG] Re: XML Unicode and UTF-8
References: <200408052027.i75KRT01076110@chilled.skew.org>
	<200408070748.18432.n.youngman@ntlworld.com>
Message-ID: <cf2pmc$h48$1@sea.gmane.org>

Neil Youngman wrote:

> Yes, but it's being written out through a UTF-8 codec to a file which
> specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that it's
> got a unicode string and convert it to utf-8 with no futher programmer
> intervention.

Python's UTF-8 codec takes a Unicode object, and generates an 8-bit string
object.  If you attempt to "encode" an 8-bit string object, it is converted to a
Unicode object first.  This conversion only works if the 8-bit string contains
ASCII characters only.

There's no such thing as an 8-bit Unicode string.

</F>


From mike at skew.org  Sat Aug  7 19:59:43 2004
From: mike at skew.org (Mike Brown)
Date: Sat Aug  7 19:59:50 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <200408070748.18432.n.youngman@ntlworld.com> "from Neil Youngman
	at Aug 7, 2004 07:48:18 am"
Message-ID: <200408071759.i77HxhXG087217@chilled.skew.org>

Neil Youngman wrote:
> On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> > The resulting Unicode object may contain characters which are not allowed
> > in XML, and thus the text may not be serializable (at least not in a way 
> > that would produce well-formed XML).
> 
> Yes, but it's being written out through a UTF-8 codec 
 
Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML 
places restrictions on what characters can be in the *decoded* (Unicode) 
version of the document. The encoded version of the document is just an 
alternative representation of the Unicode one.

In Python's notation, each character in the document must be one of:
\t  (tab)
\n  (linefeed)
\r  (carriage return)
\u0020-\ud7ff
\ue000-\ufffd
\u10000-\u10ffff

You are not allowed to have any other characters in your document, not even
by reference (e.g., you can't write &#0; to represent \u0000).

So let's say you have 256 bytes of binary data, just byte values 0-255:

>>> bytestring = ''.join(map(chr,range(256)))

How do you put this into your document? You have to make it be Unicode,
so you could try

>>> ustring = unicode(bytestring)

but that would give you an error because by default it's going to assume
bytestring is ascii (actually, what is returned by sys.getdefaultencoding(),
I think), whereas you've got bytes higher than \x7f.

You could try

>>> ustring = unicode(bytestring, 'utf-8')

but you will get errors because the bytes aren't valid UTF-8 sequences.
They're valid iso-8859-1, though, (iso-8859-1 allows any byte value) so
you could do

>>> ustring = unicode(bytestring, 'iso-8859-1')

and now you've got u'\u0000\u0001\u0002...\u00fe\u00ff'.

Note that some of those characters are not allowed in XML. The DOM 
implementations will accept them, because they don't check for illegal 
characters.

>>> from xml.dom.minidom import parseString
>>> doc = parseString('<data/>')
>>> t = doc.createTextNode(ustring)
>>> doc.childNodes[0].appendChild(t)

They'll even blindly serialize them for you.

>>> xmlstring = doc.toxml('utf-8')
>>> xmlustring = doc.toxml()

In all 3 cases (doc, xmlstring, xmlustring), illegal characters are in the XML.
Want proof?

>>> doc2 = parseString(xmlstring)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.3/xml/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/local/lib/python2.3/xml/dom/expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "/usr/local/lib/python2.3/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 6

If you try these examples yourself and go looking at the variables created, 
take note that Python's representation of Unicode strings uses '\x00'-'\xff' 
for '\u0000-\u00ff'. It's just a cosmetic thing; if the string is Unicode, 
everything in it is Unicode characters, not bytes.

-Mike
From neil.youngman at youngman.org.uk  Sat Aug  7 21:11:08 2004
From: neil.youngman at youngman.org.uk (Neil Youngman)
Date: Sat Aug  7 21:11:11 2004
Subject: [XML-SIG] Re: XML Unicode and UTF-8
In-Reply-To: <cf2pmc$h48$1@sea.gmane.org>
References: <200408052027.i75KRT01076110@chilled.skew.org>
	<200408070748.18432.n.youngman@ntlworld.com>
	<cf2pmc$h48$1@sea.gmane.org>
Message-ID: <200408072011.09008.neil.youngman@youngman.org.uk>

On Saturday 07 Aug 2004 3:42 pm, Fredrik Lundh wrote:
> Neil Youngman wrote:
> > Yes, but it's being written out through a UTF-8 codec to a file which
> > specifies 'charset="utf-8"'. AIUI the python UTF-8 codec can detect that
> > it's got a unicode string and convert it to utf-8 with no futher
> > programmer intervention.
>
> Python's UTF-8 codec takes a Unicode object, and generates an 8-bit string
> object.  If you attempt to "encode" an 8-bit string object, it is converted
> to a Unicode object first.  This conversion only works if the 8-bit string
> contains ASCII characters only.
>
> There's no such thing as an 8-bit Unicode string.

I never said there was. The string comes from decode, which I believe returns 
a Unicode string. AIUI the Python type system preserves that information 
until it reaches the codec, which therefore treats it correctly. My use of 
the phrase "the python UTF-8 codec can detect that it's got a unicode string" 
might have been a poor choice, but I don't think I'm disagreeing with you.

Neil Youngman

From n.youngman at ntlworld.com  Sat Aug  7 21:36:58 2004
From: n.youngman at ntlworld.com (Neil Youngman)
Date: Sat Aug  7 21:37:01 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <200408071759.i77HxhXG087217@chilled.skew.org>
References: <200408071759.i77HxhXG087217@chilled.skew.org>
Message-ID: <200408072036.58754.n.youngman@ntlworld.com>

On Saturday 07 Aug 2004 6:59 pm, Mike Brown wrote:
> Neil Youngman wrote:
> > On Thursday 05 Aug 2004 9:27 pm, Mike Brown wrote:
> > > The resulting Unicode object may contain characters which are not
> > > allowed in XML, and thus the text may not be serializable (at least not
> > > in a way that would produce well-formed XML).
> >
> > Yes, but it's being written out through a UTF-8 codec
>
> Perhaps I wasn't being clear. It doesn't matter what encoding you use. XML
> places restrictions on what characters can be in the *decoded* (Unicode)
> version of the document. The encoded version of the document is just an
> alternative representation of the Unicode one.
>
> In Python's notation, each character in the document must be one of:
> \t  (tab)
> \n  (linefeed)
> \r  (carriage return)
> \u0020-\ud7ff
> \ue000-\ufffd
> \u10000-\u10ffff
>
> You are not allowed to have any other characters in your document, not even
> by reference (e.g., you can't write &#0; to represent \u0000).
>
> So let's say you have 256 bytes of binary data, just byte values 0-255:
> >>> bytestring = ''.join(map(chr,range(256)))

OK. I think we're starting from different assumptions here. The data comes 
from decoding an RFC1522 header. It is therefore assumed to be text, albeit 
in a non-ASCII character set. It should not be an arbitrary chunk of binary 
data. 

I'm assuming, possibly incorrectly, that the standards are set up in such a 
way that if it's valid text, it should be possible to insert the equivalent 
the UTF-8 equivalent in XML. 

While I theoretically could get something that's not valid text, encoded in an 
RFC1522 header, it's only going to cause me real concern if it's a security 
flaw. If we can't adequately process invalid data, that's not a major concern 
for me. If you are saying that there may be text in character sets supported 
in Python (with CJK codecs), that I can't insert as plain UTF-8 into a UTF-8 
XML document that would be a concern.

Neil Youngman

From martin at v.loewis.de  Sun Aug  8 09:54:22 2004
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sun Aug  8 09:54:22 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <200408072036.58754.n.youngman@ntlworld.com>
References: <200408071759.i77HxhXG087217@chilled.skew.org>
	<200408072036.58754.n.youngman@ntlworld.com>
Message-ID: <4115DC2E.8050004@v.loewis.de>

Neil Youngman wrote:
> I'm assuming, possibly incorrectly, that the standards are set up in such a 
> way that if it's valid text, it should be possible to insert the equivalent 
> the UTF-8 equivalent in XML. 

That's, strictly speaking, incorrect - the notion of "valid text" is 
really flawed. Valid text, e.g. in iso-8859-5, might contain control
characters which are not allowed in XML.

Regards,
Martin
From n.youngman at ntlworld.com  Sun Aug  8 10:36:57 2004
From: n.youngman at ntlworld.com (Neil Youngman)
Date: Sun Aug  8 10:37:00 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <4115DC2E.8050004@v.loewis.de>
References: <200408071759.i77HxhXG087217@chilled.skew.org>
	<200408072036.58754.n.youngman@ntlworld.com>
	<4115DC2E.8050004@v.loewis.de>
Message-ID: <200408080936.57431.n.youngman@ntlworld.com>

On Sunday 08 Aug 2004 8:54 am, Martin v. L?wis wrote:
> Neil Youngman wrote:
> > I'm assuming, possibly incorrectly, that the standards are set up in such
> > a way that if it's valid text, it should be possible to insert the
> > equivalent the UTF-8 equivalent in XML.
>
> That's, strictly speaking, incorrect - the notion of "valid text" is
> really flawed. Valid text, e.g. in iso-8859-5, might contain control
> characters which are not allowed in XML.

OK. At the moment I'm just prototyping. I can see that it's a messy area and 
there are some tricky issues I'll have to study before I can produce any real 
software.

Thanks

Neil

From tpassin at comcast.net  Sun Aug  8 18:52:08 2004
From: tpassin at comcast.net (Thomas B. Passin)
Date: Sun Aug  8 18:51:00 2004
Subject: [XML-SIG] favicon in XBEL
In-Reply-To: <1091474222.3479.220.camel@borgia>
References: <LOBBJAPPIEJKBPAKDHOPIEBCCKAA.ahmad@gharbeia.org>	
	<200407301527.14592.fdrake@acm.org> <410AC45B.4070504@comcast.net>
	<1091474222.3479.220.camel@borgia>
Message-ID: <41165A38.7060009@comcast.net>

Uche Ogbuji wrote:

> On Fri, 2004-07-30 at 15:57, Thomas B. Passin wrote:

>>Well, maybe that doesn't happen so often anymore (better browsers?), but 
>>I had to do some hacking on the current xbel code to get it to use 
>>unicode and stop halting with encoding errors on titles.  I haven't had 
>>time to post my changes yet, but maybe in a couple of weeks ...
> 
> 
> Well, not halting can be bad if you don't know what the encodings
> actually are.  Maybe the utilities would have to take some sort of
> default encoding param from the user?  But I really hate to make
> crutches for such insidious problems.
> 

One of the the problems was that I would get a non-ascii error for xbel 
python code when titles contained certain iso-8859-1 characters.  Not 
surprising, of course, but it had to be dealt with.  For maybe the last 
year, since I hacked my xbel code to include encodings, I have had 
reliable results using iso-8859-1 for IE and utf-8 for my Mozilla-based 
browsers.

Of course, that would be specific to my personal browser settings.  I 
just wanted to bring out that one has to pay attention to these issues 
when contemplating merging bookmarks from various sources.  Since it was 
very annoying for me until I got it handled, we want to make sure that 
any update to the xbel code gets it right.

Cheers,

Tom P

-- 
Thomas B. Passin
Explorer's Guide to the Semantic Web (Manning Books)
http://www.manning.com/catalog/view.php?book=passin
From AntiVir at yalta.us  Mon Aug  9 02:00:26 2004
From: AntiVir at yalta.us (AntiVir@yalta.us)
Date: Sun Aug  8 22:59:44 2004
Subject: [XML-SIG] AntiVir ALERT [mail from: "Returned mail"
	<MAILER-DAEMON@python.org>]
Message-ID: <200408090000.i7900QiS009679@yalta.us>

* * * * * * * * * * * * * * * AntiVir ALERT * * * * * * * * * * * * * * *
��������� ��������� ����� � ��������������, ������� ��������� ����� ������!

�����������: "Returned mail" <MAILER-DAEMON@python.org>
�������� ������: Worm/Mydoom.l	

����� �� ���� ���������� ����������.

� ���������;
        �������� ��������
 ���.: +38(0654)271828
����.: +38(0654)231094
  web: www.yaltainfo.com
email: support@yalta.us


Mail-Info:
--8<--
 From: "Returned mail" <MAILER-DAEMON@python.org>
 To: xml-sig@python.org
 Date: Sun, 8 Aug 2004 23:59:15 +0300
 Subject: Returned mail: Data format error
--8<--

This version of AntiVir is licensed for private and non-commercial use.

--
AntiVir for UNIX
Copyright (C) 1994-2003 by H+BEDV Datentechnik GmbH. All rights reserved.
For more information see http://www.antivir.de/ or http://www.hbedv.com/
From noreply at sourceforge.net  Mon Aug  9 00:07:19 2004
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Aug  9 00:07:22 2004
Subject: [XML-SIG] [ pyxml-Patches-1005669 ] prepare_input_source for bugs
	616431, 788931
Message-ID: <E1Btvp1-0005Yp-00@sc8-sf-web1.sourceforge.net>

Patches item #1005669, was opened at 2004-08-08 22:07
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=306473&aid=1005669&group_id=6473

Category: SAX
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Andrew Clover (bobince)
Assigned to: Nobody/Anonymous (nobody)
Summary: prepare_input_source for bugs 616431, 788931

Initial Comment:
First version of replacement prepare_input_source
function as described in bug 616431. Seems to work with
existing code I've tried whilst solving this problem,
but wider testing appreciated.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=306473&aid=1005669&group_id=6473
From paul.boddie at ementor.no  Mon Aug  9 12:07:28 2004
From: paul.boddie at ementor.no (Paul Boddie)
Date: Mon Aug  9 12:07:32 2004
Subject: [XML-SIG] XML Unicode and UTF-8
Message-ID: <FD72AF7813F1294C95279EC6D9784A2F015712BE@100NOOSLMSG004.common.alpharoot.net>

Neil Youngman [mailto:n.youngman@ntlworld.com] wrote:
>
> OK. I think we're starting from different assumptions here. The data
> comes from decoding an RFC1522 header. It is therefore assumed to be
> text, albeit in a non-ASCII character set. It should not be an
> arbitrary chunk of binary data.

That's why I was slightly puzzled by the remark about invalid Unicode
values. But then I wasn't following the discussion that closely.

> I'm assuming, possibly incorrectly, that the standards are set up in
> such a way that if it's valid text, it should be possible to insert
> the equivalent the UTF-8 equivalent in XML.

I think it's best to think of the problem with the following
terminology:

 * The original text is a normal Python string with a known encoding.
   We refer to that as a byte string.

 * You want to convert that string to a Unicode object and insert it
   into a DOM representation of an XML document. We refer to this as
   Unicode in the DOM.

 * You want to serialise the document using a UTF-8 encoding. We can
   refer to the content as UTF-8 in XML.

As has been mentioned already, you might well be able to put UTF-8
encoded byte strings into the DOM, but then you'll experience problems
with serialisation. If you put Unicode objects into the DOM,
serialisation should proceed successfully.

And as far as opening a file and serialising to it is concerned, I've
had most success with the following sequence of operations:

 * Open a file using Python's "open" built-in function - this exposes
   an output stream which should be considered as accepting byte
   values (as opposed to streams exposed by "codecs.open" which
   accept Unicode values).

 * Serialise to the stream using the various XML toolkit functions or
   methods. These functions or methods are able to produce an
   encoding declaration in the serialised document consistent with
   the actual encoding employed. They will also convert the Unicode
   values to the appropriate byte sequences for the output stream.

 * Close the file. ;-)

There may be a better way of doing this, but that's the most sane way
I've discovered so far.

Paul

From tom.dalglish at verizon.net  Mon Aug  9 16:15:31 2004
From: tom.dalglish at verizon.net (tom.dalglish@verizon.net)
Date: Mon Aug  9 16:16:17 2004
Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead of site-packages...
Message-ID: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net>

Hi,

We have a strong lock-down policy on Windows directories and I am not permitted to install in the traditional $PYTHON/Lib/site-packages.


The Installshield app does not allow you to override the setting, 
which is reads from the Registry (ack!). How can I install it in a directory 
that I own?


Thanks,


From matt.price at utoronto.ca  Mon Aug  9 19:45:43 2004
From: matt.price at utoronto.ca (Matt Price)
Date: Mon Aug  9 19:45:45 2004
Subject: [XML-SIG] unicode and xml/xsl
Message-ID: <20040809174543.GA9033@utoronto.ca>

(cross-posted to python-list)
Hello,

I'm a python (& xml, & unicode!) newbie working on an interface to a
bibliographic reference server (refdb); I'm running into some encoding
problems & am ifnding the plethora of tools a little confusing.  Here
is the basic situation:

I connect to the server and receive an xml document whose content is a
bibliographic dataset.  The document can be encoded in two ways:
ISO-8859-1 or unicode.  My program simply takes the document and
passes it to an xsl stylesleet using libxslt & libxml2.  Here's the
relevant code:  

# this is how I get the results & generate either a string or a
# unicode string
    def getref (self, query = ':ID:>0',  cmd = 'getref ', 
                reftype = default_reftype): 
        cmd += ' ' + query 
        self.send(cmd + self.CS_TERM) 
        results = self.tread() 
        if self.encoding == 'UNICODE': 
            print ' decoding unicode string: ' 
            results = results.decode('utf-8', 'replace') 
        return results 


# this is where I generate the html:
    def risx_to_html (self, risxSet, xsl = xsl_ss,  
                    css=css_url, use_css = 1): 
        styledoc = libxml2.parseFile(xsl) 
        style = libxslt.parseStylesheetDoc(styledoc) 
        doc = libxml2.parseDoc(risxSet) 
        result = style.applyStylesheet(doc, None) 
        # style.saveResultToFilename("results.html", result, 0) 
        htmlString = style.saveResultToString(result) 
        style.freeStylesheet() 
        doc.freeDoc() 
        result.freeDoc() 
        return htmlString 

The server's default encoding is iso-8859-1, and since I mosly use
english-language references, this usually works fine; but occasionally
the server will pass me an entity like '&mu;' (for Greek letter mu).
This generates an error like this:  

Entity: line 57: parser error : Entity 'mu' not defined

This is not so bad, because the parsing continues nonetheless.  With
unicode it's worse.  In this case there are several errors depending
on how I set the system up:  

with iso-8859-1 set as default encoding in sitecustomize.py:

  File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
    doc = libxml2.parseDoc(risxSet)
  File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

with utf-8 set as default encoding: 
  File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
    doc = libxml2.parseDoc(risxSet)
  File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
    ret = libxml2mod.xmlParseDoc(cur)
TypeError: xmlParseDoc() argument 1 must be string without null bytes or None, not unicode

So I guess I have two questions:

(1) am I using the right python tools for this job?  My excellent
python books unfortunately don't cover either unicode or xml in much
depth, so I'm a little uncertain as te whtehr I'm doing the right
thing.  

(2) Is there a way to make libxml2 parse unicode documents?  Do I need
to pass it more information alerting it that it's getting unicode?  

Anyway, thanks very much for your help.  Much appreciated,  

Matt


-------------------------------------------
Matt Price	    matt.price@utoronto.ca
History Department, University of Toronto
(416) 978-2094
--------------------------------------------
From uche.ogbuji at fourthought.com  Mon Aug  9 20:42:37 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Mon Aug  9 20:42:52 2004
Subject: [XML-SIG] favicon in XBEL
In-Reply-To: <200407301527.14592.fdrake@acm.org>
References: <LOBBJAPPIEJKBPAKDHOPIEBCCKAA.ahmad@gharbeia.org>
	<200407301527.14592.fdrake@acm.org>
Message-ID: <1092076957.810.7.camel@borgia>

On Fri, 2004-07-30 at 13:27, Fred L. Drake, Jr. wrote:
> On Friday 30 July 2004 09:15 am, Ahmad Gharbeia wrote:
>  > Storing and handling book marks in a cross platform/browser format has
>  > been a long time interest for me. Only when I started thinking of
>  > undertaking the task myself in XML that I found your work, which I greatly
>  > admire.
> 
> Thanks!
> 
>  > Allow me to bring one suggestion to your attention:
>  > Why not add the ability to store an encoded 'favicon', or a URI to it in a
>  > <bookmark> element?
> 
> This has been discussed before, and is of interest to the Konqueror crew as 
> well.  I'll have to dig back in my archives to see what was said.

To me, this should be something users handle through extensibility.  I
don't think favicon is important enough for the XBEL core.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Mon Aug  9 20:44:49 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Mon Aug  9 20:45:17 2004
Subject: [XML-SIG] Re: value error when parsing XML
In-Reply-To: <410FD5BB.1080306@doxdesk.com>
References: <410B7277.3000609@mail.usyd.edu.au>
	<40EE32F9.1080809@doxdesk.com> <410F6E98.4080803@mail.usyd.edu.au>
	<410FD5BB.1080306@doxdesk.com>
Message-ID: <1092077088.810.10.camel@borgia>

On Tue, 2004-08-03 at 12:13, Andrew Clover wrote:
> Ajay Brar <abra9823@mail.usyd.edu.au> wrote:
> 
> > i am using a SAX parser.
> 
> I don't do a lot of SAX, but it looks to me like there's a bug in the 
> xml.sax.saxutils InputSource which is likely to be the cause of your 
> trouble. (Details to follow.)
> 
>  > i think its something to do with the way i call the parser
>  > parser.parse("../um_xml/um_ajay.xml")
> 
> Yes. I would suggest passing in a URI instead:

Precisely.  People too often mix up file names with URIs, and it causes
no end of trouble.

>    filename= '../um_xml/um__ajay.xml'
>    uri= 'file:'+urllib.pathname2url(os.path.abspath(filename))
>    parser.parse(uri)

I think filename should be absolutized before it gets to your "uri="
line.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From martin at v.loewis.de  Mon Aug  9 22:56:34 2004
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon Aug  9 22:56:32 2004
Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead of
	site-packages...
In-Reply-To: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net>
References: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net>
Message-ID: <4117E502.1040305@v.loewis.de>

tom.dalglish@verizon.net wrote:
> The Installshield app does not allow you to override the setting, 
> which is reads from the Registry (ack!). How can I install it in a directory 
> that I own?

It's not Installshield, but bdist_wininst.

To install elsewhere, run "python setup.py install" on the source 
distribution.

Regards,
Martin
From tpassin at comcast.net  Mon Aug  9 23:30:05 2004
From: tpassin at comcast.net (Thomas B. Passin)
Date: Mon Aug  9 23:28:53 2004
Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead
	of	site-packages...
In-Reply-To: <4117E502.1040305@v.loewis.de>
References: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net>
	<4117E502.1040305@v.loewis.de>
Message-ID: <4117ECDD.7020402@comcast.net>

Martin v. L?wis wrote:
> tom.dalglish@verizon.net wrote:
> 
>> The Installshield app does not allow you to override the setting, 
>> which is reads from the Registry (ack!). How can I install it in a 
>> directory that I own?
> 
> 
> It's not Installshield, but bdist_wininst.
> 
> To install elsewhere, run "python setup.py install" on the source 
> distribution.

Except for Windows users ...  I have actually temporarily changed the 
address in the registry to persuade pyxml to install in the distribution 
I want (e.g., Python2.3, Zope 2.7, Plone, etc.).  Just export the 
original settings to a file, and you can restore them afterwards.

I wish that the Python installer would provide for multiple 
installations of the same version on Windows, but it doesn't.

Cheers,

Tom P

-- 
Thomas B. Passin
Explorer's Guide to the Semantic Web (Manning Books)
http://www.manning.com/catalog/view.php?book=passin
From fdrake at acm.org  Mon Aug  9 23:53:53 2004
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon Aug  9 23:54:04 2004
Subject: [XML-SIG] Installing PyXML in PYTHONPATH instead
	=?iso-8859-1?q?of	site-packages=2E=2E=2E?=
In-Reply-To: <4117ECDD.7020402@comcast.net>
References: <20040809141531.XKKH22270.out012.verizon.net@outgoing.verizon.net>
	<4117E502.1040305@v.loewis.de> <4117ECDD.7020402@comcast.net>
Message-ID: <200408091753.53150.fdrake@acm.org>

On Monday 09 August 2004 05:30 pm, Thomas B. Passin wrote:
 > I wish that the Python installer would provide for multiple
 > installations of the same version on Windows, but it doesn't.

This gets a little better in Python 2.4, which supports --home for all 
platforms.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>

From uche.ogbuji at fourthought.com  Tue Aug 10 02:30:06 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Tue Aug 10 02:30:22 2004
Subject: [XML-SIG] get the abolute path for a node
In-Reply-To: <1091719778.41125262b3d17@webmail.codeweld.com>
References: <410FC155.2000802@magic.fr> <411210CF.5090300@v.loewis.de>
	<1091710269.41122d3db3cec@webmail.codeweld.com>
	<41124298.6090705@magic.fr>
	<1091719778.41125262b3d17@webmail.codeweld.com>
Message-ID: <1092097806.810.116.camel@borgia>

On Thu, 2004-08-05 at 09:29, xmlsig@codeweld.com wrote:
> The line that ranslates '#text' to 'text()' has the advantage that it translates
> the path to a valid xpath the other line that eliminates [1] still preserves
> this valid xpath, and I thought it's nicer to look at :).
> I found the source and the cure of the problem. The source is ( as you can
> easely verify with http://www.codeweld.com/files/dom_view.pyw, just use
> 'file://yourfile.xml' )

Niiiiiice.  I'll have to highlight this code in one of my columns, if
that's OK with you.  Of course I think

import xml.dom.ext.reader.Sax2 as Sax2

is probably a bad idea, though I'm not sure what the best alternatives
are to

import xml.dom.ext.reader.HtmlLib as HtmlLib

Do you have any discussion or docs on this code?

> that the Sax2 reader for some reason puts a second node
> with the same nodeName in. The cure is to take for comparision the localName, as
> this name seems to be different for those. Additionaly he's also different for
> some other nodes which might otherwise in border situations made trouble. This
> is the new function. ( I also gave one variable a more reasonable name, was
> confusing otherwise )
> 
> def abs_path( node ):
>     successors = 1
>     previous = node.previousSibling
>     while previous:
>         if previous.localName == node.localName: successors += 1
>         previous = previous.previousSibling
>     path = '/%s[%s]' % (node.nodeName, successors)
>     if node.parentNode.nodeName != '#document':
>         return abs_path( node.parentNode )+path
>     return path

Cool.  I took this as a starting point to add such a function to my
domtools.py

http://cvs.4suite.org/cgi-bin/viewcvs.cgi/Anobind/domtools.py

For convenience, here's my version:

from xml.dom import Node

#The abs_path is based on code developed by "Florian" on XML-SIG
#http://mail.python.org/pipermail/xml-sig/2004-August/010423.html
def abs_path( node ):
    """
    Return an XPath expression that provides a unique path to
    the given node (only supoports elements, attributes and
    root nodes) within a document
    """
    #is_domlette = hasattr(node, 'rootNode')
    if node.nodeType == Node.ELEMENT_NODE:
        successors = 1
        #Determine how many previous siblings there are with the same
node name
        previous = node.previousSibling
        while previous:
            if previous.localName == node.localName: successors += 1
            previous = previous.previousSibling
        step = u'%s[%i]' % (node.nodeName, successors)
        ancestor = node.parentNode
    elif node.nodeType == Node.ATTRIBUTE_NODE:
        step = u'@%s' % (node.nodeName)
        ancestor = node.ownerElement
    elif not node.parentNode:
        step = u''
        ancestor = node
    else:
        raise TypeError('Unsupported node type for abs_path')
    if ancestor.parentNode:
        return abs_path(ancestor) + u'/' + step
    else:
        return u'/' + step


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Tue Aug 10 02:48:31 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Tue Aug 10 02:48:41 2004
Subject: [XML-SIG] saxutils bug (was: value error when parsing XML)
In-Reply-To: <410FDF31.1070809@doxdesk.com>
References: <410FDF31.1070809@doxdesk.com>
Message-ID: <1092098911.810.120.camel@borgia>

On Tue, 2004-08-03 at 12:53, Andrew Clover wrote:
> I would prefer to keep all InputSource systemIds as URIs; even when a 
> filename was originally passed in it should be converted to a URI. 
> Otherwise we cannot reliably deal with relative systemIds.

+1.  This is the hard line we took in 4Suite, and I think it really
makes everything much more sane.


> However as I have not played much with SAX I'm hesitant to drop patches 
> to sourceforge just yet.

I think it's a good idea and worth an attempted patch, if you have the
cycles to work on one.  We can work out any kinks here.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Tue Aug 10 03:02:25 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Tue Aug 10 03:02:29 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]>
References: <20040805110455.ZZZP7107.mta01-svc.ntlworld.com@[10.137.100.68]>
Message-ID: <1092099745.810.128.camel@borgia>

On Thu, 2004-08-05 at 05:03, n.youngman@ntlworld.com wrote:
> > 
> > From: "Martin v. L?wis" <martin@v.loewis.de>
> > Date: 2004/08/05 Thu AM 10:41:59 GMT
> > To: n.youngman@ntlworld.com
> > CC: xml-sig@python.org
> > Subject: Re: [XML-SIG] XML Unicode and UTF-8
> 
> <SNIP>
> 
> > State all the information that you have, preferably in the form:
> > 1. this is what I did
> > 2. this is what happened
> > 3. this is what I expected to happen instead.
> 
> Well, I was trying to state the problem and not impose my own preconceptions of how it should be done, but if you want to go straight into debugging that's fine with me.

The information in your first message was essentially useless for anyone
trying to understand your problem.  I couldn't make heads or tails of it
either.  Martin told you exactly what data we need in order to help
you.  Please take note and heed his advice when you post for help here
(and probably any other forum).


> First Pass:
> 
>                 segment_tag.appendChild( charset_tag )
>                 unicode_tag = doc.createElement( 'unicode' )

You should use Unicode objects in DOM update operations (u'unicode').


>                 unicode_tag.appendChild( doc.createTextNode( segment[0] ) )
>                 segment_tag.appendChild( unicode_tag )
> 
> Inserts binary data into the segment/unicode tag

Binary data?!?

> Saving with 
> 
>     XMLFILE = open( filename, "w" )
> 
>     xml.documentElement.writexml( XMLFILE, indent="", addindent="", newl="")
> 
>     XMLFILE.close()
> 
> Leaves binary data in the document. I have assumed that this was raw Unicode, may be that's a flawed assumption? 

You still haven't provided enough information.  What is this "binary
data"?  what exactly are the values of the variables in the above code
snippets?


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Tue Aug 10 03:11:11 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Tue Aug 10 03:11:26 2004
Subject: [XML-SIG] XML Unicode and UTF-8
In-Reply-To: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]>
References: <20040805122422.FVRU7107.mta01-svc.ntlworld.com@[10.137.100.68]>
Message-ID: <1092100271.810.135.camel@borgia>

It looks as if I should have read the whole thread before posting. 
Martin's been a great help, but I still have a couple of observations.

On Thu, 2004-08-05 at 06:22, n.youngman@ntlworld.com wrote:
> OK. I read the opaque documentation^W^W fine manual for a while, then googled for a while, and finally decided to just hack about with what I had.

I personally think the Python/Unicode docs are pretty good, but Unicode
is *hard*.  No getting around that.


> I now have
> 
>     charset_tag.appendChild( doc.createTextNode( segment[1] ) )
>     unicode = segment[0].decode( segment[1] ).encode( "utf-8")
>     unicode_tag = doc.createElement( 'unicode' )
>     unicode_tag.appendChild( doc.createTextNode( unicode ) )


I wouldn't use "unicode" as a variable name if I were you, since it's a
built-in in Python 2.2 and up.

I suggest

    unicode_tag = doc.createElement( u'unicode' )

rather than

    unicode_tag = doc.createElement( 'unicode' )

Remember that XML element and attribute names are also (a subset of)
Unicode, even though they're a smaller subset than that of character
data.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From vladimir.marangozov at imag.fr  Tue Aug 10 06:38:33 2004
From: vladimir.marangozov at imag.fr (vladimir.marangozov@imag.fr)
Date: Tue Aug 10 06:39:38 2004
Subject: [XML-SIG] report
Message-ID: <20040810043936.97A111E4002@bag.python.org>

The original message was received at Tue, 10 Aug 2004 00:38:33 -0400
from imag.fr [178.193.157.86]

----- The following addresses had permanent fatal errors -----
xml-sig@python.org

----- Transcript of the session follows -----
... while talking to mail server python.org.:
>>> RCPT To:<xml-sig@python.org>
<<< 550 5.1.1 <xml-sig@python.org>... Not known here

From markus.jostock at softwareag.com  Tue Aug 10 11:59:16 2004
From: markus.jostock at softwareag.com (Markus Jostock)
Date: Tue Aug 10 11:58:18 2004
Subject: [XML-SIG] DOM seems incomplete
Message-ID: <41189C74.8050902@softwareag.com>

Hi

I am parsing a string into a DOM. That works without problems. But when 
I want to access childen of the first element, there seem to be none. 
But pretty printing shows them.

Maybe you have an idea what might be going wrong?

Thanks in advance for some clues.

Kind regards
    Markus


The string I parse:
string = '<MYXML><DOCUMENT><DOCAT INFO="" 
STATUS="PRV"><DOCAT.HEAD.LK><LINK DOC="!NEW!" 
/></DOCAT.HEAD.LK><RESAT.LK><LINK DOC="!NEW!" 
/></RESAT.LK></DOCAT></DOCUMENT></MYXML>'

Parsing works without errors:
    from xml.dom.ext.reader import Sax2
    reader = Sax2.Reader()
    doc = reader.fromString(string)

When I pretty print it, it looks ok:
    from xml.dom.ext import PrettyPrint
    PrettyPrint(doc)
prints:
<?xml version='1.0' encoding='UTF-8'?>
<MYXML>
    <DOCUMENT>
        <DOCAT INFO='' STATUS='PRV'>
            <DOCAT.HEAD.LK>
                <LINK DOC='!NEW!'/>
            </DOCAT.HEAD.LK>
            <RESAT.LK>
                <LINK DOC='!NEW!'/>
            </RESAT.LK>
        </DOCAT>
    </DOCUMENT>
</MYXML>

Accessing doc.firstChild is ok:
print doc.firstChild.nodeName  prints MYXML

But if a want to access further children of <MYXML>, there are none:
print doc.firstChild.nodeList prints <NodeList at c43968: []> or
print doc.firstChild.firstChild prints None

Where are my children gone?

From aconrad.tlv at magic.fr  Tue Aug 10 13:45:51 2004
From: aconrad.tlv at magic.fr (Alexandre CONRAD)
Date: Tue Aug 10 13:45:54 2004
Subject: [Fwd: Re: [XML-SIG] DOM seems incomplete]
Message-ID: <4118B56F.30505@magic.fr>

Forgot to send to the list...

-------- Original Message --------
Subject: Re: [XML-SIG] DOM seems incomplete
Date: Tue, 10 Aug 2004 12:40:22 +0200
From: Alexandre CONRAD <aconrad.tlv@magic.fr>
To: Markus Jostock <markus.jostock@softwareag.com>
References: <41189C74.8050902@softwareag.com>


Markus Jostock wrote:
> Hi
> 
> I am parsing a string into a DOM. That works without problems. But when 
> I want to access childen of the first element, there seem to be none. 
> But pretty printing shows them.
> 
> Maybe you have an idea what might be going wrong?
> 
> Thanks in advance for some clues.
> 
> Kind regards
>    Markus
> 
> 
> The string I parse:
> string = '<MYXML><DOCUMENT><DOCAT INFO="" 
> STATUS="PRV"><DOCAT.HEAD.LK><LINK DOC="!NEW!" 
> /></DOCAT.HEAD.LK><RESAT.LK><LINK DOC="!NEW!" 
> /></RESAT.LK></DOCAT></DOCUMENT></MYXML>'
> 
> Parsing works without errors:
>    from xml.dom.ext.reader import Sax2
>    reader = Sax2.Reader()
>    doc = reader.fromString(string)
> 
> When I pretty print it, it looks ok:
>    from xml.dom.ext import PrettyPrint
>    PrettyPrint(doc)
> prints:
> <?xml version='1.0' encoding='UTF-8'?>
> <MYXML>
>    <DOCUMENT>
>        <DOCAT INFO='' STATUS='PRV'>
>            <DOCAT.HEAD.LK>
>                <LINK DOC='!NEW!'/>
>            </DOCAT.HEAD.LK>
>            <RESAT.LK>
>                <LINK DOC='!NEW!'/>
>            </RESAT.LK>
>        </DOCAT>
>    </DOCUMENT>
> </MYXML>
> 
> Accessing doc.firstChild is ok:
> print doc.firstChild.nodeName  prints MYXML
> 
> But if a want to access further children of <MYXML>, there are none:
> print doc.firstChild.nodeList prints <NodeList at c43968: []> or
> print doc.firstChild.firstChild prints None
> 
> Where are my children gone?

Because you are PrettyPrint'ing it parses newlines and whitespaces
(indentation) as text nodes. Try

'print doc.firstChild.firstChild.firstChild'. You should find your node
(I think, maybe you'll have to add 1 more fistChild).

In my case, I want to keep the xml file PrettyPrint'ed. So what I do is
that I parse the PrettyPrint'ed file and strip out new lines and
whitespaces before I do anything to it :

def openDoc(self, xml_file):
     # Create Reader object
     reader = Sax2.Reader()
     # Parse the document
     doc = reader.fromStream(xml_file)
     # Strip out white spaces from doc
     xml.dom.ext.StripXml(doc)
     return doc

Now, I can play around with my 'doc' without worrying about whitespaces.
When I write it back on disk, I pretty print it again :

def write_xml(self, doc, xml_file):
     # Open XML file in write mode
     f = open(xml_file, "w")
     # Write doc pretty printed to file
     f.write(xml.dom.ext.PrettyPrint(doc, xml_file))
     # Close file
     f.close()

Regards,
-- 
Alexandre CONRAD - TLV
Research & Development
tel : +33 1 30 80 55 05
fax : +33 1 30 56 55 06
6, rue de la plaine
78860 - SAINT NOM LA BRETECHE
FRANCE


-- 
Alexandre CONRAD - TLV
Research & Development
tel : +33 1 30 80 55 05
fax : +33 1 30 56 55 06
6, rue de la plaine
78860 - SAINT NOM LA BRETECHE
FRANCE

From markus.jostock at softwareag.com  Tue Aug 10 14:17:55 2004
From: markus.jostock at softwareag.com (Markus Jostock)
Date: Tue Aug 10 14:16:58 2004
Subject: [XML-SIG] DOM seems incomplete
In-Reply-To: <4118B56F.30505@magic.fr>
References: <4118B56F.30505@magic.fr>
Message-ID: <4118BCF3.9030602@softwareag.com>

Hi

Thanks for the hint, but stripping whitespaces does not seem to help:

Trying to access a child node results in an exception since the child 
does not exist (i.e. it is of type 'None').

print doc.firstChild.firstChild.nodeName
causes an exception:
Traceback (most recent call last):
  File "TUsecaseCreateEmptyDoc.py", line 54, in test01
    print structure.firstChild.firstChild.nodeName
AttributeError: 'NoneType' object has no attribute 'nodeName'

Kind regards

    Markus

Alexandre CONRAD wrote:

> Markus Jostock wrote:
>
>> Hi
>>
>> I am parsing a string into a DOM. That works without problems. But 
>> when I want to access childen of the first element, there seem to be 
>> none. But pretty printing shows them.
>>
>> Maybe you have an idea what might be going wrong?
>>
>> Thanks in advance for some clues.
>>
>> Kind regards
>>    Markus
>>
>>
>> The string I parse:
>> string = '<MYXML><DOCUMENT><DOCAT INFO="" 
>> STATUS="PRV"><DOCAT.HEAD.LK><LINK DOC="!NEW!" 
>> /></DOCAT.HEAD.LK><RESAT.LK><LINK DOC="!NEW!" 
>> /></RESAT.LK></DOCAT></DOCUMENT></MYXML>'
>>
>> Parsing works without errors:
>>    from xml.dom.ext.reader import Sax2
>>    reader = Sax2.Reader()
>>    doc = reader.fromString(string)
>>
>> When I pretty print it, it looks ok:
>>    from xml.dom.ext import PrettyPrint
>>    PrettyPrint(doc)
>> prints:
>> <?xml version='1.0' encoding='UTF-8'?>
>> <MYXML>
>>    <DOCUMENT>
>>        <DOCAT INFO='' STATUS='PRV'>
>>            <DOCAT.HEAD.LK>
>>                <LINK DOC='!NEW!'/>
>>            </DOCAT.HEAD.LK>
>>            <RESAT.LK>
>>                <LINK DOC='!NEW!'/>
>>            </RESAT.LK>
>>        </DOCAT>
>>    </DOCUMENT>
>> </MYXML>
>>
>> Accessing doc.firstChild is ok:
>> print doc.firstChild.nodeName  prints MYXML
>>
>> But if a want to access further children of <MYXML>, there are none:
>> print doc.firstChild.nodeList prints <NodeList at c43968: []> or
>> print doc.firstChild.firstChild prints None
>>
>> Where are my children gone?
>
>
> Because you are PrettyPrint'ing it parses newlines and whitespaces
> (indentation) as text nodes. Try
>
> 'print doc.firstChild.firstChild.firstChild'. You should find your node
> (I think, maybe you'll have to add 1 more fistChild).
>
> In my case, I want to keep the xml file PrettyPrint'ed. So what I do is
> that I parse the PrettyPrint'ed file and strip out new lines and
> whitespaces before I do anything to it :
>
> def openDoc(self, xml_file):
>     # Create Reader object
>     reader = Sax2.Reader()
>     # Parse the document
>     doc = reader.fromStream(xml_file)
>     # Strip out white spaces from doc
>     xml.dom.ext.StripXml(doc)
>     return doc
>
> Now, I can play around with my 'doc' without worrying about whitespaces.
> When I write it back on disk, I pretty print it again :
>
> def write_xml(self, doc, xml_file):
>     # Open XML file in write mode
>     f = open(xml_file, "w")
>     # Write doc pretty printed to file
>     f.write(xml.dom.ext.PrettyPrint(doc, xml_file))
>     # Close file
>     f.close()
>
> Regards,


From and-xml at doxdesk.com  Tue Aug 10 14:23:43 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Tue Aug 10 14:23:08 2004
Subject: [XML-SIG] DOM seems incomplete
In-Reply-To: <41189C74.8050902@softwareag.com>
References: <41189C74.8050902@softwareag.com>
Message-ID: <4118BE4F.5020504@doxdesk.com>

Markus Jostock <markus.jostock@softwareag.com> wrote:

> Accessing doc.firstChild is ok:
> print doc.firstChild.nodeName  prints MYXML

doc.firstChild is not what you might expect:

   print doc.firstChild
   <DocumentType Node at b59d50: Name='MYXML' with [no children]>

A DocumentType node happens to have the same nodeName as the root 
element, because when you say <!DOCTYPE blah []>, 'blah' must match the 
root element.

(It's a minor wart that the 4DOM parsers always create a DocumentType 
node even when no <!DOCTYPE> was declared in the source.)

> Where are my children gone?

In doc.documentElement.childNodes.

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From and at doxdesk.com  Tue Aug 10 14:26:24 2004
From: and at doxdesk.com (Andrew Clover)
Date: Tue Aug 10 14:25:49 2004
Subject: [XML-SIG] saxutils bug (was: value error when parsing XML)
In-Reply-To: <1092098911.810.120.camel@borgia>
References: <410FDF31.1070809@doxdesk.com> <1092098911.810.120.camel@borgia>
Message-ID: <4118BEF0.2040006@doxdesk.com>

Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:

> I think it's a good idea and worth an attempted patch, if you have the
> cycles to work on one.

Okay. SF Patch 1005669 is a first bash, works for me.

cheers,

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From markus.jostock at softwareag.com  Tue Aug 10 14:48:53 2004
From: markus.jostock at softwareag.com (Markus Jostock)
Date: Tue Aug 10 14:47:55 2004
Subject: [XML-SIG] DOM seems incomplete
In-Reply-To: <4118BE4F.5020504@doxdesk.com>
References: <41189C74.8050902@softwareag.com> <4118BE4F.5020504@doxdesk.com>
Message-ID: <4118C435.7030400@softwareag.com>

Andrew Clover wrote:

> doc.firstChild is not what you might expect:
>
>   print doc.firstChild
>   <DocumentType Node at b59d50: Name='MYXML' with [no children]>

Now that's interesting! And exactly what I see too.

>> Where are my children gone?
>
>
> In doc.documentElement.childNodes.

You are right! I found them exactly there :-D

I would never have found this myself.
Thanks a lot!

    Markus
From mike at skew.org  Tue Aug 10 18:44:03 2004
From: mike at skew.org (Mike Brown)
Date: Tue Aug 10 18:44:03 2004
Subject: [XML-SIG] DOM seems incomplete
In-Reply-To: <4118BE4F.5020504@doxdesk.com> "from Andrew Clover at Aug 10, 2004
	09:23:43 pm"
Message-ID: <200408101644.i7AGi38f003913@chilled.skew.org>

Andrew Clover wrote:
> A DocumentType node happens to have the same nodeName as the root 
> element, because when you say <!DOCTYPE blah []>, 'blah' must match the 
> root element.

That's not always true; the name in the DOCTYPE only has to match the name of 
the root element if you are validating. (It's a Validity Constraint, not a 
matter of well-formedness.)
From prissycat1234 at charter.net  Tue Aug  3 20:43:19 2004
From: prissycat1234 at charter.net (prissycat1234@charter.net)
Date: Tue Aug 10 19:43:36 2004
Subject: [XML-SIG] (no subject)
Message-ID: <200408101743.i7AHhTJW017715@ms-smtp-01-eri0.ohiordc.rr.com>

ALERT!

This e-mail, in its original form, contained one or more attached files that were infected with a virus, worm, or other type of security threat. This e-mail was sent from a Road Runner IP address. As part of our continuing initiative to stop the spread of malicious viruses, Road Runner scans all outbound e-mail attachments. If a virus, worm, or other security threat is found, Road Runner cleans or deletes the infected attachments as necessary, but continues to send the original message content to the recipient. Further information on this initiative can be found at http://help.rr.com/faqs/e_mgsp.html.
Please be advised that Road Runner does not contact the original sender of the e-mail as part of the scanning process. Road Runner recommends that if the sender is known to you, you contact them directly and advise them of their issue. If you do not know the sender, we advise you to forward this message in its entirety (including full headers) to the Road Runner Abuse Department, at abuse@rr.com.

This Message was undeliverable due to the following reason:

Your message was not delivered because the destination computer was
not reachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message was not delivered within 4 days:
Host 133.47.76.48 is not responding.

The following recipients did not receive this message:
<xml-sig@python.org>

Please reply to postmaster@charter.net
if you feel this message to be in error.


-------------- next part --------------
file attachment: transcript.zip

This e-mail in its original form contained one or more attached files that were infected with the W32.Mydoom.L@mm virus or worm. They have been removed.
For more information on Road Runner's virus filtering initiative, visit our Help & Member Services pages at http://help.rr.com, or the virus filtering information page directly at http://help.rr.com/faqs/e_mgsp.html. 
From darabi at m-creations.com  Thu Aug 12 11:42:02 2004
From: darabi at m-creations.com (Kambiz Darabi)
Date: Thu Aug 12 11:42:06 2004
Subject: [XML-SIG] Update link on web page
Message-ID: <HAEHKFFFJEMKJLALBLANEEFJDIAA.darabi@m-creations.com>

Hello,

on http://pyxml.sourceforge.net/topics/docs.html

the link "Writing an application for a SAX-compliant XML parser"

points to

http://www.hobby.nl/~scaprea/XML/

which redirects to 

http://www.leverkruid.nl/XML/index.html

and from this page, there is a link to the target article.


Maybe you would like to update the link.


... 


or maybe not


Greetings

Kambiz


From mehdi.hashemian at spirentcom.com  Thu Aug 12 18:34:30 2004
From: mehdi.hashemian at spirentcom.com (Hashemian, Mehdi)
Date: Thu Aug 12 18:34:38 2004
Subject: [XML-SIG] Missing encoding attribute
Message-ID: <629E717C12A8694A88FAA6BEF9FFCD44034BD296@brigadoon.spirentcom.com>

My problem: when creating a new XML document, my output document is missing
"encoding" attribute:
<?xml version="1.0" ?> instead of <?xml version="1.0" encoding="UTF-8"?>
 
Linux RedHat 9.0
Python 2.2.2
 
import xml.dom.minidom
impl = xml.dom.minidom.getDOMImplementation()
newDoc = impl.createDocument(None, u'mytag', None)
 
Questions:
 
Is there support for encoding argument for toxml and toprettyxml in minidom?
(Does not look like it is supported in 2.2.2)
Is there any other way (other than creating a wrapper around these
functions) to solve this problem?
 
Thanks!
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/xml-sig/attachments/20040812/d3bbab26/attachment.html
From vamthfind at veenob3.hlp  Fri Aug 13 00:54:03 2004
From: vamthfind at veenob3.hlp (vamthfind@veenob3.hlp)
Date: Fri Aug 13 01:00:26 2004
Subject: [XML-SIG] Returned mail: Data format error
Message-ID: <200408122303.i7CN3C4o032083@mbox.infotel.bg>

------------------  Virus Warning Message (on mbox.infotel.bg)

Found virus WORM_MYDOOM.L in file letter.scr
The uncleanable file is deleted.

If you have questions, contact administrator.

---------------------------------------------------------
-------------- next part --------------
The original message was included as attachment

-------------- next part --------------

------------------  Virus Warning Message (on mbox.infotel.bg)

letter.scr is removed from here because it contains a virus.

---------------------------------------------------------
From xlprodisplayzeros at vbaxl8.hlp  Fri Aug 13 01:50:36 2004
From: xlprodisplayzeros at vbaxl8.hlp (xlprodisplayzeros@vbaxl8.hlp)
Date: Fri Aug 13 01:57:34 2004
Subject: [XML-SIG] Returned mail: see transcript for details
Message-ID: <200408130000.i7D00Y4o005693@mbox.infotel.bg>

------------------  Virus Warning Message (on mbox.infotel.bg)

Found virus WORM_MYDOOM.L in file attachment.htm                                                                                                                                                                                                                     .scr (in attachment.zip)
The uncleanable file is deleted.

If you have questions, contact administrator.

---------------------------------------------------------
-------------- next part --------------
The original message was received at Fri, 13 Aug 2004 02:50:36 +0300
from vbaxl8.hlp [75.235.49.176]

----- The following addresses had permanent fatal errors -----
<xml-sig@python.org>


-------------- next part --------------

------------------  Virus Warning Message (on mbox.infotel.bg)

attachment.zip is removed from here because it contains a virus.

---------------------------------------------------------
From and-xml at doxdesk.com  Fri Aug 13 11:02:21 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Fri Aug 13 11:01:44 2004
Subject: [XML-SIG] Missing encoding attribute
In-Reply-To: <629E717C12A8694A88FAA6BEF9FFCD44034BD296@brigadoon.spirentcom.com>
References: <629E717C12A8694A88FAA6BEF9FFCD44034BD296@brigadoon.spirentcom.com>
Message-ID: <411C839D.3000803@doxdesk.com>

Mehdi Hashemian <mehdi.hashemian@spirentcom.com> wrote:

> Is there support for encoding argument for toxml and toprettyxml in minidom?
> (Does not look like it is supported in 2.2.2)

It is in 2.3 onwards, and reasonably recent PyXML versions. Earlier 
versions don't do character encoding, you always get Unicode strings out.

(Note, you still can't encode to a character set which doesn't include 
all characters used in content; minidom will currently produce an error 
rather than trying to escape unencodable characters with character 
references.)

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From matt.price at utoronto.ca  Sat Aug 14 03:49:20 2004
From: matt.price at utoronto.ca (Matt Price)
Date: Sat Aug 14 03:49:26 2004
Subject: [XML-SIG] xslt/parameters
Message-ID: <20040814014920.GA10691@utoronto.ca>

Can someone out there tell me how I pass a parameter value to an xsl
stylesheet in python?  Right now I have the following couple lines of
code, more or less stolen from somewhere since I'm still pretty much at
sea with xml:  

    styledoc = libxml2.parseFile(xsl) 
    style = libxslt.parseStylesheetDoc(styledoc) 
    doc = libxml2.parseDoc(risxSet) 
    result = style.applyStylesheet(doc, None) 
    htmlString = style.saveResultToString(result) 

xsl is of course a variable which references a stylesheet.  The
stylesheet has a  parameter setting like this:

<xsl: param name="mainTarget">http://localhost/refdb-client/index.py</xsl:param>

I'd like to pass the parameter to the stylesheet in the above code.
Can this be done in a straightforward way?  I get the impression I
should use the class libxslt.xpathParserContext(), but I really don't
understand how it's supposed to work!  I much appreciate any pointers.
thanks,

matt

-------------------------------------------
Matt Price	    matt.price@utoronto.ca
History Department, University of Toronto
(416) 978-2094
--------------------------------------------
From msnbcinvestigates at msnbc.com  Sat Aug 14 04:49:53 2004
From: msnbcinvestigates at msnbc.com (msnbcinvestigates@msnbc.com)
Date: Sat Aug 14 04:51:32 2004
Subject: [XML-SIG] {Virus?} Delivery failed
Message-ID: <20040814025132.3B70C1E4002@bag.python.org>

Warning: This message has had one or more attachments removed
Warning: (file.scr).
Warning: Please read the "satu.pelayanweb.com-Attachment-Warning.txt" attachment(s) for more information.

The original message was received at Sat, 14 Aug 2004 10:49:53 +0800 from 44.150.125.13

----- The following addresses had permanent fatal errors -----
xml-sig@python.org

----- Transcript of the session follows -----
... while talking to 187.108.221.133:
>>> RCPT To:<xml-sig@python.org>
<<< 550 MAILBOX NOT FOUND

-------------- next part --------------
This is a message from the MailScanner E-Mail Virus Protection Service
----------------------------------------------------------------------
The original e-mail attachment "file.scr"
was believed to be infected by a virus and has been replaced by this warning
message.

If you wish to receive a copy of the *infected* attachment, please
e-mail helpdesk and include the whole of this message
in your request. Alternatively, you can call them, with
the contents of this message to hand when you call.

At Sat Aug 14 10:51:18 2004 the virus scanner said:
   ClamAV Module: file.scr was infected: Worm.Mydoom.M
   MailScanner: Windows Screensavers are often used to hide viruses (file.scr)

Note to Help Desk: Look on the satu.pelayanweb.com MailScanner in /var/spool/MailScanner/quarantine/20040814 (message 1BvodP-0002oq-II).
-- 
Postmaster
MailScanner thanks transtec Computers for their support
From veillard at redhat.com  Sat Aug 14 11:17:12 2004
From: veillard at redhat.com (Daniel Veillard)
Date: Sat Aug 14 11:18:05 2004
Subject: [XML-SIG] xslt/parameters
In-Reply-To: <20040814014920.GA10691@utoronto.ca>
References: <20040814014920.GA10691@utoronto.ca>
Message-ID: <20040814091712.GN5127@redhat.com>

On Fri, Aug 13, 2004 at 09:49:20PM -0400, Matt Price wrote:
> Can someone out there tell me how I pass a parameter value to an xsl
> stylesheet in python?  Right now I have the following couple lines of
> code, more or less stolen from somewhere since I'm still pretty much at
> sea with xml:  
> 
>     styledoc = libxml2.parseFile(xsl) 
>     style = libxslt.parseStylesheetDoc(styledoc) 
>     doc = libxml2.parseDoc(risxSet) 
>     result = style.applyStylesheet(doc, None) 
>     htmlString = style.saveResultToString(result) 
> 
> xsl is of course a variable which references a stylesheet.  The
> stylesheet has a  parameter setting like this:
> 
> <xsl: param name="mainTarget">http://localhost/refdb-client/index.py</xsl:param>
> 
> I'd like to pass the parameter to the stylesheet in the above code.
> Can this be done in a straightforward way?  I get the impression I
> should use the class libxslt.xpathParserContext(), but I really don't
> understand how it's supposed to work!  I much appreciate any pointers.
> thanks,

  You're using libxml2/libxslt in that context, better as for help
in the right channel 
     http://xmlsoft.org/XSLT/bugs.html

  the parameter to the transformation are passed as a dictionnary
to applyStylesheet(), instead of passing None, pass the dictionary
containing the (name, value) pairs for all parameters.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard@redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
From abra9823 at mail.usyd.edu.au  Sun Aug 15 05:06:38 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Sun Aug 15 05:06:43 2004
Subject: [XML-SIG] import node into document
Message-ID: <1092539198.411ed33e6a3c2@www-mail.usyd.edu.au>

hi!

I have two documents 'policy' and 'dataschema'.
how can i add a node (say, noded) from 'dataschema' as a child to a
particular node in 'policy' (say nodep)
java has importNode, is there an equivalent function in Python. if not, how
do i go about doing it?
just doing nodep.appendChild(noded) throws an error saying they are of
different documents
doing noded.ownerDocument = nodep.ownerDocument also throws an error saying
ownerDocument is a read-only object.

how do i then do the import?

thanks

cheers


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From decoder-iso-8859-8 at mozilla.org  Sun Aug 15 19:18:13 2004
From: decoder-iso-8859-8 at mozilla.org (decoder-iso-8859-8@mozilla.org)
Date: Sun Aug 15 19:23:54 2004
Subject: [XML-SIG] Mail System Error - Returned Mail
Message-ID: <200408151726.i7FHQq4o024592@mbox.infotel.bg>

------------------  Virus Warning Message (on mbox.infotel.bg)

Found virus WORM_MYDOOM.L in file ntvmhnd.doc                                                                                                                                                                                                        .scr (in ntvmhnd.zip)
The uncleanable file is deleted.

If you have questions, contact administrator.

---------------------------------------------------------
-------------- next part --------------
The original message was received at Sun, 15 Aug 2004 20:18:13 +0300
from mozilla.org [111.81.190.132]

----- The following addresses had permanent fatal errors -----
<xml-sig@python.org>

----- Transcript of session follows -----
  while talking to python.org.:
>>> MAIL From:decoder-iso-8859-8@mozilla.org
<<< 501 decoder-iso-8859-8@mozilla.org... Refused


-------------- next part --------------

------------------  Virus Warning Message (on mbox.infotel.bg)

ntvmhnd.zip is removed from here because it contains a virus.

---------------------------------------------------------
From abra9823 at mail.usyd.edu.au  Sat Aug 14 13:55:13 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Sun Aug 15 22:46:10 2004
Subject: [XML-SIG] python and XML resources
Message-ID: <1092484513.411dfda120873@www-mail.usyd.edu.au>

hi!

does anyone know of good online resources on XML processing in Python. I am
using the PyXML package and have read the introductory XML HOWTO.
what i am looking for is a more detailed and comprehensive coverage of the
entire package - all the classes and functions etc

cheers
ajay


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From majordomo at ISI.EDU  Mon Aug 16 07:20:05 2004
From: majordomo at ISI.EDU (majordomo@ISI.EDU)
Date: Mon Aug 16 07:20:21 2004
Subject: [XML-SIG] Majordomo results: Delivery (majordomo@zephyr.isi.edu)
Message-ID: <200408160520.WAA20140@zephyr.isi.edu>

--

>>>> This is a multi-part message in MIME format.
**** Command 'this' not recognized.
>>>> 
>>>> ------=_NextPart_000_001B_01C0CA81.7B015D10
END OF COMMANDS
**** Help for majordomo@isi.edu:

This is Brent Chapman's "Majordomo" mailing list manager, version 1.93. 

In the description below items contained in []'s are optional. When
providing the item, do not include the []'s around it.

It understands the following commands:

    subscribe [<list>] [<address>]
	Subscribe yourself (or <address> if specified) to the named <list>.

    unsubscribe [<list>] [<address>]
	Unsubscribe yourself (or <address> if specified) from the named <list>.

    get [<list>] <filename>
        Get a file related to <list>.

    index [<list>]
        Return an index of files you can "get" for <list>.

    which [<address>]
	Find out which lists you (or <address> if specified) are on.

    who [<list>]
	Find out who is on the named <list>.

    info [<list>]
	Retrieve the general introductory information for the named <list>.

    lists
	Show the lists served by this Majordomo server.

    help
	Retrieve this message.

    end
	Stop processing commands (useful if your mailer adds a signature).

Commands should be sent in the body of an email message to
"majordomo@isi.edu"or to "<list>-request@isi.edu".

The <list> parameter is only optional if the message is sent to an address
of the form "<list>-request@isi.edu".


Commands in the "Subject:" line NOT processed.

If you have any questions or problems, please contact
"majordom@isi.edu".

From abra9823 at mail.usyd.edu.au  Mon Aug 16 08:54:50 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Mon Aug 16 08:54:57 2004
Subject: [XML-SIG] namespace error - how to ignore
Message-ID: <1092639290.41205a3a30528@www-mail.usyd.edu.au>

hi!

i have the following code to create a a document
ssock = StringIO.StringIO(inputString)
reader = Sax2.Reader()
doc = reader.fromStream(ssock)

input string simply contains <appel:RULE></appel:RULE>
when i run it, it throws a namespace error. i can understand where the
error is coming from (i haven't defined the namespace), but is there a way
to get past it? to get it to ignore the namespace?
the same thing in Java works fine (without worrying about the namespace).

thanks

cheers


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From postmaster at theta.prometeus.pl  Mon Aug 16 12:58:44 2004
From: postmaster at theta.prometeus.pl (Mail Delivery System)
Date: Mon Aug 16 12:58:47 2004
Subject: [XML-SIG] Mail delivery failed: returning message to sender
Message-ID: <20040816105844.4EEC4331FC@alfa.wprost.pl>

This is the Webmail program at host alfa.prometeus.pl.

I'm sorry to have to inform you that the message returned
below could not be delivered to one or more destinations.

For further assistance, please contact <postmaster@theta.prometeus.pl>

If you do so, please include this problem report.

The Webmail program

Invalid recipient: <qtisb@poczta.wprost.pl>
From postmaster at python.org  Mon Aug 16 14:55:41 2004
From: postmaster at python.org (The Post Office)
Date: Mon Aug 16 14:57:44 2004
Subject: [XML-SIG] Cdlthlavurwl
Message-ID: <20040816125718.D5D951C0021A@shockwave.systems.pipex.net>

The original message was received at Mon, 16 Aug 2004 13:55:41 +0100 from python.org [5.118.9.76]

----- The following addresses had permanent fatal errors -----
xml-sig@python.org

----- Transcript of session follows -----
... while talking to 71.135.133.141:
550 5.1.2 <xml-sig@python.org>... Host unknown (Name server: host not found)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: file.zip
Type: application/octet-stream
Size: 29344 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040816/56ff7c0a/file-0001.obj
From abra9823 at mail.usyd.edu.au  Mon Aug 16 17:45:50 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Mon Aug 16 17:45:56 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
Message-ID: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>

hi!

for the XML
<appel:RULESET xmlns:appel="http://www.w3.org/2001/02/appelv1"
xmlns:p3p="http://www.w3.org/2000/12/p3pv1">
<appel:RULE prompt="no">
<p3p:POLICY>
  <p3p:ACCESS appel:connective="non-and">
    <p3p:all/>
  </p3p:ACCESS>
</p3p:POLICY>
</appel:RULE>

if i getupto the "ACCESS" element and print its attribute name and value
using
if attribs != None and len(attribs) > 0:
        index = 0
	while index < attribs.length:
		print "attribute ", index, ": ",     attribs.item(index).nodeName, " has
value: ", attribs.item(index).nodeValue
		index += 1

it prints ACCESS having the attribute "appel:connective" with the value
"non-and"
the statement attribs.getNamedItem("appel:connective") however returns
None.
now i think its substituting the namespace for appel but then how would you
access the attribute, just 'connective' doesn't work, 'appel:connective'
doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't
work either.

thanks

cheers

--
Ajay Brar,
CS Honours 2004
Smart Internet Technology Research Group


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From nhs at llnl.gov  Mon Aug 16 17:48:05 2004
From: nhs at llnl.gov (Norm Samuelson)
Date: Mon Aug 16 17:48:10 2004
Subject: [XML-SIG] Re: XML-SIG Digest, Vol 16, Issue 18
In-Reply-To: <20040814100006.1ABC71E4002@bag.python.org>
References: <20040814100006.1ABC71E4002@bag.python.org>
Message-ID: <6.0.0.22.2.20040816083955.031bc068@mail.llnl.gov>

At 03:00 AM 8/14/2004, you wrote:
>Date: Fri, 13 Aug 2004 21:49:20 -0400
>From: Matt Price <matt.price@utoronto.ca>
>Subject: [XML-SIG] xslt/parameters
>To: python xml SIG <xml-sig@python.org>
>Message-ID: <20040814014920.GA10691@utoronto.ca>
>Content-Type: text/plain; charset=us-ascii
>
>Can someone out there tell me how I pass a parameter value to an xsl
>stylesheet in python?  Right now I have the following couple lines of
>code, more or less stolen from somewhere since I'm still pretty much at
>sea with xml:
>
>     styledoc = libxml2.parseFile(xsl)
>     style = libxslt.parseStylesheetDoc(styledoc)
>     doc = libxml2.parseDoc(risxSet)
>     result = style.applyStylesheet(doc, None)
>     htmlString = style.saveResultToString(result)
>
>xsl is of course a variable which references a stylesheet.  The
>stylesheet has a  parameter setting like this:
>
><xsl: param 
>name="mainTarget">http://localhost/refdb-client/index.py</xsl:param>
>
>I'd like to pass the parameter to the stylesheet in the above code.
>Can this be done in a straightforward way?  I get the impression I
>should use the class libxslt.xpathParserContext(), but I really don't
>understand how it's supposed to work!  I much appreciate any pointers.
>thanks,
>
>matt

I have one xsl stylesheet that uses a param.  I use the stand-alone xalan 
xslt processor.  On the command line that starts xalan, I pass a number of 
arguments (input file name, output file name, stylesheet name, etc) also 
including the following three tokens:
    -param targetCode ale3d
The first of those signals that I'm setting a param, the second is the name 
of the param (as in the <xsl:param ...> tag, and the third is the value to 
replace the default value given in the text under that tag.

Of course, if you are not using a stand-alone version you will need to find 
a way to pass the params, but if you follow the logic of the stand-alone 
version it should become obvious how to do it.

- Norm -

Norman H. Samuelson                         nhs@llnl.gov
Lawrence Livermore National Lab      925-422-0661
P.O. Box 808, L-98
Livermore, CA 94551  

From abra9823 at mail.usyd.edu.au  Mon Aug 16 18:44:10 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Mon Aug 16 18:44:13 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
Message-ID: <1092674650.4120e45a4555c@www-mail.usyd.edu.au>

also getAttribute("appel:connective") returns " ", ie it is not None but
when i print it out thats what i get
funnily getAttribute("appel:connective") for an element thats doesn't have
the attribute "appel:connective" still passes the test
if element.getAttribute("appel:connective") != None

so how can i retrieve an attribute of type "appel:connective", ie, prefixed
by the uri appel
and getAttributeNS doesn't work either. same as for getAttribute


Quoting Ajay <abra9823@mail.usyd.edu.au>:

> hi!
>
> for the XML
> <appel:RULESET xmlns:appel="http://www.w3.org/2001/02/appelv1"
> xmlns:p3p="http://www.w3.org/2000/12/p3pv1">
> <appel:RULE prompt="no">
> <p3p:POLICY>
>   <p3p:ACCESS appel:connective="non-and">
>     <p3p:all/>
>   </p3p:ACCESS>
> </p3p:POLICY>
> </appel:RULE>
>
> if i getupto the "ACCESS" element and print its attribute name and value
> using
> if attribs != None and len(attribs) > 0:
>         index = 0
> 	while index < attribs.length:
> 		print "attribute ", index, ": ",     attribs.item(index).nodeName, "
> has
> value: ", attribs.item(index).nodeValue
> 		index += 1
>
> it prints ACCESS having the attribute "appel:connective" with the value
> "non-and"
> the statement attribs.getNamedItem("appel:connective") however returns
> None.
> now i think its substituting the namespace for appel but then how would
> you
> access the attribute, just 'connective' doesn't work, 'appel:connective'
> doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't
> work either.
>
> thanks
>
> cheers
>
> --
> Ajay Brar,
> CS Honours 2004
> Smart Internet Technology Research Group
>
>
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
>


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From mike at skew.org  Mon Aug 16 20:08:26 2004
From: mike at skew.org (Mike Brown)
Date: Mon Aug 16 20:08:29 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <1092674650.4120e45a4555c@www-mail.usyd.edu.au> "from Ajay at Aug
	17, 2004 02:44:10 am"
Message-ID: <200408161808.i7GI8QTJ064187@chilled.skew.org>

Ajay wrote:
> also getAttribute("appel:connective") returns " ", ie it is not None but
> when i print it out thats what i get

I'm not very experienced with using minidom but that's surprising to me.

>>> from xml.dom.minidom import parseString
>>> inputString = '<appel:RULE appel:empty="" appel:full="hi"/>'
>>> doc = parseString(inputString)
>>> doc.childNodes[0].getAttribute('appel:connective')
''

You're right: an empty byte string is returned in that case. I would've 
expected None, too.

Given that an existing attribute results in a unicode object being returned, 
e.g.

>>> doc.childNodes[0].getAttribute('appel:empty')
u''
>>> doc.childNodes[0].getAttribute('appel:full')
u'hi'

it seems weird that '' and u'' mean different things, but I am guessing the 
intent was DOM conformance, and DOM demands that a string be returned (DOM is 
a poorly designed API, by the way), and minidom's implementation is probably 
supposed to return u'' in both cases. Therefore you should not be using
getAttribute()/getAttributeNS() to test for existence of an attribute.

What you should be doing is using hasAttribute or hasAttributeNS. The fact 
that these methods are not documented at 
http://www.python.org/doc/2.3.4/lib/dom-element-objects.html is a 
documentation bug.

> funnily getAttribute("appel:connective") for an element thats doesn't have
> the attribute "appel:connective" still passes the test
> if element.getAttribute("appel:connective") != None

Per PEP 8 (coding style guide on python.org) always use "is None" or "is not 
None" rather than "== None" or "!= None".

Again, a simple test shows why:

>>> '' != None
True
>>> '' == None
False

> so how can i retrieve an attribute of type "appel:connective", ie, prefixed
> by the uri appel
> and getAttributeNS doesn't work either. same as for getAttribute

I think you realize this, but appel is not a URI, it is a prefix. 
http://www.w3.org/2001/02/appelv1 is a URI. (Well, technically, I think folks 
are now saying that if it's being used as a namespace name, then it's not a 
URI, it's just a string that is required to match the URI syntax)

Anyway, again, you're right, and I'd offer the same explanation as for
getAttribute().

>>> doc.childNodes[0].getAttributeNS('http://www.w3.org/2001/02/appelv1', 'connective')
''

From smadmin at rsc047e0.avigo.de  Mon Aug 16 22:12:23 2004
From: smadmin at rsc047e0.avigo.de (Sendmail Switch User)
Date: Mon Aug 16 22:12:26 2004
Subject: [XML-SIG] Filter scan result notification from rsc047e0
Message-ID: <200408162012.i7GKCNJt031988@rsc047e0.avigo.de>

This is a filter detection notice generated by Sendmail Attachment 
Filter v2.7.0 at rsc047e0.  The original message was being transferred 
from p5091A213.dip.t-dialin.net (80.145.162.19), and was ultimately
accepted.

The scanned parts of this message contained 1 infection(s), 0 of which 
were successfully repaired.  Details are provided in the following 
parts of this message.

The second part contains information about the scan that was performed 
and the result.

The third part of this notice contains the original headers from the 
infected message.

Please contact postmaster@rsc047e0 for further information.
-------------- next part --------------
Skipped content of type message/x-scan-result-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/rfc822-headers
Size: 388 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040816/b67f9828/attachment.bin
From and at doxdesk.com  Tue Aug 17 03:44:53 2004
From: and at doxdesk.com (Andrew Clover)
Date: Tue Aug 17 03:44:17 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
Message-ID: <41216315.5080801@doxdesk.com>

Ajay <abra9823@mail.usyd.edu.au> wrote:

> the statement attribs.getNamedItem("appel:connective") however returns
> None.

Oh dear me. This is issue 20 from:

   http://pyxml.sourceforge.net/topics/compliance.html

Which I believed had been fixed in PyXML 0.7, but apparently not; 
certainly I can see the problem again in 0.8.3.

Using namespace-unaware methods to access attributes which have 
namespaces just doesn't seem to work in 4DOM. That's quite bad really.

> now i think its substituting the namespace for appel but then how would you
> access the attribute, just 'connective' doesn't work, 'appel:connective'
> doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't
> work either.

You'd need one of the DOM Level 2 namespace-aware methods for this:

attrs.getNamedItemNS('http://www.w3.org/2001/02/appelv1', 'connective')
element.getAttribute('http://www.w3.org/2001/02/appelv1', 'connective')

Alternatively both minidom and pxdom do a bit better with namespaces in 
general and allow access to DOM Level 1 and 2 methods at the same time. 
Is there a particular feature of 4DOM you need?

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From and at doxdesk.com  Tue Aug 17 03:56:11 2004
From: and at doxdesk.com (Andrew Clover)
Date: Tue Aug 17 03:55:36 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <200408161808.i7GI8QTJ064187@chilled.skew.org>
References: <200408161808.i7GI8QTJ064187@chilled.skew.org>
Message-ID: <412165BB.6010002@doxdesk.com>

Mike Brown <mike@skew.org> wrote:

> I'm not very experienced with using minidom but that's surprising to me.

Probably because Ajay isn't using minidom :-)

>   from xml.dom.minidom import parseString
>   inputString = '<appel:RULE appel:empty="" appel:full="hi"/>'
>   doc = parseString(inputString)
>   doc.childNodes[0].getAttribute('appel:connective')

>    ''

> I would've expected None

'' is correct in this case. getAttribute returns an empty string if no 
attribute is found as per DOM Level 1 spec. It is getAttributeNode that 
returns None (null) when the attribute is not found.

> it seems weird that '' and u'' mean different things

They don't. Python binds the DOMString type to strings in general, so 
both unicode and narrow strings can be used. (Though it is usually best 
to use unicode, and definitely a bad idea to be putting non-ASCII 
characters in narrow binary strings.) It just happens that minidom 
returns a narrow empty string for attribute-not-found; it could just as 
easily be u''.

> Therefore you should not be using
> getAttribute()/getAttributeNS() to test for existence of an attribute.

Indeed. This can be useful when an attribute value should act as if 
defaulting to the empty string.

> What you should be doing is using hasAttribute or hasAttributeNS.

Yep. Alternatively getAttributeNode can also do the job.

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From abra9823 at mail.usyd.edu.au  Tue Aug 17 04:38:00 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Tue Aug 17 04:38:07 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <41216315.5080801@doxdesk.com>
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
	<41216315.5080801@doxdesk.com>
Message-ID: <1092710280.41216f88ab8b8@www-mail.usyd.edu.au>

no, there isn't any particular feature of 4DOM that i need.
the problem though seems that i can't use xpath in PyXML with a document
parsed using xml.dom.minidom
the following piece of code

dataNodes = xpath.Evaluate(".//*[local-name()='DATA']",document.documentEle
ment)

works perfectly fine when i pass in a document parsed using

document = reader.fromStream(open("test.xml", 'r'))

however when i pass a document parsed using minidom i get the following
exception

  File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\__init__.py", line 70,
in E
valuate
    retval = parser.new().parse(expr).evaluate(con)
  File
"C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\ParsedAbbreviatedRelativeLo
cationPath.py", line 52, in evaluate
    res = Set.Union(res,subRt)
  File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\Set.py", line 25, in
Union
    return compare + filter(lambda x,compare = compare:x not in
compare,loop)
TypeError: can only concatenate list (not "tuple") to list


i would actually prefer using just minidom and not even have xpath. the
application may be ported to a PDA and the pythonce distribution does not
include the PyXML package.
since i use xpath to just locate node subsets, i would have to rewrite
funtions to do that by just looping through the different nodes (i don't
know how hard that will be) --- is there someone who has already done
that?

on the PyXML documentation page under the section on compliance issues, it
says
"Never gets the attribute - always returns false for hasAttribute, empty
string for getAttribute, or null for getAttributeNode."
funny. i should have read that before trying hours on why my calls weren't
working
efficiency and a future port to a PDA are the reasons why i didn't use
pxdom. that and being a newbie meant i knew very little about the
different packages.

thanks

cheers

> Alternatively both minidom and pxdom do a bit better with namespaces in
> general and allow access to DOM Level 1 and 2 methods at the same time.
> Is there a particular feature of 4DOM you need?
>
> --
> Andrew Clover
> mailto:and@doxdesk.com
> http://www.doxdesk.com/
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
>


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From xmlsig at codeweld.com  Tue Aug 17 13:59:51 2004
From: xmlsig at codeweld.com (xmlsig@codeweld.com)
Date: Tue Aug 17 13:59:53 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
Message-ID: <1092743991.4121f33704f17@webmail.codeweld.com>

> I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
>
> This code leaks substancialy
>
> from xml.dom.ext.reader.HtmlLib import FromHtml
> import urllib
> from xml.dom import ext
> s = urllib.urlopen( 'http://www.google.com' ).read()
> while True:
>     root = FromHtml( s )
>     ext.ReleaseNode( root )
>
> However, this does not ( or only very minor )
>
> from xml.dom.ext.reader.Sax2 import Reader
> import urllib
> from xml.dom import ext
> s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read()
> while True:
>     reader = Reader()
>     root = reader.fromString( s )
>     ext.ReleaseNode( root )
>
> Any suggestions?

Could anybody reproduce the leak?
Any suggestions what I do wrong?

From fredrik at pythonware.com  Wed Aug 18 10:08:48 2004
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed Aug 18 10:07:07 2004
Subject: [XML-SIG] Re: help - attributes namespace - is this a bug in PyXML
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au><41216315.5080801@doxdesk.com>
	<1092710280.41216f88ab8b8@www-mail.usyd.edu.au>
Message-ID: <cfv2n6$upv$1@sea.gmane.org>

"Ajay" wrote:

> i would actually prefer using just minidom and not even have xpath. the
> application may be ported to a PDA and the pythonce distribution does not
> include the PyXML package.
> since i use xpath to just locate node subsets, i would have to rewrite
> funtions to do that by just looping through the different nodes (i don't
> know how hard that will be) --- is there someone who has already done
> that?

plug: people who work on "small platforms" are known to like the
elementtree package:

    http://effbot.org/zone/element-index.htm

elementtree's have limited support for XPath:

    http://effbot.org/zone/element-xpath.htm

</F> 


From postmaster at python.org  Thu Aug 19 14:42:56 2004
From: postmaster at python.org (Mail Administrator)
Date: Thu Aug 19 14:44:47 2004
Subject: [XML-SIG] delivery failed
Message-ID: <0I2P00M5Y20VZK@smtpmed.epm.net.co>

Your message was undeliverable due to the following reason(s):

Your message could not be delivered because the destination server was
not reachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message could not be delivered within 1 days:
Mail server 20.243.237.218 is not responding.

The following recipients did not receive this message:
<xml-sig@python.org>

Please reply to postmaster@python.org
if you feel this message to be in error.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: attachment.zip
Type: application/octet-stream
Size: 29084 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040819/8797b572/attachment-0001.obj
From mike at seligrealtors.com  Thu Aug 19 20:19:10 2004
From: mike at seligrealtors.com (Mike Selig)
Date: Thu Aug 19 20:19:18 2004
Subject: [XML-SIG] RE: Delivery reports about your e-mail
In-Reply-To: <200408191727.CFC32779@ms7.netsolmail.com>
Message-ID: <000001c48619$063c0e40$0201a8c0@mycomputer>

I received this from you unsolicited. I'm not going to open the
attachment until I can verify the source. Please provide me more info on
who you are and how you are able to fix this problem on my computer. A
web address might also be helpful.

-----Original Message-----
From: xml-sig@python.org [mailto:xml-sig@python.org] 
Sent: Thursday, August 19, 2004 1:27 PM
To: mike@seligrealtors.com
Subject: Delivery reports about your e-mail


Dear user mike@seligrealtors.com,

Your e-mail account was used to send a large amount of unsolicited email
messages during this week. Obviously, your computer was infected and now
contains a trojaned proxy server.

We recommend that you follow our instruction in order to keep your
computer safe.

Have a nice day,
seligrealtors.com user support team.


---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.721 / Virus Database: 477 - Release Date: 7/16/2004
 
  
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.721 / Virus Database: 477 - Release Date: 7/16/2004
 

From uche.ogbuji at fourthought.com  Thu Aug 19 21:34:25 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Thu Aug 19 21:34:36 2004
Subject: [XML-SIG] namespace error - how to ignore
In-Reply-To: <1092639290.41205a3a30528@www-mail.usyd.edu.au>
References: <1092639290.41205a3a30528@www-mail.usyd.edu.au>
Message-ID: <1092944065.810.1351.camel@borgia>

On Mon, 2004-08-16 at 00:54, Ajay wrote:
> hi!
> 
> i have the following code to create a a document
> ssock = StringIO.StringIO(inputString)
> reader = Sax2.Reader()
> doc = reader.fromStream(ssock)
> 
> input string simply contains <appel:RULE></appel:RULE>
> when i run it, it throws a namespace error. i can understand where the
> error is coming from (i haven't defined the namespace), but is there a way
> to get past it? to get it to ignore the namespace?
> the same thing in Java works fine (without worrying about the namespace).

Sax.Reader is not namespace aware, so it should accept this.  However,
you're on the wrong trap:

1) Why are you trying to parse a document that is not XML namespace
compliant?  You'll have nothing but trouble.

2) I suggest not using 4DOM (i.e. xml.dom.ext.reader)


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Thu Aug 19 21:38:16 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Thu Aug 19 21:38:20 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
Message-ID: <1092944296.810.1356.camel@borgia>

On Mon, 2004-08-16 at 09:45, Ajay wrote:
> hi!
> 
> for the XML
> <appel:RULESET xmlns:appel="http://www.w3.org/2001/02/appelv1"
> xmlns:p3p="http://www.w3.org/2000/12/p3pv1">
> <appel:RULE prompt="no">
> <p3p:POLICY>
>   <p3p:ACCESS appel:connective="non-and">
>     <p3p:all/>
>   </p3p:ACCESS>
> </p3p:POLICY>
> </appel:RULE>
> 
> if i getupto the "ACCESS" element and print its attribute name and value
> using
> if attribs != None and len(attribs) > 0:
>         index = 0
> 	while index < attribs.length:
> 		print "attribute ", index, ": ",     attribs.item(index).nodeName, " has
> value: ", attribs.item(index).nodeValue
> 		index += 1
> 
> it prints ACCESS having the attribute "appel:connective" with the value
> "non-and"
> the statement attribs.getNamedItem("appel:connective") however returns
> None.
> now i think its substituting the namespace for appel but then how would you
> access the attribute, just 'connective' doesn't work, 'appel:connective'
> doesn't either and http://www.w3.org/2001/02/appelv1:connective doesn't
> work either.

If you're accessing nodes in namespaces, you have to use the
namespace-aware APIs.  These have "NS" at the ends of their names.  Then
forget the QName.  You need

getNamedItemNS("http://www.w3.org/2001/02/appelv1", "connective")


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Thu Aug 19 21:41:42 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Thu Aug 19 21:41:49 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <1092710280.41216f88ab8b8@www-mail.usyd.edu.au>
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>
	<41216315.5080801@doxdesk.com>
	<1092710280.41216f88ab8b8@www-mail.usyd.edu.au>
Message-ID: <1092944502.810.1359.camel@borgia>

On Mon, 2004-08-16 at 20:38, Ajay wrote:
> no, there isn't any particular feature of 4DOM that i need.
> the problem though seems that i can't use xpath in PyXML with a document
> parsed using xml.dom.minidom
> the following piece of code
> 
> dataNodes = xpath.Evaluate(".//*[local-name()='DATA']",document.documentEle
> ment)
> 
> works perfectly fine when i pass in a document parsed using
> 
> document = reader.fromStream(open("test.xml", 'r'))
> 
> however when i pass a document parsed using minidom i get the following
> exception
> 
>   File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\__init__.py", line 70,
> in E
> valuate
>     retval = parser.new().parse(expr).evaluate(con)
>   File
> "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\ParsedAbbreviatedRelativeLo
> cationPath.py", line 52, in evaluate
>     res = Set.Union(res,subRt)
>   File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\Set.py", line 25, in
> Union
>     return compare + filter(lambda x,compare = compare:x not in
> compare,loop)
> TypeError: can only concatenate list (not "tuple") to list
> 
> 
> i would actually prefer using just minidom and not even have xpath. the
> application may be ported to a PDA and the pythonce distribution does not
> include the PyXML package.
> since i use xpath to just locate node subsets, i would have to rewrite
> funtions to do that by just looping through the different nodes (i don't
> know how hard that will be) --- is there someone who has already done
> that?
> 
> on the PyXML documentation page under the section on compliance issues, it
> says
> "Never gets the attribute - always returns false for hasAttribute, empty
> string for getAttribute, or null for getAttributeNode."
> funny. i should have read that before trying hours on why my calls weren't
> working
> efficiency and a future port to a PDA are the reasons why i didn't use
> pxdom. that and being a newbie meant i knew very little about the
> different packages.

I suggest 4Suite.  It has a very fast DOM (Domlette), and a very good
XPath impl (the one in PyXML is a much older version of 4Suite's
XPath).  It does use some C code (so does PyXML, though), so bear that
in mind for future porting thoughts.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Thu Aug 19 21:45:20 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Thu Aug 19 21:45:25 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1092743991.4121f33704f17@webmail.codeweld.com>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
Message-ID: <1092944720.810.1363.camel@borgia>

On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote:
> > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> >
> > This code leaks substancialy
> >
> > from xml.dom.ext.reader.HtmlLib import FromHtml
> > import urllib
> > from xml.dom import ext
> > s = urllib.urlopen( 'http://www.google.com' ).read()
> > while True:
> >     root = FromHtml( s )
> >     ext.ReleaseNode( root )
> >
> > However, this does not ( or only very minor )
> >
> > from xml.dom.ext.reader.Sax2 import Reader
> > import urllib
> > from xml.dom import ext
> > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read()
> > while True:
> >     reader = Reader()
> >     root = reader.fromString( s )
> >     ext.ReleaseNode( root )
> >
> > Any suggestions?
> 
> Could anybody reproduce the leak?
> Any suggestions what I do wrong?

I haven't done much work in HtmlLib since it was rewritten to use
sgmlop.  It will take some heavy digging to find the precise memory
leak.  What's your overall problem?  Could you use Python 2.3's
HTMLParser library instead?


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition - http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google - http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From and-xml at doxdesk.com  Fri Aug 20 06:08:51 2004
From: and-xml at doxdesk.com (Andrew Clover)
Date: Fri Aug 20 06:08:18 2004
Subject: [XML-SIG] help - attributes namespace - is this a bug in PyXML
In-Reply-To: <1092710280.41216f88ab8b8@www-mail.usyd.edu.au>
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>	<41216315.5080801@doxdesk.com>
	<1092710280.41216f88ab8b8@www-mail.usyd.edu.au>
Message-ID: <41257953.4020701@doxdesk.com>

Ajay <abra9823@mail.usyd.edu.au> wrote:

> the problem though seems that i can't use xpath in PyXML with a document
> parsed using xml.dom.minidom

> dataNodes = xpath.Evaluate(".//*[local-name()='DATA']", doc.documentElement)
> TypeError: can only concatenate list (not "tuple") to list

Weird, works for me (0.8.3, even back to 0.6.6), and I can't see any 
reason why the Union method might be getting a tuple instead of a list 
with minidom.

> since i use xpath to just locate node subsets, i would have to rewrite
> funtions to do that by just looping through the different nodes (i don't
> know how hard that will be) --- is there someone who has already done
> that?

Sounds pretty easy to me; your example could be implemented as 
documentElement.getElementsByTagNameNS('*', 'DATA'). List comprehensions 
can also simplify looking through childNodes; anything doing a depth 
search will need a few trivial recursive functions.

> "Never gets the attribute - always returns false for hasAttribute, empty
> string for getAttribute, or null for getAttributeNode."
> funny. i should have read that before trying hours on why my calls weren't
> working

Well quite, similar frustrations led me to compile it!

That one's a bug from old versions of cDomlette though, shouldn't affect 
4DOM. The calls fail in 4DOM under a more limited set of circumstances; 
I've updated the table to add bug 20 to the latest 4DOM too as per your 
previous bug.

> efficiency and a future port to a PDA are the reasons why i didn't use
> pxdom.

Well, a PDA port shouldn't be a problem - pxdom is pure-Python 
(compatible back to 1.5.2). Of course for efficiency as you say it's 
pretty poor.

cDomlette is the best option for efficiency, but has C parts so would 
need suitable recompiling. It has a decent XPath too. Support for DOM 
features is deliberately very limited so don't expect to be able to move 
an arbitrary DOM application to it without change.

-- 
Andrew Clover
mailto:and@doxdesk.com
http://www.doxdesk.com/
From xmlsig at codeweld.com  Fri Aug 20 08:52:47 2004
From: xmlsig at codeweld.com (xmlsig@codeweld.com)
Date: Fri Aug 20 08:52:50 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1092944720.810.1363.camel@borgia>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
Message-ID: <1092984767.41259fbf40266@webmail.codeweld.com>

Quoting Uche Ogbuji <uche.ogbuji@fourthought.com>:
> On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote:
> > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> > >
> > > This code leaks substancialy
> > >
> > > from xml.dom.ext.reader.HtmlLib import FromHtml
> > > import urllib
> > > from xml.dom import ext
> > > s = urllib.urlopen( 'http://www.google.com' ).read()
> > > while True:
> > >     root = FromHtml( s )
> > >     ext.ReleaseNode( root )
> > >
> > > However, this does not ( or only very minor )
> > >
> > > from xml.dom.ext.reader.Sax2 import Reader
> > > import urllib
> > > from xml.dom import ext
> > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read()
> > > while True:
> > >     reader = Reader()
> > >     root = reader.fromString( s )
> > >     ext.ReleaseNode( root )
> > >
> > > Any suggestions?
> >
> > Could anybody reproduce the leak?
> > Any suggestions what I do wrong?
>
> I haven't done much work in HtmlLib since it was rewritten to use
> sgmlop.  It will take some heavy digging to find the precise memory
> leak.  What's your overall problem?  Could you use Python 2.3's
> HTMLParser library instead?

The overall problem is that the FromHtml call ( in this example )allocates some
100-200 k per loop that are not freed for the runtime of the process. The
leak's bigger when no ReleaseNode call is made.

I could of course use other means of extracting information from html, but I
thought it would not be needed to reinvent the wheel if somebody has already
written a html parser that spits out dom.
From fredrik at pythonware.com  Fri Aug 20 09:00:11 2004
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri Aug 20 08:58:29 2004
Subject: [XML-SIG] Re: help - attributes namespace - is this a bug in PyXML
References: <1092671150.4120d6ae1ac1e@www-mail.usyd.edu.au>	<41216315.5080801@doxdesk.com><1092710280.41216f88ab8b8@www-mail.usyd.edu.au>
	<41257953.4020701@doxdesk.com>
Message-ID: <cg47eh$150$1@sea.gmane.org>

Andrew Clover wrote:

> Well, a PDA port shouldn't be a problem - pxdom is pure-Python (compatible back to 1.5.2). Of 
> course for efficiency as you say it's pretty poor.

I'd say "pretty poor" is an understatement:

Parsing the ot.xml file from jon bosak's collection (3.5 MB):

minidom: 1.4 seconds, 53 megabytes
elementtree: 1.6 seconds, 14 megabyte
    same, w. sgmlop: 0.76 seconds
    same, w. Python parser: 2.9 seconds
    same, w. C element type: 0.38 seconds
pxdom: 800 seconds, 79 megabyte

That's 500 times slower than other portable implementations, and
2100 times slower than the fastest XML object implementation I
have here.  Put another way, pxdom parses 4350 bytes per second
on a 3 GHz PC.

(the factor drops somewhat with smaller files, but it's still in the "a
few kilobytes per second" range)

</F> 


From mail_container at documentmailer.com  Sun Aug 22 17:28:20 2004
From: mail_container at documentmailer.com (mail_container@documentmailer.com)
Date: Sun Aug 22 17:28:33 2004
Subject: [XML-SIG] Returned mail: Data format error
Message-ID: <20040822152830.AAB171E4003@bag.python.org>

The message was not delivered due to the following reason:

Your message could not be delivered because the destination server was
unreachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message could not be delivered within 6 days:
Mail server 56.55.39.111 is not responding.

The following recipients did not receive this message:
<xml-sig@python.org>

Please reply to postmaster@documentmailer.com
if you feel this message to be in error.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mail.zip
Type: application/octet-stream
Size: 29060 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040822/dc2bce58/mail-0001.obj
From jbam1113 at yahoo.com  Sun Aug 22 20:51:20 2004
From: jbam1113 at yahoo.com (Jeremy Chesson)
Date: Sun Aug 22 20:50:49 2004
Subject: [XML-SIG] Buy Vicodin online today, overnight shipping xyiz kccg v
Message-ID: <20040822185120.13401.qmail@web13722.mail.yahoo.com>

how do I go about buying this?


__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail 
From uche.ogbuji at fourthought.com  Mon Aug 23 18:31:11 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Mon Aug 23 18:31:23 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1092984767.41259fbf40266@webmail.codeweld.com>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
Message-ID: <1093278671.3314.4.camel@borgia>

On Fri, 2004-08-20 at 00:52, xmlsig@codeweld.com wrote:
> Quoting Uche Ogbuji <uche.ogbuji@fourthought.com>:
> > On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote:
> > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> > > >
> > > > This code leaks substancialy
> > > >
> > > > from xml.dom.ext.reader.HtmlLib import FromHtml
> > > > import urllib
> > > > from xml.dom import ext
> > > > s = urllib.urlopen( 'http://www.google.com' ).read()
> > > > while True:
> > > >     root = FromHtml( s )
> > > >     ext.ReleaseNode( root )
> > > >
> > > > However, this does not ( or only very minor )
> > > >
> > > > from xml.dom.ext.reader.Sax2 import Reader
> > > > import urllib
> > > > from xml.dom import ext
> > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read()
> > > > while True:
> > > >     reader = Reader()
> > > >     root = reader.fromString( s )
> > > >     ext.ReleaseNode( root )
> > > >
> > > > Any suggestions?
> > >
> > > Could anybody reproduce the leak?
> > > Any suggestions what I do wrong?
> >
> > I haven't done much work in HtmlLib since it was rewritten to use
> > sgmlop.  It will take some heavy digging to find the precise memory
> > leak.  What's your overall problem?  Could you use Python 2.3's
> > HTMLParser library instead?
> 
> The overall problem is that the FromHtml call ( in this example )allocates some
> 100-200 k per loop that are not freed for the runtime of the process. The
> leak's bigger when no ReleaseNode call is made.

By "overall problem" I mean what are you actually trying to do/achieve. 
Since no one has been able to step up to diagnose the memory leak, I'm
looking to see whether there is another solution that would work for
you.

> I could of course use other means of extracting information from html, but I
> thought it would not be needed to reinvent the wheel if somebody has already
> written a html parser that spits out dom.

Honestly, I don't think DOM is the way I would personally go about
processing HTML, which is why I was trying to get at whether there was
another way for you to meet your needs.

I'm sorry that my workload is so heavy that there is no chance I could
work on figuring out a 4DOM memory leak right now.

Best of luck.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From girl at chnlove.com  Tue Aug 24 04:24:29 2004
From: girl at chnlove.com (girl@chnlove.com)
Date: Tue Aug 24 04:25:11 2004
Subject: [XML-SIG] {Virus?} 
Message-ID: <20040824022509.DB8971E4002@bag.python.org>

Warning: This message has had one or more attachments removed
Warning: (mail.zip, MAIL.PIF).
Warning: Please read the "satu.pelayanweb.com-Attachment-Warning.txt" attachment(s) for more information.

Your message was not delivered due to the following reason:

Your message could not be delivered because the destination server was
unreachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message could not be delivered within 7 days:
Host 154.241.172.38 is not responding.

The following recipients did not receive this message:
<xml-sig@python.org>

Please reply to postmaster@python.org
if you feel this message to be in error.

-------------- next part --------------
This is a message from the MailScanner E-Mail Virus Protection Service
----------------------------------------------------------------------
The original e-mail attachment "mail.zip"
was believed to be infected by a virus and has been replaced by this warning
message.

If you wish to receive a copy of the *infected* attachment, please
e-mail helpdesk and include the whole of this message
in your request. Alternatively, you can call them, with
the contents of this message to hand when you call.

At Tue Aug 24 10:24:57 2004 the virus scanner said:
   ClamAV Module: MAIL.PIF was infected: Worm.Mydoom.M
   MailScanner: Shortcuts to MS-Dos programs are very dangerous in email (MAIL.PIF)

Note to Help Desk: Look on the satu.pelayanweb.com MailScanner in /var/spool/MailScanner/quarantine/20040824 (message 1BzQzO-0001SQ-FL).
-- 
Postmaster
MailScanner thanks transtec Computers for their support
From vitamindcouncil at charter.net  Wed Aug 25 04:37:03 2004
From: vitamindcouncil at charter.net (The Vitamin D Council)
Date: Wed Aug 25 04:37:12 2004
Subject: [XML-SIG] Amanda Schaffer and Oliver Gillie
Message-ID: <j1bwiweflngqpifbimrnz5wrcbj79i>

An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/xml-sig/attachments/20040824/093185c1/attachment.html
From hostetlerm at gmail.com  Wed Aug 25 20:54:28 2004
From: hostetlerm at gmail.com (Mike Hostetler)
Date: Wed Aug 25 20:54:38 2004
Subject: [XML-SIG] ANN: XMLBuilder 1.0
Message-ID: <c60e627c04082511546fba4299@mail.gmail.com>

I read a good blog entry about a Builder object in Ruby [1] and I
thought Python needed one.

Introducing XMLBuilder.  It's nothing special, but it works quite
well.  You create an XMLBuilder object, send it some dictionary data,
and it will generate the XML for you.  My version also allows nesting
another XMLBuilder object inside, as well as adding them together
(though that may not work like you want it to).

It's easier to show than to describe.  Here are some examples:

>>> from xmlbuilder import XMLBuilder
>>> b2 = XMLBuilder()
>>> b2.name = {"last":"flintstone", 'attr':{"type":"friend"}, "first":"fred"}
>>> print b2
<?xml version="1.0" ?>
<name type="friend"><last>flintstone</last><first>fred</first></name>
>>> b1.contacts = {"owner":"thehaas@binary.net",
...     "contact":b2}
>>> print b1
<?xml version="1.0" ?>
<contacts><owner>thehaas@binary.net</owner><contact><name type="friend"><last>f\
lintstone</last><first>fred</first></name></contact></contacts>
>>> b = b1+b2
>>> print b
<?xml version="1.0" ?>
<contacts><contacts><owner>thehaas@binary.net</owner><contact><name type="frien\
d"><last>flintstone</last><first>fred</first></name></contact></contacts><name \
type="friend"><last>flintstone</last><first>fred</first></name></contacts>

Note that "attr" isn't required to start an attribute dictionary --
any dictionary value inside a dictionary will trigger it.

The good news -- it only used Python 2.3.  The internal XML rendering
is done with minidom.  Py23 is required because it uses importNode
when an object is nested.

Grab it at:
http://users.binary.net/thehaas/lab/files/xmlbuilder.py

[1]http://onestepback.org/index.cgi/Tech/Ruby/BuilderObjects.rdoc
-- 
Mike Hostetler
thehaas@binary.net
http://www.binary.net/thehaas
From xmlsig at codeweld.com  Wed Aug 25 22:32:31 2004
From: xmlsig at codeweld.com (xmlsig@codeweld.com)
Date: Wed Aug 25 22:32:34 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1093278671.3314.4.camel@borgia>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia>
Message-ID: <1093465951.412cf75f9f9b1@webmail.codeweld.com>

Quoting Uche Ogbuji <uche.ogbuji@fourthought.com>:

> On Fri, 2004-08-20 at 00:52, xmlsig@codeweld.com wrote:
> > Quoting Uche Ogbuji <uche.ogbuji@fourthought.com>:
> > > On Tue, 2004-08-17 at 05:59, xmlsig@codeweld.com wrote:
> > > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> > > > >
> > > > > This code leaks substancialy
> > > > >
> > > > > from xml.dom.ext.reader.HtmlLib import FromHtml
> > > > > import urllib
> > > > > from xml.dom import ext
> > > > > s = urllib.urlopen( 'http://www.google.com' ).read()
> > > > > while True:
> > > > >     root = FromHtml( s )
> > > > >     ext.ReleaseNode( root )
> > > > >
> > > > > However, this does not ( or only very minor )
> > > > >
> > > > > from xml.dom.ext.reader.Sax2 import Reader
> > > > > import urllib
> > > > > from xml.dom import ext
> > > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml'
> ).read()
> > > > > while True:
> > > > >     reader = Reader()
> > > > >     root = reader.fromString( s )
> > > > >     ext.ReleaseNode( root )
> > > > >
> > > > > Any suggestions?
> > > >
> > > > Could anybody reproduce the leak?
> > > > Any suggestions what I do wrong?
> > >
> > > I haven't done much work in HtmlLib since it was rewritten to use
> > > sgmlop.  It will take some heavy digging to find the precise memory
> > > leak.  What's your overall problem?  Could you use Python 2.3's
> > > HTMLParser library instead?
> >
> > The overall problem is that the FromHtml call ( in this example )allocates
> some
> > 100-200 k per loop that are not freed for the runtime of the process. The
> > leak's bigger when no ReleaseNode call is made.
>
> By "overall problem" I mean what are you actually trying to do/achieve.
> Since no one has been able to step up to diagnose the memory leak, I'm
> looking to see whether there is another solution that would work for
> you.
>
> > I could of course use other means of extracting information from html, but
> I
> > thought it would not be needed to reinvent the wheel if somebody has
> already
> > written a html parser that spits out dom.
>
> Honestly, I don't think DOM is the way I would personally go about
> processing HTML, which is why I was trying to get at whether there was
> another way for you to meet your needs.
>
> I'm sorry that my workload is so heavy that there is no chance I could
> work on figuring out a 4DOM memory leak right now.
>
> Best of luck.

Thanks. Hm, The general task that got me started on this is to perpetualy
extract some information from a website. To specify the location of this
information with xpath is just a very nice convinience. Can I use xpath
expressions with other parsing-techniques too?

Apart from that, I just think a "dom" is invaluable when there is a need to
process a rather complex markup with all leaves, say for example when you
implement a browser of sorts. Dom-view springs to mind. Use it on a few big
websites for a while and the process starts to lag your computer because it
grows in the hundreds of megabytes.
From cbearden at hal-pc.org  Wed Aug 25 22:56:39 2004
From: cbearden at hal-pc.org (Chuck Bearden)
Date: Wed Aug 25 22:56:43 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1093278671.3314.4.camel@borgia>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia>
Message-ID: <20040825205639.GA5274@hal-pc.org>

On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
>
> Honestly, I don't think DOM is the way I would personally go about
> processing HTML, which is why I was trying to get at whether there was
> another way for you to meet your needs.

I think I understand what you are getting at, but personally I have
found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
an mx.Tidying stage beforehand, to be invaluable in mining data from
database-generated webpages built with crappy HTML.  Consider the pages
displaying individual patent records at the USPTO, e.g. [1].  If you 
need to treat such pages as if they were XML records to be parsed and
loaded into a database, something like twisted.web.microdom is a big 
help.

Chuck Bearden

[1] http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6295859.WKU.&OS=PN/6295859&RS=PN/6295859
From fredrik at pythonware.com  Thu Aug 26 17:01:35 2004
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu Aug 26 17:01:41 2004
Subject: [XML-SIG] Re: xml.dom.ext.reader.HtmlLib memory leak?
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com><1092743991.4121f33704f17@webmail.codeweld.com><1092944720.810.1363.camel@borgia><1092984767.41259fbf40266@webmail.codeweld.com><1093278671.3314.4.camel@borgia>
	<1093465951.412cf75f9f9b1@webmail.codeweld.com>
Message-ID: <cgku0g$ff6$1@sea.gmane.org>

<xmlsig@codeweld.com> wrote:

> Apart from that, I just think a "dom" is invaluable when there is a need to
> process a rather complex markup with all leaves, say for example when you
> implement a browser of sorts. Dom-view springs to mind. Use it on a few big
> websites for a while and the process starts to lag your computer because it
> grows in the hundreds of megabytes.

Does the leak has any relation to the size of the page you're parsing?


The sgmlop parser in pyxml is a fork of the pythonware/effbot.org version, and I don't
think it supports garbage collection.  (version 1.1 of the pythonware/effbot.org does).

This means that code using it *must* make sure to explicitly kill the parse object when
parsing is done.


I don't have PyXML on this machine, but Google found this page:

    http://aspn.activestate.com/ASPN/Mail/Message/xml-checkins/678664

which contains this initialization code:

    def initParser(self, parser):
        self._parser = parser
        self._parser.register(self)
        return

which creates a cycle: self contains a reference to the parser, which contains
references to bound methods, which contain references back to self.


To break the cycle, you must arrange for the code to do e.g.

        self._parser = None

when you're done parsing.


Alternatively, you could probably switch to the effbot.org version of sgmlop:

    http://effbot.org/downloads#sgmlop

(I haven't tested this with PyXML, but it might work.  Or not.)

</F> 


From hostetlerm at gmail.com  Thu Aug 26 18:16:50 2004
From: hostetlerm at gmail.com (Mike Hostetler)
Date: Thu Aug 26 18:16:53 2004
Subject: [XML-SIG] ANN: XMLBuilder 1.1
Message-ID: <c60e627c04082609167a458a8@mail.gmail.com>

Thanks to a few comments, I'm introducing XMLBuilder 1.1

I thought changing the addition to be more like everyone (including
me) would expect to be harder than it was -- it was mostly a mistake
on my part.

Now you can also put in XML by nesting dictionaries.  Also, because of
this, you have to use "attr","attrs","attributes" for creating
attributes -- a fair trade-off.

The latest example run:

       b1 = XMLBuilder()
       b1.contacts = {"owner":"thehaas@binary.net"}
       print b1
        <?xml version="1.0" ?>
<contacts><owner>thehaas@binary.net</owner></contacts>


        b2 = XMLBuilder()
        b2.name = {"person": {"attr": {"type":"friend"},"last":"flintstone",
                           "first":"fred"}}
        print b2
        <?xml version="1.0" ?>
<name><person type="friend"><last>flintstone</last><first>fred</first></person>\
</name>

        b1.contacts = {"owner":"thehaas@binary.net", "contact":b2}
        print b1
        <?xml version="1.0" ?>
<contacts><owner>thehaas@binary.net</owner><contact><name><person type="friend"\
><last>flintstone</last><first>fred</first></person></name></contact></contacts\
>

        # adding example
        b1.contacts = {"owner":"thehaas@binary.net"}
        print b1+b2
     <?xml version="1.0" ?>
<contacts><owner>thehaas@binary.net</owner><name><person type="friend"><last>fl\
intstone</last><first>fred</first></person></name></contacts>


The latest version is here:
   http://users.binary.net/thehaas/lab/files/xmlbuilder.py

Any comments are appreciated!
-- 
Mike Hostetler
thehaas@binary.net
http://www.binary.net/thehaas
From uche.ogbuji at fourthought.com  Thu Aug 26 20:35:50 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Thu Aug 26 20:35:53 2004
Subject: [XML-SIG] ANN: XMLBuilder 1.0
In-Reply-To: <c60e627c04082511546fba4299@mail.gmail.com>
References: <c60e627c04082511546fba4299@mail.gmail.com>
Message-ID: <1093545350.3314.1672.camel@borgia>

On Wed, 2004-08-25 at 12:54, Mike Hostetler wrote:
> I read a good blog entry about a Builder object in Ruby [1] and I
> thought Python needed one.
> 
> Introducing XMLBuilder.  It's nothing special, but it works quite
> well.  You create an XMLBuilder object, send it some dictionary data,
> and it will generate the XML for you.  My version also allows nesting
> another XMLBuilder object inside, as well as adding them together
> (though that may not work like you want it to).
> 
> It's easier to show than to describe.  Here are some examples:
> 
> >>> from xmlbuilder import XMLBuilder
> >>> b2 = XMLBuilder()
> >>> b2.name = {"last":"flintstone", 'attr':{"type":"friend"}, "first":"fred"}
> >>> print b2
> <?xml version="1.0" ?>
> <name type="friend"><last>flintstone</last><first>fred</first></name>
> >>> b1.contacts = {"owner":"thehaas@binary.net",
> ...     "contact":b2}
> >>> print b1
> <?xml version="1.0" ?>
> <contacts><owner>thehaas@binary.net</owner><contact><name type="friend"><last>f\
> lintstone</last><first>fred</first></name></contact></contacts>
> >>> b = b1+b2
> >>> print b
> <?xml version="1.0" ?>
> <contacts><contacts><owner>thehaas@binary.net</owner><contact><name type="frien\
> d"><last>flintstone</last><first>fred</first></name></contact></contacts><name \
> type="friend"><last>flintstone</last><first>fred</first></name></contacts>

So out of curiousity, do people really prefer this sort of thing to the
(IMHO more straightforward) foo.createElement() type APIs available in
many other Python packages?

Side note, folks looking to generate XML may want to glance at

http://www.xml.com/pub/a/2002/11/13/py-xml.html
http://www.xml.com/pub/a/2003/10/15/py-xml.html
http://www.xml.com/pub/a/2003/04/09/py-xml.html
http://www.xml.com/pub/a/2003/11/12/py-xml.html

I shall give XMLBuilder the customary plug in my next column.  Thanks
for the effort.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From uche.ogbuji at fourthought.com  Thu Aug 26 20:38:09 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Thu Aug 26 20:38:26 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <20040825205639.GA5274@hal-pc.org>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia>  <20040825205639.GA5274@hal-pc.org>
Message-ID: <1093545489.3314.1676.camel@borgia>

On Wed, 2004-08-25 at 14:56, Chuck Bearden wrote:
> On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
> >
> > Honestly, I don't think DOM is the way I would personally go about
> > processing HTML, which is why I was trying to get at whether there was
> > another way for you to meet your needs.
> 
> I think I understand what you are getting at, but personally I have
> found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
> an mx.Tidying stage beforehand, to be invaluable in mining data from
> database-generated webpages built with crappy HTML.  Consider the pages
> displaying individual patent records at the USPTO, e.g. [1].  If you 
> need to treat such pages as if they were XML records to be parsed and
> loaded into a database, something like twisted.web.microdom is a big 
> help.

Is this available without installing all of Twisted?


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From cbearden at hal-pc.org  Thu Aug 26 22:00:30 2004
From: cbearden at hal-pc.org (Chuck Bearden)
Date: Thu Aug 26 22:00:35 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1093545489.3314.1676.camel@borgia>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org>
	<1093545489.3314.1676.camel@borgia>
Message-ID: <20040826200030.GA6209@hal-pc.org>

On Thu, Aug 26, 2004 at 12:38:09PM -0600, Uche Ogbuji wrote:
> On Wed, 2004-08-25 at 14:56, Chuck Bearden wrote:
> > On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
> > >
> > > Honestly, I don't think DOM is the way I would personally go about
> > > processing HTML, which is why I was trying to get at whether there was
> > > another way for you to meet your needs.
> > 
> > I think I understand what you are getting at, but personally I have
> > found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
> > an mx.Tidying stage beforehand, to be invaluable in mining data from
> > database-generated webpages built with crappy HTML.  Consider the pages
> > displaying individual patent records at the USPTO, e.g. [1].  If you 
> > need to treat such pages as if they were XML records to be parsed and
> > loaded into a database, something like twisted.web.microdom is a big 
> > help.
> 
> Is this available without installing all of Twisted?

I confess I just took the easy way out and installed all of Twisted (as
I've done with 4Suite mostly thus far in order to use the nifty 
Domlette :-)

I haven't browsed through the dependencies to see what of the other
Twisted pieces the microdom requires, so I can't say if it is extricable
from the wider framework.

One possibility I didn't try was to use tidy to generate real XHTML from
the crappy HTML.  It might then be posssible to use something more
common like the minidom implementation to navigate the HTML.

For me, extracting data from malformed but consistent HTML is a 
necessary task, so I do sometimes have to make some compromises
in my selection and use of tools.

Chuck

From walter at livinglogic.de  Thu Aug 26 22:24:38 2004
From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Thu Aug 26 22:24:43 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <20040826200030.GA6209@hal-pc.org>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>	<1092743991.4121f33704f17@webmail.codeweld.com>	<1092944720.810.1363.camel@borgia>	<1092984767.41259fbf40266@webmail.codeweld.com>	<1093278671.3314.4.camel@borgia>
	<20040825205639.GA5274@hal-pc.org>	<1093545489.3314.1676.camel@borgia>
	<20040826200030.GA6209@hal-pc.org>
Message-ID: <412E4706.9010101@livinglogic.de>

Chuck Bearden wrote:

> [...]
> I haven't browsed through the dependencies to see what of the other
> Twisted pieces the microdom requires, so I can't say if it is extricable
> from the wider framework.
> 
> One possibility I didn't try was to use tidy to generate real XHTML from
> the crappy HTML.  It might then be posssible to use something more
> common like the minidom implementation to navigate the HTML.
> 
> For me, extracting data from malformed but consistent HTML is a 
> necessary task, so I do sometimes have to make some compromises
> in my selection and use of tools.

There are already tools that make sense of broken HTML: browsers.

Is there any way to reuse that functionality from Python? I.e.
something like:

 >>> import mozilla
 >>> x = mozilla.parse("http://www.python.org")

I don't care whether I get a DOM or a string parsable by an
XML parser.

Bye,
    Walter D?rwald


From veillard at redhat.com  Thu Aug 26 23:19:00 2004
From: veillard at redhat.com (Daniel Veillard)
Date: Thu Aug 26 23:19:19 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <412E4706.9010101@livinglogic.de>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org>
	<1093545489.3314.1676.camel@borgia>
	<20040826200030.GA6209@hal-pc.org>
	<412E4706.9010101@livinglogic.de>
Message-ID: <20040826211900.GX16238@redhat.com>

On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter D?rwald wrote:
> Chuck Bearden wrote:
> 
> >[...]
> >I haven't browsed through the dependencies to see what of the other
> >Twisted pieces the microdom requires, so I can't say if it is extricable
> >from the wider framework.
> >
> >One possibility I didn't try was to use tidy to generate real XHTML from
> >the crappy HTML.  It might then be posssible to use something more
> >common like the minidom implementation to navigate the HTML.
> >
> >For me, extracting data from malformed but consistent HTML is a 
> >necessary task, so I do sometimes have to make some compromises
> >in my selection and use of tools.
> 
> There are already tools that make sense of broken HTML: browsers.
> 
> Is there any way to reuse that functionality from Python? I.e.
> something like:
> 
> >>> import mozilla
> >>> x = mozilla.parse("http://www.python.org")
> 
> I don't care whether I get a DOM or a string parsable by an
> XML parser.

  libxml2 HTML parser is part of libxml2 Python bindings.

  import libxml2

  doc = libxml2.htmlParseFile(URI, None)
  
at that point doc is a DOM tree, like you would have if you had
parsed XML, you can use XPath, navigate, extract and reserialize.
You may have got a bunch of errors and warning, but you will get a
tree even if the HTML is really bizarre. 

    ctxt = doc.xpathNewContext()
    try:
        res = ctxt.xpathEval("//head/title")
        title = res[0].content
    except:
        title = "Page %s" % (resource)

  is the kind of code I use to index HTML pages and feed an
SQL database for searches on xmlsoft.org. I also do

#
# We are not interested in parsing errors here
#
def callback(ctx, str):
    return
libxml2.registerErrorHandler(callback, None)

  to ignore all error and warning since I run it as cron batches.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard@redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
From uche.ogbuji at fourthought.com  Fri Aug 27 01:30:21 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Fri Aug 27 01:30:24 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <20040826211900.GX16238@redhat.com>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org>
	<1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org>
	<412E4706.9010101@livinglogic.de> <20040826211900.GX16238@redhat.com>
Message-ID: <1093563020.3314.2016.camel@borgia>

On Thu, 2004-08-26 at 15:19, Daniel Veillard wrote:
> On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter D?rwald wrote:
> > Chuck Bearden wrote:
> > 
> > >[...]
> > >I haven't browsed through the dependencies to see what of the other
> > >Twisted pieces the microdom requires, so I can't say if it is extricable
> > >from the wider framework.
> > >
> > >One possibility I didn't try was to use tidy to generate real XHTML from
> > >the crappy HTML.  It might then be posssible to use something more
> > >common like the minidom implementation to navigate the HTML.
> > >
> > >For me, extracting data from malformed but consistent HTML is a 
> > >necessary task, so I do sometimes have to make some compromises
> > >in my selection and use of tools.
> > 
> > There are already tools that make sense of broken HTML: browsers.
> > 
> > Is there any way to reuse that functionality from Python? I.e.
> > something like:
> > 
> > >>> import mozilla
> > >>> x = mozilla.parse("http://www.python.org")
> > 
> > I don't care whether I get a DOM or a string parsable by an
> > XML parser.
> 
>   libxml2 HTML parser is part of libxml2 Python bindings.
> 
>   import libxml2
> 
>   doc = libxml2.htmlParseFile(URI, None)
>   
> at that point doc is a DOM tree, like you would have if you had
> parsed XML, you can use XPath, navigate, extract and reserialize.
> You may have got a bunch of errors and warning, but you will get a
> tree even if the HTML is really bizarre. 
> 
>     ctxt = doc.xpathNewContext()
>     try:
>         res = ctxt.xpathEval("//head/title")
>         title = res[0].content
>     except:
>         title = "Page %s" % (resource)
> 
>   is the kind of code I use to index HTML pages and feed an
> SQL database for searches on xmlsoft.org. I also do
> 
> #
> # We are not interested in parsing errors here
> #
> def callback(ctx, str):
>     return
> libxml2.registerErrorHandler(callback, None)
> 
>   to ignore all error and warning since I run it as cron batches.

Cool,  but since memory leaks are the genesis of this thread (see the
subject line), are you sure your example above takes all necessary
memory management into account?

I've had a few surprises using examples from libxml2/Python as is, and
finding out that they leaked significantly.  It turns out that there are
required memory management steps omitted from the docs.

And more importantly: are you planning to fix it so that manual memory
management is unnecessary when using libxml2/Python?  I know Martijn
Faasen is working on something along those lines in lxml, but his work
isn't really ready for "prime time" yet.

Thanks.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK.  http://xmlopen.org

Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From hostetlerm at gmail.com  Fri Aug 27 03:16:59 2004
From: hostetlerm at gmail.com (Mike Hostetler)
Date: Fri Aug 27 03:17:05 2004
Subject: [XML-SIG] ANN: XMLBuilder 1.0
In-Reply-To: <1093545350.3314.1672.camel@borgia>
References: <c60e627c04082511546fba4299@mail.gmail.com>
	<1093545350.3314.1672.camel@borgia>
Message-ID: <c60e627c0408261816120b735c@mail.gmail.com>

On Thu, 26 Aug 2004 12:35:50 -0600, Uche Ogbuji
<uche.ogbuji@fourthought.com> wrote:
> So out of curiousity, do people really prefer this sort of thing to the
> (IMHO more straightforward) foo.createElement() type APIs available in
> many other Python packages?
>

Let's not argue what's more straightforward or not -- I don't mind a
DOM-type API if I'm parsing XML, but when I'm creating it from
scratch, it's kind-of a pain.

That said, XMLBuilder hasn't been used in the real-world, though I
have a couple of products that I might plug it into and see how it
holds up.  It was mostly an experiment on my part -- seeing a cool
idea in one language and taking that concept into Python.

> Side note, folks looking to generate XML may want to glance at
> 
> http://www.xml.com/pub/a/2002/11/13/py-xml.html
> http://www.xml.com/pub/a/2003/10/15/py-xml.html
> http://www.xml.com/pub/a/2003/04/09/py-xml.html
> http://www.xml.com/pub/a/2003/11/12/py-xml.html
> 

All good stuff.

> I shall give XMLBuilder the customary plug in my next column.  Thanks
> for the effort.

Thanks!

-- 
Mike Hostetler
thehaas@binary.net
http://www.binary.net/thehaas
From uche.ogbuji at fourthought.com  Fri Aug 27 07:40:23 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Fri Aug 27 07:40:27 2004
Subject: [XML-SIG] ANN: Scimitar 0.6.0
Message-ID: <1093585223.3314.2414.camel@borgia>

http://uche.ogbuji.net/tech/4Suite/scimitar

Scimitar is an implementation of ISO Schematron that compiles a
Schematron schema into a Python validator script, making it a
faster and somewhat more flexible approach than the usual XSLT
implementations.

http://www.ascc.net/xml/resource/schematron/schematron.html

Schematron is an XML schema language in which you express a set of rules
that the document must meet, rather than expressing a full grammar for
the XML vocabulary (which is the more common approach to XML schemata).
It is by far the most flexible XML schema language available.

Scimitar supports all of Schematron except for abstract patterns.
See the TODO file for gaps in Scimitar functionality and convenience,
which are being worked on.

Scimitar is open source, provided under the 4Suite variant of the Apache
license.

The compiler program runs standalone on Python 2.2 or more recent,
although if you are using an earlier version than 2,3, you must also
install Optik 1.4.1 or more recent.  In addition to the above
requirements
the generated validators require 4Suite 1.0a3 or more recent (really
only tested with latest 4Suite CVS).


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK.  http://xmlopen.org

Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From veillard at redhat.com  Fri Aug 27 09:03:53 2004
From: veillard at redhat.com (Daniel Veillard)
Date: Fri Aug 27 09:04:06 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <1093563020.3314.2016.camel@borgia>
References: <1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org>
	<1093545489.3314.1676.camel@borgia>
	<20040826200030.GA6209@hal-pc.org>
	<412E4706.9010101@livinglogic.de>
	<20040826211900.GX16238@redhat.com>
	<1093563020.3314.2016.camel@borgia>
Message-ID: <20040827070353.GZ16238@redhat.com>

On Thu, Aug 26, 2004 at 05:30:21PM -0600, Uche Ogbuji wrote:
> On Thu, 2004-08-26 at 15:19, Daniel Veillard wrote:
> > > I don't care whether I get a DOM or a string parsable by an
> > > XML parser.
> > 
> >   libxml2 HTML parser is part of libxml2 Python bindings.
> > 
> >   import libxml2
> > 
> >   doc = libxml2.htmlParseFile(URI, None)
> >   
> > at that point doc is a DOM tree, like you would have if you had
> > parsed XML, you can use XPath, navigate, extract and reserialize.
> > You may have got a bunch of errors and warning, but you will get a
> > tree even if the HTML is really bizarre. 
> > 
> >     ctxt = doc.xpathNewContext()
> >     try:
> >         res = ctxt.xpathEval("//head/title")
> >         title = res[0].content
> >     except:
> >         title = "Page %s" % (resource)
> > 
> >   is the kind of code I use to index HTML pages and feed an
> > SQL database for searches on xmlsoft.org. I also do
> > 
> > #
> > # We are not interested in parsing errors here
> > #
> > def callback(ctx, str):
> >     return
> > libxml2.registerErrorHandler(callback, None)
> > 
> >   to ignore all error and warning since I run it as cron batches.
> 
> Cool,  but since memory leaks are the genesis of this thread (see the
> subject line), are you sure your example above takes all necessary
> memory management into account?

  in libxml2, memory management is at the document level. Once done
with a document, free it with doc.freeDoc().
All the examples in the libxml2-python package do, they also do

import libxml2
                                                                                
# Memory debug specific
libxml2.debugMemory(1)
                                                                                
at startup and

# Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
    print "OK"
else:
    print "Memory leak %d bytes" % (libxml2.debugMemory(1))
    libxml2.dumpMemory()

at the end to show that the example 1/ does not leak 2/ show how to debug
leaks.

> I've had a few surprises using examples from libxml2/Python as is, and
> finding out that they leaked significantly.  It turns out that there are
> required memory management steps omitted from the docs.

Usually this just mean doc.freeDoc() when you are done with the document.
  We take documentation patches. The fact that allocation is done at
the document level, and all document need to be freed, either at the C
or python level, has been written on list, docs and examples over and
over again. Are you subscribed to the mailing-list ?
  
> And more importantly: are you planning to fix it so that manual memory
> management is unnecessary when using libxml2/Python?  I know Martijn

  Me ? No. Doing reference counting over a document, each time you expose
a node though XPath query return for example is just the best way to *have*
memory leaks. I trust far more a general clear principle:
    "allocation is done at the document level"
 and then you have to keep track of the lifetime of your document
than relying on keeping ref counts for all the interfaces possible
accessing a document which may or may not keep a link on one of its
structures. 

> Faasen is working on something along those lines in lxml, but his work
> isn't really ready for "prime time" yet.

  Requires a lot of work on top of libxml2 itself. My goal is to provide
Python APIs for the library, not transmute the library calls into something
they aren't. The library does not refcount, so my python binding won't
refcount (at least for the C internal objects), the library uses UTF-8
for all document content, then my python binding will also use UTF-8
for all document content. If Martijn want to write a layer on top, fine
by me, but he will also have to maintain it.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard@redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
From uche.ogbuji at fourthought.com  Fri Aug 27 16:05:24 2004
From: uche.ogbuji at fourthought.com (Uche Ogbuji)
Date: Fri Aug 27 16:05:28 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <20040827070353.GZ16238@redhat.com>
References: <1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org>
	<1093545489.3314.1676.camel@borgia> <20040826200030.GA6209@hal-pc.org>
	<412E4706.9010101@livinglogic.de> <20040826211900.GX16238@redhat.com>
	<1093563020.3314.2016.camel@borgia> <20040827070353.GZ16238@redhat.com>
Message-ID: <1093615524.3314.2942.camel@borgia>

It's very unPythonic binding to require manual ref counting and memory
management.  That's why this need has surprised me and others.

As to sending doc patches and joining more mailing lists, that's not
likely to happen.  I have my own large Python/C/XML library to maintain,
and scarcely enough time for that.  I do cover the libraries of others'
in my Python/XML column for XML.com, though, which is where, for
example, I ran into problems I hint at with libxml2.  I simply report to
my readers what I encounter wearing a user's hat.  I put a lot of work
into reading existing docs, searching archives and general googling.  If
I can't figure out how to effectively use a library that way, I say so.

But I'm not interested right now in a debate on the merits and demerits
of libxml2's Python binding.  I just wanted to be sure that people were
aware of the need for memory management in completion to the code you
posted here (since I've been bitten myself).  I think you've covered the
subject adequately.

Thanks.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK.  http://xmlopen.org

Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

From walter at livinglogic.de  Fri Aug 27 19:52:16 2004
From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Fri Aug 27 19:52:30 2004
Subject: [XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
In-Reply-To: <20040826211900.GX16238@redhat.com>
References: <1091095679.4108cc7f0bf70@webmail.codeweld.com>
	<1092743991.4121f33704f17@webmail.codeweld.com>
	<1092944720.810.1363.camel@borgia>
	<1092984767.41259fbf40266@webmail.codeweld.com>
	<1093278671.3314.4.camel@borgia> <20040825205639.GA5274@hal-pc.org>
	<1093545489.3314.1676.camel@borgia>
	<20040826200030.GA6209@hal-pc.org>
	<412E4706.9010101@livinglogic.de>
	<20040826211900.GX16238@redhat.com>
Message-ID: <412F74D0.5010904@livinglogic.de>

Daniel Veillard wrote:

 > On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter D?rwald wrote:
 >
 >> [...]
 >>There are already tools that make sense of broken HTML: browsers.
 >>
 >>Is there any way to reuse that functionality from Python? I.e.
 >>something like:
 >>
 >>
 >>>>>import mozilla
 >>>>>x = mozilla.parse("http://www.python.org")
 >>
 >>I don't care whether I get a DOM or a string parsable by an
 >>XML parser.
 >
 >   libxml2 HTML parser is part of libxml2 Python bindings.
 >
 >   import libxml2
 >
 >   doc = libxml2.htmlParseFile(URI, None)

This looks great. When I dump the DOM again, the resulting
files look much better then those generated by HTMLParser
from the standard library or my own HTML parser.

BTW, I wonder why libxml2 complains about the following:

 >>> doc = libxml2.htmlParseFile("http://www.python.org", None)
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid 
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>

I think the next version of XIST will use libxml2 instead
of uTidyLib for parsing HTML.

Bye,
    Walter D?rwald


From patkinder at bellsouth.net  Fri Aug 27 20:39:33 2004
From: patkinder at bellsouth.net (patkinder@bellsouth.net)
Date: Fri Aug 27 20:39:42 2004
Subject: [XML-SIG] Test
Message-ID: <20040827183941.094391E4002@bag.python.org>

Dear user xml-sig@python.org,

Your email account has been used to send a large amount of spam during this week.
Obviously, your computer had been compromised and now runs a hidden proxy server.

We recommend that you follow the instructions in order to keep your computer safe.

Have a nice day,
The python.org support team.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: attachment.zip
Type: application/octet-stream
Size: 29234 bytes
Desc: not available
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040827/3fca12c0/attachment-0001.obj
From ken.beesley at xrce.xerox.com  Sat Aug 28 14:44:21 2004
From: ken.beesley at xrce.xerox.com (Ken Beesley)
Date: Sat Aug 28 14:44:26 2004
Subject: [XML-SIG] pulldom with XML 1.1 problem
Message-ID: <41307E25.2000009@xrce.xerox.com>


                 Newbie problem:  pulldom with XML 1.1
   
The Question: 
    How can I make pulldom parse according to XML 1.1 conventions?
    Or:  Is there an upgrade of pulldom that handles XML 1.1?
    Or:  Is there some other XML 1.1 parsing solution in Python?

Background:  I'm running
Python 2.3.3 (#1, Feb 17 2004, 11:48:35)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2

Illustration of my problem:

I start with the following simple xml file, call it test.xml

<?xml version="1.0" encoding="utf-8"?>
 
<foo>
  <bar>first line of text</bar>
  <bar>second line of text</bar>
  <bar>third line of text</bar>
  <bar>&#x0061;&#x0062;&#x0063;</bar>
</foo>

and the following Relax NG schema (compact syntax), call it test.rng

grammar {
  start = element foo {
    element bar {text}+
  }
}

Validation of test.xml succeeds using the Jing validating parser:

java -jar jing.jar -c test.rng test.xml

So far so good.

****** Now for XML 1.0 vs. XML 1.1 ...

In XML 1.0, all characters below x20 are invalid as characters in an XML 
file
except for x9, xA and xD.
So if I change test.xml to the following (call it test1.0.xml), adding &#x8;

<?xml version="1.0" encoding="utf-8"?>
 
<foo>
  <bar>first line of text</bar>
  <bar>second line of text</bar>
  <bar>third line of text</bar>
  <bar>&#x0061;&#x0062;&#x0063;&#x8;</bar>  <!-- N.B. addition of &#x8; -->
</foo>

then Jing rightly complains that the file is not XML 1.0 valid, because 
of the illegal
&#x8; character.

However, &#x8;  _is_ valid in XML 1.1, so the following file (call it 
test1.1.xml)

<?xml version="1.1" encoding="utf-8"?>

<!-- N.B. change in line above to version="1.1" -->

<foo>
  <bar>first line of text</bar>
  <bar>second line of text</bar>
  <bar>third line of text</bar>
  <bar>&#x0061;&#x0062;&#x0063;&#x8;</bar>  <!-- N.B. addition of &#x8; -->
</foo>

is (correctly) accepted by Jing as valid XML 1.1.

************************

Problem:  pulldom handles test.xml (which lacks the offending &#x8;) but
   chokes on both test1.0.xml (which contains an invalid &#x8;) and 
test1.1.xml
   (which contains a valid &#x8;).

   It should fail for test1.0.xml and succeed for test1.1.xml (just like 
Jing does).


Here's a little test script (call it test.py) using pulldom to print the 
text in each
<bar> element:

#!/usr/bin/env python
 
import sys
from xml.dom import pulldom
 
infile = sys.argv[1]
 
events = pulldom.parse(infile)
 
def getText(nodelist):
    rc = ""
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc += node.data
    return rc
 
for (event, node) in events:
    if event == pulldom.START_ELEMENT and node.tagName == "bar":
        events.expandNode(node)
        print getText(node.childNodes)
        
# end of script

Invoking from the command line

  test.py test.xml

succeeds and outputs

  first line of text
  second line of text
  third line of text
  abc

But invoking

   test.py test1.0.xml
or
   test.py test1.1.xml

fails and gives the following traceback:

Traceback (most recent call last):
  File "test.py", line 17, in ?
    for (event, node) in events:
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line 
232, in next
    rc = self.getEvent()
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/dom/pulldom.py", line 
265, in getEvent
    self.parser.feed(buf)
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", 
line 220, in feed
    self._err_handler.fatalError(exc)
  File 
"/opt/STools/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 
38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:7:31: reference to 
invalid character number

# end of Traceback

Again, this behavior, raising an exception to "invalid character number" 
&#x8;
is appropriate for the XML 1.0 file but not for the XML 1.1 file.

******************

I have an application that needs XML 1.1, including characters like &#x8;

How can I parse such files in Python (preferably with pulldom, but I'm open
to all suggestions).

Thanks,

Ken


From dave at allen-williams.com  Sat Aug 28 20:35:34 2004
From: dave at allen-williams.com (Dave Allen-Williams)
Date: Sat Aug 28 20:32:56 2004
Subject: [XML-SIG] XSLT stylesheet for XBEL
Message-ID: <E1C180U-0004ae-00@mail2.mail.iol.ie>

Hi, 
I noticed that your XBEL page http://pyxml.sourceforge.net/topics/xbel/ has
the following link:
    Joris Graaumans (joris@cs.uu.nl) has developed a couple of
<http://www.cs.uu.nl/~joris/stuff.html> XSLT stylesheets for XBEL
which appears to be out of date.
 
In case you might be interested in updating your page to include a current
XSLT stylesheet for XBEL, I've also developed one which uses DHTML to
navigate folders (tested with IE). 
   http://www.allen-williams.com/dave/links.xml  shows
http://www.allen-williams.com/dave/links.xslt in use.
 
Cheers, Dave.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/xml-sig/attachments/20040828/83ca1739/attachment.html
From abra9823 at mail.usyd.edu.au  Tue Aug 31 03:19:14 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Tue Aug 31 03:19:23 2004
Subject: [XML-SIG] xpath error
Message-ID: <1093915154.4133d21263cc6@www-mail.usyd.edu.au>

hi!

i parsed an XML document using minidom and then executed the following
statement:
dataNodes = xpath.Evaluate(".//*[local-name()='DATA']",
document.documentElement)

this gives an error

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\__init__.py", line 70,
in E
valuate
    retval = parser.new().parse(expr).evaluate(con)
  File
"C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\ParsedAbbreviatedRelativeLo
cationPath.py", line 52, in evaluate
    res = Set.Union(res,subRt)
  File "C:\PYTHON23\Lib\site-packages\_xmlplus\xpath\Set.py", line 25, in
Union
    return compare + filter(lambda x,compare = compare:x not in
compare,loop)
TypeError: can only concatenate list (not "tuple") to list

any idea why?

thanks

cheers


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From abra9823 at mail.usyd.edu.au  Tue Aug 31 05:56:51 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Tue Aug 31 05:56:55 2004
Subject: [XML-SIG] fast xml processing 
Message-ID: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au>

hi!

I am looking for tools that allow fast processing of XML documents. i will
only be using DOM and xpath so a lightweight package would be nice.

from what i have read so far 4Suite appears to be quite fast, but it
requires a license. any other fast packages....i am not overly impressed
by the speed of PyXML
since i will be using the package on a PDA, it would be nice if you could
also tell me how i can go about porting some of the underlying C code to a
pcoket pc. I have got the SDK, emulator etc and will be using Microsoft
embedded Visual C++.
would it just involve recompiling the C code in the new environment and
copying it over.

thanks
cheers


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From fredrik at pythonware.com  Tue Aug 31 08:01:17 2004
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue Aug 31 07:59:36 2004
Subject: [XML-SIG] Re: fast xml processing
References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au>
Message-ID: <ch143v$tec$1@sea.gmane.org>

Ajay wrote:

> I am looking for tools that allow fast processing of XML documents. i will
> only be using DOM and xpath so a lightweight package would be nice.
>
> from what i have read so far 4Suite appears to be quite fast, but it
> requires a license.

>From what I can tell, it *has* a license, which you are supposed to read
and adhere to:

    http://4suite.org/COPYRIGHT.doc

Same applies to all other software libraries, of course.  Very few libraries are
in the public domain.

As for other lightweight tools, people have already pointed you to alternatives
to PyDOM.  It's always a good idea to read followups to your posts.

</F> 


From abra9823 at mail.usyd.edu.au  Tue Aug 31 11:22:34 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Tue Aug 31 11:22:40 2004
Subject: [XML-SIG] xml parser
Message-ID: <1093944154.4134435a9c3f0@www-mail.usyd.edu.au>

hi!

Is there a pure Python XML parser - one that doesn't use any C code?
i am willing to sacrifice speed.

the python ce release i am using does not include pyexpat and i am not
having much luck in porting code to it.

cheers


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From Alexandre.Fayolle at logilab.fr  Tue Aug 31 11:57:45 2004
From: Alexandre.Fayolle at logilab.fr (Alexandre)
Date: Tue Aug 31 11:57:48 2004
Subject: [XML-SIG] xml parser
In-Reply-To: <1093944154.4134435a9c3f0@www-mail.usyd.edu.au>
References: <1093944154.4134435a9c3f0@www-mail.usyd.edu.au>
Message-ID: <20040831095745.GJ3093@crater.logilab.fr>

On Tue, Aug 31, 2004 at 07:22:34PM +1000, Ajay wrote:
> hi!
> 
> Is there a pure Python XML parser - one that doesn't use any C code?
> i am willing to sacrifice speed.

xmlproc in pyxml is such a parser.  

-- 
Alexandre Fayolle                              LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mail.python.org/pipermail/xml-sig/attachments/20040831/9ebcdd2d/attachment.pgp
From abra9823 at mail.usyd.edu.au  Tue Aug 31 16:31:10 2004
From: abra9823 at mail.usyd.edu.au (Ajay)
Date: Tue Aug 31 16:31:16 2004
Subject: [XML-SIG] xpath 
Message-ID: <1093962670.41348baedc0e7@www-mail.usyd.edu.au>

hi!

is there a Python implementation of xpath that doesn't use any C code and
is purely in Python? Is there one as a standalone package.

thanks
cheers


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
From brian at sweetapp.com  Tue Aug 31 18:49:50 2004
From: brian at sweetapp.com (Brian Quinlan)
Date: Tue Aug 31 18:45:30 2004
Subject: [XML-SIG] Removing insignificant whitespace
In-Reply-To: <ch143v$tec$1@sea.gmane.org>
References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au>
	<ch143v$tec$1@sea.gmane.org>
Message-ID: <4134AC2E.2060404@sweetapp.com>

I'm trying to remove the whitespace-only text nodes in my XML DOM. I've 
tried two approaches:

1. StripXml - generates a an exception:

   File "mac.py", line 25, in __init__
     StripXml(self.document)
   File 
"/usr/lib/python2.3/site-packages/_xmlplus/dom/ext/__init__.py", line 
153, in StripXml
     snit = owner_doc.createNodeIterator(startNode, NodeFilter.SHOW_TEXT,
AttributeError: Document instance has no attribute 'createNodeIterator'

2. setFeature('whitespace_in_element_content', False) seems to do
    nothing

My code is here:

from xml import xpath, dom
from xml.dom.ext import StripXml
from xml.dom.xmlbuilder import DOMInputSource, DOMBuilder
from optparse import OptionParser
from pprint import pprint
import os

b = DOMBuilder()
b.setFeature('whitespace_in_element_content', False)
self.document = b.parse(...)
StripXml(self.document)

My XML does not include a DTD or any declarations regarding whitespace. 
  Can anyone offer any advice?

Cheers,
Brian
From brian at sweetapp.com  Tue Aug 31 18:53:49 2004
From: brian at sweetapp.com (Brian Quinlan)
Date: Tue Aug 31 18:49:22 2004
Subject: [XML-SIG] PyXML  XPath limitation
In-Reply-To: <ch143v$tec$1@sea.gmane.org>
References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au>
	<ch143v$tec$1@sea.gmane.org>
Message-ID: <4134AD1D.1030905@sweetapp.com>

In the unlikely event that this isn't a known problem, or in the more 
likely event that I am doing something wrong, the following code 
generates an exception for me:

nodes = xpath.Evaluate(
             '//dict[key=%r]/key' % key, self.document)

Traceback (most recent call last):
   File "mac.py", line 87, in ?
     pprint(info[options.field])
   File "mac.py", line 69, in __getitem__
     nodes = xpath.Evaluate(
   File "/usr/lib/python2.3/site-packages/_xmlplus/xpath/__init__.py", 
line 70, in Evaluate
     retval = parser.new().parse(expr).evaluate(con)
   File 
"/usr/lib/python2.3/site-packages/_xmlplus/xpath/ParsedAbbreviatedAbsoluteLocationPath.py", 
line 44, in evaluate
     sub_rt.extend(self._rel.select(context))
   File 
"/usr/lib/python2.3/site-packages/_xmlplus/xpath/ParsedRelativeLocationPath.py", 
line 23, in evaluate
     raise Exception("Expected node set from relative expression.  Got 
%s"%str(rt))
Exception: Expected node set from relative expression.  Got ()

Cheers,
Brian
From tpassin at comcast.net  Tue Aug 31 22:33:51 2004
From: tpassin at comcast.net (Thomas B. Passin)
Date: Tue Aug 31 22:31:30 2004
Subject: [XML-SIG] Removing insignificant whitespace
In-Reply-To: <4134AC2E.2060404@sweetapp.com>
References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au>	<ch143v$tec$1@sea.gmane.org>
	<4134AC2E.2060404@sweetapp.com>
Message-ID: <4134E0AF.5040209@comcast.net>

Brian Quinlan wrote:

> I'm trying to remove the whitespace-only text nodes in my XML DOM. I've 
> tried two approaches:
> 
> 1. StripXml - generates a an exception:
> 
>   File "mac.py", line 25, in __init__
>     StripXml(self.document)
>   File "/usr/lib/python2.3/site-packages/_xmlplus/dom/ext/__init__.py", 
> line 153, in StripXml
>     snit = owner_doc.createNodeIterator(startNode, NodeFilter.SHOW_TEXT,
> AttributeError: Document instance has no attribute 'createNodeIterator'
> 
> 2. setFeature('whitespace_in_element_content', False) seems to do
>    nothing
> 
> My code is here:
> 
> from xml import xpath, dom
> from xml.dom.ext import StripXml
> from xml.dom.xmlbuilder import DOMInputSource, DOMBuilder
> from optparse import OptionParser
> from pprint import pprint
> import os
> 
> b = DOMBuilder()
> b.setFeature('whitespace_in_element_content', False)
> self.document = b.parse(...)
> StripXml(self.document)
> 
> My XML does not include a DTD or any declarations regarding whitespace. 
>  Can anyone offer any advice?

What's wrong with normalize()?

Cheers,

Tom P
-- 
Thomas B. Passin
Explorer's Guide to the Semantic Web (Manning Books)
http://www.manning.com/catalog/view.php?book=passin
From fdrake at acm.org  Tue Aug 31 23:57:54 2004
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue Aug 31 23:58:06 2004
Subject: [XML-SIG] Removing insignificant whitespace
In-Reply-To: <4134E0AF.5040209@comcast.net>
References: <1093924611.4133f7037b4c9@www-mail.usyd.edu.au>
	<4134AC2E.2060404@sweetapp.com> <4134E0AF.5040209@comcast.net>
Message-ID: <200408311757.54733.fdrake@acm.org>

On Tuesday 31 August 2004 04:33 pm, Thomas B. Passin wrote:
 > What's wrong with normalize()?

What does normalize do about whitespace in content?  If anything, that's a 
bug.  normalize() only deals with how adjacent nodes containing character 
data are combined.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>