and <@#$%> with

From fredrik at pythonware.com Tue Jul 15 13:22:16 2008 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 15 Jul 2008 13:22:16 +0200 Subject: [XML-SIG] elementtree and uncomplete parsing In-Reply-To: <1213994442.485c15ca1aaca@imp.free.fr> References: <1213994442.485c15ca1aaca@imp.free.fr> Message-ID: jeanmarc.chourot at free.fr wrote: > I would like to retrieve what is between the tags ... into > strings, the "subelements" being considered as simple string and not processed > by elelement tree. > In other words, this could be badly formed HTML not processed embeded into well > formed xml tags. > > i.e. : > string1 = "This text is completely crap because > blabla " > string2="This is another node with > random tags " You say parse, but your description seems to say that you want to serialize the contents of an XML node, but without getting the outermost element. Is that correct? In ET 1.3, you can do do this by setting the tag to None and then serializing the node as usual, but to do this in 1.2 (as shipped with Python 2.5), you need to process the string afterwards. Assuming the element you want to serialize in the variable "node", you can do: >>> node >>> s = ET.tostring(node) >>> s 'something some other thing hello text' >>> _, _, s = s.partition(">") # chop off first tag >>> s, _, _ = s.rpartition("<") # chop off last tag >>> s 'something some other thing hello text' >>> Alternatively, you can "normalize" the node and use ordinary slicing: >>> node.tag = "node" # make sure we know what it is >>> node.attrib.clear() >>> s = ET.tostring() >>> s = ET.tostring(node) >>> s = s[6:-7] >>> s 'something some other thing hello text' >>> From joshua.r.english at gmail.com Sat Jul 19 17:27:13 2008 From: joshua.r.english at gmail.com (Josh English) Date: Sat, 19 Jul 2008 08:27:13 -0700 Subject: [XML-SIG] Parser not resetting Message-ID: I have a file of XML-data objects. I wrote a ContentHandler to search for elements and return them in a variety of formats. Once I plugged all this into a GUI I discovered a strange problem. The function call tells a parser to parse the same source file, but it stops reading the source file and limits searching to previously found results. I'm using a ConfigParser object to determine what to match, but I don't think this is the problem. The code uses an xml.sax.handler.ContentHandler subclass, that stores the results in "output" and the filter function is : def list_submissions(argstring,filters={}): sh=SubHandler(argstring,filters) parser = make_parser() parser.setContentHandler(sh) parser.parse(open(submissionpath)) return sh.output running this function over and over causes the problem print len(list_submissions("--dict") # Get all submissions as dictionary objects print len(list_submissions("-mAsmv --dict") # Get all submissions to this this market print len(list_submissions("--dict") # Get all submissions again the results: 152 7 7 The last call should give me 152. I'm not finding a method to reset a parser. I've tried deleting the parser object in the function, deleting the parser and the content handler object in the function, but nothing seems to solve this problem. Has anyone seen this? I'm running Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)], using the standard xml files with this distribution. Thanks, Josh English Joshua.R.English at gmail.com http://joshenglish.livejournal.com From fredrik at pythonware.com Sat Jul 19 20:54:36 2008 From: fredrik at pythonware.com (Fredrik Lundh) Date: Sat, 19 Jul 2008 20:54:36 +0200 Subject: [XML-SIG] Parser not resetting In-Reply-To: References: Message-ID: Josh English wrote: > The code uses an xml.sax.handler.ContentHandler subclass, that stores > the results in "output" and the filter function is : > > def list_submissions(argstring,filters={}): > sh=SubHandler(argstring,filters) > parser = make_parser() > parser.setContentHandler(sh) > parser.parse(open(submissionpath)) > > return sh.output > > running this function over and over causes the problem what's filters here? are you aware of this issue: http://effbot.org/zone/default-values.htm ? From ericchao0613 at gmail.com Thu Jul 24 10:29:46 2008 From: ericchao0613 at gmail.com (Eric Chao) Date: Thu, 24 Jul 2008 16:29:46 +0800 Subject: [XML-SIG] Creating XML with Python Message-ID: I've been trying to convert some text that has some odd coding to xml. I am trying to use python to create a program that will process this text: GENESIS CHAPTER 1 The Creation {{01:1}}1 In the beginning God created the heavens and the earth. {{01:1}}2 The earth was <$FOr {a waste and emptiness}>>formless and void, and darkness was over the {{01:1}}3 Then God said, ``Let there be light"; and there was light. to something like this:

In the beginning God created the heaven and the earth.

And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

And God said, Let there be light: and there was light.

I am not very good with Python and I was hoping someone could offer some advice on how to get started. I tried to write a program that produces XML, but I think I need more of a find and replace type program. Thanks ! -Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From fredrik at pythonware.com Thu Jul 24 14:04:58 2008 From: fredrik at pythonware.com (Fredrik Lundh) Date: Thu, 24 Jul 2008 14:04:58 +0200 Subject: [XML-SIG] Creating XML with Python In-Reply-To: References: Message-ID: Eric Chao wrote: > I've been trying to convert some text that has some odd coding to xml. I > am trying to use python to create a program that will process this text: > > GENESIS > CHAPTER 1 > The Creation > {{01:1}}1 In the beginning God created the heavens and > the earth. > {{01:1}}2 The earth was <$FOr {a waste and > emptiness}>>formless and void, and darkness was over the > {{01:1}}3 Then God said, ``Let there be light"; and there was light. > > to something like this: > > > >

In the beginning God created the heaven and the > earth.

And the earth was without form, and void; and > darkness was upon the face of the deep. And the Spirit of God moved upon > the face of the waters.

And God said, Let there be light: and there was > light.

> > I am not very good with Python and I was hoping someone could offer some > advice on how to get started. I tried to write a program that produces > XML, but I think I need more of a find and replace type program. Thanks ! that looks a rather daunting task even for an experienced Python programmer (especially mapping between different translations ;-). I'd concentrate on parsing the original file format first, before even thinking about how to write it out in XML. it might be some kind of SGML, in which case the standard sgmllib library might be helpful: http://effbot.org/librarybook/sgmllib.htm if that seems to work, try building some suitable data structure from the incoming data (lists of strings might work, but you might want to create some simple container objects that holds the lists for you). when you have all this in place, you can either just walk the data structure and create XML on the fly (don't forget to escape reserved characters; you can use cgi.escape for that), or build e.g. an ElementTree (xml.tree) and then ask that module to serialize the tree for you. hope this helps, at least a little. From stefan_ml at behnel.de Thu Jul 24 14:14:57 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Jul 2008 14:14:57 +0200 (CEST) Subject: [XML-SIG] Creating XML with Python In-Reply-To: References: Message-ID: <51226.213.61.181.86.1216901697.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Eric Chao wrote: > advice on how to get started. I tried to write a program that produces > XML, but I think I need more of a find and replace type program. Thanks ! You might find these interesting: http://codespeak.net/lxml/tutorial.html#the-e-factory http://codespeak.net/lxml/s5/lxml-ep2008.html Stefan From jcd at unc.edu Thu Jul 24 14:30:18 2008 From: jcd at unc.edu (J. Clifford Dyer) Date: Thu, 24 Jul 2008 08:30:18 -0400 Subject: [XML-SIG] Creating XML with Python In-Reply-To: References: Message-ID: <1216902618.31596.2.camel@jcd-desktop> On Thu, 2008-07-24 at 14:04 +0200, Fredrik Lundh wrote: > Eric Chao wrote: > > > I've been trying to convert some text that has some odd coding to xml. I > > am trying to use python to create a program that will process this text: > > > > GENESIS > > CHAPTER 1 > > The Creation > > {{01:1}}1 In the beginning God created the heavens and > > the earth. > > {{01:1}}2 The earth was <$FOr {a waste and > > emptiness}>>formless and void, and darkness was over the > > {{01:1}}3 Then God said, ``Let there be light"; and there was light. > > > > to something like this: > > > > > > > >

In the beginning God created the heaven and the > > earth.

> >

And the earth was without form, and void; and > > darkness was upon the face of the deep. And the Spirit of God moved upon > > the face of the waters.

> >

And God said, Let there be light: and there was > > light.

> > > > I am not very good with Python and I was hoping someone could offer some > > advice on how to get started. I tried to write a program that produces > > XML, but I think I need more of a find and replace type program. Thanks ! > > that looks a rather daunting task even for an experienced Python > programmer (especially mapping between different translations ;-). > > I'd concentrate on parsing the original file format first, before even > thinking about how to write it out in XML. > > it might be some kind of SGML, in which case the standard sgmllib > library might be helpful: > > http://effbot.org/librarybook/sgmllib.htm > > if that seems to work, try building some suitable data structure from > the incoming data (lists of strings might work, but you might want to > create some simple container objects that holds the lists for you). If it turns out not to be valid SGML, you may need to look into using pyparsing. There was a good introduction to it in a recent issue of python magazine. There are also a bunch of online tutorials. -- Oook! J. Cliff Dyer Carolina Digital Library and Archives UNC Chapel Hill From jsulak at gmail.com Sun Jul 27 18:40:13 2008 From: jsulak at gmail.com (James Sulak) Date: Sun, 27 Jul 2008 11:40:13 -0500 Subject: [XML-SIG] Replicating DTD information using XMLFilterBase and XMLGenerator Message-ID: <7cb78b3b0807270940r3b0d6a78n323ec6b4bf7e2319@mail.gmail.com> Hello All, I'm attempting to use xml.sax.utils.XMLFilterBase and XMLGenerator to take an input XML document, filter out certain elements, and output the result to a second XML file. I have it mostly working, except that I lose the DTD declaration and anything (processing instructions or comments) before the root element. I believe I'm supposed to be using a LexicalHandler to get the information from the DTD, but I have not been able to figure out how to do this, or how to integrate it with the rest of the code. I'm pretty new at using Python (and SAX, for that matter) to work with XML, so I'm hoping this is a fairly simple question. I'm basing my code off of Uche Ogbuji's example at http://www.ibm.com/developerworks/xml/library/x-tipsaxflex.html#resources. Any help would be appreciated. Thanks, -James From stefan_ml at behnel.de Sun Jul 27 22:38:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 27 Jul 2008 22:38:44 +0200 Subject: [XML-SIG] Replicating DTD information using XMLFilterBase and XMLGenerator In-Reply-To: <7cb78b3b0807270940r3b0d6a78n323ec6b4bf7e2319@mail.gmail.com> References: <7cb78b3b0807270940r3b0d6a78n323ec6b4bf7e2319@mail.gmail.com> Message-ID: <488CDCD4.5040001@behnel.de> Hi, James Sulak wrote: > I'm attempting to use xml.sax.utils.XMLFilterBase and XMLGenerator to > take an input XML document, filter out certain elements, and output > the result to a second XML file. I have it mostly working, except > that I lose the DTD declaration and anything (processing instructions > or comments) before the root element. I believe I'm supposed to be > using a LexicalHandler to get the information from the DTD, but I have > not been able to figure out how to do this, or how to integrate it > with the rest of the code. > > I'm pretty new at using Python (and SAX, for that matter) to work with > XML Try lxml's iterparse() instead of SAX. It will build an in-memory tree (including the DTD or its reference if you want, see the parser docs), but you can remove the unwanted elements from the tree while it parses. It's still pretty memory friendly and definitely a lot easier to work with than SAX. http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files Stefan From spencer.crissman at gmail.com Mon Jul 28 18:13:33 2008 From: spencer.crissman at gmail.com (spencer.c) Date: Mon, 28 Jul 2008 09:13:33 -0700 (PDT) Subject: [XML-SIG] lxml - html entities Message-ID: <18693223.post@talk.nabble.com> I am using lxml to process some xhtml files. The files have html character codes embedded in them. For instance: ' rather than a '. When I parse the files, edit them, and then write them back out, I want my edits to be the only changes in the output files, but lxml is replacing the character codes with the actual characters they are supposed to represent as well. So if I have: It& #39;s an example. <-- Space inserted to help readability. It is writing out: It's an example. I've tried setting resolve_entities to false, ala: tree = etree.parse(input, etree.XMLParser(resolve_entities=False)) But this seems to have no effect. There a way to tell lxml to ignore these/leave them as is? Thanks. -s -- View this message in context: http://www.nabble.com/lxml---html-entities-tp18693223p18693223.html Sent from the Python - xml-sig mailing list archive at Nabble.com. From mnvl16 at yahoo.co.uk Mon Jul 28 18:58:22 2008 From: mnvl16 at yahoo.co.uk (Jack Grahl) Date: Mon, 28 Jul 2008 16:58:22 +0000 (GMT) Subject: [XML-SIG] Catalog.py - is it possible to convert playlists to unicode on-the-fly? Message-ID: <394083.27755.qm@web27905.mail.ukl.yahoo.com> Hi there, I have had serpentine choke on an .m3u playlist which, closer inspection revealed, had filenames which included special characters in iso-latin-1 encoding. The files themselves had, i believe, their filenames automatically converted to the same characters in utf-8 encoding. (This was the result of ripping mp3s on a Windows machine and transferring the files) I looked into the code a little bit and found that the playlist decoding mechanism is in /usr/lib/python2.5/site-packages/Ft/Xml/Catalog.py. The first place that python chokes on a non-unicode string is at line 332, however the 'uri' string is used at many places after that. Now, I understand that the module has no obligation to make sense of a playlist which does not use the same encoding as the filename in question, and so in fact points to a file which to all intents and purposes isn't there. However as the behaviour is a little difficult for the end user to deal with, I wondered if it would be possible to make modifications to the code so that, in the event of the playlist files not being found, it guesses at an alternate encoding, translates the string to utf-8 based on that guess, and looks for the new filenames? Obviously it's not difficult to check that such a guess is right if the files in question do exist. If anyone has any advice as to whether this is a good idea, and how to implement it, I would be very grateful. Yours, Jack Grahl __________________________________________________________ Not happy with your email address?. Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html From jsulak at gmail.com Tue Jul 29 02:56:32 2008 From: jsulak at gmail.com (James Sulak) Date: Mon, 28 Jul 2008 19:56:32 -0500 Subject: [XML-SIG] Replicating DTD information using XMLFilterBase and XMLGenerator In-Reply-To: <488CDCD4.5040001@behnel.de> References: <7cb78b3b0807270940r3b0d6a78n323ec6b4bf7e2319@mail.gmail.com> <488CDCD4.5040001@behnel.de> Message-ID: <7cb78b3b0807281756u780ee47ckb50bd9043b2bd548@mail.gmail.com> Thanks, Stefan, for pointing me to lxml. It looks like a good alternative to SAX in this situation. However, I'm a little confused as to the best way to remove elements from the tree while keeping their tail text. This is what I have so far: context = etree.iterparse("test.xml") for event, element in context: for title in element.xpath("child::title"): element.remove(title) Do I need to explicitly assign the tail text to either the parent or the preceding sibling? If so, what's the best way to accomplish that? Thanks, -James On Sun, Jul 27, 2008 at 3:38 PM, Stefan Behnel wrote: > Hi, > > James Sulak wrote: >> I'm attempting to use xml.sax.utils.XMLFilterBase and XMLGenerator to >> take an input XML document, filter out certain elements, and output >> the result to a second XML file. I have it mostly working, except >> that I lose the DTD declaration and anything (processing instructions >> or comments) before the root element. I believe I'm supposed to be >> using a LexicalHandler to get the information from the DTD, but I have >> not been able to figure out how to do this, or how to integrate it >> with the rest of the code. >> >> I'm pretty new at using Python (and SAX, for that matter) to work with >> XML > > Try lxml's iterparse() instead of SAX. It will build an in-memory tree > (including the DTD or its reference if you want, see the parser docs), but you > can remove the unwanted elements from the tree while it parses. It's still > pretty memory friendly and definitely a lot easier to work with than SAX. > > http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk > http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files > > Stefan > From stefan_ml at behnel.de Tue Jul 29 07:42:03 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 Jul 2008 07:42:03 +0200 Subject: [XML-SIG] Replicating DTD information using XMLFilterBase and XMLGenerator In-Reply-To: <7cb78b3b0807281756u780ee47ckb50bd9043b2bd548@mail.gmail.com> References: <7cb78b3b0807270940r3b0d6a78n323ec6b4bf7e2319@mail.gmail.com> <488CDCD4.5040001@behnel.de> <7cb78b3b0807281756u780ee47ckb50bd9043b2bd548@mail.gmail.com> Message-ID: <488EADAB.3000303@behnel.de> Hi, James Sulak wrote: > Thanks, Stefan, for pointing me to lxml. It looks like a good > alternative to SAX in this situation. However, I'm a little confused > as to the best way to remove elements from the tree while keeping > their tail text. This is what I have so far: > > context = etree.iterparse("test.xml") > > for event, element in context: > for title in element.xpath("child::title"): it's likely faster to use for title in element.iterchildren("title"): here. > element.remove(title) > > Do I need to explicitly assign the tail text to either the parent or > the preceding sibling? Yes, the tail text is part of the Element object. Take a look at the "drop_tree" and "drop_tag" methods in lxml.html. http://codespeak.net/svn/lxml/trunk/src/lxml/html/__init__.py Stefan From stefan_ml at behnel.de Tue Jul 29 07:43:28 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 Jul 2008 07:43:28 +0200 Subject: [XML-SIG] lxml - html entities In-Reply-To: <18693223.post@talk.nabble.com> References: <18693223.post@talk.nabble.com> Message-ID: <488EAE00.4040201@behnel.de> (this is being discussed on the lxml mailing list) spencer.c wrote: > I am using lxml to process some xhtml files. The files have html character > codes embedded in them. For instance: ' rather than a '. When I parse > the files, edit them, and then write them back out, I want my edits to be > the only changes in the output files, but lxml is replacing the character > codes with the actual characters they are supposed to represent as well. > > So if I have: > It& #39;s an example. <-- Space inserted to help readability. > > It is writing out: > It's an example. > > I've tried setting resolve_entities to false, ala: > tree = etree.parse(input, etree.XMLParser(resolve_entities=False)) > > But this seems to have no effect. > > There a way to tell lxml to ignore these/leave them as is? > > Thanks. > > -s From ericchao0613 at gmail.com Tue Jul 29 08:39:16 2008 From: ericchao0613 at gmail.com (Eric Chao) Date: Tue, 29 Jul 2008 14:39:16 +0800 Subject: [XML-SIG] find and replace Message-ID: Hey, So I've been trying to find and replace some text with python using regular expression, but I haven't been able to make permanent changes and just to the files. -this is the data in the original file. @#$The Creation@#$% {{01:1}}1 In the beginning God created the heavens and the earth. {{01:1}}2 The earth was <$FOr {a waste and emptiness}>>formless and void, and darkness was over the surface of the deep, and the Spirit of God was <$FOr {hovering}>>moving over the surface of the waters. {{01:1}}3 Then God said, ``Let there be light"; and there was light. {{01:1}}4 God saw that the light was good; and God separated the light from the darkness. -I want to replace @#$ with and <@#$%> with using python. After I replace a few elements, I can start creating xml. Can anyone offer some insight as to how to do this ? It would be greatly appreciated. Thanks ! -Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjabaut at gmail.com Thu Jul 31 20:31:42 2008 From: tjabaut at gmail.com (tjabaut) Date: Thu, 31 Jul 2008 11:31:42 -0700 (PDT) Subject: [XML-SIG] How to Parse XML with Single Element & Multiple Attribures Message-ID: I have an app where the API returns an XML state with a single element that contains a number of attributes (depending on which API is called). I would like ot know how I can step through these Attributes so that I can populate an XHTML results page. I am pretty new to Python, but would love to figure this out.