[Tutor] xml question
tpc at csua.berkeley.edu
tpc at csua.berkeley.edu
Wed Oct 22 14:52:12 EDT 2003
hi Jimmy,
If you want to extract the PCDATA from your XML file you may want to look
at xmllib:
http://www.python.org/doc/current/lib/module-xmllib.html
I have experience parsing XHTML with HTMLParser.HTMLParser and I imagine
this class shouldn't be that much different. Here's an example of an
XHTML Parser that ignores stuff inside <style> (which for me tends to
break parsing of the page), gathers title and content, and keeps a list
of last element parsed of a page before encountering an error:
class XHTMLParser(HTMLParser.HTMLParser):
# Simple class used to parse title and text outside of XHTML tags.
# Also stores a list of XHTML pages that contained errors.
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.extracted_text = ''
self.title = ''
self.in_title = False
self.in_style = False
self.last_parsed_tag = ''
def handle_starttag(self, tag, attrs):
length = len(attrs)
if length == 0:
self.last_parsed_tag = ' Last parsed element: <' +
tag + '>'
if (tag == "title"):
self.in_title = True
elif (tag == "style"):
self.in_style = True
elif length > 0:
strattrs = ''.join([ ' %s="%s"' % (key, value) for
key, value in attrs])
self.last_parsed_tag = ' Last parsed element: <' +
tag + strattrs + '>'
def handle_endtag(self, tag):
self.last_parsed_tag = ' Last parsed element: </' + tag +
'>'
if (tag == "title"):
self.in_title = False
elif (tag == "style"):
self.in_style = False
def handle_data(self, data):
if data.isspace():
pass
elif self.in_title:
self.title = self.title + data
elif self.in_style:
pass
else:
self.extracted_text = self.extracted_text + " " +
data
def get_text(self):
result = self.extracted_text
self.extracted_text = ""
return result
def get_title(self):
result = self.title
self.title = ""
return result
On Wed, 22 Oct 2003, Jimmy verma wrote:
> Hello,
>
> I am getting caught in a problem for which i need some suggestions.
>
> I have an xml file like
>
> <XYZ Version="0">
> <header>
> <A = "1"/>
> <B = "2"/>
> <C = "0"/>
> <D = "0"/>
> <E = "0"/>
> </header>
>
> <dir>
> <table0>
> <P = "0"/>
> <Q = "1"/>
> <R = "2"/>
> <S = "3"/>
> </table0>
> <table1>
> <P = "4"/>
> <Q= "5"/>
> <R = "6"/>
> <S = "7"/>
> </table1>
> <\dir>
>
> <\XYZ>
>
> I want to make two tables out of this: one is 'header' and other is 'dir'.
>
> Tables can be in the form of list:
>
>
> like
> header = [1,2,0,0,0]
>
> dir = [ [0,1,2,3], [4,5,6,7] ]
>
>
> I just want the values in the tables not the tags. Is there some pythonic
> way of doing it.
>
>
> Thanks in advance.
>
> Regards,
>
> J+
>
> _________________________________________________________________
> Special offer from American Express.Don't miss out.
> http://server1.msn.co.in/features/amex/index.asp Apply now!
>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list