[Tutor] xml question

tpc at csua.berkeley.edu tpc at csua.berkeley.edu
Wed Oct 22 14:52:12 EDT 2003


hi Jimmy,

If you want to extract the PCDATA from your XML file you may want to look
at xmllib:

http://www.python.org/doc/current/lib/module-xmllib.html

I have experience parsing XHTML with HTMLParser.HTMLParser and I imagine
this class shouldn't be that much different.  Here's an example of an
XHTML Parser that ignores stuff inside <style> (which for me tends to
break parsing of the page), gathers title and content, and keeps a list
of last element parsed of a page before encountering an error:

class XHTMLParser(HTMLParser.HTMLParser):
# Simple class used to parse title and text outside of XHTML tags.
# Also stores a list of XHTML pages that contained errors.

        def __init__(self):
                HTMLParser.HTMLParser.__init__(self)
                self.extracted_text = ''
                self.title = ''
                self.in_title = False
                self.in_style = False
                self.last_parsed_tag = ''

        def handle_starttag(self, tag, attrs):
                length = len(attrs)
                if length == 0:
                        self.last_parsed_tag = ' Last parsed element: <' +
tag + '>'
                        if (tag == "title"):
                                self.in_title = True
                        elif (tag == "style"):
                                self.in_style = True
                elif length > 0:
                        strattrs = ''.join([ ' %s="%s"' % (key, value) for
key, value in attrs])
                        self.last_parsed_tag = ' Last parsed element: <' +
tag + strattrs + '>'

        def handle_endtag(self, tag):
                self.last_parsed_tag = ' Last parsed element: </' + tag +
'>'
                if (tag == "title"):
                        self.in_title = False
                elif (tag == "style"):
                        self.in_style = False

        def handle_data(self, data):
                if data.isspace():
                        pass
                elif self.in_title:
                        self.title = self.title + data
                elif self.in_style:
                        pass
                else:
                        self.extracted_text = self.extracted_text + " " +
data

        def get_text(self):
                result = self.extracted_text
                self.extracted_text = ""
                return result

        def get_title(self):
                result = self.title
                self.title = ""
                return result



On Wed, 22 Oct 2003, Jimmy verma wrote:

> Hello,
>
> I am getting caught in a problem for which i need some suggestions.
>
> I have an xml file like
>
> <XYZ Version="0">
>     <header>
>        <A = "1"/>
>        <B = "2"/>
>        <C = "0"/>
>        <D = "0"/>
>        <E = "0"/>
>     </header>
>
>     <dir>
>        <table0>
>           <P = "0"/>
>           <Q = "1"/>
>           <R = "2"/>
>           <S = "3"/>
>        </table0>
>        <table1>
>           <P = "4"/>
>           <Q= "5"/>
>           <R = "6"/>
>           <S = "7"/>
>        </table1>
>      <\dir>
>
> <\XYZ>
>
> I want to make two tables out of this: one is 'header' and other is 'dir'.
>
> Tables can be in the form of list:
>
>
> like
> header = [1,2,0,0,0]
>
> dir = [ [0,1,2,3], [4,5,6,7] ]
>
>
> I just want the values in the tables not the tags. Is there some pythonic
> way of doing it.
>
>
> Thanks in advance.
>
> Regards,
>
> J+
>
> _________________________________________________________________
> Special offer from American Express.Don't miss out.
> http://server1.msn.co.in/features/amex/index.asp  Apply now!
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>




More information about the Tutor mailing list