Formatters, Writers, Parsers...

Kevin Carlson nskhcarlso at bellsouth.net
Mon Jun 16 18:14:55 EDT 2003


I am trying to write a module that will convert HTML into formatted 
text.  I have seen numerous examples in the archives and in the Python
Cookbook about stripping out HTML tags which all seem to work fine.

However, what I need to do is format the text according to the tags that 
the HTML contains.  In particular, I need to increase the indentation 
about 10 characters for each <td> tag that is read.  I have been 
approaching this as follows:


------- Begin code -----------
class TextParser(htmllib.HTMLParser) :

   def __init__(self, fmtr, verbose=0) :
     htmllib.HTMLParser.__init__(self, fmtr, verbose)
     self.fmtr = fmtr
     self.insideTR = 0
     self.currentMargin = 0

   def start_tr(self, attrs) :
     self.insideTR = 1

   def end_tr(self) :
     self.insideTR = 0
     self.fmtr.end_paragraph(1)

   def start_td(self, attrs) :
     self.currentMargin = self.currentMargin + 10
     self.fmtr.push_margin(self.currentMargin)

   def end_td(self) :
     pass

def parseText(data) :
   parser = TextParser(
              formatter.AbstractFormatter(formatter.DumbWriter()))
   parser.feed(data)

------- End code ---------


I guess I don't understand formatters because I am getting the exact 
same results as if don't issue any of the calls to the formatter.

Can anyone shed some light on what I am doing wrong?

Thanks,

Kevin







More information about the Python-list mailing list