re or html parser module, for wildcard search within html document?

Sun Aug 3 01:49:44 EDT 2003

On 1 Aug 2003 19:06:53 -0700, mm2ps at yahoo.co.uk (Douglas) wrote:

>I want to search and replace some expressions within an html document.
>Specifically, I want to replace any tag containing the word "font"
>with a new tag. As I want to use some form of wild card for the
>search, eg. <*font*>, should I use a regular expression module (re) or
>one of the specific html parsers? If this should be done with an html
>parser module then which one and where is some easy going introductory
>documentation, please?
>
Do you want to change to another font? If you want to eliminate it altogether,
you will have to eliminate the </font> end tag also.

This seems unlikely to bomb with a regex, unless someone has deleted something to make them
unmatched, and then commented the trash out. But then they deserve more trash ;-)

Assuming you want just to change the opening font tag to another font tag, a regex like

Read starting info (I saved python page to disk)

 >>> html = file('www_python_org.html').read()

Make regex
 >>> import re
 >>> rxo = re.compile(r'<[Ff][Oo][Nn][Tt] [^>]*>')

Check original
 >>> rxo.findall(html)
 ['<font color="#ffffff">', '<font color="#ffffff">', '<font color="#ffffff">', '<font color="#ff
 ffff">', '<font color="#ffffff">', '<font color="#ffffff">', '<font color="#ffffff">']

Make an new by substitution
 >>> html2 = rxo.sub('<FONT color="#FF0000">', html)

Write it out
 >>> file('www_python_red.html','w').write(html2)

Check what we did to the data (look at the two with the browser and see effect to left)

 >>> rxo.findall(html2)
 ['<FONT color="#FF0000">', '<FONT color="#FF0000">', '<FONT color="#FF0000">', '<FONT color="#FF
 0000">', '<FONT color="#FF0000">', '<FONT color="#FF0000">', '<FONT color="#FF0000">']
 >>>

HTH

Regards,
Bengt Richter