re or html parser module, for wildcard search within html document?
Bengt Richter
bokr at oz.net
Sun Aug 3 01:49:44 EDT 2003
On 1 Aug 2003 19:06:53 -0700, mm2ps at yahoo.co.uk (Douglas) wrote:
>I want to search and replace some expressions within an html document.
>Specifically, I want to replace any tag containing the word "font"
>with a new tag. As I want to use some form of wild card for the
>search, eg. <*font*>, should I use a regular expression module (re) or
>one of the specific html parsers? If this should be done with an html
>parser module then which one and where is some easy going introductory
>documentation, please?
>
Do you want to change to another font? If you want to eliminate it altogether,
you will have to eliminate the </font> end tag also.
This seems unlikely to bomb with a regex, unless someone has deleted something to make them
unmatched, and then commented the trash out. But then they deserve more trash ;-)
Assuming you want just to change the opening font tag to another font tag, a regex like
Read starting info (I saved python page to disk)
>>> html = file('www_python_org.html').read()
Make regex
>>> import re
>>> rxo = re.compile(r'<[Ff][Oo][Nn][Tt] [^>]*>')
Check original
>>> rxo.findall(html)
['<font color="#ffffff">', '<font color="#ffffff">', '<font color="#ffffff">', '<font color="#ff
ffff">', '<font color="#ffffff">', '<font color="#ffffff">', '<font color="#ffffff">']
Make an new by substitution
>>> html2 = rxo.sub('<FONT color="#FF0000">', html)
Write it out
>>> file('www_python_red.html','w').write(html2)
Check what we did to the data (look at the two with the browser and see effect to left)
>>> rxo.findall(html2)
['<FONT color="#FF0000">', '<FONT color="#FF0000">', '<FONT color="#FF0000">', '<FONT color="#FF
0000">', '<FONT color="#FF0000">', '<FONT color="#FF0000">', '<FONT color="#FF0000">']
>>>
HTH
Regards,
Bengt Richter
More information about the Python-list
mailing list