[Tutor] Re: extracting html source

pan@uchicago.edu pan@uchicago.edu
Tue Apr 15 07:14:02 2003


Hi all,

My apology to this group for those redundant and unreadable emails
I sent in the previous Tutor digest, Vol 1 #2361. It's due to a webmail
client of a webhost company called icdsoft. They have checkboxes for

  [ ] don't use MIME format
  [ ] don't use HTML format

I thought I could send a pure ASCII email by unchecking both of them but
obviously I was wrong. I have no idea why the emails I sent out from their
webmail only cause problems when sending to this list. But I've given up 
using their webmail client. 

Below is a readable reply (hope so) to Wayne's question: 


Hi Wayne,

Try this:

a= 'aaa <span> bbb </span> ccc'

>>> import re

>>> re.findall('<span>.*</span>', htmlSource)  #
<=3D=3D [A]
['<span> bbb </span>'] =20

>>> re.findall('<span>(.*)</span>', htmlSource)  #
<=3D=3D [B]
[' bbb ']


Note the difference between [A] and [B]

If there's a '\n' in between <span> and </span>:

>>> b =3D '''aaa <span> bb
..      bbb </span> ccc'''

>>> b
'aaa <span> bb\n     bbb </span> ccc'

>>> re.findall(r'<span>[\w\s]*</span>',b)
['<span> bb\n     bbb </span>']

>>> re.findall(r'<span>([\w\s]*)</span>',b)
[' bb\n     bbb ']


More:

>>> c=3D''' aaa <span> bbb1=20
.. bbb2
.. bbb3
.. </span>'''

>>> c
' aaa <span> bbb1 \nbbb2\nbbb3\n</span>'

>>> re.findall(r'<span>([\w\s]*)</span>',c)
[' bbb1 \nbbb2\nbbb3\n']


hth
pan