Regular expression help needed

Harvey Thomas hst at empolis.co.uk
Tue Sep 17 05:53:23 EDT 2002


Torkil Grindstein wrote:
> Sent: 17 September 2002 10:04
> To: python-list at python.org
> Subject: Regular expression help needed
> 
> 
> Hi.
> 
> It's been a while since I played around with regexps, and I
> realize that I really need some help here.
> 
> Mission: I have a string containing a whole WML document. I want
> to extract data from following occurences:
> 
> <meta name="michael" content="owen"/>
> 
> That is, I want to extract the content of any occurences of
> the meta "michael" key. There may be several occurences in one
> document.
> <meta name="michael" content="owen"/>
> <meta name="michael" content="jackson"/>
> will give me the results "owen" and "jackson".
> 
> Of course, it is possible that the document has written the
> tags in uppercase (META Name, META NAME, etc), but I could
> lowercase the whole document string prior to searching. (The
> content is case insensitive for my purposes.)
> 
> I would really be glad if someone out there took this challenge..:)
> 
> Cheers,
> Torkil
> -- 

Superficially, this is easy:

>>> s='<meta name="michael" content="owen"/><meta name="michael" content="jackson"/>'
>>> r=re.compile('<meta\s+name="michael"\s+content="([^"]*)"')
>>> r.findall(s)
['owen', 'jackson']

but I can't really recommend it as it contains too many implicit assumptions:

1) There are no comments or processing instructions that could contain similar data
2) The attributes always come in the order specified i.e. name first, content second, any others following
3) The attributes are always double quoted and there are no spaces surrounding the =

All of these can be got round by using REs, but as the data is in (presumably well-formed) XML an easier solution would be to parse the XML using SAX and simply examine the attributes of each element named "meta" or "META".

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.




More information about the Python-list mailing list