Regexp problem with `('
John Nagle
nagle at animats.com
Thu Mar 22 12:37:28 EDT 2007
Johny wrote:
> I have the following text
>
> <title>Goods Item 146 (174459989) - OurWebSite</title>
>
> from which I need to extract
> `Goods Item 146 '
>
> Can anyone help with regexp?
> Thank you for help
> L.
In general, parsing HTML with regular expressions is a bad idea.
Usually, you use something like BeautifulSoup to parse the HTML,
extract the desired field, like the contents of "<title>", then
work on that.
If you try to do this line by line with regular expressions,
it will fail when the line breaks aren't where you expect. If
you try to do a whole document with regular expressions, other
material such as content in comments can be misrecognized.
Try something like this:
# Regular expression to extract group before "(NNNNN)"
kreextractitem = re.compile(r'^(.*)\(\d+\))
pagetree = BeautifulSoup.BeautifulSoup(stringcontaininghtml)
titleitem = pagetree.find({'title':True, 'TITLE':True})
if titleitem :
titletext = " ".join(atag.findAll(text=True, recursive=True))
# Text of TITLE item is now in "titletext" as a string.
groups = kreextractitem.search(titletext)
if groups :
goodsitem = groups.group(1).strip()
# "goodsitem" now contains everything before "(NNNN)"
This approach will work no matter where the line breaks are in the original
HTML.
John Nagle
More information about the Python-list
mailing list