Regexp problem with `('

John Nagle nagle at animats.com
Thu Mar 22 12:37:28 EDT 2007


Johny wrote:
> I have  the following text
> 
> <title>Goods Item  146 (174459989)  - OurWebSite</title>
> 
> from which I need to extract
> `Goods Item  146 '
> 
> Can anyone help with regexp?
> Thank you for help
> L.

    In general, parsing HTML with regular expressions is a bad idea.
Usually, you use something like BeautifulSoup to parse the HTML,
extract the desired field, like the contents of "<title>", then
work on that.

    If you try to do this line by line with regular expressions,
it will fail when the line breaks aren't where you expect.  If
you try to do a whole document with regular expressions, other
material such as content in comments can be misrecognized.

     Try something like this:

	# Regular expression to extract group before "(NNNNN)"
	kreextractitem = re.compile(r'^(.*)\(\d+\))
	pagetree = BeautifulSoup.BeautifulSoup(stringcontaininghtml)
	titleitem = pagetree.find({'title':True, 'TITLE':True})
	if titleitem :
	    titletext = " ".join(atag.findAll(text=True, recursive=True))	
	    #	Text of TITLE item is now in "titletext" as a string.
	    groups = kreextractitem.search(titletext)
	    if groups :
		goodsitem = groups.group(1).strip()	
		# "goodsitem" now contains everything before "(NNNN)"


This approach will work no matter where the line breaks are in the original
HTML.

				John Nagle



More information about the Python-list mailing list