[Tutor] problem in replacing regex

Tue Apr 7 11:29:33 CEST 2009

[forwarded to list]

Le Tue, 7 Apr 2009 12:23:33 +0530,
Kumar <hihiren1 at gmail.com> s'exprima ainsi:

> Hi denis,
> 
> Thanks a lot for the reply.
> 
> Actually on our web application when we display the data, at that time we do
> parsing and make hyperlinks (through <a>) wherever possible. so if there is
> any url like (http://www.hello.com) then while displaying data we convert it
> to <a href="http://www.hello.com">http://www.hello.com</a>
> and if we find any account number then we make them to go to our default
> account page
> e.g. text is "I am using 12345-45". then while viewing we replace it with
> following.
> I am using <a href="http://helloc.com/accid/12345-45">12345-45</a>
> 
> I hope above example would clear your problem.
> 
> now my problem is first i try to convert all existing link to <a> tag. this
> work perfectly fine.
> so e.g. the value is "I am using this url http://hello.com/accid/12345-45"
> then as per above algorithm it works perfectly find and change it to
> following.
> I am using this url <a href="http://hello.com/accid/12345-45">
> http://hello.com/accid/12345-45</a>
> now after that i again replace all accids to convert into url so above value
> become followign
> I am using this url <a href="http://hello.com/accid=<a href="
> http://hello.com/accid/12345-45">12345-45</a>">http://hello.com/accid=<a
> href="http://hello.com/accid/12345-45">12345-45<a></a>
> 
> and the complete link is messed up.
> so while converting the accids into url i want to exclude the values which
> start with http (e.g. http://hello.com/accid/12345-45)
> 
> i hope it becomes more clear now.
> one solution i have is i can exclude the accids start with / i.e. /
> http://hello.com/accid/12345-45 but that is not perfect solution.
> 
> Any pointer would be very helpful.
> 
> Thanks,
> Kumar

Ok, now I understand. You need to convert both url-s and account numbers to html encoded links. Whatever the order you choose, numbers will be double-encoded.
My solution (maybe not the best) would be to process both operations in one go, using a pattern that matches both and a smarter link writer func able to distinguish an url from a number. Pseudo code:

toLinkPattern = re.compile("(urlFormat)|(accountNumFormat)")
def toLink(match):
   string = match.group()
   if isAccountNum(string):
      return accountNumToLink(string)
   return urlToLink(string)
result = toLinkPattern.sub(toLink, source)

To make things easier, note that using groups() instead or group() will also tell you what kind of thing has been matched due to the position in match tuple. EG:

import re
pat = re.compile("([1-9])|([a-z])")
print pat.findall("a1b2c")
def replace(match):
	print match.group(), match.groups()
	(digit, letter) = (match.groups()[0],match.groups()[1])
	print "digit:%s  letter:%s" %(digit,letter)
	if digit is not None:
		return "0"
	return '@'
print pat.sub(replace,"a1b2c")
==>
[('', 'a'), ('1', ''), ('', 'b'), ('2', ''), ('', 'c')]
a (None, 'a')
digit:None  letter:a
1 ('1', None)
digit:1  letter:None
b (None, 'b')
digit:None  letter:b
2 ('2', None)
digit:2  letter:None
c (None, 'c')
digit:None  letter:c
@0 at 0@

You can also use named groups:

pat = re.compile("(?P<digit>\d)|(?P<letter>[a-z])")
def replace(match):
	digit,letter = (match.group('digit'),match.group('letter'))
	print "digit:%s  letter:%s" %(digit,letter)
	if digit is not None:
	# or better directly: if match.group('digit') is not None:
		return "0"
	return '@'
print pat.sub(replace,"a1b2c")
==>
digit:None  letter:a
digit:1  letter:None
digit:None  letter:b
digit:2  letter:None
digit:None  letter:c
@0 at 0@

Denis
------
la vita e estrany