python re - a not needed

Peter Otten __peter__ at web.de
Thu Dec 16 04:21:22 EST 2004


kepes.krisztian wrote:

> Hi !
> 
> I want to get infos from a html, but I need all chars except <.
> All chars is: over chr(31), and over (128) - hungarian accents.
> The .* is very hungry, it is eat < chars too.
> 
> If I can use not, I simply define an regexp.
> [not<]*</a>
> 
> It is get all in the href.
> 
> I wrote this programme, but it is too complex - I think:
> 
> import re
> 
> l=[]
> for i in range(33,65):
>     if i<>ord('<') and i<>ord('>'):
>        l.append('\\'+chr(i))
> s='|'.join(l)
> all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
> sre='<Subj>([%s]{1,1024})</d>'%all
> #sre='<Subj>([?!\\<]{1,1024})</d>'
> s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'
> 
> 
> print sre
> print s
> cp=re.compile(sre)
> m=cp.search(s)
> print m.groups()
> 
> Have the python an regexp exception, or not function ? How to I use it ?
> 
> Thanx for help:
>  kk

You could try these regexps or variants thereof:

"<Subj>([^<]*)"

'^' changes the character set to exclude any characters listed after '^'
from matching.

"<Subj>(.*?)<"

The '?' makes the preceding '*' non-greedy, i. e. the following '<' will
match the first '<' character encountered in the string to be searched.

Peter




More information about the Python-list mailing list