How can I exclude a word by using re?

Jordan Rastrick jrastrick at student.usyd.edu.au
Mon Aug 15 21:23:27 EDT 2005


could ildg said:

> I want to use re because I want to extract something from a html. It
> will be very complicated  without using re. But while using re, I
> found that I must exlude a hole word "</td>", certainly, there are
> many many "</td>" in this html.

Actually, for properly processing html, you shouldn't really be using
regular expressions, precisely because the problem is complicated -
regular expressions are too simple and can't properly model a language
like HTML, which is generated by a context free grammar.

If thats only meaningless technical mumbo-jumbo to you, never mind -
the important point is you shouldn't really use an re. Trust me :)

What you want for a job like is an HTML parser. Theres one in the
standard library; if it doesnt suit, there are plenty of third party
ones. I like Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you insist on using an re, well I'm sure someone on this group will
figure out a solution to your issue thats as good as you're going to
get...


>
> My re is as below:
> _____________________________________________
> r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
> ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
> ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
> _____________________________________________
> There should be over 30 matches in the html. But I find nothing by
> re.finditer(html) because my last line of re is wrong. I can't use
> "(?P<name>.+)</td>" because there are many many "</td>" in the html
> and I just want the ".*" to match what are before the firest "</td>".
> So I think if there is some idea I can exclude a word, this will be
> done. Assume there is "NOT(WORD)" can do it, I just need to write the
> last line of the re as "(?P<name>(NOT(</td>))+)</td>".
> But I still have no idea after thinking and trying for a very long time.
>
> In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
> exactly the first "</td>" in this match. And there is more than one
> match in this html, so this must be done by using re.
>
> And I can't use any of your idea because what I want I deal with is a
> very complicated html, not just a single line of word.
>
> I can copy part of the html up to here but it's kinda too lengthy.
> On 8/15/05, John Machin <sjmachin at lexicon.net> wrote:
> > could ildg wrote:
> > > In re, the punctuation "^" can exclude a single character, but I want
> > > to exclude a whole word now. for example I have a string "hi, how are
> > > you. hello", I want to extract all the part before the world "hello",
> > > I can't use ".*[^hello]" because "^" only exclude single char "h" or
> > > "e" or "l" or "o". Will somebody tell me how to do it? Thanks.
> >
> > (1) Why must you use re? It's often a good idea to use string methods
> > where they can do the job you want.
> > (2) What do you want to have happen if "hello" is not in the string?
> >
> > Example:
> >
> > C:\junk>type upto.py
> > def upto(strg, what):
> >      k = strg.find(what)
> >      if k > -1:
> >          return strg[:k]
> >      return None # or raise an exception
> >
> > helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
> > that's it"
> >
> > print repr(upto(helo, "HELLO"))
> > print repr(upto(helo, "hello"))
> > print repr(upto(helo, "hi"))
> > print repr(upto(helo, "goodbye"))
> > print repr(upto("", "goodbye"))
> > print repr(upto("", ""))
> >
> > C:\junk>upto.py
> > 'hi, how are you? '
> > "hi, how are you? HELLO I'm fine, thank you "
> > ''
> > None
> > None
> > ''
> >
> > HTH,
> > John
> > --
> > http://mail.python.org/mailman/listinfo/python-list
> >




More information about the Python-list mailing list