problem with regex, how to conclude more than one character

Charles Yan tecspring at gmail.com
Fri Nov 7 02:23:56 EST 2008


Really thanks for quickly reply Chris!
Actually I tried BeautifulSoup and it's great.
But I'm not very familiar with it and it need more codes to parse the html
and get the right text.
I think regexp is more convenient if there is a way to filter out the list
just in one line:)
I did this all the way but stopped here...


On 11/7/08, Chris Rebert <clp at rebertia.com> wrote:
>
> On Thu, Nov 6, 2008 at 11:06 PM,  <tecspring at gmail.com> wrote:
> > I always have no idea about how to express "conclude the entire word"
> > with regexp,  while using python, I encountered this problem again...
> >
> > for example, if I want to match the "string" in "test a string",
> > re.findall(r"[^a]* (\w+)","test a string") will work, but what if
> > there is not "a" but "an"(test a string)? the [^an] will failed
> > because it will stop at the first character "a".
> >
> > I guess people not always use this kind of way to filter words?
> > Here comes the real problem I encountered:
> > I want to filter the text both in "<td>" block and the "<span>"'s
> > title attribute
>
> Is there any particularly good reason why you're using regexps for
> this rather than, say, an actual (X)HTML parser?
>
> Cheers,
> Chris
> --
> Follow the path of the Iguana...
> http://rebertia.com
>
> > ###################### code #############################
> > import re
> > content='''<tr align="center" valign="middle" class="CellCss"><td
> > valign="middle">LA</td><td valign="middle">11/10/2008</td><td
> > valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
> > valign="middle"><span title="Understanding the stock market"
> > class="MouseCursor">Understand....</span></td><td title="Charisma"
> > valign="middle">Charisma</td><td valign="middle">Booked</td><td
> > valign="middle">'''
> >
> > re.findall(r'''<td valign="middle">([^<]+)</td><td
> > valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
> > valign="middle">([^<]+)</td><td valign="middle"><span
> > title="([^"]*)"''',content)
> >
> > #################### code end ############################
> > As you saw above,
> > I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
> > the stock market"
> > there are two "<span>" block but I can just get the "title" attribute
> > of the first "<span>" using regexp.
> > for the second, which should be "Charisma" I need to use some kind of
> > [^</td>]* to match "class="MouseCursor">Understand....</span></td>",
> > then I can continue match the second "<span>" block.
> >
> > Maybe I didn't describe this clearly, then feel free to tell me:)
> > thanks for any further reply!
> > --
> > http://mail.python.org/mailman/listinfo/python-list
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20081107/6580a203/attachment-0001.html>


More information about the Python-list mailing list