Why it does NOT work on Linux ?

Steve Holden sholden at holdenweb.com
Mon Feb 4 09:48:36 EST 2002


"Markus Schönhaber" <mks99 at t-online.de> wrote in message
news:mailman.1012677967.8260.python-list at python.org...
> > I have the following part of program that finds ItemID numbers.
> > Here, for example, are two
> > 146759 and 146700 .
> > This program works well under windows but on Linux it does not
> > find any number. Can you please help?
> > Thanks.
> > Ladislav
> >
> > ####################
> > import re
> > Text="""<tr BGCOLOR="#FFFFFF">
> >                       <td valign="top" align="left"><a
> > href="lead.asp?ItemID=146759">[CN] Oak, Foiled & Antique
> > Furniture</a></td>
> >                       <td valign="top" align="center">18/12/2001</td>
> >                     </tr><tr BGCOLOR="#FFFFFF">
> >                     <td valign="top" align="left"><a
> > href="lead.asp?ItemID=146700">[CN] Oak, Foiled & Antique
> > Furniture</a></td>
> >                       <td valign="top" align="center">18/12/2001</td>
> >                     </tr>"""
> >
> > IDs=re.compile('.*<a href="lead.asp\?ItemID=(\d{5,10}).*')
> > Results=re.findall(IDs,Text)
> > print Results
>
> The interesting thing is, that it works at all for you. It definitely
> doesn't on my WinXP machine.
>
> 1.) Here
> > IDs=re.compile('.*<a href="lead.asp\?ItemID=(\d{5,10}).*')
> > Results=re.findall(IDs,Text)
>
> you call re.findall with a regular expression object as a first parameter
> which should be a string. What you want to do is
>
> Results = IDs.findall(Text)
>
> i. e. call the appropriate method on the re object you created.
>
While your suggestion is correct, it is incorrect to say that Ladislav's
method is wrong. There's the same method/function pairing as there is on
strings with the string module, so it is just as possible to write

    Results=re.findall(IDs,Text)

as

    Results = IDs.findall(Text)

>
> 2.) There are two whitespaces (a space and a newline - the latter may be
> inserted by your or my mail agent) between "<a" and "href...". So you
should
> replace your re with something like this:
>
> IDs = re.compile('<a\s*href="lead.asp\?ItemID=(\d{5,10})', re.MULTILINE)
>
> Since you are using findall, the enclosing ".*" expressions are
superfluous.
>
>
> BTW: Be careful reagarding backslashes in REs since the string gets
> interpreted two times.
>
This is why most recommend using raw strings (the r"..." form) - this
removes most of the backslash's escaping behavior, making the expressions
easier to read.

Ultimately, unless you *know* how the HTML is generated and can guarantee
its format, processing HTML (or XML, or *ML ;-) with re's is likely to get
very complex very quickly, and some sort of parsing solution (such as
htmllib/sgmllib) will be more fruitful in the long-term.

regards
 Steve
--
Consulting, training, speaking: http://www.holdenweb.com/
Python Web Programming: http://pydish.holdenweb.com/pwp/








More information about the Python-list mailing list