Why does Python not return first line?

Mon Mar 16 00:10:36 EDT 2009

On Mar 16, 11:25 am, Gilles Ganault <nos... at nospam.com> wrote:
> On Mon, 16 Mar 2009 01:14:00 +0100, Gilles Ganault <nos... at nospam.com>
> wrote:
>
> >I'm stuck at why Python doesn't return the first line in this simple
> >regex
>
> Found it: Python does extract the token, but displaying it requires
> removing hidden chars:
>
> =====
> response = "<span>Address :</span></td>\r\t\t<td>\r\t\t\t3 Abbey Road,
> St Johns Wood <br />\r\t\t\tLondon, NW8 9AY\t\t</td>"
>
> re_address = re.compile('<span>Address
> :</span></td>.+?<td>(.+?)</td>',re.I | re.S | re.M)
>
> address = re_address.search(response)
> if address:
>         address = address.group(1).strip()

When in doubt, use the repr() function (2.X) or the ascii() function
(3.X); it will show you unambiguously exactly what you have in a
string; in this case:

'3 Abbey Road, St Johns Wood <br />\r\t\t\tLondon, NW8 9AY'

>
>         #Important!
>         for item in ["\t","\r"," <br />"]:
>                 address = address.replace(item,"")
>
>         print "address is %s" % address

and the result is:

3 Abbey Road, St Johns WoodLondon, NW8 9AY

WoodLondon ??

Consider the possibility that whether the webpage originated on *x or
not, the author inserted that "<br />" with beneficial intent i.e. not
just to annoy you. You may wish to replace it with something instead
of discarding it.

If you really want the address to look tidy, you could do something
like this:

def norm_space(s):
    return ' '.join(s.split())

tidy = ", ".join([norm_space(x) for x in address.replace('<br />',
',').strip(' ,').split(',')])

Perhaps the "<br /") has even more significance (line break?) than a
comma ... in which case you should split the address into lines first,
and apply the tidy process to each line.

HTH,
John