string encoding regex problem

Peter Otten __peter__ at web.de
Sat Aug 23 17:13:07 EDT 2014


Philipp Kraus wrote:

> I have create a short script:
> 
> ---------
> #!/usr/bin/env python
> 
> import re, urllib2
> 
> 
> def URLReader(url) :
>     f = urllib2.urlopen(url)
>     data = f.read()
>     f.close()
>     return data
> 
> 
> print re.match( "\<small\ \>.*\<\/small\>",
> URLReader("http://sourceforge.net/projects/boost/") )
> ---------
> 
> Within the data the string "<small>boost_1_56_0.tar.gz</small>" should
> be machted, but I get always a None result on the re.match, re.search
> returns also a None.

>>> help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

As the string doesn't start with your regex re.match() is clearly wrong, but 
re.search() works for me:

>>> import re, urllib2
>>> 
>>> 
>>> def URLReader(url) :
...     f = urllib2.urlopen(url)
...     data = f.read()
...     f.close()
...     return data
... 
>>> data = URLReader("http://sourceforge.net/projects/boost/")
>>> re.search("\<small\ \>.*\<\/small\>", data)
<_sre.SRE_Match object at 0x7f282dd58718>
>>> _.group()
'<small >boost_1_56_pdf.7z</small>'


> I have tested the regex under http://regex101.com/ with the HTML code
> and on the page the regex is matched.
> 
> Can you help me please to fix the problem, I don't understand that the
> match returns None





More information about the Python-list mailing list