Searching in an html file

david_ullrich at my-deja.com david_ullrich at my-deja.com
Fri Jul 28 13:18:56 EDT 2000


In article <8lrrgd$9o7$1 at nnrp1.deja.com>,
  pauljolly at my-deja.com wrote:
> Dear all,
>
> I have managed to write the code to retrieve an html file from a given
> address, and transfer this to a variable. I am now trying to search
for
> a ° within the file, using the re module. I cannot however get it
> to work. My code is as follows:
>
> import re
> searchstring=re.compile('°')
> result=searchstring.search(htmlcode)
>
> where htmlcode is the code from the page retrieved.
>
> This gives result=NONE. What is going wrong here? I know that °
can
> be found within the page, I have checked in TextPad. I am completely
> stuck because when I try the same search but on the following string:
>
> "Thisisastringonwhichiwilltestthecodetofind°withinthetext"
>
> it finds the ° without any problem. The only difference I can see
> between the two strings is their length. The htmlcode string is 24KB,
> the string above much less. What is going on here?

    Where did htmlcode come from? (If it came from something that
knows about html it could be that the entity reference has already
been replaced by the appropriate entity before you search for it.
If you just said htmlcode=open('thefilename.html','r').read() then
never mind...)

    I have no problem with strings much larger than 24K:

import re
searchstring=re.compile('°')
htmlcode='0'*300000 + '°'
result=searchstring.search(htmlcode)
print result.start()

DU

> Any help is much appreciated.
>
> Paul Jolly
>
> Sent via Deja.com http://www.deja.com/
> Before you buy.
>


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list