python tags on websites timeout problem

Sun Jul 20 18:36:41 EDT 2003

In article <cdac0350.0307191527.755df3e1 at posting.google.com>, jeff wrote:
> Hiya
> 
> im trying to pull tags off a website using python ive got a few things
> running that have the potential to work its just i cant get them to
> becuase  of certain errors?
> 
> basically i dont what to download the images and all the stuff just
> the html and then work from there, i think its timing out because its
> trying to downlaod the images as well which i dont what to do as this
> would decrease the speed of what im trying to achieve, the URL used is
> only that for an example
> 

A web page is made up of many separate components. When you
"download a webpage" you generally are fetching the HTML code,
and you will not get any images unless you specifically
download those by their own URLs.

> this is my source
> 
> --------------------------------------------------------------------------------
> 
> #!/usr/bin/env python
> import re
> import urllib
> 
> file = urllib.urlretrieve("http://images.google.com/images?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=rabbit"
> , "temp1.tmp")
> 

Two things:

Don't use the name "file" as the name of your variable, as that
is now the standard way to access a file (used instead of open)

Why save the file and then read it back in?

I might do something like...

text = urllib.urlopen('http://www.example.org')
for line in text.readlines():
    print line

> # searching the file content line by line:
> keyword = re.compile(r"</a>")
> 
> for line in text:
>     result = keyword.search (line)
>     if result:
>        print result.group(1), ":", line,

There are no parentheses in your regex, so I do not
think you will ever have a group(1)

>>> import re
>>> keyword = re.compile(r"</a>")
>>> x = 'abc </a> def'
>>> z = keyword.search(x)
>>> z.groups()
()
>>> z.group(0)
'</a>'
>>> z.group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: no such group

>>> keyword = re.compile(r"(</a>)")
>>> z=keyword.search(x)
>>> z.group(0)
'</a>'
>>> z.group(1)
'</a>'

> --------------------------------------------------------------------------------
> and these are the errors im getting

> 
> C:\Python22>python tagyourit.py
> Traceback (most recent call last):
>   File "tagyourit.py", line 5, in ?
>     file = urllib.urlretrieve("http://images.google.com/image
> 8&oe=UTF-8&q=rabbit" , "temp1.tmp")

Is this newline (between image and 8 really there?  Maybe
there is a problem with the URL...

>   File "C:\PYTHON22\lib\urllib.py", line 80, in urlretrieve
>     return _urlopener.retrieve(url, filename, reporthook, dat
>   File "C:\PYTHON22\lib\urllib.py", line 210, in retrieve
>     fp = self.open(url, data)
>   File "C:\PYTHON22\lib\urllib.py", line 178, in open
>     return getattr(self, name)(url)
>   File "C:\PYTHON22\lib\urllib.py", line 292, in open_http
>     h.endheaders()
>   File "C:\PYTHON22\lib\httplib.py", line 695, in endheaders
>     self._send_output()
>   File "C:\PYTHON22\lib\httplib.py", line 581, in _send_outpu
>     self.send(msg)
>   File "C:\PYTHON22\lib\httplib.py", line 548, in send
>     self.connect()
>   File "C:\PYTHON22\lib\httplib.py", line 532, in connect
>     raise socket.error, msg
> --------------------------------------------------------------------------------

I think maybe you just are not getting any response at
all from your try to fetch.  Can you get any other URL ?
Maybe google is watching user-agent strings to try to keep
spiders out of their pages?