Modification of a urllib2 object ?

Fri Oct 10 16:02:22 EDT 2008

On Oct 10, 2:32 pm, vincehofmeis... at gmail.com wrote:
> I have several ways to the following problem.
>
> This is what I have:
>
> ...
> import ClientForm
> import BeautifulSoup from BeautifulSoup
>
> request = urllib2.Request('http://form.com/)
>
> self.first_object = urllib2.open(request)
>
> soup = BeautifulSoup(self.first_object)
>
> forms = ClienForm.ParseResponse(self.first_object)
>
> Now, when I do this, forms returns an index errror because no forms
> are returned, but the BeautifulSoup registers fine.

First off, please copy and paste working code; the above has several
syntax errors, so it can't raise IndexError (or anything else for that
matter).

> Now, when I switch the order to this:
>
> import ClientForm
> import BeautifulSoup from BeautifulSoup
>
> request = urllib2.Request('http://form.com/)
>
> self.first_object = urllib2.open(request)
>
> forms = ClienForm.ParseResponse(self.first_object)
>
> soup = BeautifulSoup(self.first_object)
>
> Now, the form is returned correctly, but the BeautifulSoup objects
> returns empty.
>
> So what I can draw from this is both methods erase the properties of
> the object,

No, that's not the case. What happens is that the http response object
returned by urllib2.open() is read by the ClienForm.ParseResponse or
BeautifulSoup - whatever happens first - and the second call has
nothing to read.

The easiest solution is to save the request object and call
urllib2.open twice. Alternatively check if ClientForm has a parse
method that accepts strings instead of urllib2 requests and then read
and save the html text explicitly:

>>> text = urllib2.open(request).read()
>>> soup = BeautifulSoup(text)
>>> forms = ClientForm.ParseString(text)

HTH,
George