Modification of a urllib2 object ?

jowillia at gmail.com jowillia at gmail.com
Sat Nov 22 10:14:54 EST 2008


On Oct 10, 10:57 pm, George Sakkis <george.sak... at gmail.com> wrote:
> On Oct 10, 6:12 pm, vincehofmeis... at gmail.com wrote:
>
>
>
> > On Oct 10, 1:02 pm, George Sakkis <george.sak... at gmail.com> wrote:
>
> > > On Oct 10, 2:32 pm, vincehofmeis... at gmail.com wrote:
>
> > > > I have several ways to the following problem.
>
> > > > This is what I have:
>
> > > > ...
> > > > import ClientForm
> > > > import BeautifulSoup from BeautifulSoup
>
> > > > request = urllib2.Request('http://form.com/)
>
> > > > self.first_object = urllib2.open(request)
>
> > > > soup = BeautifulSoup(self.first_object)
>
> > > > forms = ClienForm.ParseResponse(self.first_object)
>
> > > > Now, when I do this, forms returns an index errror because no forms
> > > > are returned, but the BeautifulSoup registers fine.
>
> > > First off, please copy and paste working code; the above has several
> > > syntax errors, so it can't raise IndexError (or anything else for that
> > > matter).
>
> > > > Now, when I switch the order to this:
>
> > > > import ClientForm
> > > > import BeautifulSoup from BeautifulSoup
>
> > > > request = urllib2.Request('http://form.com/)
>
> > > > self.first_object = urllib2.open(request)
>
> > > > forms = ClienForm.ParseResponse(self.first_object)
>
> > > > soup = BeautifulSoup(self.first_object)
>
> > > > Now, the form is returned correctly, but the BeautifulSoup objects
> > > > returns empty.
>
> > > > So what I can draw from this is both methods erase the properties of
> > > > the object,
>
> > > No, that's not the case. What happens is that the http response object
> > > returned by urllib2.open() is read by the ClienForm.ParseResponse or
> > > BeautifulSoup - whatever happens first - and the second call has
> > > nothing to read.
>
> > > The easiest solution is to save the request object and call
> > > urllib2.open twice. Alternatively check if ClientForm has a parse
> > > method that accepts strings instead of urllib2 requests and then read
> > > and save the html text explicitly:
>
> > > >>> text = urllib2.open(request).read()
> > > >>> soup = BeautifulSoup(text)
> > > >>> forms = ClientForm.ParseString(text)
>
> > > HTH,
> > > George
>
> > request = urllib2.Request(settings.register_page)
>
> >                 self.url_obj = urllib2.urlopen(request).read()
>
> >                 soup = BeautifulSoup(self.url_obj);
>
> >                 forms = ClientForm.ParseResponse(self.url_obj,
> > backwards_compat=False)
>
> > Now I am getting this error:
>
> > Traceback (most recent call last):
> >   File "C:\Python25\Lib\site-packages\PyQt4\POS Pounder\Oct7\oct.py",
> > line 1251, in createAccounts
> >     forms = ClientForm.ParseResponse(self.url_obj,
> > backwards_compat=False)
> >   File "C:\Python25\lib\site-packages\clientform-0.2.9-py2.5.egg
> > \ClientForm.py", line 1054, in ParseResponse
> > AttributeError: 'str' object has no attribute 'geturl'
>
> Did you read what I wrote ? ClientForm.ParseResponse() expects a
> response object, not a string. Browsing through its docs, it seems
> there is an alternative parsing fuction, ClienForm.ParseFile(file,
> base_uri, ...).
>
> The following should work (untested):
>
> from cStringIO import StringIO
>
> request = urllib2.Request(settings.register_page)
> response = urllib2.urlopen(request)
> text = response.read()
> soup = BeautifulSoup(text)
> forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
>                              backwards_compat=False)
>
> HTH,
> George

Hello George,

I seem to be running into the same problem as Vince.  Your solution
seems very good, but ClientForm gets a little bit more from the handle
than just the text.

> The following should work (untested):
>
> from cStringIO import StringIO
>
> request = urllib2.Request(settings.register_page)
> response = urllib2.urlopen(request)
> text = response.read()
> soup = BeautifulSoup(text)
> forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
>                              backwards_compat=False)

Hello George,

When running your code in my program, which is doing something very
similar to Vince, I get:

AttributeError: 'cStringIO.StringI' object has no attribute 'geturl'

This makes perfect sense in regards to the way ClientForms handles
requests.  It seems that short of figuring out how to deepcopy the
handle, your going to be stuck making the request twice.  But this is
going to hit the URL (server) twice, which I would say is a bad idea.

I've been struggling with this issue for some time now, and this is
the first place I've found a solid discussion about it.

-Josh



More information about the Python-list mailing list