Modification of a urllib2 object ?

Fri Oct 10 23:57:33 EDT 2008

On Oct 10, 6:12 pm, vincehofmeis... at gmail.com wrote:

> On Oct 10, 1:02 pm, George Sakkis <george.sak... at gmail.com> wrote:
>
>
>
> > On Oct 10, 2:32 pm, vincehofmeis... at gmail.com wrote:
>
> > > I have several ways to the following problem.
>
> > > This is what I have:
>
> > > ...
> > > import ClientForm
> > > import BeautifulSoup from BeautifulSoup
>
> > > request = urllib2.Request('http://form.com/)
>
> > > self.first_object = urllib2.open(request)
>
> > > soup = BeautifulSoup(self.first_object)
>
> > > forms = ClienForm.ParseResponse(self.first_object)
>
> > > Now, when I do this, forms returns an index errror because no forms
> > > are returned, but the BeautifulSoup registers fine.
>
> > First off, please copy and paste working code; the above has several
> > syntax errors, so it can't raise IndexError (or anything else for that
> > matter).
>
> > > Now, when I switch the order to this:
>
> > > import ClientForm
> > > import BeautifulSoup from BeautifulSoup
>
> > > request = urllib2.Request('http://form.com/)
>
> > > self.first_object = urllib2.open(request)
>
> > > forms = ClienForm.ParseResponse(self.first_object)
>
> > > soup = BeautifulSoup(self.first_object)
>
> > > Now, the form is returned correctly, but the BeautifulSoup objects
> > > returns empty.
>
> > > So what I can draw from this is both methods erase the properties of
> > > the object,
>
> > No, that's not the case. What happens is that the http response object
> > returned by urllib2.open() is read by the ClienForm.ParseResponse or
> > BeautifulSoup - whatever happens first - and the second call has
> > nothing to read.
>
> > The easiest solution is to save the request object and call
> > urllib2.open twice. Alternatively check if ClientForm has a parse
> > method that accepts strings instead of urllib2 requests and then read
> > and save the html text explicitly:
>
> > >>> text = urllib2.open(request).read()
> > >>> soup = BeautifulSoup(text)
> > >>> forms = ClientForm.ParseString(text)
>
> > HTH,
> > George
>
> request = urllib2.Request(settings.register_page)
>
>                 self.url_obj = urllib2.urlopen(request).read()
>
>                 soup = BeautifulSoup(self.url_obj);
>
>                 forms = ClientForm.ParseResponse(self.url_obj,
> backwards_compat=False)
>
>
>
> Now I am getting this error:
>
> Traceback (most recent call last):
>   File "C:\Python25\Lib\site-packages\PyQt4\POS Pounder\Oct7\oct.py",
> line 1251, in createAccounts
>     forms = ClientForm.ParseResponse(self.url_obj,
> backwards_compat=False)
>   File "C:\Python25\lib\site-packages\clientform-0.2.9-py2.5.egg
> \ClientForm.py", line 1054, in ParseResponse
> AttributeError: 'str' object has no attribute 'geturl'

Did you read what I wrote ? ClientForm.ParseResponse() expects a
response object, not a string. Browsing through its docs, it seems
there is an alternative parsing fuction, ClienForm.ParseFile(file,
base_uri, ...).

The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
                             backwards_compat=False)

HTH,
George