urllib2 (py2.6) vs urllib.request (py3)

Tue Mar 17 14:16:23 EDT 2009

Il Tue, 17 Mar 2009 15:40:02 +0000, R. David Murray ha scritto:

> mattia <gervaz at gmail.com> wrote:
>> Il Tue, 17 Mar 2009 10:55:21 +0000, R. David Murray ha scritto:
>> 
>> > mattia <gervaz at gmail.com> wrote:
>> >> Hi all, can you tell me why the module urllib.request (py3) add
>> >> extra characters (b'fef\r\n and \r\n0\r\n\r\n') in a simple example
>> >> like the following and urllib2 (py2.6) correctly not?
>> >> 
>> >> py2.6
>> >> >>> import urllib2
>> >> >>> f = urllib2.urlopen("http://www.google.com").read() fd =
>> >> >>> open("google26.html", "w")
>> >> >>> fd.write(f)
>> >> >>> fd.close()
>> >> 
>> >> py3
>> >> >>> import urllib.request
>> >> >>> f = urllib.request.urlopen("http://www.google.com").read() with
>> >> >>> open("google30.html", "w") as fd:
>> >> ...     print(f, file=fd)
>> >> ...
>> >> >>>
>> >> >>>
>> >> Opening the two html pages with ff I've got different results (the
>> >> extra characters mentioned earlier), why?
>> > 
>> > The problem isn't a difference between urllib2 and urllib.request, it
>> > is between fd.write and print.  This produces the same result as your
>> > first example:
>> > 
>> > 
>> >>>> import urllib.request
>> >>>> f = urllib.request.urlopen("http://www.google.com").read() with
>> >>>> open("temp3.html", "wb") as fd:
>> > ...     fd.write(f)
>> > 
>> > 
>> > The "b'....'" is the stringified representation of a bytes object,
>> > which is what urllib.request returns in python3.  Note the 'wb',
>> > which is a critical difference from the python2.6 case.  If you omit
>> > the 'b' in python3, it will complain that you can't write bytes to
>> > the file object.
>> > 
>> > The thing to keep in mind is that print converts its argument to
>> > string before writing it anywhere (that's the point of using it), and
>> > that bytes (or buffer) and string are very different types in
>> > python3.
>> 
>> Well... now in the saved file I've got extra characters "fef" at the
>> begin and "0" at the end...
> 
> The 'fef' is reminiscent of a BOM.  I don't see any such thing in the
> data file produced by my code snippet above.  Did you try running that,
> or did you modify your code?  If the latter, maybe if you post your
> exact code I can try to run it and see if I can figure out what is going
> on.
> 
> I'm far from an expert in unicode issues, by the way :)  Oh, and I'm
> running 3.1a1+ from svn, by the way, so it is also possible there's been
> a bug fix of some sort.

The extra code were produced using python version 3.0. This afternoon 
I've downloaded the 3.0.1 version and everything works fine for the 
previous example using the "wb" params. And now knowing that urlopen 
returns bytes I've also figured out how to decode the result (I deal with 
just html pages, no jpg, pdf, etc.) so I just have to know the charset of 
the page (if available).