Pyhon 2.x or 3.x, which is faster?

Steven D'Aprano steve at pearwood.info
Wed Mar 9 10:28:01 EST 2016


On Thu, 10 Mar 2016 01:39 am, BartC wrote:


> This is the code:
> 
> f=open("input","r")
> t=f.read(1000)
> f.close()

If you don't give read an argument, it will try to read the entire file:

t = f.read()


> print ("T",type(t),len(t))
> print (t)
> for i in t:
>     print (ord(i))
> 
> This doesn't specify any specific code encoding; I don't know how, and
> Steven didn't mention anything other than a text file. 

I did warn you that, and I quote, "There's more, but that's the basics". You
could always read the Fine Manual, or even the interactive help (always a
boon for the busy programmer):

help(open) starts with:


open(...)
    open(file, mode='r', buffering=-1, encoding=None,
         errors=None, newline=None, closefd=True, opener=None) 
    -> file object

    Open file and return a stream.  Raise IOError upon failure.


To specify an encoding, pass the name of the encoding as argument:

    open(filename, "r", encoding="utf-8-sig")

for UTF-8 files as created by Notepad, and 

    open(filename, "r", encoding="utf-8")

for UTF-8 files without the leading 3-byte signature.


> The input data is 
> represented by this dump, and this is also what binary mode gives:
> 
> 0000: ef bb bf e2 82 ac e2 82 ac e2 82 ac    ............

That matches the bytes I suggested in a previous post:

b'\xef\xbb\xbf\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

but not the values you quoted, specifically the triples of:

226, 8218, 172 (decimal) or in hex: e2 201a ac

Obviously hex 201a is too big to fit in a byte. I'm not sure how you could
have got that. Human error perhaps?


>> Unicode handling is easy as long as you (a) understand the fundamental
>> difference between text and bytes, and (b) declare your encodings.
>> Python isn't magical. It can't know the encoding without being told.
> 
> Hence the BOM bytes.

Alas, if only it were that simple. But encoding is *metadata*, not data, and
cannot reliably be read from the file itself. It may be a useful heuristic,
which is *mostly* reliable, but it cannot be considered foolproof.

How do you distinguish between a UTF-8 signature and a Latin-1 file that
happens to start with these three characters ""? Or a MacRoman file that
happens to start with the three characters "Ôªø"? To mention just a few.

The problem is, any stream of bytes can only be correctly recognised as text
if you know what encoding the bytes represent:


py> dump = b'\xef\xbb\xbf\x2d\x2d\x2d'
py> dump.decode('utf-8-sig')
'---'
py> dump.decode('latin-1')
'---'
py> dump.decode('MacRoman')
'Ôªø---'
py> dump.decode('cp1251')
'п»ї---'



> (Isn't it better that it's automatic? Someone sends you a text file that
> you want to open within a Python program. Are you supposed to analyze it
> first, or expect the sender to tell you what it is (they probably won't
> know) then need to hack the program to read it properly?)

You cannot know for sure what encoding a text file uses, unless it has been
recorded somewhere outside of the text file and transmitted it "out of
band". That is, you ask the sender. And you are right, they probably won't
know. Then you try to guess, and if you guess wrong, the text you read will
contain moji-bake:

https://en.wikipedia.org/wiki/Mojibake

See also:

https://en.wikipedia.org/wiki/Charset_detection

https://en.wikipedia.org/wiki/Bush_hid_the_facts



-- 
Steven




More information about the Python-list mailing list