Pyhon 2.x or 3.x, which is faster?

Chris Angelico rosuav at gmail.com
Wed Mar 9 09:54:44 EST 2016


On Thu, Mar 10, 2016 at 1:39 AM, BartC <bc at freeuk.com> wrote:
> On 09/03/2016 14:11, Chris Angelico wrote:
>>
>> On Thu, Mar 10, 2016 at 1:03 AM, BartC <bc at freeuk.com> wrote:
>>>
>>> I've just tried a UTF-8 file and getting some odd results. With a file
>>> containing [three euro symbols]:
>>>
>>> €€€
>>>
>>> (including a 3-byte utf-8 marker at the start), and opened in text mode,
>>> Python 3 gives me this series of bytes (ie. the ord() of each character):
>>>
>>> 239
>>> 187
>>> 191
>>> 226
>>> 8218
>>> 172
>>> 226
>>> 8218
>>> 172
>>> 226
>>> 8218
>>> 172
>>>
>>> And prints the resulting string as: €€€.
>>
>>
>> The first three bytes are the "UTF-8 BOM", which suggests you may have
>> created this in a broken editor like Notepad.
>
>
> Yes, that's what I used, but what's broken about it? If Python doesn't
> understand the BOM, it should still resynchronise after a few bytes.

It's an extra character. You thought the file contained three
characters; it actually contained four.

>> For the rest, I'm not sure how you told Python to open this as text,
>> but you certainly did NOT specify an encoding of UTF-8. The 8218
>> entries in there are completely bogus. Can you show your code, please,
>> and also what you get if you open the file as binary?
>
> This is the code:
>
> f=open("input","r")
> t=f.read(1000)
> f.close()
>
> print ("T",type(t),len(t))
>
> print (t)
>
> for i in t:
>         print (ord(i))
>
> This doesn't specify any specific code encoding; I don't know how, and
> Steven didn't mention anything other than a text file. The input data is
> represented by this dump, and this is also what binary mode gives:
>
> 0000: ef bb bf e2 82 ac e2 82 ac e2 82 ac    ............

Okay. Try changing your first line to this:

f = open("input", encoding="utf-8")

By default, you get a system-specific encoding, which in your case
appears to be one of the Windows codepages. That's why you're getting
nonsense out of it - you write in one encoding and read in another.
It's commonly called mojibake.

>> Unicode handling is easy as long as you (a) understand the fundamental
>> difference between text and bytes, and (b) declare your encodings.
>> Python isn't magical. It can't know the encoding without being told.
>
>
> Hence the BOM bytes.
>
> (Isn't it better that it's automatic? Someone sends you a text file that you
> want to open within a Python program. Are you supposed to analyze it first,
> or expect the sender to tell you what it is (they probably won't know) then
> need to hack the program to read it properly?)

No, it's not better to be automatic. They are supposed to tell you
what it is. Someone somewhere saved the file using a particular
encoding. In this example, you chose when you told Notepad to save it
as UTF-8; so you carry that information with the file, and open it
using the encoding="UTF-8" parameter.

Analyzing files to try to guess their encodings is fundamentally hard.
I have a source of occasional text files that basically just dumps
stuff on me without any metadata, and I have to figure out (a) what
the encoding is, and (b) what language the text is in. I can generally
assume that the files are ASCII-compatible (on the rare occasions when
they're not, they're usually going to be UTF-16, which is fairly easy
to spot), and then I have two levels of heuristics to try to guess a
most-likely encoding - but ultimately, the script just decodes the
text as best it can, and then hands the result up to the human. If the
result looks mostly like Spanish but has acute accents instead of
tildes over the n's, it's probably the wrong codepage. Or if the text
is all completely meaningless junk, it's probably Cyrillic or Greek
letters, and needs to be decoded using an appropriate eight-bit
encoding. It often ends up being trial-and-error to figure out what
encoding was actually used.

Trying to guess the encoding of text in a file full of bytes is like
trying to guess the modem settings (8N1? 7E1?). If the other end
doesn't tell you, you'll probably end up with something that carries
some decodable content, but not the original content. It's almost
completely useless.

ChrisA



More information about the Python-list mailing list