Pyhon 2.x or 3.x, which is faster?

Wed Mar 9 09:54:54 EST 2016

On Thu, 10 Mar 2016 01:03 am, BartC wrote:

> On 09/03/2016 02:18, Steven D'Aprano wrote:
>> On Wed, 9 Mar 2016 12:28 pm, BartC wrote:
>>
>>> (Which wasn't as painful as I'd expected. However the next project I
>>> have in mind is 20K lines rather than 0.7K. For that I'm looking at some
>>> mechanical translation I think. And probably some library to wrap around
>>> Python's i/o.)
>>
>> You almost certainly don't need another wrapper around Python's I/O,
>> making it slower still. You need to understand what Python's I/O is
>> doing.
> 
> Well, the original project will be using its file i/o library. So it'll
> use the same interface that will be reimplemented on top of Python i/o.

Just don't complain that it's slow :-)

> And input operations mainly consist of grabbing an entire file at once.

with open(pathname) as f:
    data = f.read()

> Output is a little more mixed.

It often is.

> I've just tried a UTF-8 file and getting some odd results. With a file
> containing [three euro symbols]:
> 
> €€€
> 
> (including a 3-byte utf-8 marker at the start), and opened in text mode,
> Python 3 gives me this series of bytes (ie. the ord() of each character):
> 
> 239
> 187
> 191
> 226
> 8218
> 172
> 226
> 8218
> 172
> 226
> 8218
> 172

Er, do you think that 8218 is a *byte*? (Hint: 1 byte = 8 bits, at least on
any platform you are likely to be running.)

Bart, you have a bad habit of giving us the output of your code, with an
implied "explain this", but without showing us the code you used to
generate the output. Without seeing the code you used, I have *no idea* how
you could get that result. If you read the file in binary, you should get
this:

b'\xef\xbb\xbf\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

Or in decimal:

239, 187, 191, 226, 130, 172, 226, 130, 172, 226, 130, 172

How you are getting 8218 instead of 130, I have no idea!

If you read the file as text, but using the wrong encoding, say Latin-1, you
would get this:

'ï»¿â\x82¬â\x82¬â\x82¬'

or in decimal:

239, 187, 191, 226, 130, 172, 226, 130, 172, 226, 130, 172

Without seeing your code, I cannot possibly diagnose what you are doing.

> And prints the resulting string as: ï»¿â‚¬â‚¬â‚¬. Although this latter
> might depend on my console's code page setting. 

That is very likely to be the reason for printing strange things. Life is
much easier on Linux and OS-X, where the console works with UTF-8 by
default.

> Changing it to UTF-8 
> however (CHCP 65001 in Windows) gives me this error when I run the
> program again:
> 
> ----------
> Fatal Python error: Py_Initialize: can't initialize sys standard streams
> LookupError: unknown encoding: cp65001
> 
> This application has requested the Runtime to terminate it in an unusual
> way.
> Please contact the application's support team for more information.
> ----------

I'm afraid I don't know how to deal with that. It's a Windows-specific
issue.

-- 
Steven