[Python-Dev] Deprecation warning on integer shifts and such

Oren Tirosh oren-py-d@hishome.net
Wed, 14 Aug 2002 10:25:23 +0300


On Tue, Aug 13, 2002 at 09:07:49AM +0200, Martin v. Loewis wrote:
> Oren Tirosh <oren-py-d@hishome.net> writes:
> > I think that this will produce the smallest number of
> > incompatibilities for existing code and maintain compatibility with
> > C header files on 32 bit platforms. In this case 0xff000000 will
> > always be interpreted as -16777216 and the 'i' parser will happily
> > convert it to wither 0xFF000000 or 0xFFFFFFFFFF000000, depending on
> > the native platform word size - which is probably what the
> > programmer meant.
> 
> This means you suggest that PEP 237 is not implemented, or atleast
> frozen at the current stage.

Not at all! Removing the differences between ints and longs is good. 
My reservations are about thehexadecimal representation.

    - Currently, the '%u', '%x', '%X' and '%o' string formatting
      operators and the hex() and oct() built-in functions behave
      differently for negative numbers: negative short ints are
      formatted as unsigned C long, while negative long ints are
      formatted with a minus sign.  This will be changed to use the
      long int semantics in all cases (but without the trailing 'L'
      that currently distinguishes the output of hex() and oct() for
      long ints).  Note that this means that '%u' becomes an alias for
      '%d'.  It will eventually be removed.

In Python up to 2.2 it's inconsistent between ints and longs:
>>> hex(-16711681)
'0xff00ffff'
>>> hex(-16711681L)
'-0xff0001L'		# ??!?!?

The hex representation of ints gives me useful information about their 
bit structure. After all, it is not immediately apparent to most mortals 
that the number above is a mask for bits 16-23.

The hex representation of longs is something I find quite misleading and 
I think it's also unprecedented.  This wart has bothered me for a long 
time now but I didn't have any use for it so I didn't mind too much. Now 
it is proposed to extend this useless representation to ints so I do.

So we have two elements of the language that are inconsistent. One of 
them is in widespread use and the other is... ahem... 

Which one of them should be changed to conform to the other? 

My proposal: 

On 32 bit platforms:
>>> hex(-16711681)
'0xff00ffff'
>>> hex(-16711681L)
'0xff00ffff'

On 64 bit platforms:
>>> hex(-16711681)
'0xffffffffff00ffffLL'
>>> hex(-16711681L)
'0xffffffffff00ffffLL'

The 'LL' suffix means that this number is to be treated as a 64 bit
*signed* number. This is consistent with the way it is interpreted by 
GCC and other unix compilers on both 32 and 64 bit platforms.  

What to do about numbers from 2**31 to 2**32-1?

>>> hex(4278255615)
0xff00ffffU

The U suffix, also borrowed from C, makes it unambigous on 32 and 64 bit 
platforms for both Python and C. 

Representation of positive numbers:

 0x00000000   -         0x7fffffff   : unambigous on all platforms
 0x80000000U  -         0xffffffffU  : representation adds U suffix
0x100000000LL - 0x7fffffffffffffffLL : representation adds LL suffix

Representation of negative numbers:
 0x80000000  - 0xffffffff (-2147483648 to -1):
	8 digits on 32 bit platforms
 0xffffffff80000000LL  - 0xffffffffffffffffLL  (same range):
	16 digits and LL suffix on 64 bit platforms

 others negative numbers: 16 digits and LL suffix on all platforms.

This makes the hex representation of a number informative and consistent 
between int and long on all platforms. It is also consistent with the
C compiler on the same platform. Yes, it will produce a different text
representation of some numbers on different platforms but this conveys
important information about the bit structure of the number which really
is different between platforms. eval()ing it back to a number is still 
consistent.

When converting in the other direction (hex representation to number) 
there is an ambigous range from 0x80000000 to 0xffffffff.  Should it be 
treated as signed or unsigned?  The current interpretation is signed. PEP
237 proposes to change it to unsigned. I propose to do neither - this range
should be deprecated and some explicit notation should be used instead.

There's no need to be in a hurry about deprecating it, though. The
overwhelming majority of Python code will run on 32 bit platforms for some
time yet.

I propose that on 32 bit platforms this will produce a silent warning. No 
code will break. Running the program with -Wall will inform the programmer 
that the code may not work for some future version of Python.

On 64 bit platforms this will be interpreted the same way as on a 32 bit 
platform (signed 32 bits) but produce a noisy warning.  If the code was 
written on a 64 bit platform and the programmer meant the number to be 
treated as unsigned an explicit U suffix can be added to make it 
unambigously unsigned. If the code was written on a 32 bit platform and 
the programmer meant the number to be treated as signed it's possible to 
just live with the warning (the code should still run correctly) or add 8 
leading 'F's and an 'LL' suffix to make it unambigously signed. The 
modified code will run without warning on both 32 and 64 bit platforms.

Notes: 

The number 4000000000 would be represented in hex as 0xEE6B2800U whether 
it's as an int on a 64 bit platform or a long on either 32 or 64 bit 
platforms.  The representation depends only on the numeric value, not the 
type. This proposal therefore does not contradict the purpose of PEP 237
because ints and longs are treated identically.

What's the hex representation of numbers outside the range of 64 bit 
integers? Frankly, I don't care.  I'll go with any proposed solution as
long as eval(hex(x)) == x.

On Microsoft platforms 64 bit literals use the suffix 'i64', not 'LL'.
Python may either use 'LL' exclusively or produce 'i64' on Microsoft
platforms and 'LL' on other platforms. In the latter case it should 
accept either suffix on all platforms.

Yes, this proposal is more complicated and has special treatment for
different ranges but that is because the issue is not trivial and cannot
be brushed aside using a one-size-doesn't-fit-anyone approach. This
reminds me a lot of unicode issues.

What about the L suffix? This proposal adopts the LL and U suffixes from
C and ensures that they are interpreted consistently on both languages.
But the L suffix is not consistent with C for the range 0x80000000L to 
0xFFFFFFFFL. Should the L suffix be deprecated? Should it produce a 
warning for the possibly ambigous range?

	Oren