[Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)

Fri, 28 Apr 2000 15:08:57 -0700

> > Guido van Rossum wrote:
> > >
> > > The email below is a serious bug report.  A quick analysis shows that
> > > UserString.count() calls the count() method on a string object, which
> > > calls PyArg_ParseTuple() with the format string "O|ii".  The 'i'
> > > format code truncates integers.  It probably should raise an overflow
> > > exception instead.  But that would still cause the test to fail --
> > > just in a different way (more explicit).  Then the string methods
> > > should be fixed to use long ints instead -- and then something else
> > > would probably break...
> >
MAL wrote:
> > All uses in stringobject.c and unicodeobject.c use INT_MAX
> > together with integers, so there's no problem on that side
> > of the fence ;-)
> >
> > Since strings and Unicode objects use integers to describe the
> > length of the object (as well as most if not all other
> > builtin sequence types), the correct default value should
> > thus be something like sys.maxlen which then gets set to
> > INT_MAX.
> >
> > I'd suggest adding sys.maxlen and the modifying UserString.py,
> > re.py and sre_parse.py accordingly.
>
Guido wrote:
> Hm, I'm not so sure.  It would be much better if passing sys.maxint
> would just WORK...  Since that's what people have been doing so far.
>

Possible solutions (I give 4 of them):

1. The 'i' format code could raise an overflow exception and the
PyArg_ParseTuple() call in string_count() could catch it and truncate to
INT_MAX (reasoning that any overflow of the end position of a string can be
bound to INT_MAX because that is the limit for any string in Python).

Pros:
- This "would just WORK" for usage of sys.maxint.

Cons:
-  This overflow exception catching should then reasonably be propagated to
other similar functions (like string.endswith(), etc).
- We have to assume that the exception raised in the PyArg_ParseTuple(args,
"O|ii:count", &subobj, &i, &last) call is for the second integer (i.e.
'last'). This is subtle and ugly.

Pro or Con:
- Do we want to start raising overflow exceptions for other conversion
formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on Win64 where
sizeof(long) < size(void*))? I think this is a good idea in principle but
may break code (even if it *does* identify bugs in that code).

2. Just change the definitions of the UserString methods to pass a variable
length argument list instead of default value parameters. For example change
UserString.count() from:

    def count(self, sub, start=0, end=sys.maxint):
        return self.data.count(sub, start, end)

to:

    def count(self, *args)):
        return self.data.count(*args)

The result is that the default value for 'end' is now set by string_count()
rather than by the UserString implementation:
>>> from UserString import UserString
>>> s= 'abcabcabc'
>>> u = UserString('abcabcabc')
>>> s.count('abc')
3
>>> u.count('abc')
3

Pros:
- Easy change.
- Fixes the immediate bug.
- This is a safer way to copy the string behaviour in UserString anyway (is
it not?).

Cons:
- Does not fix the general problem of the (common?) usage of sys.maxint to
mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit
Unices).
- The UserString code is no longer really self-documenting.

3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with
breaks the logical difference with sys.maxint (set to LONG_MAX):
 - sys.maxint == "the largest value a Python integer can hold"
 - sys.maxlen == "the largest value for the length of an object in Python
(e.g. length of a string, length of an array)"

Pros:
- More explicit in that it separates two distinct meanings for sys.maxint
(which now makes a difference on 64-bit Unices).
- The code changes should be fairly straightforward.

Cons:
- Places in the code that still use sys.maxint where they should use
sys.maxlen will unknowingly
be overflowing ints and bringing about this bug.
- Something else for coders to know about.

4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f. ANSI size_t
type). It is probably not a biggie, but Python currently makes the
assumption that string never exceed INT_MAX in length. While this assumption
is not likely to be proven false it technically could be on 64-bit systems.
As well, when you start compiling on Win64 (where sizeof(int) ==
sizeof(long) < sizeof(size_t)) then you are going to be annoyed by hundreds
of warnings about implicit casts from size_t (64-bits) to int (32-bits) for
every strlen, str*, fwrite, and sizeof call that you make.

Pros:
- IMHO logically more correct.
- Might clean up some subtle bugs.
- Cleans up annoying and disconcerting warnings.
- Will probably mean less pain down the road as 64-bit systems (esp. Win64)
become more prevalent.

Cons:
- Lot of coding changes.
- As Guido said: "and then something else would probably break". (Though, on
currently 32-bits system, there should be no effective change). Only 64-bit
systems should be affected and, I would hope, the effect would be a clean
up.

I apologize for not being succinct. Note that I am volunteering here.
Opinions and guidance please.

Trent