This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Compromise PyUnicode_EncodeUTF8
Type: Stage:
Components: Interpreter Core Versions: Python 2.3
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: lemburg, tim.peters
Priority: normal Keywords: patch

Created on 2002-04-27 04:35 by tim.peters, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf8.patch tim.peters, 2002-04-27 04:35 PyUnicode_EncodeUTF8 replacement
Messages (4)
msg39733 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-04-27 04:35
This combines various ideas from Python-Dev.  It 
overallocates, but:

1) For short strings it does the conversion into a 
stack buffer, and allocates exactly as much string 
space as it turns out it needs at the end.  So it 
should be faster, but not waste any small-block memory.

2) For long strings it knows it's going to end up in 
the system malloc/realloc, so it asks for the maximum 
possibly needed at the start, returning the excess 
untouched at the end.  This gets rid of all the 
embedded "but did I really get enough memory yet?" 
tests and reallocations.
msg39734 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002-04-27 14:53
Logged In: YES 
user_id=38388

Cool. I like it.

You better make sure the stack buffer doesn't overrun though
-- I've only skimmed the implementation, but would suggest
to an explicit test for this which is not only executed in
the debug build.
msg39735 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-04-27 17:41
Logged In: YES 
user_id=31435

Well, the overallocation is exactly the same whether it's 
on the stack or on the heap:  where size is the # of 
Unicode characters, it's guaranteed that 4*size bytes are 
available for writing.  The PyString_xyz routines guarantee 
to make an additional byte available to store a trailing 
\0, and indeed they add a trailing \0 automatically.

So the only question remaining is whether 4*size is a 
correct upper bound.  I think it's clear enough from your 
code that it is, and so I'm happy to leave verification of 
that to the debug build.  What it could use more is runtime 
release-build verfication that 4*size doesn't overflow a C 
int.
msg39736 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-04-27 18:05
Logged In: YES 
user_id=31435

I added runtime release-build verification that 4*size 
doesn't overflow a C int, and cleaned up the patch a 
little.  Since you and Martin both seem basically happy 
with it, I just checked it in:

Objects/unicodeobject.c new revision: 2.146
History
Date User Action Args
2022-04-10 16:05:16adminsetgithub: 36510
2002-04-27 04:35:55tim.peterscreate