Issue 549375: Compromise PyUnicode_EncodeUTF8

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36510

classification

Title:	Compromise PyUnicode_EncodeUTF8
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 2.3

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	tim.peters	Nosy List:	lemburg, tim.peters
Priority:	normal	Keywords:	patch

Created on 2002-04-27 04:35 by tim.peters, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
utf8.patch	tim.peters, 2002-04-27 04:35	PyUnicode_EncodeUTF8 replacement

Messages (4)
msg39733 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-04-27 04:35
This combines various ideas from Python-Dev. It overallocates, but: 1) For short strings it does the conversion into a stack buffer, and allocates exactly as much string space as it turns out it needs at the end. So it should be faster, but not waste any small-block memory. 2) For long strings it knows it's going to end up in the system malloc/realloc, so it asks for the maximum possibly needed at the start, returning the excess untouched at the end. This gets rid of all the embedded "but did I really get enough memory yet?" tests and reallocations.
msg39734 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-04-27 14:53
Logged In: YES user_id=38388 Cool. I like it. You better make sure the stack buffer doesn't overrun though -- I've only skimmed the implementation, but would suggest to an explicit test for this which is not only executed in the debug build.
msg39735 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-04-27 17:41
Logged In: YES user_id=31435 Well, the overallocation is exactly the same whether it's on the stack or on the heap: where size is the # of Unicode characters, it's guaranteed that 4size bytes are available for writing. The PyString_xyz routines guarantee to make an additional byte available to store a trailing \0, and indeed they add a trailing \0 automatically. So the only question remaining is whether 4size is a correct upper bound. I think it's clear enough from your code that it is, and so I'm happy to leave verification of that to the debug build. What it could use more is runtime release-build verfication that 4*size doesn't overflow a C int.
msg39736 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-04-27 18:05
Logged In: YES user_id=31435 I added runtime release-build verification that 4*size doesn't overflow a C int, and cleaned up the patch a little. Since you and Martin both seem basically happy with it, I just checked it in: Objects/unicodeobject.c new revision: 2.146

History
Date	User	Action	Args
2022-04-10 16:05:16	admin	set	github: 36510
2002-04-27 04:35:55	tim.peters	create