Issue 1101726: Patch for potential buffer overrun in tokenizer.c

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/41437

classification

Title:	Patch for potential buffer overrun in tokenizer.c
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 2.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	loewis	Nosy List:	doerwalter, glchapman, loewis
Priority:	high	Keywords:	patch

Created on 2005-01-13 15:45 by glchapman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
tokenizer.c.3.diff	glchapman, 2005-02-27 15:22	patch without utf8 manipulation
pep263tests.zip	glchapman, 2005-02-27 15:23	test files
test_pep263.py.diff	glchapman, 2005-03-21 13:47
tokenizer.c.4.diff	doerwalter, 2005-05-16 14:11

Messages (9)
msg47529 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-01-13 15:45
The fp_readl function in tokenizer.c has a potential buffer overrun; see: www.python.org/sf/1089395 It is also triggered by trying to generate the pycom file for the Excel 11.0 typelib, which is where I ran into it. The attached patch allows successful generation of the Excel file; it also runs the fail.py script from the above report without an access violation. It also doesn't break any of the tests in the regression suite (probably something like fail.py should be added as a test). It is not as efficient as it might be; with a function for determining the number of unicode characters in a utf8 string, you could avoid some memory allocations. Perhaps such a function should be added to unicodeobject.c? And, of course, the patch definitely needs review. I'm especially concerned that my use of tok->decoding_buffer might be violating some kind of assumption that I missed.
msg47530 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-01-23 14:26
Logged In: YES user_id=86307 I'm attaching a new patch (tokenizer.c.2.diff), which should be better since it avoids some unnecessary allocations and decoding/encoding. Hopefully, I haven't made any unwarranted assumptions about UTF8.
msg47531 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-02-23 19:48
Logged In: YES user_id=89016 I think the patch looks good. Staring at it for a while I believe the reference counts are OK. Can you incorporate fail.py into the test suite?
msg47532 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-02-27 15:22
Logged In: YES user_id=86307 After thinking about it some more, I realized that fp_readl doesn’t have to do any utf8 manipulation. If the string it copies into the buffer does not include a \n, the buffer is expanded and fp_readl is called again until a \n is returned (or the end of the file is reached). It doesn’t matter if, during this loop, the buffer contains temporarily broken utf8 characters. So I’m attaching another new patch. This copies as many characters as fit into the buffer and saves any overflow into a new string object (stored in tok >decoding_buffer). When it is called again by the loop, it uses this overflow string to copy characters into the buffer. I’m also attaching a zip file with a modified test_pep263.py and some ancillary files. To test fp_readl, python needs to read from a file. Perhaps it would be better to generate temporary files rather than include these extra files. Also, although the tests will trigger the assert using a debug build of the 2.4 source, I’m not getting an access violation running regrtest against a release build (of 2.4) – perhaps different tests are in order?
msg47533 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-03-14 21:43
Logged In: YES user_id=89016 Maybe I've put the test files in the wrong spot, but what I'm getting is: > ./python Lib/test/test_pep263.py test_pep263 (__main__.PEP263Test) ... ok test_evilencoding (__main__.EvilEncodingTest) ... ok Segmentation fault You should probably put pep263_evilencoding.py and pep263_longline.py alongside test_pep263.py and then find then via os.path.join(__file__.rsplit(os.sep, 1)[0], "pep263_longline.py")
msg47534 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-03-21 13:47
Logged In: YES user_id=86307 You're quite right, 2.4 AV's reliably using the new test_pep263. The copy of python24.dll I was testing against already had (the first version of) the patch applied. Anyway, I'm attaching a diff for test_pep263 (against the original 2.4 test_pep263) which incorporates your suggestions about not assuming the current directory.
msg47535 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-05-16 14:11
Logged In: YES user_id=89016 Here is another version of the patch, this version doesn't pass any arguments to readline(), because it's no longer neccessary to limit the line length. What do you think?
msg47536 - (view)	Author: Greg Chapman (glchapman)	Date: 2005-05-17 13:47
Logged In: YES user_id=86307 Looks good to me.
msg47537 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-07-12 21:59
Logged In: YES user_id=89016 Checked in as: Parser/tokenizer.c 2.78 Parser/tokenizer.c 2.76.2.1

History
Date	User	Action	Args
2022-04-11 14:56:09	admin	set	github: 41437
2005-01-13 15:45:20	glchapman	create