[New-bugs-announce] [issue22649] Use _PyUnicodeWriter in case_operation()

Wed Oct 15 22:30:45 CEST 2014

New submission from STINNER Victor:

The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests:

- test_capitalize: 'ﬁnnish' => 'FInnish' (ascii)
- test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
- test_swapcase: 'ﬁ' => 'FI', 'ß' => 'SS'
- test_title: 'ﬁNNISH' => 'Finnish'
- test_upper: 'ﬁ' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes:

- test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'ﬁnnish' => 'FInnish'
- test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
- test_lower: 'İ' => 'i̇'
- test_swapcase: 'ﬁ' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
- test_title: 'ﬁNNISH' => 'Finnish'
- test_upper: 'ﬁ' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'

----------
files: case_writer.patch
keywords: patch
messages: 229497
nosy: haypo
priority: normal
severity: normal
status: open
title: Use _PyUnicodeWriter in case_operation()
type: performance
versions: Python 3.5
Added file: http://bugs.python.org/file36942/case_writer.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue22649>
_______________________________________