[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Sun Nov 28 01:46:51 CET 2010

Ezio Melotti <ezio.melotti at gmail.com> added the comment:

AFAIU the macro returns lone surrogates as they are, this means that:
  1) if the string contains only surrogate pairs, Py_UNICODE_NEXT will iterate on scalar values[0];
  2) if the string contains only lone surrogates, it will iterate on codepoints[1];
  3) if it contains both it will be half and half (i.e. scalar values if the surrogates are in pair, or falling back on codepoints if they aren't);
(for strings without surrogates, iterating on scalar values or codepoints is the same).

Is this semantic correct for all (or at least most of) the places where the macro will be used?
Would a stricter version (that rejects lone surrogates and iterates on scalar values only) be useful in addition or in alternative to Py_UNICODE_NEXT?

[0]: http://unicode.org/glossary/#unicode_scalar_value
[1]: http://unicode.org/glossary/#code_point

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10542>
_______________________________________