[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Thu Sep 8 20:56:11 CEST 2011

Terry J. Reedy <tjreedy at udel.edu> added the comment:

On 9/8/2011 4:32 AM, Ezio Melotti wrote:
> So to summarize a bit, there are different possible level of strictness:
>    1) all the possible encodable values, including the ones>10FFFF;
>    2) values in range 0..10FFFF;
>    3) values in range 0..10FFFF except surrogates (aka scalar values);
>    4) values in range 0..10FFFF except surrogates and noncharacters;
>
> and this is what is currently available in Python:
>    1) not available, probably it will never be;
>    2) available through the 'surrogatepass' error handler;
>    3) default behavior (i.e. with the 'strict' error handler);
>    4) currently not available.
>
> Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about).  The possible options are:
>    * add a new codec (actually one for each UTF encoding);
>    * add a new error handler that explicitly disallows noncharacters;
>    * change the meaning of 'strict' to match option 4;

If 'strict' meant option 4, then 'scalarpass' could mean option 3. 
'surrogatepass' would then mean 'pass surragates also, in addition to 
non-char scalers'.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________