[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy
report at bugs.python.org
Thu Sep 8 20:56:11 CEST 2011
Terry J. Reedy <tjreedy at udel.edu> added the comment:
On 9/8/2011 4:32 AM, Ezio Melotti wrote:
> So to summarize a bit, there are different possible level of strictness:
> 1) all the possible encodable values, including the ones>10FFFF;
> 2) values in range 0..10FFFF;
> 3) values in range 0..10FFFF except surrogates (aka scalar values);
> 4) values in range 0..10FFFF except surrogates and noncharacters;
>
> and this is what is currently available in Python:
> 1) not available, probably it will never be;
> 2) available through the 'surrogatepass' error handler;
> 3) default behavior (i.e. with the 'strict' error handler);
> 4) currently not available.
>
> Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about). The possible options are:
> * add a new codec (actually one for each UTF encoding);
> * add a new error handler that explicitly disallows noncharacters;
> * change the meaning of 'strict' to match option 4;
If 'strict' meant option 4, then 'scalarpass' could mean option 3.
'surrogatepass' would then mean 'pass surragates also, in addition to
non-char scalers'.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
More information about the Python-bugs-list
mailing list