[issue40980] group names of bytes regexes are strings

Quentin Wenger report at bugs.python.org
Tue Jun 16 16:37:53 EDT 2020


Quentin Wenger <wenger.quentin at bluewin.ch> added the comment:

You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

> It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the fact that it "naturally" converts bytes to unicode code points is an implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in the docs that group names come out as latin-1-encoded strings, with all the restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding altogether.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue40980>
_______________________________________


More information about the Python-bugs-list mailing list