[issue40980] group names of bytes regexes are strings

Tue Jun 16 20:26:30 EDT 2020

Quentin Wenger <wenger.quentin at bluewin.ch> added the comment:

I just had an "aha moment": What re claims is that, rather than doing as I suggested:

> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
> 
> # what character does the group name correspond to?
> # maybe we can try to infer it by decoding the bytestring?
> # let's try to do it with the default encoding... that's natural, right?
> >>> p.decode()
> '(?P<ú>)'
> ```

the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation":

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<Ãº>)'

# ok so the group name will be "Ãº"
```

This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue40980>
_______________________________________