[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Tue Jan 7 21:58:12 CET 2014

On 2014-01-07 19:43, Ethan Furman wrote:
> On 01/07/2014 11:32 AM, MRAB wrote:
>> On 2014-01-07 18:38, Ethan Furman wrote:
>>> On 01/07/2014 10:22 AM, MRAB wrote:
>>>>> On Jan 7, 2014, at 7:44, Steven D'Aprano <steve at pearwood.info> wrote:
>>>>>
>>>>>> Suppose we take a byte-string with a non-ASCII byte:
>>>>>>
>>>>>>    b'abc\xFF'.decode('ascii-compatible')
>>>>>>
>>>> That would be:
>>>>
>>>>      bytestring(b'abc\xFF')
>>>>
>>>> Bytes outside the ASCII range would be mapped to Unicode low
>>>> surrogates:
>>>>
>>>>      bytestring(b'abc\xFF') == bytestring('abc\uDCFF')
>>>
>>> Not sure what you mean here.  The resulting bytes should be 'abc\xFF' and of length 4.
>>>
>> 'abc\xFF' is a Unicode string, but you wouldn't be able to convert it
>> to a bytestring because '\xFF' is a codepoint outside the ASCII range
>> and not a low surrogate.
>
> I can see terminology is going to be a pain in this thread.  ;)
>
> My vision for a bytestring type (more refined):
>
>     - made up of single bytes in the range 0 - 255 (no unicode anywhere)
>
>     - indexing returns a bytestring of length 1, not an integer (as bytes does)
>
>     - `bytestring(7)` either fails, or returns 'bytestring('\x07')' not 'bytestring(0, 0, 0, 0, 0, 0, 0)'
>
> So my statement above of 'abc\xFF' should not be interpreted as a unicode string... I guess I'll use 'y' as an
> abbreviation for now: y'abc\xFF'.
>
No disagreement there.

The point about Unicode is about how it could behave if mixed with
Unicode strings.