[Python-ideas] RFC: bytestring as a str representation [was: a new bytestring type?]

Tue Jan 7 19:11:07 CET 2014

I think there are three problems with your proposal--all of which I mentioned in the long reply to Steven, but I suspect many people tl;dr'd over that, and I like your proposal enough that I want to make sure either I'm wrong, or you fix them. So:

On Jan 6, 2014, at 10:37, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:

> So ... now that we have the flexible string representation (PEP 393),
> let's add a 7-bit representation!

The name has confused both Steven and Nick into misinterpreting the idea, and it confused me until I read over the details twice and it finally clicked, and it still doesn't make sense after I understand what you mean.

This is an 8-bit representation where non-ASCII bytes are used to smuggle non-ASCII bytes. Just like the existing 16-bit representation where surrogate escapes are used to smuggle non-ASCII bytes. It's not a 7-bit representation unless there's nothing but ASCII in it--and it's never used in the case where there's nothing but ASCII. I'm not sure what the right word is, but this isn't it.

> 1.  It is only produced on input by a new 'ascii-compatible' codec,

This name might also be confusing people.
> 
> 3.  When combined with a str in 8-bit representation:
> 
>    a.  If the 8-bit str contains any Latin-1 or C1 characters, both
>        strs are promoted to 16-bit, and non-ASCII characters in the
>        7-bit string are converted by the surrogateescape handler.

This part worries me a bit. The bytes 61 62 63 FF in this new representation actually _mean_ 'abc' followed by a smuggled FF byte. But the words 0061 0062 0063 DCFF in a 16-bit representation just mean 'abc\uDCFF', which _can be interpreted_, via the surrogate-escape mechanism, as 'abc' and a smuggled byte, but don't actually _mean_ that. It seems like your proposal only works if we change it so that they really _do_ mean that.

> 6.  On output the 'ascii-compatible' codec simply memcpy's 7-bit str
>    and pure ASCII 8-bit str, and raises on anything else.

So if a 7-bit string gets converted to a surrogate-escaped 16-bit string, it can never be written out again? For a contrived example:

(b'abc\xff'.decode('ascii-compatible') + '\u1234')[:4].encode('ascii-compatible')

I'd expect to get back my b'abcd\xff'. But your rules give me an exception.

Maybe you were expecting this to be taken care of in the slicing, but rule 1 makes that impossible; you can never get a 7-bit string by doing anything but decoding ascii-compatible (or combining two 7-bit strings).

I think ascii-compatible has to accept non-8-bit-repr strings (by encoding ASCII as ASCII and surrogate escapes as bytes and everything else is an exception). This is necessary because 60 61 62 FF (7-bit) and 0061 0062 0063 DCFF (16-bit) are the same string anyway. But it's especially necessary because the former can be silently converted into the latter (and there's no way to even test whether that's happened).

Of course that means biting the bullet and saying that \uDCFF in python really means a smuggled FF byte, rather than just being a way to smuggle an FF byte through Unicode if want to you do so explicitly. But as I said above, I think you've already bitten that bullet.