[Python-ideas] a new bytestring type?

Nick Coghlan ncoghlan at gmail.com
Mon Jan 6 15:58:30 CET 2014


On 6 Jan 2014 19:16, "Andrew Barnert" <abarnert at yahoo.com> wrote:
>
> From: Nick Coghlan <ncoghlan at gmail.com>
> Sent: Sunday, January 5, 2014 2:57 PM
>
>
> >I actually expected someone to have experimented with an "encodedstr"
type by now. This would be a type that behaved like the Python 2 str type,
but had an encoding attribute. On encountering Unicode text strings, it
would encode then appropriately.
>
> I did something like this when I was first playing with 3.0, and I
managed to find it.
>
> I tried two different implementations, a bytes subclass that fakes being
a str as well as possible by decoding on the fly (or, in some cases, by
encoding its arguments on the fly), and a str that fakes being a bytes as
well as possible by doing the opposite.
>
> >However, people have generally instead followed the model of decoding to
text and operating in that domain, since it avoids a lot of subtle issues
(like accidentally embedding byte order marks when concatenating strings).
>
>
> It's also conceptually cleaner to work with text as text instead of as
bytes that you can sort of use as text.
>
> Also, one major reason people resist working with text (or upgrading to
3.x) is the perceived performance costs of dealing with Unicode. But if you
want to do any kind of string processing on your text beyond searching for
ASCII header names and the like, you pretty much have to do it as Unicode
or it's wrong. So, you'd need something that allows you to do those ASCII
header searches in 8-bit-land, but either doesn't allow full string
processing, or automatically decodes and re-encodes on the fly (which
obviously isn't going to be faster).
>
> >This is likely encouraged by the fact that str, bytes and bytearray
don't currently implement type coercion correctly (which in turn is due to
a long standing bug in the way the abstract C API handles sequence types
defined in C rather than Python), so an encodedstr type would need to
inherit from str or bytes to get interoperability, and then wouldn't
interoperate with the other one.
>
>
> What's the bug?

http://bugs.python.org/issue11477

CPython doesn't check for NotImplemented results from sq_concat or
sq_repeat, so the sequence implementations raise TypeError directly and the
RHS doesn't get consulted to see if it can handle the operation.
Subclassing works anyway because subclasses are always checked first even
when they're the RHS.

Thanks for the info on your experiences with attempting to implement an
encodedstr type. I still feel there is potential merit to the concept, but
it's certainly going to take some thought.

Cheers,
Nick.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140107/6b7e0364/attachment.html>


More information about the Python-ideas mailing list