[Python-Dev] bytes / unicode

Mon Jun 28 01:31:21 CEST 2010

I've been watching this discussion with intense interest, but have
been so lagged in following the thread that I haven't replied.
I got caught up today....

On Sun, 27 Jun 2010 15:53:59 +1000, Nick Coghlan <ncoghlan at gmail.com> wrote:
> The difference is that we have three classes of algorithm here:
> - those that work only on octet sequences
> - those that work only on character sequences
> - those that can work on either
> 
> Python 2 lumped all 3 classes of algorithm together through the
> multi-purpose 8-bit str type. The unicode type provided some scope to
> separate out the second category, but the divisions were rather
> blurry.
> 
> Python 3 forces the first two to be separated by using either octets
> (bytes/bytearray) or characters (str). There are a *very small* number
> of APIs where it is appropriate to be polymorphic, but this is
> currently difficult due to the need to supply literals of the
> appropriate type for the objects being operated on.
> 
> This isn't ever going to happen automagically due to the need to
> explicitly provide two literals (one for octet sequences, one for
> character sequences).

In email6 I'm currently handling this by putting the algorithm on a
base class and the literals on 'Bytes...' and 'String...'  subclasses as
class variables.  Slightly ugly, but it works.

The current design also speaks to an earlier point someone made about the
fact that we are really dealing with more complex, and domain specific,
data, not simply "byte strings".  A "BytesMessage" contains lots of
structured encoding information as well as the possibility of 'garbage'
bytes.  A StringMessage contains text and data decoded into objects
(ex: an image object), possibly with some PEP 383 surrogates included
(haven't quite figured that part out yet).  So, a BytesMessage object
isn't just a byte string, it's a load of structured data that requires
the associated algorithms to convert into meaningful text and objects.
Going the other way, the decisions made about character encodings need to
be encoded into the structured bytes representation that could ultimately
go out on the wire.

I suspect that the same thing needs to be done for URIs/IRIs, and
html/MIME and the corresponding text and objects.  It is my hope that
the email6 work will lay a firm foundation for the latter, but URI/IRI
is a whole different protocol that I'm glad I don't have to deal with :)

> The virtues of a separate poly_str type are that:

Having such a poly_str type would probably make my life easier.

I also would like just vent a little frustration at having to
use single-character-slice notation when I want to index a character
in a string in my algorithms....

--
R. David Murray                                      www.bitdance.com