[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 11:37:43 CEST 2014

Seriously, can this discussion move somewhere else?
This has nothing to do on python-dev.

Thank you

Antoine.

On Wed, 17 Sep 2014 18:56:02 +1000
Steven D'Aprano <steve at pearwood.info> wrote:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
> 
> > Guido's mantra is something like "Python's str doesn't contain
> > characters or even code points[1], it contains code units."
> 
> But is that true? If it were true, I would expect to be able to make 
> Python text strings containing code units that aren't code points, e.g. 
> something like "\U12340000" or chr(0x12340000) should work, but neither 
> do. As far as I can tell, there is no way to build a string containing 
> items which aren't code points.
> 
> I don't think it is useful to say that strings *contain* code units, 
> more that they *are made up from* code units. Code units are the 
> implementation: 16-bit code units in narrow builds, 32-bit code units 
> in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and 
> beyond. (I don't know of any Python implementation which uses UTF-8 
> internally, but if there was one, it would use 8-bit code units.)
> 
> It isn't very useful to say that in Python 3.3 the string "A" *contains*
> the 8-bit code unit 0x41. That's conflating two different levels of 
> explanation (the high-level interface and the underlying implemention) 
> and potentially leads to user confusion like
> 
> # 8-bit code units are bytes, right?
> assert b'\41' in "A"
> 
> which is Not Even Wrong.
> http://rationalwiki.org/wiki/Not_even_wrong
> 
> I think it is correct to say that Python strings are sequences of 
> Unicode code points U+0000 through U+10FFFF. There are no other 
> restrictions, e.g. strings can contain surrogates, noncharacters, or 
> nonsensical combinations of code points such as a U+0300 COMBINING GRAVE 
> ACCENT combined with U+000A (newline).
> 
> 
> > Implying
> > that dealing with characters (or the grapheme globs that occasionally
> > raise their ugly heads here) is an issue for higher-level facilities
> > than str to deal with.
> 
> Agreed that Python doesn't offer a string type based on graphemes, and 
> that such a facility belongs as a high-level library, not a built-in 
> type.
> 
> Also agreed that talking about characters is sloppy. Nevertheless, for 
> English speakers at least, "code point = character" isn't too awful a 
> first approximation.
> 
> 
> > The point being that
> > 
> >  > Basically, we are pretending that the each smuggled byte is single
> >  > character
> > 
> > is something of a misstatement (good enough for present purpose of
> > discussing email, but not good enough for the general case of
> > understanding how this is supposed to work when porting the construct
> > to other Python implementations), while
> > 
> >  > for string parsing purposes...but they don't match any of our
> >  > parsing constants.
> > 
> > is precisely Pythonically correct.  You might want to add "because all
> > parsing constants contain only valid characters by construction."
> 
> I don't understand what you are trying to say here.
> 
> 
> >  > [*] I worried a lot that this was re-introducing the bytes/string
> >  > problem from python2.
> > 
> > It isn't, because the bytes/str problem was that given a str object
> > out of context you could not tell whether it was a binary blob or
> > text, and if text, you couldn't tell if it was external encoded text
> > or internal abstract text.
> > 
> > That is not true here because the representations of characters vs.
> > smuggled bytes in str are disjoint sets.
> 
> Nor am I sure what you are trying to say here either.
> 
> 
> > Footnotes: 
> > [1]  In Unicode terminology, a code unit is the smallest computer
> > object that can represent a character (this is uniquely and sanely
> > defined for all real Unicode transformation formats aka UTFs).  A code
> > point is an integer 0 - (17*256*256-1) that can represent a character,
> > but many code points such as surrogates and 0xFFFF are defined to be
> > non-characters.
> 
> Actually not quite. "Noncharacter" is concretely defined in Unicode, and 
> there are only 66 of them, many fewer than the surrogate code points 
> alone. Surrogates are reserved, not noncharacters.
> 
> http://www.unicode.org/glossary/#surrogate_code_point
> http://www.unicode.org/faq/private_use.html#nonchar1
> 
> It is wrong to talk about "surrogate characters", but perhaps you mean 
> to say that surrogates (by which I understand you to mean surrogate code 
> points) are "not human-meaningful characters", which is not the same 
> thing as a Unicode noncharacter.
> 
> 
> > Characters are those code points that may be assigned
> > an interpretation as a character, including undefined characters
> > (private space and reserved).
> 
> So characters are code points which are characters, including undefined 
> characters? :-)
> 
> http://www.unicode.org/glossary/#character
> 
> 
>