How do I display unicode value stored in a string variable using ord()

Terry Reedy tjreedy at udel.edu
Sun Aug 19 13:34:09 EDT 2012


On 8/19/2012 4:04 AM, Paul Rubin wrote:


> Meanwhile, an example of the 393 approach failing:

I am completely baffled by this, as this example is one where the 393 
approach potentially wins.

> I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.
> So the chars were mostly ascii,

3.3 stores ascii pages 1 byte/char rather than 2 or 4.

 > but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision.

I doubt that there are really any non-bmp chars. As Steven said, reject 
such false identifications.

 > That's a  natural for UTF-8

3.3 would convert to utf-8 for storage on disk.

> but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.

3.2- wide builds would *always* use 4 bytes/char. Is not occasionally 
better than always?

>      py> s = chr(0xFFFF + 1)
>      py> a, b = s
>
> That looks like Python 3.2 is buggy and that sample should just throw an
> error.  s is a one-character string and should not be unpackable.

That looks like a 3.2- narrow build. Such which treat unicode strings as 
sequences of code units rather than sequences of codepoints. Not an 
implementation bug, but compromise design that goes back about a decade 
to when unicode was added to Python. At that time, there were only a few 
defined non-BMP chars and their usage was extremely rare. There are now 
more extended chars than BMP chars and usage will become more common 
even in English text.

Pre 3.3, there are really 2 sub-versions of every Python version: a 
narrow build and a wide build version, with not very well documented 
different behaviors for any string with extended chars. That is and 
would have become an increasing problem as extended chars are 
increasingly used. If you want to say that what was once a practical 
compromise has become a design bug, I would not argue. In any case, 3.3 
fixes that split and returns Python to being one cross-platform language.

> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?

Python has often copied or borrowed, with adjustments. This time it is 
the first. We will see how it goes, but it has been tested for nearly a 
year already.

> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.  By that I mean pick some implementation
> constant k (say k=128) and represent the string as a UTF-8 encoded byte
> array, accompanied by a vector n//k pointers into the byte array, where
> n is the number of codepoints in the string.  Then you can reach any
> offset analogously to reading a random byte on a disk, by seeking to the
> appropriate block, and then reading the block and getting the char you
> want within it.  Random access is then O(1) though the constant is
> higher than it would be with fixed width encoding.

I would call it O(k), where k is a selectable constant. Slowing access 
by a factor of 100 is hardly acceptable to me. For strings less than k, 
access is O(len). I believe slicing would require re-indexing.

As 393 was near adoption, I proposed a scheme using utf-16 (narrow 
builds) with a supplementary index of extended chars when there are any. 
That makes access O(1) if there are none and O(log(k)), where k is the 
number of extended chars in the string, if there are some.

-- 
Terry Jan Reedy




More information about the Python-list mailing list