How do I display unicode value stored in a string variable using ord()

Paul Rubin no.email at nospam.invalid
Sun Aug 19 13:48:06 EDT 2012


Terry Reedy <tjreedy at udel.edu> writes:
>> Meanwhile, an example of the 393 approach failing:
> I am completely baffled by this, as this example is one where the 393
> approach potentially wins.

What?  The 393 approach is supposed to avoid memory bloat and that
does the opposite.

>> I was involved in a project that dealt with terabytes of OCR data of
>> mostly English text.  So the chars were mostly ascii,
> 3.3 stores ascii pages 1 byte/char rather than 2 or 4.

But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.

> I doubt that there are really any non-bmp chars.

You may be right about this.  I thought about it some more after
posting and I'm not certain that there were supplemental characters.

> As Steven said, reject such false identifications.

Reject them how?

>> That's a  natural for UTF-8
> 3.3 would convert to utf-8 for storage on disk.

They are already in utf-8 on disk though that doesn't matter since
they are also compressed.  

>> but the PEP-393 approach would bloat up the memory
>> requirements by a factor of 4.
> 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
> better than always?

The bloat is in comparison with utf-8, in that example.

> That looks like a 3.2- narrow build. Such which treat unicode strings
> as sequences of code units rather than sequences of codepoints. Not an
> implementation bug, but compromise design that goes back about a
> decade to when unicode was added to Python. 

I thought the whole point of Python 3's disruptive incompatibility with
Python 2 was to clean up past mistakes and compromises, of which unicode
headaches was near the top of the list.  So I'm surprised they seem to
repeated a mistake there.  

> I would call it O(k), where k is a selectable constant. Slowing access
> by a factor of 100 is hardly acceptable to me. 

If k is constant then O(k) is the same as O(1).  That is how O notation
works.  I wouldn't believe the 100x figure without seeing it measured in
real-world applications.



More information about the Python-list mailing list