Handle foreign character web input

Michael Torrie torriem at gmail.com
Sun Jun 30 10:10:24 EDT 2019


On 06/30/2019 06:21 AM, Richard Damon wrote:
> On 6/30/19 4:00 AM, moi wrote:
>> Unfortunately not.
>>
>> The only thing Python succeeds to propose is a mechanism
>> which does the opposite of UTF-8 when it comes to handle
>> memory *and* - at the same time - which also does the opposite
>> of UTF-32 regarding performance.

I guess "moi" is banned from the mailing list for posting this kind of
rubbish, just like our other old unicode troll as I see no trace of his
post on the list.  Which is just as well.  It's completely wrong.  The
in-memory, internal byte encoding of unicode is irrelevant to the
programmer. In Python 3 we deal with unicode. Period. Any performance
issues he or our other unicode troll (perhaps the same person?) stem
from not understanding the nature of immutable strings.

>> For some other reasons, this mechanism leads to buggy
>> code.

No it doesn't.  Without any evidence to back him up, this is a complete
fabrication on Moi's part.

> My understanding was that the Python 3 'String' class always used a
> Unicode encoding (never a code-page encoding). If you indexed into a
> string you would get at each location the full code point value of that
> character. Now Unicode isn't just UTF-8 or UTF-32/UCS-4 or the like,
> those are just different ways to encode into memory/a stream Unicode
> code points. It may be that Python makes some awkward choices of how it
> wants to store the characters in memory, but to the programmer, it is
> just Unicode code points. If you specifically want something list a
> UTF-8 encoding, that is one of the usages of Bytes was.

That's correct.  It doesn't matter what format Python chooses to use in
memory.

Some argue that O(1) indexing of a unicode string is not important
because indexing a unicode string by code point (a "character") is
incorrect some/much of the time, owing to the fact that sometimes what
is seen as a single character on the screen is actually composed of more
than one code point (grapheme cluster). Hence using UTF-8 internally is
good enough, and encoding to bytes is a no-op (fast).




More information about the Python-list mailing list