Handle foreign character web input

Chris Angelico rosuav at gmail.com
Sun Jun 30 10:04:15 EDT 2019


On Sun, Jun 30, 2019 at 10:26 PM Richard Damon <Richard at damon-family.org> wrote:
>
> On 6/30/19 4:00 AM, moi wrote:
> > Le samedi 29 juin 2019 19:25:40 UTC+2, Richard Damon a écrit :
> >>
> >> Now (as I understand it), all Python (3) 'Strings' are internally
> >> Unicode, if you need something with a different encoding it needs to be
> >> in Bytes.
> >>
> >> --
> >
> > Unfortunately not.
> >
> > The only thing Python succeeds to propose is a mechanism
> > which does the opposite of UTF-8 when it comes to handle
> > memory *and* - at the same time - which also does the opposite
> > of UTF-32 regarding performance.
> >
> > For some other reasons, this mechanism leads to buggy
> > code.
> >
>
> My understanding was that the Python 3 'String' class always used a
> Unicode encoding (never a code-page encoding). If you indexed into a
> string you would get at each location the full code point value of that
> character. Now Unicode isn't just UTF-8 or UTF-32/UCS-4 or the like,
> those are just different ways to encode into memory/a stream Unicode
> code points. It may be that Python makes some awkward choices of how it
> wants to store the characters in memory, but to the programmer, it is
> just Unicode code points. If you specifically want something list a
> UTF-8 encoding, that is one of the usages of Bytes was.

I didn't see who you were quoting, but it looks like our old "Py3's
Unicode is buggy" troll is back (or maybe he never left, he just got
banned from the mailing list). The ONLY bugginess he has ever shown is
a performance regression on a very specific sort of operation where
the old (Py2) behaviour was *actually* buggy. Take no notice of him;
he is either trolling or somehow deluded, and either way, he never
listens to people's responses. Waste none of your time on him.

ChrisA



More information about the Python-list mailing list