Handle foreign character web input

Richard Damon Richard at Damon-Family.org
Sat Jun 29 13:25:22 EDT 2019


On 6/29/19 3:19 AM, Thomas Jollans wrote:
> On 28/06/2019 22:25, Tobiah wrote:
>> A guy comes in and enters his last name as RÖnngren.
> With a capital Ö in the middle? That's unusual.
>>
>> So what did the browser really give me; is it encoded
>> in some way, like latin-1?  Does it depend on whether
>> the name was cut and pasted from a Word doc. etc?
>> Should I handle these internally as unicode?  Right
>> now my database tables are latin-1 and things seem
>> to usually work, but not always.
>
>
> If your database is using latin-1, German and French names will work,
> but Croatian and Polish names often won't. Not to mention people using
> other writing systems.
>
> So Günther and François are ok, but Bolesław turns into Boles?aw and
> don't even think about anybody called Владимир or محمد. 

I would say that currently, the only real reason to use an encoding
other than Unicode (normally UTF-8) would be historical inertia. Maybe a
field that will only ever have plain ASCII characters could use ASCII
(such a field would never have real natural language words, but only
computer generated codes). All the various 'codepages' were useful in
their day, when machines were less capable, and Unicode hadn't been
invented or wasn't supported well or was too expensive to use.

Now (as I understand it), all Python (3) 'Strings' are internally
Unicode, if you need something with a different encoding it needs to be
in Bytes.

-- 
Richard Damon




More information about the Python-list mailing list