Handle foreign character web input

Jon Ribbens jon+usenet at unequivocal.eu
Sat Jun 29 16:15:02 EDT 2019


On 2019-06-28, Chris Angelico <rosuav at gmail.com> wrote:
> On Sat, Jun 29, 2019 at 6:31 AM Tobiah <toby at tobiah.org> wrote:
>> A guy comes in and enters his last name as RÖnngren.
>>
>> So what did the browser really give me; is it encoded
>> in some way, like latin-1?  Does it depend on whether
>> the name was cut and pasted from a Word doc. etc?
>> Should I handle these internally as unicode?  Right
>> now my database tables are latin-1 and things seem
>> to usually work, but not always.
>
> Definitely handle them as Unicode. You'll receive them in some
> encoding, probably UTF-8, and it depends on the browser.

You can basically assume it is the encoding that the page the form was
on was using - which is a good reason to always explicitly specify
utf-8 encoding on HTML pages.

>> Also, what do people do when searching for a record.
>> Is there some way to get 'Ronngren' to match the other
>> possible foreign spellings?
>
> Ehh....... probably not. That's a human problem, not a programming
> one. Best of luck.

And yet there are many programs which attempt to solve it. The Python
module 'unidecode' will do a decent stab of it if the language is
vaguely European. Certainly, storing the UTF-8 string and also the
'unidecoded' ASCII string and searching on both is unlikely to hurt
and will often help. Additionally using Metaphone or similar will
probably also help.



More information about the Python-list mailing list