Handle foreign character web input

Sat Jun 29 16:15:02 EDT 2019

On 2019-06-28, Chris Angelico <rosuav at gmail.com> wrote:
> On Sat, Jun 29, 2019 at 6:31 AM Tobiah <toby at tobiah.org> wrote:
>> A guy comes in and enters his last name as RÖnngren.
>>
>> So what did the browser really give me; is it encoded
>> in some way, like latin-1?  Does it depend on whether
>> the name was cut and pasted from a Word doc. etc?
>> Should I handle these internally as unicode?  Right
>> now my database tables are latin-1 and things seem
>> to usually work, but not always.
>
> Definitely handle them as Unicode. You'll receive them in some
> encoding, probably UTF-8, and it depends on the browser.

You can basically assume it is the encoding that the page the form was
on was using - which is a good reason to always explicitly specify
utf-8 encoding on HTML pages.

>> Also, what do people do when searching for a record.
>> Is there some way to get 'Ronngren' to match the other
>> possible foreign spellings?
>
> Ehh....... probably not. That's a human problem, not a programming
> one. Best of luck.

And yet there are many programs which attempt to solve it. The Python
module 'unidecode' will do a decent stab of it if the language is
vaguely European. Certainly, storing the UTF-8 string and also the
'unidecoded' ASCII string and searching on both is unlikely to hurt
and will often help. Additionally using Metaphone or similar will
probably also help.