[I18n-sig] basic question about collation strategies

kirby urner kirby.urner at gmail.com
Sun May 10 19:29:06 CEST 2015


My nonprofit is only beginning to address
non-Latin-1 characters in full names in
corporate listings.  My current plan is to
allow a Full_name field in any script e.g.
Devanagari, but then insist on at least
single letters A-Z in Last and First name
fields.  Some examples:

https://flic.kr/p/sqV14G
(using religious types from Wikipedia
for pseudo-records)

Although I've worked in libraries which
have alphabetization worked out across
multiple languages (I could return Arabic
titles to their proper place in my hey day),
I am less sure of how Unicode handles
collations across all language boundaries.

It seemed easier to use the Roman alphabet
to force a simple last, first collation, whereas
Full_name is not used for collation at all and
may be in any character set supported by
Unicode.  Given Roman letters have phonetic
value, one looks for the Full_name based
on how you'd sound it out in "Romanji" (the
Nipponese name for Roman letter scripts,
such as Python's keywords and Standard
Library).

Is there an industry standard I should know
about and is my simplification of alpha
searching an accepted strategy?

Kirby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/i18n-sig/attachments/20150510/73308fca/attachment.html>


More information about the I18n-sig mailing list