[Python-Dev] unicode alphanumerics

Finn Bock bckfnn@worldonline.dk
Mon, 03 Jul 2000 17:06:05 GMT


[M.-A. Lemburg]

>"M.-A. Lemburg" wrote:
>> 
>> Fredrik Lundh wrote:
>> > how about this plan:
>> >
>> > -- you add a Py_UNICODE_ALPHA to unicodeobject.h asap,
>> >    which does exactly that (or I can do that, if you prefer).
>> >    (and maybe even a Py_UNICODE_ALNUM)
>> 
>> Ok, I'll add Py_UNICODE_ISALPHA and Py_UNICODE_ISALNUM
>> (first with approximations of the sort you give above and
>> later with true implementations using tables in unicodectype.c)
>> on Monday... gotta run now.
>> 
>> > -- I change SRE to use that asap.
>> >
>> > -- you, I, or someone else add a better implementation,
>> >    some other day.
>
>I've just looked into this... the problem here is what to
>consider as being "alpha" and what "numeric". 
>
>I could add two new tables for the characters with category 'Lo'
>(other letters, not cased) and 'Lm' (letter modifiers)
>to match all letters in the Unicode database, but those
>tables have some 5200 entries (note that there are only 804 lower
>case letters and 686 upper case ones).

In JDK1.3, Character.isLetter(..) and Character.isDigit(..) are 
documented as:

  http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.html#isLetter(char)
  http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.html#isDigit(char)
  http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.html#isLetterOrDigit(char)

I guess that java uses the extra huge tables.

regards,
finn