=?iso-8859-1?q?Re:_How_to_allow_special_character's_like_=EF, =F9, acute_e_etc...?=

John Machin sjmachin at lexicon.net
Tue Sep 5 19:08:04 EDT 2006


sonald wrote:
> Dear All,
> I am working on a module that validates the provided CSV data in a text
> format, which must be in a predefined format.
> We check for the :
>
[snip]
>
> 3. valid-text expressions,
>         Example:
>             ValidText('Minor', '[yYnN]')
>
>     Parameters:
>             name    => field name
>             regex   => the regular expression y/Y for Yes & n/N for No
>
> Recently we are getting data, where, the name contains non-english
> characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc

The offending characters are (unusually) lowercase in otherwise
uppercase strings; is this actual data or are you typing what you think
you see instead of copy/paste?

>
> Using the Text function, these names are not validated as they contain
> special characters or non-english characters (ï,ù). But the data is
> correct.

It would help a great deal if you were to tell us (1) what is the regex
that you are using (2) what encoding you believe/know the data is
written in (3) does your app call locale.setlocale() at start-up? If
the following guesses are wrong, please say so.

Guess (1) (a) you are using the pattern "[A-Za-z]" to check for
alphabetic characters (b) you are using the "\w" pattern to check for
alphanumeric characters and then using "[\d_]" to reject digits and
underscores.
Guess (2): "cp1252" or "latin1" or "unknown" -- all pretty much
equivalent :-)
Guess (3): No.

If guess (1b) is correct: the so-called "special" characters are not
being interpreted as alphabetic because the re module is
locale-dependent. Here is what the re docs have to say:
"""
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current locale.
If UNICODE is set, this will match the characters [0-9_] plus whatever
is classified as alphanumeric in the Unicode character properties
database.
"""

If you are not using (1b) or something like it, you need to move in
that direction.

Please bear this in mind: the locale is meant to be an
attribute/property of the *user* of your application; it is *not* meant
to be an attribute of the input data. Read the docs of the locale
module -- switching locales on the fly is *not* a good idea.

> Is there any function that can allow such special character's but not
> numbers...?

The righteous way of handling your problem is:
(1) decode each field in the incoming 8-bit string data to Unicode,
using what you know/guess to be the correct encoding setting. Then
string methods like isalpha() and isdigit() will use the Unicode
character properties and your "special" characters will be recognised
for what they are.
(2) use the UNICODE flag in re.

>
> Secondly, If I were to get the data in Russian text,
> are there any
> (lingual) packages available so that i can use the the same module for
> validation.

If you are getting the data as 8-bit strings, then the above approach
should still "work" at the basic level ... you decode it using 'cp1251'
or whatever, and the Cyrillic letter equivalents of "Ivanov" would pass
muster as alphabetic.

> Such that I just have to import the package and the module can be used
> for validating russian text or japanese text....

Chinese, Japanese and Korean ("CJK") names are written natively in
characters that are not alphabetic in the linguistic sense. The number
of characters that could possibly be written in a name is rather large.
However the CJK characters are classified as Unicode category "Lo"
(Letter, other) and do actually match \w in re.

So with a minimal amount of work, you can provide a basic level of
validation across the board. Anything fancier needs local knowledge
[not a c.l.py topic].

Some points for consideration:
(1) You may wish not to reject digits irrevocably -- some jurisdictions
do permit people to change their legal name to "4567" or whatever.
(2) You are of course allowing space, hyphen and apostrophe as valid
characters in "English" names e.g. "mac Intyre", "O'Brien-Smith". Bear
in mind that other punctuation characters may be valid in other
languages -- see 'local knowledge" above.
(3) If you are given data encoded as utf16* or utf32, you won't be able
to use the csv module (neither the ObjectCraft one nor the Python one
(read the docs)) directly. You will need to recode the file as UTF8,
read it using the csv module, and *then* decode each text field from
utf8.

HTH,
John




More information about the Python-list mailing list