[Python-Dev] Regular expressions, Unicode etc.

Wed Aug 8 23:54:06 CEST 2007

>> In the mediate term, locale-based testing will go away/be not
>> implementable (in particular, Py3k won't have a byte-oriented
>> character string type, so we can't use isprint). In general,
>> isprint is unsuitable since it doesn't support multi-byte
>> character sets.
> 
> Well, iswprint isn't so restricted :-) 

Yes. However, it is even more difficult to convert from
Py_UNICODE to wchar_t in general.

> I don't see the relevance
> of this, as EXACTLY the same problem applies to isalnum and \w.

There is no problem for isalnum: it will just go away if
byte-oriented characters go away. Fortunately, we have a
replacement for the Unicode case.

The relevance is that your specification of "printing character"
as "isprint returns true" is nearly useless, as it only applies
to byte-oriented characters.

> If you can solve one problem (and you have to solve the latter),
> you can solve the other.

Unicode-isalnum is defined as isalpha|isdecimal|isdigit|isnumeric.
isalpha means categories Ll, Lu, Lt, Lo, Lm. isdecimal means
character has the decimal property. isigit means the character has
the digit property. isnumeric means the character has the numeric
property.

>> Can you please explain the concept of "printing character"? If
>> you have a Unicode code point, how do you determine whether it
>> is printing? If rendering it would generate black pixels on white
>> background?
> 
> Eh?  This is a character set we are talking about.  The proposed
> extensions to include font and colour are an aberration that I shall
> thankfully be long retired before they hit.

It was a proposal for a definition. English is not my native
language, and "printing character" means nothing to me. So
I kindly asked for a definition, and suggested one possibility.
I would not have guessed that you consider white-space characters
as "printing", as they don't actually print anything.

> The point about an escape for printing characters is to check
> for bad characters in text input, and the rule I mentioned is
> fine for that.  What's the problem with it?

The problem is that you did not quite mention a rule, or else
I missed it.

You seem to be asking for being able to express "not a control
character". I propose that this is best done with UTS#18,
in which you would write

  [\P{C}] # or \P{Other}

If this is what you want, I'm all in favor of having it
implemented.

Regards,
Martin