[Python-Dev] Regular expressions, Unicode etc.

Wed Aug 8 23:16:37 CEST 2007

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin at v.loewis.de> wrote:
> 
> >> Before discussing the escape, I'd like to see a specification of
> >> it first - what characters precisely would classify as "printing"?
> > 
> > For basic ASCII and locale-based testing, whatever isprint() says.
> > Just as for isalpha().
> 
> In the mediate term, locale-based testing will go away/be not
> implementable (in particular, Py3k won't have a byte-oriented
> character string type, so we can't use isprint). In general,
> isprint is unsuitable since it doesn't support multi-byte
> character sets.

Well, iswprint isn't so restricted :-)  I don't see the relevance
of this, as EXACTLY the same problem applies to isalnum and \w.
If you can solve one problem (and you have to solve the latter),
you can solve the other.

> > For Unicode, whatever people agree!  I use the criterion that it
> > has a defined category that doesn't start with 'C' - which is what
> > I think that most people will accept.
> 
> -1. There must be a better specification than that.
> 
> Can you please explain the concept of "printing character"? If
> you have a Unicode code point, how do you determine whether it
> is printing? If rendering it would generate black pixels on white
> background?

Eh?  This is a character set we are talking about.  The proposed
extensions to include font and colour are an aberration that I shall
thankfully be long retired before they hit.

Unicode has a two letter classification of each character, with
the main category being in upper case and the subsidiary one in
lower.  Let's ignore the latter, as it is irrelevant.  The main
categories are 'Z' (spaces), 'L' (letters), 'N' (numbers),
'S' (Symbols), 'P' (punctuation), 'M' (marks) and 'C' control
characters.

There are some pretty weird entries in 'L' and 'N' and the
difference between 'S', P' and 'M' is arcane, to a degree.  But
all of the categories except 'C' are things that display, and
'C' is mainly the ASCII controls we know and, er, love - with
some similar extras.

Obviously, unclassified characters should not be called printing,
and equally obviously controls shouldn't.  There is no clear
reason why the others should not be - especially as the difference
between a modifying accent and a free-standing one is something
so obscure that most people don't even know that there IS one.

The point about an escape for printing characters is to check
for bad characters in text input, and the rule I mentioned is
fine for that.  What's the problem with it?

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679