languages with full unicode support

Dr.Ruud rvtol+news at isolution.nl
Sat Jul 1 06:51:27 EDT 2006


Chris Uppal schreef:

> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators,
> symbol, punctuation.... ?), there doesn't seem to be any sane way
> that a language could allow an unrestricted choice of Unicode in
> identifiers.

The Perl-code below prints:

xdigit
    22 /194522 =  0.011%  (lower:     6, upper:     6)
ascii
   128 /194522 =  0.066%  (lower:    26, upper:    26)
\d
   268 /194522 =  0.138%
digit
   268 /194522 =  0.138%
IsNumber
   612 /194522 =  0.315%
alpha
 91183 /194522 = 46.875%  (lower:  1380, upper:  1160)
alnum
 91451 /194522 = 47.013%  (lower:  1380, upper:  1160)
word
 91801 /194522 = 47.193%  (lower:  1380, upper:  1160)
graph
102330 /194522 = 52.606%  (lower:  1380, upper:  1160)
print
102349 /194522 = 52.616%  (lower:  1380, upper:  1160)
blank
    18 /194522 =  0.009%
space
    24 /194522 =  0.012%
punct
   374 /194522 =  0.192%
cntrl
  6473 /194522 =  3.328%


Especially look at 'word', the same as \w, which for ASCII is
[0-9A-Za-z_].


==8<===================
#!/usr/bin/perl
# Program-Id: unicount.pl
# Subject: show Unicode statistics

  use strict ;
  use warnings ;

  use Data::Alias ;

  binmode STDOUT, ':utf8' ;

  my @table =
  # +--Name------+---qRegexp--------+-C-+-L-+-U-+
  (
    [ 'xdigit'   , qr/[[:xdigit:]]/ , 0 , 0 , 0 ] ,
    [ 'ascii'    , qr/[[:ascii:]]/  , 0 , 0 , 0 ] ,
    [ '\\d'      , qr/\d/           , 0 , 0 , 0 ] ,
    [ 'digit'    , qr/[[:digit:]]/  , 0 , 0 , 0 ] ,
    [ 'IsNumber' , qr/\p{IsNumber}/ , 0 , 0 , 0 ] ,
    [ 'alpha'    , qr/[[:alpha:]]/  , 0 , 0 , 0 ] ,
    [ 'alnum'    , qr/[[:alnum:]]/  , 0 , 0 , 0 ] ,
    [ 'word'     , qr/[[:word:]]/   , 0 , 0 , 0 ] ,
    [ 'graph'    , qr/[[:graph:]]/  , 0 , 0 , 0 ] ,
    [ 'print'    , qr/[[:print:]]/  , 0 , 0 , 0 ] ,
    [ 'blank'    , qr/[[:blank:]]/  , 0 , 0 , 0 ] ,
    [ 'space'    , qr/[[:space:]]/  , 0 , 0 , 0 ] ,
    [ 'punct'    , qr/[[:punct:]]/  , 0 , 0 , 0 ] ,
    [ 'cntrl'    , qr/[[:cntrl:]]/  , 0 , 0 , 0 ] ,
  ) ;

  my @codepoints =
  (
     0x0000 ..  0xD7FF,
     0xE000 ..  0xFDCF,
     0xFDF0 ..  0xFFFD,
     0x10000 .. 0x1FFFD,
     0x20000 .. 0x2FFFD,
#    0x30000 .. 0x3FFFD, # etc.
  ) ;

  for my $row ( @table )
  {
    alias my ($name, $qrx, $count, $lower, $upper) = @$row ;

    printf "\n%s\n", $name ;

    my $n = 0 ;

    for ( @codepoints )
    {
      local $_ = chr ;  # int-2-char conversion
      $n++ ;

      if ( /$qrx/ )
      {
        $count++ ;
        $lower++ if / [[:lower:]] /x ;
        $upper++ if / [[:upper:]] /x ;
      }
    }

    my $show_lower_upper =
      ($lower || $upper)
      ? sprintf( "  (lower:%6d, upper:%6d)"
               , $lower
               , $upper
               )
      : '' ;

    printf "%6d /%6d =%7.3f%%%s\n"
           , $count
           , $n
           , 100 * $count / $n
           , $show_lower_upper
  }
__END__

-- 
Affijn, Ruud

"Gewoon is een tijger."





More information about the Python-list mailing list