Where to contribute Unicode General Category encoding/decoding

Thu Dec 13 19:06:23 EST 2012

On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:

> I was expecting PyPI. Here is the code, please advise on where to submit
> it:
>   http://pastebin.com/dbzeasyq

If anywhere, either a third-party module, or the unicodedata standard 
library module.

Some unanswered questions:

- when would somebody need this function?

- why is is called "decodeUnicodeGeneralCategory" when it 
  doesn't seem to have anything to do with decoding?

- why is the parameter "sortable" called sortable, when it
  doesn't seem to have anything to do with sorting?

If this is useful at all, it would be more useful to just expose the data 
as a dict, and forget about an unnecessary wrapper function:

from collections import namedtuple
r = namedtuple("record", "other name desc")  # better field names needed!

GC = {
    'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
    'Cc': r('Control', 'Control', 
            'a C0 or C1 control code'), # a.k.a. cntrl
    'Cf': r('Format', 'Format', 'a format control character'),
    'Cn': r('Unassigned', 'Unassigned', 
            'a reserved unassigned code point or a noncharacter'),
    'Co': r('Private Use', 'Private_Use', 'a private-use character'),
    'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
    'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
    'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
    'Ll': r('Letter, Lowercase', 'Lowercase_Letter', 
            'a lowercase letter'),
    'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
    'Lo': r('Letter, Other', 'Other_Letter', 
            'other letters, including syllables and ideographs'),
    'Lt': r('Letter, Titlecase', 'Titlecase_Letter', 
            'a digraphic character, with first part uppercase'),
    'Lu': r('Letter, Uppercase', 'Uppercase_Letter', 
            'an uppercase letter'),
    'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
    'Mc': r('Mark, Spacing', 'Spacing_Mark', 
            'a spacing combining mark (positive advance width)'),
    'Me': r('Mark, Enclosing', 'Enclosing_Mark',
            'an enclosing combining mark'),
    'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark', 
            'a nonspacing combining mark (zero advance width)'),
    'N' : r('Number', 'Number', 'Nd | Nl | No'),
    'Nd': r('Number, Decimal', 'Decimal_Number', 
            'a decimal digit'), # a.k.a. digit
    'Nl': r('Number, Letter', 'Letter_Number', 
            'a letterlike numeric character'),
    'No': r('Number, Other', 'Other_Number',
            'a numeric character of other type'),
    'P' : r('Punctuation', 'Punctuation',          
            'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
    'Pc': r('Punctuation, Connector', 'Connector_Punctuation', 
            'a connecting punctuation mark, like a tie'),
    'Pd': r('Punctuation, Dash', 'Dash_Punctuation', 
            'a dash or hyphen punctuation mark'),
    'Pe': r('Punctuation, Close', 'Close_Punctuation', 
            'a closing punctuation mark (of a pair)'),
    'Pf': r('Punctuation, Final', 'Final_Punctuation', 
            'a final quotation mark'),
    'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
            'an initial quotation mark'),
    'Po': r('Punctuation, Other', 'Other_Punctuation', 
            'a punctuation mark of other type'),
    'Ps': r('Punctuation, Open', 'Open_Punctuation',
            'an opening punctuation mark (of a pair)'),
    'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
    'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
    'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
            'a non-letterlike modifier symbol'),
    'Sm': r('Symbol, Math', 'Math_Symbol', 
            'a symbol of mathematical use'),
    'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
    'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
    'Zl': r('Separator, Line', 'Line_Separator',
            'U+2028 LINE SEPARATOR only'),
    'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
            'U+2029 PARAGRAPH SEPARATOR only'),
    'Zs': r('Separator, Space', 'Space_Separator', 
            'a space character (of various non-zero widths)'),
    }

del r

Usage is then trivially the same as normal dict and attribute access:

py> GC['Ps'].desc
'an opening punctuation mark (of a pair)'

-- 
Steven