Where to contribute Unicode General Category encoding/decoding

Fri Dec 14 08:07:51 EST 2012

On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
> On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
> 
> 
> 
> > I was expecting PyPI. Here is the code, please advise on where to submit
> 
> > it:
> 
> >   http://pastebin.com/dbzeasyq
> 
> 
> 
> If anywhere, either a third-party module, or the unicodedata standard 
> 
> library module.
> 
> 
> 
> 
> 
> Some unanswered questions:
> 
> 
> 
> - when would somebody need this function?
> 

When working with Unicode metedata, see below.

> 
> 
> - why is is called "decodeUnicodeGeneralCategory" when it 
> 
>   doesn't seem to have anything to do with decoding?

It is actually a simple LUT. I like your improvements below.

> - why is the parameter "sortable" called sortable, when it
> 
>   doesn't seem to have anything to do with sorting?

The values return are alphabetically sortable.

> 
> 
> 
> 
> 
> If this is useful at all, it would be more useful to just expose the data 
> 
> as a dict, and forget about an unnecessary wrapper function:
> 
> 
> 
> 
> 
> from collections import namedtuple
> 
> r = namedtuple("record", "other name desc")  # better field names needed!
> 
> 
> 
> GC = {
> 
>     'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
> 
>     'Cc': r('Control', 'Control', 
> 
>             'a C0 or C1 control code'), # a.k.a. cntrl
> 
>     'Cf': r('Format', 'Format', 'a format control character'),
> 
>     'Cn': r('Unassigned', 'Unassigned', 
> 
>             'a reserved unassigned code point or a noncharacter'),
> 
>     'Co': r('Private Use', 'Private_Use', 'a private-use character'),
> 
>     'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
> 
>     'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
> 
>     'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
> 
>     'Ll': r('Letter, Lowercase', 'Lowercase_Letter', 
> 
>             'a lowercase letter'),
> 
>     'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
> 
>     'Lo': r('Letter, Other', 'Other_Letter', 
> 
>             'other letters, including syllables and ideographs'),
> 
>     'Lt': r('Letter, Titlecase', 'Titlecase_Letter', 
> 
>             'a digraphic character, with first part uppercase'),
> 
>     'Lu': r('Letter, Uppercase', 'Uppercase_Letter', 
> 
>             'an uppercase letter'),
> 
>     'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
> 
>     'Mc': r('Mark, Spacing', 'Spacing_Mark', 
> 
>             'a spacing combining mark (positive advance width)'),
> 
>     'Me': r('Mark, Enclosing', 'Enclosing_Mark',
> 
>             'an enclosing combining mark'),
> 
>     'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark', 
> 
>             'a nonspacing combining mark (zero advance width)'),
> 
>     'N' : r('Number', 'Number', 'Nd | Nl | No'),
> 
>     'Nd': r('Number, Decimal', 'Decimal_Number', 
> 
>             'a decimal digit'), # a.k.a. digit
> 
>     'Nl': r('Number, Letter', 'Letter_Number', 
> 
>             'a letterlike numeric character'),
> 
>     'No': r('Number, Other', 'Other_Number',
> 
>             'a numeric character of other type'),
> 
>     'P' : r('Punctuation', 'Punctuation',          
> 
>             'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
> 
>     'Pc': r('Punctuation, Connector', 'Connector_Punctuation', 
> 
>             'a connecting punctuation mark, like a tie'),
> 
>     'Pd': r('Punctuation, Dash', 'Dash_Punctuation', 
> 
>             'a dash or hyphen punctuation mark'),
> 
>     'Pe': r('Punctuation, Close', 'Close_Punctuation', 
> 
>             'a closing punctuation mark (of a pair)'),
> 
>     'Pf': r('Punctuation, Final', 'Final_Punctuation', 
> 
>             'a final quotation mark'),
> 
>     'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
> 
>             'an initial quotation mark'),
> 
>     'Po': r('Punctuation, Other', 'Other_Punctuation', 
> 
>             'a punctuation mark of other type'),
> 
>     'Ps': r('Punctuation, Open', 'Open_Punctuation',
> 
>             'an opening punctuation mark (of a pair)'),
> 
>     'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
> 
>     'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
> 
>     'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
> 
>             'a non-letterlike modifier symbol'),
> 
>     'Sm': r('Symbol, Math', 'Math_Symbol', 
> 
>             'a symbol of mathematical use'),
> 
>     'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
> 
>     'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
> 
>     'Zl': r('Separator, Line', 'Line_Separator',
> 
>             'U+2028 LINE SEPARATOR only'),
> 
>     'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
> 
>             'U+2029 PARAGRAPH SEPARATOR only'),
> 
>     'Zs': r('Separator, Space', 'Space_Separator', 
> 
>             'a space character (of various non-zero widths)'),
> 
>     }
> 
> 
> 
> del r
> 
> 
> 
> 
> 
> Usage is then trivially the same as normal dict and attribute access:
> 
> 
> 
> py> GC['Ps'].desc
> 
> 'an opening punctuation mark (of a pair)'
> 
> 
> 

Thank you for the improvements. I have some more extra dicts in this way such as:
  http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
where this general category is begin used. This information is useful when handling Unicode metadata.

I think I will approach both
  http://pypi.python.org/pypi/unicodeblocks/
and
  http://pypi.python.org/pypi/unicodescript/
to see who will adopt this.

Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.

Thanks for all your help,

Pander

> 
> 
> 
> 
> -- 
> 
> Steven