Where to contribute Unicode General Category encoding/decoding
Pander Musubi
pander.musubi at gmail.com
Fri Dec 14 11:22:31 EST 2012
On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote:
> On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
>
> > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
>
> >
>
> >
>
> >
>
> > > I was expecting PyPI. Here is the code, please advise on where to submit
>
> >
>
> > > it:
>
> >
>
> > > http://pastebin.com/dbzeasyq
>
> >
>
> >
>
> >
>
> > If anywhere, either a third-party module, or the unicodedata standard
>
> >
>
> > library module.
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > Some unanswered questions:
>
> >
>
> >
>
> >
>
> > - when would somebody need this function?
>
> >
>
>
>
> When working with Unicode metedata, see below.
>
>
>
> >
>
> >
>
> > - why is is called "decodeUnicodeGeneralCategory" when it
>
> >
>
> > doesn't seem to have anything to do with decoding?
>
>
>
> It is actually a simple LUT. I like your improvements below.
>
>
>
> > - why is the parameter "sortable" called sortable, when it
>
> >
>
> > doesn't seem to have anything to do with sorting?
>
>
>
> The values return are alphabetically sortable.
>
>
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > If this is useful at all, it would be more useful to just expose the data
>
> >
>
> > as a dict, and forget about an unnecessary wrapper function:
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > from collections import namedtuple
>
> >
>
> > r = namedtuple("record", "other name desc") # better field names needed!
>
> >
>
> >
>
> >
>
> > GC = {
>
> >
>
> > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
>
> >
>
> > 'Cc': r('Control', 'Control',
>
> >
>
> > 'a C0 or C1 control code'), # a.k.a. cntrl
>
> >
>
> > 'Cf': r('Format', 'Format', 'a format control character'),
>
> >
>
> > 'Cn': r('Unassigned', 'Unassigned',
>
> >
>
> > 'a reserved unassigned code point or a noncharacter'),
>
> >
>
> > 'Co': r('Private Use', 'Private_Use', 'a private-use character'),
>
> >
>
> > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
>
> >
>
> > 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
>
> >
>
> > 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
>
> >
>
> > 'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
>
> >
>
> > 'a lowercase letter'),
>
> >
>
> > 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
>
> >
>
> > 'Lo': r('Letter, Other', 'Other_Letter',
>
> >
>
> > 'other letters, including syllables and ideographs'),
>
> >
>
> > 'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
>
> >
>
> > 'a digraphic character, with first part uppercase'),
>
> >
>
> > 'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
>
> >
>
> > 'an uppercase letter'),
>
> >
>
> > 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
>
> >
>
> > 'Mc': r('Mark, Spacing', 'Spacing_Mark',
>
> >
>
> > 'a spacing combining mark (positive advance width)'),
>
> >
>
> > 'Me': r('Mark, Enclosing', 'Enclosing_Mark',
>
> >
>
> > 'an enclosing combining mark'),
>
> >
>
> > 'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
>
> >
>
> > 'a nonspacing combining mark (zero advance width)'),
>
> >
>
> > 'N' : r('Number', 'Number', 'Nd | Nl | No'),
>
> >
>
> > 'Nd': r('Number, Decimal', 'Decimal_Number',
>
> >
>
> > 'a decimal digit'), # a.k.a. digit
>
> >
>
> > 'Nl': r('Number, Letter', 'Letter_Number',
>
> >
>
> > 'a letterlike numeric character'),
>
> >
>
> > 'No': r('Number, Other', 'Other_Number',
>
> >
>
> > 'a numeric character of other type'),
>
> >
>
> > 'P' : r('Punctuation', 'Punctuation',
>
> >
>
> > 'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
>
> >
>
> > 'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
>
> >
>
> > 'a connecting punctuation mark, like a tie'),
>
> >
>
> > 'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
>
> >
>
> > 'a dash or hyphen punctuation mark'),
>
> >
>
> > 'Pe': r('Punctuation, Close', 'Close_Punctuation',
>
> >
>
> > 'a closing punctuation mark (of a pair)'),
>
> >
>
> > 'Pf': r('Punctuation, Final', 'Final_Punctuation',
>
> >
>
> > 'a final quotation mark'),
>
> >
>
> > 'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
>
> >
>
> > 'an initial quotation mark'),
>
> >
>
> > 'Po': r('Punctuation, Other', 'Other_Punctuation',
>
> >
>
> > 'a punctuation mark of other type'),
>
> >
>
> > 'Ps': r('Punctuation, Open', 'Open_Punctuation',
>
> >
>
> > 'an opening punctuation mark (of a pair)'),
>
> >
>
> > 'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
>
> >
>
> > 'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
>
> >
>
> > 'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
>
> >
>
> > 'a non-letterlike modifier symbol'),
>
> >
>
> > 'Sm': r('Symbol, Math', 'Math_Symbol',
>
> >
>
> > 'a symbol of mathematical use'),
>
> >
>
> > 'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
>
> >
>
> > 'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
>
> >
>
> > 'Zl': r('Separator, Line', 'Line_Separator',
>
> >
>
> > 'U+2028 LINE SEPARATOR only'),
>
> >
>
> > 'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
>
> >
>
> > 'U+2029 PARAGRAPH SEPARATOR only'),
>
> >
>
> > 'Zs': r('Separator, Space', 'Space_Separator',
>
> >
>
> > 'a space character (of various non-zero widths)'),
>
> >
>
> > }
>
> >
>
> >
>
> >
>
> > del r
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > Usage is then trivially the same as normal dict and attribute access:
>
> >
>
> >
>
> >
>
> > py> GC['Ps'].desc
>
> >
>
> > 'an opening punctuation mark (of a pair)'
>
> >
>
> >
>
> >
>
>
>
> Thank you for the improvements. I have some more extra dicts in this way such as:
>
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>
> where this general category is begin used. This information is useful when handling Unicode metadata.
>
>
>
> I think I will approach both
>
> http://pypi.python.org/pypi/unicodeblocks/
>
> and
>
> http://pypi.python.org/pypi/unicodescript/
>
> to see who will adopt this.
>
>
>
> Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.
>
>
>
> Thanks for all your help,
>
>
>
> Pander
>
>
>
> >
>
> >
>
> >
>
> >
>
> > --
>
> >
>
> > Steven
Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html
More information about the Python-list
mailing list