[Patches] [Patch #101663] Regression test for Unicode database

Tue, 26 Sep 2000 05:33:58 -0700

Patch #101663 has been updated. 

Project: 
Category: library
Status: Open
Summary: Regression test for Unicode database

Follow-Ups:

Date: 2000-Sep-26 03:27
By: lemburg

Comment:
This is a regression test for the available Unicode database
methods and functions.

There's one problem with it: it takes a few seconds to run because
it has to check 64k characters...

-------------------------------------------------------

Date: 2000-Sep-26 04:08
By: none

Comment:
umm.  you did mean "return h.hexdigest()" rather than
"return repr(h)", didn't you?

I also think the code is quite a bit more hypergeneralized (read slow) than it really has to be...  (e.g. if a method is missing or chokes on the data, why pretend it returned an empty string?)

</F>
-------------------------------------------------------

Date: 2000-Sep-26 04:13
By: lemburg

Comment:
RE: .hexdigest: good idea !

RE: generalization: the exception handling is needed because
some methods raise errors for e.g. non-numbers. I don't think
that inlining the tests will change much about the execution
speed... it will still be slow.

-------------------------------------------------------

Date: 2000-Sep-26 04:43
By: none

Comment:
RE: exceptions: but if you spell things out, you can use
the "default" argument to get rid of the exception.

Here's the main body from my version of this script (this
assumes that 'repr' does the right thing, of course).  To
cope with a missing unicodedata module, just split the tuple in two parts, updating two different digests.

<pre>
for i in range(65536):
    char = unichr(i)
    data = (
        # ctype predicates
        char.isalnum(),
        char.isalpha(),
        char.isdecimal(),
        char.isdigit(),
        char.islower(),
        char.isnumeric(),
        char.isspace(),
        char.istitle(),
        char.isupper(),
        # ctype mappings
        char.lower(),
        char.upper(),
        char.title(),
        # properties
        unicodedata.digit(char, None),
        unicodedata.numeric(char, None),
        unicodedata.decimal(char, None),
        unicodedata.category(char),
        unicodedata.bidirectional(char),
        unicodedata.decomposition(char),
        unicodedata.mirrored(char),
        unicodedata.combining(char)
        )
    h.update(repr(data))
</pre>

BTW, note that "islower/upper/title" tests more than
just the IsLower/etc predicates; maybe they should be written as "char*3" or "char+'constant string'" to catch bogus combinations of attributes.  My first checkin had
a broken IsTitle table, but the test didn't spot that
because it just checked a single character...
-------------------------------------------------------

Date: 2000-Sep-26 04:49
By: gvanrossum

Comment:
Marc-Andre, the test output is not portable -- it contains three instances: <SHA object at 0x82273b0>.
-------------------------------------------------------

Date: 2000-Sep-26 05:33
By: lemburg

Comment:
Here's an updated patch that fixes all of the above...

-------------------------------------------------------

-------------------------------------------------------
For more info, visit:

http://sourceforge.net/patch/?func=detailpatch&patch_id=101663&group_id=5470