Best search algorithm to find condition within a range

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Apr 8 22:32:26 EDT 2015


On Wed, 8 Apr 2015 11:49 am, Chris Angelico wrote:

> You could use base 1,114,112 fairly readily in any decent modern
> programming language. That'll happily represent base one-million.


Well, not really...

Here is the breakdown of Unicode code points by category, as of Python 3.3:

# Other
Cc: 65 (control characters)
Cf: 139 (format characters)
Cn: 864415 (unassigned)
Co: 137468 (private use)
Cs: 2048 (surrogates)

# Letters
Ll: 1751 (lowercase)
Lm: 237 (modifier)
Lo: 97553 (other)
Lt: 31 (titlecase)
Lu: 1441 (uppercase)

# Marks
Mc: 353 (spacing combining)
Me: 12 (enclosing)
Mn: 1280 (nonspacing)

# Numbers
Nd: 460 (decimal digit)
Nl: 224 (letter)
No: 464 (other)

# Punctuation
Pc: 10 (connector)
Pd: 23 (dash)
Pe: 71 (close)
Pf: 10 (final quote)
Pi: 12 (initial quote)
Po: 434 (other)
Ps: 72 (open)

# Symbols
Sc: 48 (currency)
Sk: 115 (modifier)
Sm: 952 (math)
So: 4404 (other)

# Separator
Zl: 1 (line)
Zs: 18 (paragraph)
Zp: 1 (space)


Clearly we shouldn't use control or format characters, surrogates,
separators, marks, etc. (At least, I hope it is clear why you don't want,
say, newlines, to be used as digits.) Punctuation is borderline, as are
symbols, since that won't interoperate well with anything else. How can you
parse number+number if the numbers themselves might contain + signs? I
wouldn't use unassigned code points, as that is all but guaranteed to lead
to future problems, but I might reluctantly allow private use. That leaves
us the following which *may* be suitable:

Co: 137468 (private use)
Ll: 1751 (lowercase)
Lo: 97553 (other)
Lt: 31 (titlecase)
Lu: 1441 (uppercase)
Nd: 460 (decimal digit)
Nl: 224 (letter)
No: 464 (other)
Sc: 48 (currency)
Sm: 952 (math)
So: 4404 (other)

which comes to a total of 244796, far short of a million. Add in the 632
punctuation marks if you like, and we're short.

There are other problems too:

- Confusables. Can you tell the difference between AΑА versus АAΑ, 
  or ВΒB versus BΒВ? Or even O versus 0?

- Lack of glyphs for the majority of those code points in most fonts. 
  Most numbers will look like a sequence of boxes.

- Difficulty of data entry.

- Some people's digits will not have the value that they expect,
  e.g. digit '1' might not have the numeric value 1, for at least
  all-but-ten of the 460 different decimal digits in use.

- Realistically, who is going to use this?

Even as an intellectual exercise, using huge bases for human input and
output isn't very useful. The idea of using massive implicit bases for the
internal implementation of BigNums is quite reasonable, but for human input
and output, it doesn't fly.



-- 
Steven




More information about the Python-list mailing list