[Python-Dev] Python and the Unicode Character Database

haiyang kang cornsea at gmail.com
Fri Dec 3 04:18:43 CET 2010


> Furthermore, data can well originate from texts that were written
> hundreds or even thousands of years ago, so there is plenty of
> material available for processing.

humm...,  for this, i think we need a special tuned language
processing system to handle this, and one subsystem for one language :)...
(sometimes a single word is not enough, we also need context)

Take pi for example, in modern math, it is wrote as: 3.1415...;
 in old China, it is sometimes wrote as: 三一四一五 or
 三点一四一五 or 叁点壹肆壹伍;

And if these texts are extracted through scanner
 (OCR or other image processing tech),  in my POV,
it is the job of this image processing subsystem
 (or some other subsystem between the image processing and database)
to do the mapping between number and raw text data, example table in DB:
text      | raw data                    |raw image data
-----------|---------------------------------|-----------------------
3.1415 | 三一四一五                | image...

br,
khy


More information about the Python-Dev mailing list