How do I display unicode value stored in a string variable using ord()

DJC djc at news.invalid
Sun Aug 19 11:32:06 EDT 2012


On 19/08/12 15:25, Steven D'Aprano wrote:

> Not necessarily. Presumably you're scanning each page into a single
> string. Then only the pages containing a supplementary plane char will be
> bloated, which is likely to be rare. Especially since I don't expect your
> OCR application would recognise many non-BMP characters -- what does
> U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
> doesn't recognise it, you can't get it in your output. (If you do, the
> OCR software has a nasty bug.)
>
> Anyway, in my ignorant opinion the proper fix here is to tell the OCR
> software not to bother trying to recognise Imperial Aramaic, Domino
> Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
> expecting them in your source material. Not only will the scanning go
> faster, but you'll get fewer wrong characters.

Consider the automated recognition of a CAPTCHA. As the chars have to be 
entered by the user on a keyboard, only the most basic charset can be 
used, so the problem of which chars are possible is quite limited.



More information about the Python-list mailing list