How to read fonts in python

Robert Kern robert.kern at gmail.com
Mon Nov 17 15:22:51 EST 2008


Steve Holden wrote:
> ganesh gajre wrote:
>> Hello all,
>>  I am writing a program to convert indic true type font to unicode. For
>> which i need to know how to read the any file i.e Text, Doc, Excel file
>> in python and identify the font used in which that file is written. So
>> that using Map file can convert the file in unicode.
> 
> You are getting too ambitious. Text files don't have any font
> information associated with them. Not only that, but the encoding of
> Unicode character data is independent of the font used to render the
> readable glyphs as text.
> 
> This makes it look as though you don't really know what you are doing.
> Perhaps you should start more slowly, and try explaining the real problem.
> 
> I'm not even sure what "converting a font to Unicode" means, so you
> might start by explaining that.

Fonts associate numbers to glyphs. Using Unicode code points for most of this 
mapping is reasonably common nowadays, but there are many older fonts that use 
any number of other mappings. Sometimes they used fairly standard text encodings 
like the ISO-8859-* series, but sometimes they used ad hoc mappings in order to 
make use of Latin keyboards easily.

For some older WYSIWYG word processor documents using these fonts, the text's 
"encoding" is specified in an ad hoc fashion only by the font. The word 
processor file may say that character 10 is the ASCII 'A' (or at least, the byte 
0x41), but the font may map 0x41 to some Indic glyph. The only thing in the file 
which says that the byte 0x41 should be interpreted as that Indic glyph is the 
font. As you say, this is irrelevant to real text files, but might be useful for 
Word documents which use these hacky fonts.

Ganesh, you should take a look at FontTools to handle parsing TTF files.

   http://sourceforge.net/projects/fonttools/

In order to read specific document types, you will need to find parsers for each 
of the file types you want to. Be aware that many of these parsers don't parse 
the font information as they are geared more for just the extraction of the text 
information.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco




More information about the Python-list mailing list