From python to LaTeX in emacs on windows

Benjamin Niemann b.niemann at betternet.de
Tue Aug 31 05:18:18 EDT 2004


Brian Elmegaard wrote:
> Benjamin Niemann <b.niemann at betternet.de> writes:
> 
> Thank for the help. I solved the problem by specifying the cp1252
> encoding for the python file by a magic comment and for the input data file. 
> 
> 
>>When you read the filecontents in python, you'll have the "raw" byte
>>sequence, in this case it is the UTF-8 encoding of unicode text. But
>>you probably want a unicode string. Use "text = unicode(data,
>>'utf-8')" where "data" is the filecontent you read. After processing
>>you probably want to write it back to a file. Before you do this, you
>>will have to convert the unicode string back to a byte sequence. Use
>>"data = text.encode('utf')".
>>
> 
> 
> This worked, but when I try to print text I get:
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)
> Why is that?
The console only understands "byte streams". To print a unicode string, 
python tries to encode it using the default encoding, which is 'ascii' 
in your case. That encoding is not able to represent characters like 
'ü', 'ä'.. which causes the exception. What I usually do is something like:
print text.encode("cp1251", "ignore")

The 'ignore' argument causes all characters, that cannot be represented 
in cp1251 to be silently dropped - which is ok, if the output is only 
used e.g. to track progress.

Don't know if there's a way to python to do this automagically for all 
unicodes passed to stdout...

> 
> 
>>Handling character encodings correctly *is* difficult. 
> 
> 
> What makes it difficult? The OS, the editor, python, latex?
At least for me it is difficult, because I'm used to think "1 byte = 1 
character" and when I read/write files I could simple handle the data as 
strings. Unless you begin to parse arbitrary data from the internet, 
there is little chance that you encounter text encodings different from 
your operating systems default and you start to believe that e.g. 
"ord('ü') == 252" is a universal rule sent by the gods...
If you do it right, then you should convert all data that 'enters' your 
application as early as possible to unicode and encode it back when you 
print/save/send it - this way you'll only have to deal with unicodes in 
your application code. The most difficult part is probably changing old 
habbits ;)



More information about the Python-list mailing list