[Tutor] Unicode trouble

Kent Johnson kent37 at tds.net
Thu Dec 1 14:29:23 CET 2005


Øyvind wrote:
>>The important question is, what is actual encoding of your source data?
>>
>>>Is there anything else I could try?
> 
> 
>>Understand why the above question is important, then answer it. Until you
> 
> do >you are just thrashing around in the dark.
> 
> The source is a text-document that as far as I know only contains English
> and Norwegian letters. It can be opened with Notepad and Excel. I tried to
> run thru it in Python by:
> 
> f = open('c://file.txt')
> 
> for i in f:
>     print f
> 
> and that doesn't seem to give any problem. It prints all characters
> without any trouble.

That doesn't narrow it down much though it does point towards latin-1 (or cp1252).

> How would I find what encoding the document is in? All I can find is by
> opening Notepad, selecting Font/Script and it says 'Western'.

That doesn't really mean anything about the doc. Try opening the file in your browser. Most browsers have an encoding menu (View / Character Encoding in Firefox, View / Encoding in IE). Find the selection in this menu that makes the text display correctly; that's the encoding of the file.

> Might the problem only be related to Win32com, not Python since Python
> prints it without trouble?

That's another issue. First you need to know what you are starting with.
> 
>>Do you know what a character encoding is? Do you understand the
 difference >between utf-8 and latin-1?
> 
> Earlier characters had values 1-255. (Ascii). Now, you have a wider
> choice. In our part of the world we can use an extended version which
> contains a lot more, latin-1. UTF-8 is a part of Unicode and contains a
> lot more characters than Ascii.
> 
> My knowledge about character encoding doesn't go much farther than this.
> Simply said, I understand that the document that I want to read includes
> characters beyond Ascii, and therefore I need to use UTF-8 or Latin-1. Why
> I should use one instead of the other, I have no idea.

You really should read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Kent
-- 
http://www.kentsjohnson.com



More information about the Tutor mailing list