Text to unicode

Max M maxm at mxm.dk
Thu Oct 3 08:39:40 EDT 2002


I have a series of textfiles that are really utf-8 files in disguise.

What I mean is that an editor sees them as text files. But characters 
that really should be one special character are represented as two 
*wrong* characters.

As far as I understand the first two bytes in a unicode file has a 
special value that tells what kind of encoding the text file has. My 
guess is that my files are missing these two bytes so that my editor and 
Python believe it to be a text file.

(Thay have been saved by a third party tool)

It is html files, and the encoding in them is set to utf-8. When I view 
them in a browser I see the correct characters.

I cannot seem to get my mind around how to convert them into Unicode.

content = unicode(f.read(), 'utf-8')
 >>> UnicodeError: UTF-8 decoding error: unexpected code byte

content = unicode(f.read())
 >>> UnicodeError: ASCII decoding error: ordinal not in range(128)

Any takers?

-- 


regards Max M

The reason I don't reach any higher is that I stand on the shoulders of 
little people.




More information about the Python-list mailing list