How do you read unicode files?

Mike C. Fletcher mcfletch at rogers.com
Fri Jun 7 01:15:32 EDT 2002


Well, on my win2k machine, I can create a text file using notepad and 
specify that it's in "Unicode" format (utf16) and load it with:

unicode( open( filename, 'r').read(), 'utf16' )

That works even if there's ANSI characters > 128, and even if I specify 
"big-endian unicode" in notepad.

unicode( open( filename,'r').read(), 'utf8' )

works if I specify "UTF-8" format in notepad.

Depending on what format you want for the "standard string", you'd then 
just call, for instance .encode( 'utf8') on the resulting unicode object.

Here's a sample session:
 >>> data = open( filename,'r').read()
 >>> data
'\xff\xfeT\x00e\x00s\x00t\x00i\x00n\x00g\x00 
\x00u\x00n\x00i\x00c\x00o\x00d\x00e\x00\r\x00\n\x00\xe1\x00\xed\x00'
 >>> u = unicode( data, 'utf16' )
 >>> u
u'Testing unicode\r\n\xe1\xed'
 >>> u.encode( 'utf8')
'Testing unicode\r\n\xc3\xa1\xc3\xad'
 >>> u.encode( 'iso8859-1' )
'Testing unicode\r\n\xe1\xed'
 >>>

That last is a plain, windows-native-encoding (well, my windows-native 
encoding ;) ) of the unicode as a simple Python string.

HTH,
Mike

Matt Gerrans wrote:
> How do you read in a unicode file and convert it to a standard string?
> 
> It seems that when you open a file and read it, what you get is a string of
> single-byte characters.   I've tried all kinds of permutations of calls to
> unicode(), decode(), encode(), etc. with different flavors of encoding
> ('utf-8',  'utf-16' and so on).
> 
> I could parse the data myself (skipping the initial two bytes and then every
> other one -- I'm only working with ASCII in double byte format, so the high
> order byte is always 0), but I imagine there must be a way to get the
> existing tools to work.
> 
> What I want to be able to do is write a search and replace tool that will
> work equally well on ANSI and Unicode (or double-byte) text files (without
> changing the file type, of course)...
> 
> 


-- 
_______________________________________
   Mike C. Fletcher
   http://members.rogers.com/mcfletch/







More information about the Python-list mailing list