How do you read unicode files?
Mike C. Fletcher
mcfletch at rogers.com
Fri Jun 7 01:15:32 EDT 2002
Well, on my win2k machine, I can create a text file using notepad and
specify that it's in "Unicode" format (utf16) and load it with:
unicode( open( filename, 'r').read(), 'utf16' )
That works even if there's ANSI characters > 128, and even if I specify
"big-endian unicode" in notepad.
unicode( open( filename,'r').read(), 'utf8' )
works if I specify "UTF-8" format in notepad.
Depending on what format you want for the "standard string", you'd then
just call, for instance .encode( 'utf8') on the resulting unicode object.
Here's a sample session:
>>> data = open( filename,'r').read()
>>> data
'\xff\xfeT\x00e\x00s\x00t\x00i\x00n\x00g\x00
\x00u\x00n\x00i\x00c\x00o\x00d\x00e\x00\r\x00\n\x00\xe1\x00\xed\x00'
>>> u = unicode( data, 'utf16' )
>>> u
u'Testing unicode\r\n\xe1\xed'
>>> u.encode( 'utf8')
'Testing unicode\r\n\xc3\xa1\xc3\xad'
>>> u.encode( 'iso8859-1' )
'Testing unicode\r\n\xe1\xed'
>>>
That last is a plain, windows-native-encoding (well, my windows-native
encoding ;) ) of the unicode as a simple Python string.
HTH,
Mike
Matt Gerrans wrote:
> How do you read in a unicode file and convert it to a standard string?
>
> It seems that when you open a file and read it, what you get is a string of
> single-byte characters. I've tried all kinds of permutations of calls to
> unicode(), decode(), encode(), etc. with different flavors of encoding
> ('utf-8', 'utf-16' and so on).
>
> I could parse the data myself (skipping the initial two bytes and then every
> other one -- I'm only working with ASCII in double byte format, so the high
> order byte is always 0), but I imagine there must be a way to get the
> existing tools to work.
>
> What I want to be able to do is write a search and replace tool that will
> work equally well on ANSI and Unicode (or double-byte) text files (without
> changing the file type, of course)...
>
>
--
_______________________________________
Mike C. Fletcher
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list