Newbie Q: Extra spaces after conversion from utf-8 to utf-16-le ?
Josiah Carlson
jcarlson at uci.edu
Sun Apr 11 14:11:03 EDT 2004
> I am an absolute Newbie who has done a good amount of googling with
> the keywords utf-8, utf-16, python, convert and has reasoned that the
> following code could be used to convert a utf-8 text file to a
> utf-16-le (I believe this is what Windows uses for Unicode):
>
> s1 = open("utf8_file_generated_with_perl.txt", "r").read()
> s2 = unicode(s1, "utf-8")
> s3 = s2.encode("utf-16-le")
> open ("new_file_supposedly_in_utf16le", "w").write(s3)
>
> Well, this code kind of works (meaning I do not get any errors), but
> the produced file contains an extra space after every character (l i k
> e t h i s) and Windows believes this is an ANSI (i.e. non-unicode
> file). Clearly, what I think is working is actually not.
For standard /ASCII/ characters, when encoded with utf-16-le, there
exists a 'null' character trailing every input character that exists in
standard ASCII...
>>> s = unicode("hello", "ascii")
>>> s
u'hello'
>>> s2 = s.encode("utf-16-le")
>>> s2
'h\x00e\x00l\x00l\x00o\x00'
Generally, "Windows" makes no assumption about encoding and always
assumes ASCII. What many (not all) systems do to tell the app what
encoding is being used, is place what is known as a 'BOM' at the
beginning of the file. Check unicode.org for more information.
You will also likely find opening files as 'binary' in Windows, when
working with unicode, goes a long ways towards making correct output.
- Josiah
More information about the Python-list
mailing list