Newbie Q: Extra spaces after conversion from utf-8 to utf-16-le ?

Sun Apr 11 14:11:03 EDT 2004

> I am an absolute Newbie who has done a good amount of googling with
> the keywords utf-8, utf-16, python, convert and has reasoned that the
> following code could be used to convert a utf-8 text file to a
> utf-16-le (I believe this is what Windows uses for Unicode):
> 
> s1 = open("utf8_file_generated_with_perl.txt", "r").read()
> s2 = unicode(s1, "utf-8")
> s3 = s2.encode("utf-16-le")
> open ("new_file_supposedly_in_utf16le", "w").write(s3)
> 
> Well, this code kind of works (meaning I do not get any errors), but
> the produced file contains an extra space after every character (l i k
> e  t h i s) and Windows believes this is an ANSI (i.e. non-unicode
> file). Clearly, what I think is working is actually not.

For standard /ASCII/ characters, when encoded with utf-16-le, there 
exists a 'null' character trailing every input character that exists in 
standard ASCII...
    >>> s = unicode("hello", "ascii")
    >>> s
    u'hello'
    >>> s2 = s.encode("utf-16-le")
    >>> s2
    'h\x00e\x00l\x00l\x00o\x00'

Generally, "Windows" makes no assumption about encoding and always 
assumes ASCII.  What many (not all) systems do to tell the app what 
encoding is being used, is place what is known as a 'BOM' at the 
beginning of the file.  Check unicode.org for more information.

You will also likely find opening files as 'binary' in Windows, when 
working with unicode, goes a long ways towards making correct output.

  - Josiah