Python unicode and Windows cmd.exe

Terry Reedy tjreedy at udel.edu
Sun Mar 14 17:37:29 EDT 2010


On 3/14/2010 4:40 PM, Guillermo wrote:
> Hi,
>
> I would appreciate if someone could point out what am I doing wrong
> here.
>
> Basically, I need to save a string containing non-ascii characters to
> a file encoded in utf-8.
>
> If I stay in python, everything seems to work fine, but the moment I
> try to read the file with another Windows program, everything goes to
> hell.
>
> So here's the script unicode2file.py:
> ===================================================================
> # encoding=utf-8
> import codecs
>
> f = codecs.open("m.txt",mode="w", encoding="utf8")
> a = u"mañana"
> print repr(a)
> f.write(a)
> f.close()
>
> f = codecs.open("m.txt", mode="r", encoding="utf8")
> a = f.read()
> print repr(a)
> f.close()
> ===================================================================
>
> That gives the expected output, both calls to repr() yield the same
> result.
>
> But now, if I do type me.txt in cmd.exe, I get garbled characters
> instead of "ñ".
>
> I then open the file with my editor (Sublime Text), and I see "mañana"
> normally. I save (nothing to be saved, really), go back to the dos
> prompt, do type m.txt and I get again the same garbled characters.
>
> I then open the file m.txt with notepad, and I see "mañana" normally.
> I save (again, no actual modifications), go back to the dos prompt, do
> type m.txt and this time it works! I get "mañana". When notepad opens
> the file, the encoding is already UTF-8, so short of a UTF-8 bom being

There is no such thing as a utf-8 'byte order mark'. The concept is an 
oxymoron.

> added to the file, I don't know what happens when I save the
> unmodified file. Also, I would think that the python script should
> save a valid utf-8 file in the first place...

Adding the byte that some call a 'utf-8 bom' makes the file an invalid 
utf-8 file. However, I suspect that notepad wrote the file in the system 
encoding, which can encode n with tilde and which cmd.exe does 
understand. If you started with a file with encoded cyrillic, arabic, 
hindi, and chinese characters (for instance), I suspect you would get a 
different result.

tjr





More information about the Python-list mailing list