[ python-Bugs-1503789 ] Cannot write source code in UTF16

SourceForge.net noreply at sourceforge.net
Fri Jun 23 03:31:10 CEST 2006


Bugs item #1503789, was opened at 2006-06-09 17:38
Message generated for change (Comment added) made by tungwaiyip
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1503789&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Parser/Compiler
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Wai Yip Tung (tungwaiyip)
Assigned to: Nobody/Anonymous (nobody)
Summary: Cannot write source code in UTF16

Initial Comment:
I intend to create some source code in UTF16. I start 
the file with the encoding declaration line:

----------------------------------------------
# -*- coding: UTF-16LE -*-
print "Hello world"
----------------------------------------------

Unfortunately Python does not decode it in UTF16 as 
expected. I have found some language in PEP 0263 that 
says "It does not include encodings which use two or 
more bytes for all characters like e.g. UTF-16." While 
I am disappointed. I accepted this limitation is 
necessary to make keep the parser simple. So my first 
complaint is this fact should be documented in

http://www.python.org/doc/ref/encodings.html

Then I tried to save the source code with BOM. I think 
there should be no excuse not to decode it in UTF16 in 
that case. Unfortunately Python does not support this 
either.

Indeed the only way to get it work is to write the 
encoding declaration line in ASCII and the rest of the 
file in UTF16 (see u16_hybrid.py). Obviously most text 
editor would not support this.

I come up with this because Microsoft adopt UTF16 in 
various places.





----------------------------------------------------------------------

>Comment By: Wai Yip Tung (tungwaiyip)
Date: 2006-06-22 18:31

Message:
Logged In: YES 
user_id=561546

Turns out the code is already written but disabled. Simply 
turning it on would work.

tokenizer.c(321):
#if 0
	/* Disable support for UTF-16 BOMs until a decision
	   is made whether this needs to be supported.  */
	} else if (ch == 0xFE) {
		ch = get_char(tok); if (ch != 0xFF) goto NON_
BOM;
		if (!set_readline(tok, "utf-16-be")) return 0;
		tok->decoding_state = -1;
	} else if (ch == 0xFF) {
		ch = get_char(tok); if (ch != 0xFE) goto NON_
BOM;
		if (!set_readline(tok, "utf-16-le")) return 0;
		tok->decoding_state = -1;
#endif


Executing an utf-16 text file with BOM file would work. 
However if I also include an encoding declaration plus BOM 
like this

  # -*- coding: UTF-16le -*-


It would result in this error, for some logic in the code 
that I couldn't sort out {tokenizer.c(291)}:


  g:\bin\py_repos\python-svn\PCbuild>python_d.exe test16le.
py
    File "test16le.py", line 1
  SyntaxError: encoding problem: utf-8


If you need a justification for checking the UTF-16 BOM, it 
is Microsoft. As an early adopter of unicode before UTF-8 
is popularized, there is some software that generates UTF-
16 by default. Not a fatal issue. But I see no reason not 
to support it either.


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-06-13 10:06

Message:
Logged In: YES 
user_id=21627

The parser code is in the Parser subdirectory. It would be
good if you could follow the existing parsing strategy, i.e.
convert the input to UTF-8, and then proceed with the normal
parsing procedure.

----------------------------------------------------------------------

Comment By: Wai Yip Tung (tungwaiyip)
Date: 2006-06-13 09:27

Message:
Logged In: YES 
user_id=561546

That sounds good. It is probably a good time to try out the 
instructions to build Python on Windows.

http://groups.google.com/group/comp.lang.python/browse_
thread/thread/f09c49f77bf0a578/3e076bfcafb064cd?hl=en#3e076
bfcafb064cd

Can you point me to the relevant source code?




----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-06-10 05:27

Message:
Logged In: YES 
user_id=21627

Would you like to work on a patch?

There is, of course, a fairly obvious reason that this
doesn't work: nobody has put effort into making it work.

Personally, I suggest that you use some other encoding for
source code, e.g. UTF-8.

----------------------------------------------------------------------

Comment By: Wai Yip Tung (tungwaiyip)
Date: 2006-06-09 17:39

Message:
Logged In: YES 
user_id=561546




----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1503789&group_id=5470


More information about the Python-bugs-list mailing list