PEP 263 comments

Fri Mar 1 02:34:26 EST 2002

"Jason Orendorff" <jason at jorendorff.com> writes:

> > Java does accept iso-latin-1 files as input. In fact on my machine (Mac
> > OSX) it doesn't even accept utf-8 files with the utf-8 signature. And
> > strings containing utf-8 are interpreted as just 8-bit characters, meaning
> > every byte is a character.
> 
> Oh!  Yes, it works this way on Windows, too.  javac assumes source
> files are latin-1, and System.out.println() encodes output in latin-1.

That is not completely true. javac has the -encoding command line
which allows you to specify the source encoding; this defaults to the
platform default encoding (which probably was latin-1
resp. windows-1252 on your systems).

> I'm referring to this paragraph in Martin's original post:
> 
>   The only problem with this approach is that encodings where " or '
>   could be the second byte of a multi-byte character cannot be
>   supported as a source encoding. Python supports no such encoding
>   in the standard library at the moment, anyway, so this should not
>   be a problem.
> 
> \x22 is a double-quote mark.  Martin is a little off on the last
> bit, though; UTF-16 can produce \x22 bytes.

Right. Source encodings (atleast under the initial implementation)
need to be an ASCII superset (in the sense that source code that uses
only ASCII characters is ASCII-encoded); I see no way to allow UTF-16
as a source encoding.

Regards,
Martin