Unicode formatting for Strings

Mon Feb 5 17:23:53 EST 2007

On Feb 6, 8:05 am, robson.cozendey... at gmail.com wrote:
> On Feb 5, 7:00 pm, "Chris Mellon" <arka... at gmail.com> wrote:
>
>
>
> > On 2/5/07, Kent Johnson <k... at kentsjohnson.com> wrote:
>
> > > robson.cozendey... at gmail.com wrote:
> > > > Hi,
>
> > > > I´m trying desperately to tell the interpreter to put an 'á' in my
> > > > string, so here is the code snippet:
>
> > > > # -*- coding: utf-8 -*-
> > > > filename = u"Ataris Aquáticos #2.txt"
> > > > f = open(filename, 'w')
>
> > > > Then I save it with Windows Notepad, in the UTF-8 format. So:
>
> > > > 1) I put the "magic comment" at the start of the file
> > > > 2) I write u"" to specify my unicode string
> > > > 3) I save it in the UTF-8 format
>
> > > > And even so, I get an error!
>
> > > >   File "Ataris Aqußticos #2.py", line 1
> > > > SyntaxError: Non-ASCII character '\xff' in file Ataris Aqußticos #2.py
> > > > on line 1
>
> > > It looks like you are saving the file in Unicode format (not utf-8) and
> > > Python is choking on the Byte Order Mark that Notepad puts at the
> > > beginning of the document.
>
> > Notepad does support saving to UTF-8, and I was able to do this
> > without the problem the OP was having. I also saved both with and
> > without a BOM (in UTF-8) using SciTe, and Python worked correctly in
> > both cases.
>
> > > Try using an editor that will save utf-8 without a BOM, e.g. jedit or
> > > TextPad.
>
> > > Kent
> > > --
> > >http://mail.python.org/mailman/listinfo/python-list-Hide quoted text -
>
> > - Show quoted text -- Hide quoted text -
>
> > - Show quoted text -
>
> I saved it in UTF-8 with Notepad.

Please consider that you might possibly be mistaken.

Here are dumps of 4 varieties of file:

| >>> for i in range(4):
...  print '\nFile %d:\n%r' % (i, open('robson' + str(i) + '.py',
'rb').read())
...

File 0:
'\xef\xbb\xbf# -*- coding: utf-8 -*-\r\nfilename = u"Ataris Aqu
\xc3\xa1ticos #2.
txt"\r\nf = open(filename, \'w\')'

File 1:
'# -*- coding: utf-8 -*-\r\nfilename = u"Ataris Aqu\xc3\xa1ticos
#2.txt"\r\nf =
open(filename, \'w\')'

File 2:
'# -*- coding: cp1252 -*-\r\nfilename = u"Ataris Aqu\xe1ticos #2.txt"\r
\nf = ope
n(filename, \'w\')'

File 3:
'\xff\xfe#\x00 \x00-\x00*\x00-\x00 \x00c\x00o\x00d\x00i\x00n\x00g
\x00:\x00 \x00u
\x00t\x00f\x00-\x008\x00 \x00-\x00*\x00-\x00\r\x00\n\x00f\x00i\x00l
\x00e\x00n\x0
0a\x00m\x00e\x00 \x00=\x00 \x00u\x00"\x00A\x00t\x00a\x00r\x00i\x00s
\x00 ]
[snip]

File 0 was saved in UTF-8 with Notepad. Notepad puts a "UTF-8 BOM" at
the front of the file. It works (that is, it creates a file with the a-
acute character in its name). There is no \xff character in line 1 for
Python to complain about.

File 1 was saved in UTF-8 with another editor. No BOM, no problem.
Works.

File 2 (which specifies cp1252 encoding (my default, and probably
yours too)) was saved normally (i.e. without the stuffing about
necessary to get UTF-8). Works.

File 3 was saved in "Unicode" (really utf_16_le) using Notepad. As you
can see, it has a UTF-16-LE BOM (which contains \xff) at the start.
Python is not amused, giving exactly the same error message as you
reported.

So:

(1) If you still believe that you are getting a problem with a file
saved as UTF-8, please present reproducible credible evidence: for
example, a copy/paste of what happens when you (a) dump of the file,
immediately followed by (b) running the file with Python.

(2) Consider using your "native" encoding (e.g. cp1252) with your
normal/usual editor/IDE.

> I was thinking here... It can be a
> limitation of file.open() method?

No, it can't.

> Have anyone tested that?

Unlikely.

HTH,
John