From tree@basistech.com  Tue May 15 20:28:05 2001
From: tree@basistech.com (Tom Emerson)
Date: Tue, 15 May 2001 15:28:05 -0400
Subject: [I18n-sig] Extending definition of errors argument to codecs
Message-ID: <15105.33605.300173.26763@cymru.basistech.com>

I'd like to propose an extension to the Codec error reporting mechanism:

The 'errors' argument to encode/decode et al. would be much more
useful as a callable object. The current semantics of 'strict',
'ignore', and 'replace' are trivially implemented in this scheme,
while allowing a specific application to extend a codec with custom
error handling if necessary. Something along the lines of:

class CodecError:
    def __call__(self, bytes):
        pass

class CodecError_Replace ( CodecError ):
    def __call__(self, bytes):
        return u'\uFFFD'

class CodecError_Strict ( CodecError ):
    def __call__(self, bytes):
        raise UnicodeError, "cannot map byte range to Unicode"

Why would this be useful? I'm working text that purports to be in Big
5, but in fact it is encoded with CP950. CP950 is identical to Big 5
except that it has a handful of extra codepoints in the 0xF9 VDA block
(taken from the Eten extension). When using the current Big 5 codec on
these files I sometimes blow up because of these extended
characters. I would love to be able to do something like:

class CodecError_CP950 ( Codec_Error_Strict ):
    def __call__(self, bytes):
        if bytes == '\xf9\xd6':
            return u'\u7881'
        Codec_Error_Strict.__call__(self, bytes)

This effectively allows me to expand upon the repertoire encoded by a
the codec without modifying the tables and rebuilding (as I do now as
a work around), generating new tables, or whatever else.

Food for thought. The above design is off-the-cuff, but I think it is
close to my thoughts on the matter.

OK, flame away.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Tue May 15 21:12:22 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 15 May 2001 22:12:22 +0200
Subject: [I18n-sig] Extending definition of errors argument to codecs
References: <15105.33605.300173.26763@cymru.basistech.com>
Message-ID: <3B018DA6.B346732A@lemburg.com>

Tom Emerson wrote:
> 
> I'd like to propose an extension to the Codec error reporting mechanism:
> 
> The 'errors' argument to encode/decode et al. would be much more
> useful as a callable object. The current semantics of 'strict',
> 'ignore', and 'replace' are trivially implemented in this scheme,
> while allowing a specific application to extend a codec with custom
> error handling if necessary. 

This has been proposed some months ago already. The problem with
this approach is that it seriously breaks binary compatibility
at the C level, since all C APIs use const char *error.

The call interface would also have to be a little more context
aware, so that the callback actually has a chance of modifying
the current codec process -- simply returning a usable
replacement character isn't enough in the general case where
might want to be able to resync with input stream in case there's
a break in synchronization.

If you can come up with a patch which maintains backward
compatibility e.g. by adding a compatibility layer using
lots of PyUnicode_EncodeEx() APIs, there's good chance of
getting this into the core.

Still, it's lots of work and I'm not sure whether it wouldn't
be more worthwhile adding these sort of special error handling
schemes to the codecs in question rather than making them
a generic option for all codecs.

> Something along the lines of:
> 
> class CodecError:
>     def __call__(self, bytes):
>         pass
> 
> class CodecError_Replace ( CodecError ):
>     def __call__(self, bytes):
>         return u'\uFFFD'
> 
> class CodecError_Strict ( CodecError ):
>     def __call__(self, bytes):
>         raise UnicodeError, "cannot map byte range to Unicode"
> 
> Why would this be useful? I'm working text that purports to be in Big
> 5, but in fact it is encoded with CP950. CP950 is identical to Big 5
> except that it has a handful of extra codepoints in the 0xF9 VDA block
> (taken from the Eten extension). When using the current Big 5 codec on
> these files I sometimes blow up because of these extended
> characters. I would love to be able to do something like:
> 
> class CodecError_CP950 ( Codec_Error_Strict ):
>     def __call__(self, bytes):
>         if bytes == '\xf9\xd6':
>             return u'\u7881'
>         Codec_Error_Strict.__call__(self, bytes)
> 
> This effectively allows me to expand upon the repertoire encoded by a
> the codec without modifying the tables and rebuilding (as I do now as
> a work around), generating new tables, or whatever else.
> 
> Food for thought. The above design is off-the-cuff, but I think it is
> close to my thoughts on the matter.
> 
> OK, flame away.
> 
>     -tree
> 
> --
> Tom Emerson                                          Basis Technology Corp.
> Sr. Sinostringologist                              http://www.basistech.com
>   "Beware the lollipop of mediocrity: lick it once and you suck forever"
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue May 15 22:09:52 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 15 May 2001 23:09:52 +0200
Subject: [I18n-sig] Extending definition of errors argument to codecs
In-Reply-To: <3B018DA6.B346732A@lemburg.com> (mal@lemburg.com)
References: <15105.33605.300173.26763@cymru.basistech.com> <3B018DA6.B346732A@lemburg.com>
Message-ID: <200105152109.f4FL9q804004@mira.informatik.hu-berlin.de>

> This has been proposed some months ago already. The problem with
> this approach is that it seriously breaks binary compatibility
> at the C level, since all C APIs use const char *error.

As discussed last time, this is not a serious problem. You could move
the existing API to use callable objects as arguments, and provide
wrapper functions that still accept strings.

> simply returning a usable replacement character isn't enough in the
> general case

That points to the major problem we had list time: We could not agree
on what the general case is.

In every demonstrated use case, a simple replacement string would have
been enough (remember that, in the XML case, it would have also been a
replacement *string*, e.g. "&#4275;")

> Still, it's lots of work and I'm not sure whether it wouldn't
> be more worthwhile adding these sort of special error handling
> schemes to the codecs in question rather than making them
> a generic option for all codecs.

Ok, this is an improvement over the last time this discussion came up,
where we only agreed to implement an "XML" error handling or some such.

Regards,
Martin


From paulp@ActiveState.com  Wed May 16 18:32:43 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 16 May 2001 10:32:43 -0700
Subject: [I18n-sig] UTF-8 and BOM
Message-ID: <3B02B9BB.E1F6AE39@ActiveState.com>

Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives
users an option.

Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
character. The UTF-16 decoder removes it. I recognize that the BOM is
not useful as a "byte order mark" for UTF-8 data but I would still
suggest that the UTF-8 decoder should remove it for these reasons:

 1) Microsoft has taken the stance that a BOM is legal on UTF-8 data

 2) Doing so is legal:

"Q: Is the UTF-8 encoding scheme the same irrespective of whether the
underlying processor is little endian or big endian?

A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
endian problem as there is for encoding forms that use 16-bit or 32-bit
code units. Where a BOM is used with UTF-8, it is only to distinguish
UTF-8 from other UTF encodings =97 it has nothing to do with byte order.
[KW]"

http://www.unicode.org/unicode/faq/utf_bom.html

 3) I think that distinguising UTF-8 from other encodings through the
BOM is actually a great idea and I wish that every UTF-8 creator would
do it!

 4) The behavior would be consistent with the UTF-16 behavior.

----
import codecs

with_bom =3D u"\uFEFFabcd"
utf_8 =3D with_bom.encode("utf-8")
utf_16 =3D with_bom.encode("utf-16")

print repr(codecs.utf_8_decode(utf_8))
(u'\ufeffabcd', 7)

print repr(codecs.utf_16_decode(utf_16))
(u'abcd', 12)


--=20
Take a recipe. Leave a recipe. =20
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Wed May 16 19:48:51 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 16 May 2001 20:48:51 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com>
Message-ID: <3B02CB93.A9DCFD8@lemburg.com>

Paul Prescod wrote:
> 
> Notepad always saves UTF-8 documents with a BOM. Visual Studio 7 gives
> users an option.
> 
> Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> character. The UTF-16 decoder removes it. I recognize that the BOM is
> not useful as a "byte order mark" for UTF-8 data but I would still
> suggest that the UTF-8 decoder should remove it for these reasons:
 
>  1) Microsoft has taken the stance that a BOM is legal on UTF-8 data

BOMs are standard Unicode char points, so they are legal in all
Unicode encodings.
 
>  2) Doing so is legal:
> 
> "Q: Is the UTF-8 encoding scheme the same irrespective of whether the
> underlying processor is little endian or big endian?
> 
> A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
> endian problem as there is for encoding forms that use 16-bit or 32-bit
> code units. Where a BOM is used with UTF-8, it is only to distinguish
> UTF-8 from other UTF encodings - it has nothing to do with byte order.
> [KW]"
> 
> http://www.unicode.org/unicode/faq/utf_bom.html

... as I said :-)
 
>  3) I think that distinguising UTF-8 from other encodings through the
> BOM is actually a great idea and I wish that every UTF-8 creator would
> do it!

Uhm, I can't follow you here... BOMs in UTF-8 look like this:

>>> u'\ufeff'.encode('utf-8')
'\xef\xbb\xbf'

which is somewhat different from '\xff\xfe' or '\xfe\xff'.
 
>  4) The behavior would be consistent with the UTF-16 behavior.

>>> u'\ufeff'.encode('utf-16')
'\xff\xfe\xff\xfe'

>>> u'\ufeff'.encode('utf-16-le')
'\xff\xfe'

>>> u'\ufeff'.encode('utf-16-be')
'\xfe\xff'

>>> u'\ufeff'.encode('utf-8')
'\xef\xbb\xbf'

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Wed May 16 20:55:35 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 16 May 2001 14:55:35 -0500
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: Your message of "Wed, 16 May 2001 20:48:51 +0200."
 <3B02CB93.A9DCFD8@lemburg.com>
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <3B02CB93.A9DCFD8@lemburg.com>
Message-ID: <200105161955.OAA04144@cj20424-a.reston1.va.home.com>

> >  3) I think that distinguising UTF-8 from other encodings through the
> > BOM is actually a great idea and I wish that every UTF-8 creator would
> > do it!
> 
> Uhm, I can't follow you here... BOMs in UTF-8 look like this:
> 
> >>> u'\ufeff'.encode('utf-8')
> '\xef\xbb\xbf'
> 
> which is somewhat different from '\xff\xfe' or '\xfe\xff'.

I think he meant that this serves as a sort-of "magic number" for
UTF-8 files.  I find that kind of cute myself. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Wed May 16 20:06:55 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 16 May 2001 12:06:55 -0700
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <3B02CB93.A9DCFD8@lemburg.com> <200105161955.OAA04144@cj20424-a.reston1.va.home.com>
Message-ID: <3B02CFCF.A26624E8@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> I think he meant that this serves as a sort-of "magic number" for
> UTF-8 files.  I find that kind of cute myself. :-)

What he said.

Thanks to this trick, notepad and Visual Studio are extremely good at
auto-detecting encodings for Unicode text files created with either
tool.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Wed May 16 20:26:41 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 16 May 2001 12:26:41 -0700
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com>
Message-ID: <3B02D471.6628A0@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
>...
> 
> BOMs are standard Unicode char points, so they are legal in all
> Unicode encodings.

My point is that it is legal to interpret it as a BOM and not just a
character.

>...
> Uhm, I can't follow you here... BOMs in UTF-8 look like this:
> 
> >>> u'\ufeff'.encode('utf-8')
> '\xef\xbb\xbf'
> 
> which is somewhat different from '\xff\xfe' or '\xfe\xff'.

That's what's great about it!

>...
> >>> u'\ufeff'.encode('utf-16')
> '\xff\xfe\xff\xfe'

It is curious that decoding this removes both FEFF characters. Is it
right that the decoder removes all BOM sequences?

>>> codecs.utf_16_decode(  codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10)
(u'a', 44)

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Wed May 16 20:59:50 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 16 May 2001 21:59:50 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com>
Message-ID: <3B02DC36.113E7BE9@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > BOMs are standard Unicode char points, so they are legal in all
> > Unicode encodings.
> 
> My point is that it is legal to interpret it as a BOM and not just a
> character.

That's correct (and also the reasoning behind adding BOM in
files or streams and being allowed to remove them at your
own will).
 
> >...
> > Uhm, I can't follow you here... BOMs in UTF-8 look like this:
> >
> > >>> u'\ufeff'.encode('utf-8')
> > '\xef\xbb\xbf'
> >
> > which is somewhat different from '\xff\xfe' or '\xfe\xff'.
> 
> That's what's great about it!

Ok, now I get it: you want to use '\xef\xbb\xbf' as file encoding
identifier. Sounds like a good idea !
 
> >...
> > >>> u'\ufeff'.encode('utf-16')
> > '\xff\xfe\xff\xfe'
> 
> It is curious that decoding this removes both FEFF characters. Is it
> right that the decoder removes all BOM sequences?
> 
> >>> codecs.utf_16_decode(  codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10)
> (u'a', 44)

Yes. The codec is smart enough to even handle input stream
with mixed byte orders (it switches dynamically based on what
it finds in the stream).

Note that BYTE ORDER MARK is only a comment for char point
'\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding
or removing these will not cause any visible effect in the
text or change the formatting. That's why you can add or
remove them at your own will.

So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove
all BOM marks from its input, or add BOM marks in some places
or add a codec utf-8-bom which prepends BOM to the start of
all encoded strings ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Wed May 16 22:07:49 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 16 May 2001 23:07:49 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02B9BB.E1F6AE39@ActiveState.com> (message from Paul Prescod on
 Wed, 16 May 2001 10:32:43 -0700)
References: <3B02B9BB.E1F6AE39@ActiveState.com>
Message-ID: <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>

> Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> character. The UTF-16 decoder removes it. I recognize that the BOM is
> not useful as a "byte order mark" for UTF-8 data but I would still
> suggest that the UTF-8 decoder should remove it for these reasons:

I think it is good to remove the BOM when decoding UTF-8. Most likely,
the only reason that this is not done is that nobody thought that
there might be one.

I disagree that putting the BOM into a file is a good thing - I think
it is stupid to do so. First of all, auto-detection can always be
fooled, so there should be a higher-level protocol for reliable data
processing. UTF-8 is relatively easy to auto-detect if you believe in
auto-detection - it's just that looking at the first few bytes it not
sufficient.

OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two
UTF-8 files to get another UTF-8 file. That properly is lost if there
is a BOM in the file.

Regards,
Martin


From mal@lemburg.com  Wed May 16 22:27:13 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 16 May 2001 23:27:13 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
Message-ID: <3B02F0B1.8863FDB1@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> > character. The UTF-16 decoder removes it. I recognize that the BOM is
> > not useful as a "byte order mark" for UTF-8 data but I would still
> > suggest that the UTF-8 decoder should remove it for these reasons:
> 
> I think it is good to remove the BOM when decoding UTF-8. Most likely,
> the only reason that this is not done is that nobody thought that
> there might be one.
> 
> I disagree that putting the BOM into a file is a good thing - I think
> it is stupid to do so. First of all, auto-detection can always be
> fooled, so there should be a higher-level protocol for reliable data
> processing. UTF-8 is relatively easy to auto-detect if you believe in
> auto-detection - it's just that looking at the first few bytes it not
> sufficient.
> 
> OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two
> UTF-8 files to get another UTF-8 file. That properly is lost if there
> is a BOM in the file.

Why should a BOM behave any different than any other Unicode
character ? BOMs can be added and deleted in pretty much all
places of a Unicode text -- that's their intent after all, so
I don't see how they could break any property of an encoding.

Or did you have the same misunderstanding as I did ? ... 
Paul is talking about the UTF-8 encoding of the BOM mark ('\xef\xbb\xbf'),
not the FF FE or FE FF byte sequence as is seen in UTF-16 streams.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From paulp@ActiveState.com  Wed May 16 22:41:35 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 16 May 2001 14:41:35 -0700
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
Message-ID: <3B02F40F.C6C1CE4A@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> > character. The UTF-16 decoder removes it. I recognize that the BOM is
> > not useful as a "byte order mark" for UTF-8 data but I would still
> > suggest that the UTF-8 decoder should remove it for these reasons:
> 
> I think it is good to remove the BOM when decoding UTF-8. Most likely,
> the only reason that this is not done is that nobody thought that
> there might be one.

Okay good.

> I disagree that putting the BOM into a file is a good thing - I think
> it is stupid to do so. First of all, auto-detection can always be
> fooled, so there should be a higher-level protocol for reliable data
> processing. 

There should be but there isn't always. What is the standard way for
tagging UTF-8 documents on the Windows file system?

> UTF-8 is relatively easy to auto-detect if you believe in
> auto-detection - it's just that looking at the first few bytes it not
> sufficient.

Yes, we're going to autodetect by trying to decode the data but that's a
pretty expensive operation. You never know if the very first non-ASCII
char will appear in the last few bytes of the file. Anyhow, it doesn't
matter. If I want a BOM in files I write out, I can add it. My main goal
is to have the reader do the right thing with "Microsoft-format" Unicode
files.

> OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two
> UTF-8 files to get another UTF-8 file. That properly is lost if there
> is a BOM in the file.

So what if there is a BOM in the middle of the data stream. MAL's
decoder will just remove it anyhow. :)

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Wed May 16 22:57:06 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 16 May 2001 14:57:06 -0700
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com>
Message-ID: <3B02F7B2.F932C084@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
>...
> 
> Note that BYTE ORDER MARK is only a comment for char point
> '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding
> or removing these will not cause any visible effect in the
> text or change the formatting. That's why you can add or
> remove them at your own will.

I'm not sure I buy that, but one could argue that a Zero width no-break
space character is a legitimate character whether you can see it on a
computer screen or not...but I don't care enough to make that argument.

> So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove
> all BOM marks from its input, or add BOM marks in some places
> or add a codec utf-8-bom which prepends BOM to the start of
> all encoded strings ?

I'd like the UTF-8 codec to treat BOMs (especially leading BOMs) as the
UTF-16 one does. Probably BOM_UTF8 should be added to codecs.py. I'm not
sure whether we need another codec. Probably not...

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Wed May 16 23:20:49 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 17 May 2001 00:20:49 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com>
Message-ID: <3B02FD41.20675BC3@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > Note that BYTE ORDER MARK is only a comment for char point
> > '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding
> > or removing these will not cause any visible effect in the
> > text or change the formatting. That's why you can add or
> > remove them at your own will.
> 
> I'm not sure I buy that, but one could argue that a Zero width no-break
> space character is a legitimate character whether you can see it on a
> computer screen or not...but I don't care enough to make that argument.

Text data is different than binary data. Unicode text
which uses combining characters (e.g. accent and 'e' to produce
'�') is equivalent to text which uses the combined character
point directly. This corner of Unicode is not well covered yet
in Python's Unicode implementation. The two major missing
items are normalization and collation support.
 
> > So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove
> > all BOM marks from its input, or add BOM marks in some places
> > or add a codec utf-8-bom which prepends BOM to the start of
> > all encoded strings ?
> 
> I'd like the UTF-8 codec to treat BOMs (especially leading BOMs) as the
> UTF-16 one does. Probably BOM_UTF8 should be added to codecs.py. I'm not
> sure whether we need another codec. Probably not...

You have to be careful here: UTF-16 prepends a BOM mark to
every string pushed through the codec -- even small snippets.
You certainly don't want to make that the default for the
much more common UTF-8 which has no real requirement to include
BOM marks at all... having the decoder automatically remove
BOM marks is easy to implement and won't cause any harm,
but carelessly adding them will get us into trouble.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From paulp@ActiveState.com  Wed May 16 23:26:56 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 16 May 2001 15:26:56 -0700
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com> <3B02FD41.20675BC3@lemburg.com>
Message-ID: <3B02FEB0.2A4135A6@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
>...
> 
> You have to be careful here: UTF-16 prepends a BOM mark to
> every string pushed through the codec -- even small snippets.
> You certainly don't want to make that the default for the
> much more common UTF-8 which has no real requirement to include
> BOM marks at all... having the decoder automatically remove
> BOM marks is easy to implement and won't cause any harm,
> but carelessly adding them will get us into trouble.

Yes, I meant to say that the standard decoder should remove them and
left it up to you whether we should have another codec where the encoder
adds them.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From martin@loewis.home.cs.tu-berlin.de  Thu May 17 05:22:42 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 17 May 2001 06:22:42 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02F0B1.8863FDB1@lemburg.com> (mal@lemburg.com)
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com>
Message-ID: <200105170422.f4H4MgC01079@mira.informatik.hu-berlin.de>

> Why should a BOM behave any different than any other Unicode
> character ? BOMs can be added and deleted in pretty much all
> places of a Unicode text -- that's their intent after all, so
> I don't see how they could break any property of an encoding.
> 
> Or did you have the same misunderstanding as I did ? ... 
> Paul is talking about the UTF-8 encoding of the BOM mark ('\xef\xbb\xbf'),
> not the FF FE or FE FF byte sequence as is seen in UTF-16 streams.

So am I, and I think that when decoding UTF-8, the first Unicode
character should be removed when it is the BOM, by the UTF-8 decoder.
It should be removed in that place because it was inserted only to
identify UTF-8 (just as the byte sequence FF FE was inserted into the
UTF-16 stream to identify it as UTF-16, and to identify the byte
order).

I don't think the decoder should remove the BOM from any other
location in the text, since removing it *does* change the content of
the text. It may be removed as part of applying some normalization,
but that should not happen unless the application explicitly requests
that normalization. In fact, none of the Unicode normalization forms
removes the BOM (see TR #15). The BOM is recommended to be a valid
character in identifiers, and it is recommended to remove it before
comparing identifiers (since it is a formatting character).

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu May 17 05:28:56 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 17 May 2001 06:28:56 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02F7B2.F932C084@ActiveState.com> (message from Paul Prescod on
 Wed, 16 May 2001 14:57:06 -0700)
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com>
Message-ID: <200105170428.f4H4Su501137@mira.informatik.hu-berlin.de>

> "M.-A. Lemburg" wrote:
> > 
> >...
> > 
> > Note that BYTE ORDER MARK is only a comment for char point
> > '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. 

No, and yes. "BYTE ORDER MARK" is not in the comment field of the
database, but in the "Unicode 1.0 name" of the database.

[Paul]
> I'm not sure I buy that, but one could argue that a Zero width no-break
> space character is a legitimate character whether you can see it on a
> computer screen or not...but I don't care enough to make that argument.

I do. A reader must not remove the BOM, unless it is clearly meant to
indicate the encoding of a document.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu May 17 05:25:36 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 17 May 2001 06:25:36 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02F40F.C6C1CE4A@ActiveState.com> (message from Paul Prescod on
 Wed, 16 May 2001 14:41:35 -0700)
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F40F.C6C1CE4A@ActiveState.com>
Message-ID: <200105170425.f4H4Pa401135@mira.informatik.hu-berlin.de>

> > I disagree that putting the BOM into a file is a good thing - I think
> > it is stupid to do so. First of all, auto-detection can always be
> > fooled, so there should be a higher-level protocol for reliable data
> > processing. 
> 
> There should be but there isn't always. What is the standard way for
> tagging UTF-8 documents on the Windows file system?

There probably is none, although giving them a .txt extension is a
good starting point. What is the standard for tagging KOI8-R documents
on the Windows file system?

> So what if there is a BOM in the middle of the data stream. MAL's
> decoder will just remove it anyhow. :)

Yes, and I think this is a bug.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu May 17 05:32:24 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 17 May 2001 06:32:24 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02FD41.20675BC3@lemburg.com> (mal@lemburg.com)
References: <3B02B9BB.E1F6AE39@ActiveState.com> <3B02CB93.A9DCFD8@lemburg.com> <3B02D471.6628A0@ActiveState.com> <3B02DC36.113E7BE9@lemburg.com> <3B02F7B2.F932C084@ActiveState.com> <3B02FD41.20675BC3@lemburg.com>
Message-ID: <200105170432.f4H4WOw01160@mira.informatik.hu-berlin.de>

> Text data is different than binary data. Unicode text
> which uses combining characters (e.g. accent and 'e' to produce
> '�') is equivalent to text which uses the combined character
> point directly. 

Are you saying that the BOM is removed under normalization? Which
normalization form?

> You have to be careful here: UTF-16 prepends a BOM mark to
> every string pushed through the codec -- even small snippets.

That seems like an error also. When writing to a UTF-16 stream, I want
the BOM to appear only in the first bytes of the resulting file.

Regards,
Martin


From paulp@ActiveState.com  Thu May 17 17:46:12 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 17 May 2001 09:46:12 -0700
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F40F.C6C1CE4A@ActiveState.com> <200105170425.f4H4Pa401135@mira.informatik.hu-berlin.de>
Message-ID: <3B040054.102EBE46@ActiveState.com>

"Martin v. Loewis" wrote:
> 
>...
> 
> There probably is none, although giving them a .txt extension is a
> good starting point. What is the standard for tagging KOI8-R documents
> on the Windows file system?

There isn't one. But utf-8 is an encoding that is growing in popularity
and KOI8-R is one that is shrinking. The unreliability of "code pages"
is a big part of what Unicode is supposed to fix.

> > So what if there is a BOM in the middle of the data stream. MAL's
> > decoder will just remove it anyhow. :)
> 
> Yes, and I think this is a bug.

Nevertheless, I don't see how concatenating two BOM-prefixed UTF-8
streams is any more or less problematic than concatenating two
BOM-prefixed UTF-16 streams.

I'll repeat that I'm not saying that the UTF-8 encoder should add a BOM.
Until this convention is more common, we shouldn't try to be innovative.
But I still think that BOMs on UTF-8 are a good idea.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Fri May 18 19:45:22 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 18 May 2001 11:45:22 -0700
Subject: [I18n-sig] Transparent Encoding
Message-ID: <3B056DC2.D143F641@ActiveState.com>

I would like to suggest that if the "data_encoding" parameter of
EncodedFile is missing or None, the encoding "unicode_internal" should
be used. Right now it is not really clear how to use the EncodedFile to
*encode* or *decode* as opposed to *transcode* (translate between
encodings). In fact it is documented only as a transcoder even though I
think that it will more often be used as an encoder or decoder.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Fri May 18 21:05:20 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 18 May 2001 22:05:20 +0200
Subject: [I18n-sig] Transparent Encoding
References: <3B056DC2.D143F641@ActiveState.com>
Message-ID: <3B058080.7CEFF26C@lemburg.com>

Paul Prescod wrote:
> 
> I would like to suggest that if the "data_encoding" parameter of
> EncodedFile is missing or None, the encoding "unicode_internal" should
> be used. Right now it is not really clear how to use the EncodedFile to
> *encode* or *decode* as opposed to *transcode* (translate between
> encodings). In fact it is documented only as a transcoder even though I
> think that it will more often be used as an encoder or decoder.

EncodedFile() creates an object which interfaces between two
worlds: the file and the program. In this sense it is always
a recoder.

I don't see why you want to make unicode-internal the default
for data_encoding... if you don't want an encoding, you shouldn't
use EncodedFile() at all.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From paulp@ActiveState.com  Fri May 18 21:40:37 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 18 May 2001 13:40:37 -0700
Subject: [I18n-sig] Transparent Encoding
References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com>
Message-ID: <3B0588C5.97E5E2E8@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
>...
> 
> I don't see why you want to make unicode-internal the default
> for data_encoding... if you don't want an encoding, you shouldn't
> use EncodedFile() at all.

What's a better idiom for

stream = codecs.EncodeFile(fileobj, "unicode-internal", "utf-8")

I want to a writable fileobj in a transparent UTF-8 encoder? 

----
Also, in Python 2.1 I just noticed that this code does some weird
pointer thing that crashes Python sometimes:

>>> for i in (1,2,3):
...     codecs.EncodedFile(open("foo.txt","w"), "unicode-internal",
"utf-8").write(u"\u2222")
...
>>> ^Z

Sometimes it crashes immediately and sometimes it only crashes when you
try to shut down Python. I can submit a bug report if you can't diagnose
this easily and haven't heard of it before.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From fw@deneb.enyo.de  Fri May 18 22:05:51 2001
From: fw@deneb.enyo.de (Florian Weimer)
Date: 18 May 2001 23:05:51 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02F0B1.8863FDB1@lemburg.com> ("M.-A. Lemburg"'s message of "Wed, 16 May 2001 23:27:13 +0200")
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
Message-ID: <877kze74gg.fsf@deneb.enyo.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> Why should a BOM behave any different than any other Unicode
> character ? BOMs can be added and deleted in pretty much all
> places of a Unicode text -- that's their intent after all, so
> I don't see how they could break any property of an encoding.

The BOM is overloaded with two meanings, it's certainly not a no-op
character.


From fw@deneb.enyo.de  Fri May 18 22:04:13 2001
From: fw@deneb.enyo.de (Florian Weimer)
Date: 18 May 2001 23:04:13 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B02B9BB.E1F6AE39@ActiveState.com> (Paul Prescod's message of "Wed, 16 May 2001 10:32:43 -0700")
References: <3B02B9BB.E1F6AE39@ActiveState.com>
Message-ID: <87bsoq74j6.fsf@deneb.enyo.de>

Paul Prescod <paulp@ActiveState.com> writes:

>  3) I think that distinguising UTF-8 from other encodings through the
> BOM is actually a great idea and I wish that every UTF-8 creator would
> do it!

I think it's even mandated by ISO/IEC 10646-1:2000.  However, the
BOM is incompatible with the traditional Unix tools, so most people
(especially the Linux-UTF-8 folks) recommend not to use it.


From martin@loewis.home.cs.tu-berlin.de  Fri May 18 22:05:09 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 18 May 2001 23:05:09 +0200
Subject: [I18n-sig] Transparent Encoding
In-Reply-To: <3B0588C5.97E5E2E8@ActiveState.com> (message from Paul Prescod on
 Fri, 18 May 2001 13:40:37 -0700)
References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com>
Message-ID: <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de>

> What's a better idiom for
> 
> stream = codecs.EncodeFile(fileobj, "unicode-internal", "utf-8")
> 
> I want to a writable fileobj in a transparent UTF-8 encoder? 

Is fileobj already given as open, or do you have a filename for it?
If the latter, just do

stream = codecs.open(filename, "w", encoding="utf-8")

If the former, do

writer = codecs.lookup("utf-8")[3]
# or
# enc, dec, reader, writer = codecs.lookup("utf-8")

stream = writer(fileobj)

An EncodedFile is not suitable since it has byte strings on both ends,
and Unicode strings only inside.

Regards,
Martin


From paulp@ActiveState.com  Fri May 18 23:01:57 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 18 May 2001 15:01:57 -0700
Subject: [I18n-sig] Transparent Encoding
References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de>
Message-ID: <3B059BD5.CE05BBF3@ActiveState.com>

"Martin v. Loewis" wrote:
> 
>...
> An EncodedFile is not suitable since it has byte strings on both ends,
> and Unicode strings only inside.

EncodedFile seems to work as I ask if I pass it the encoding name as
"unicode-internal". Furthermore, code that does that is much simpler
than code that looks up the codec manually. I'm not a big fan of those
codec tuples.

Current:

writer = codecs.lookup("utf-8")[3]
stream = writer(fileobj)

Proposed:

codecs.EncodedFile(fileobj, None, "utf-8")

As I understand it, you can almost always go without looking up the
encoder tuple thanks to the .encode method. And you can almost always go
without looking up the decoder, thanks to the .decode method. This
EncodedFile convention would allow most common cases of wrapping Unicode
to avoid looking up the tuple also.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Sat May 19 11:16:55 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 19 May 2001 12:16:55 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de>
Message-ID: <3B064817.95FB5F5E@lemburg.com>

Florian Weimer wrote:
> 
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> > Why should a BOM behave any different than any other Unicode
> > character ? BOMs can be added and deleted in pretty much all
> > places of a Unicode text -- that's their intent after all, so
> > I don't see how they could break any property of an encoding.
> 
> The BOM is overloaded with two meanings, it's certainly not a no-op
> character.

I didn't say that a BOM is a no-op character, just that adding
or removing a BOM character doesn't break the encoding.

For more infos on BOMs and how they are intended to be used,
please see the Unicode FAQ:

	http://www.unicode.org/unicode/faq/utf_bom.html#24

The problem with BOMs is that they are supposed to appear at
the start of a string. However, if you concatenate two such
strings, the BOM in the middle will turn into a normal
ZWNBSP character. 

To be fully standards compliant, string concat
of a UTF-16 string (which start with BOM marks) would have
to be special cased. This is not possible though, since
strings don't have any encoding information.

The only way to properly deal with all this is at application
level, since only the programmer knows which string will
actually form the start of a file or a larger text string.

What I could do, is add a UTF-8 codec which prepends a
BOM mark and removes it from the stream during decode. The
programmer would have to do use this codec in case she
wants to prepend UTF-8 files with a BOM then.

I'm still unsure whether I should change the UTF-16 decoder
to only remove the BOM at the start of the stream -- the above
case where BOMs are inserted due to string concatenation
is very common (each .write() to a file will produce such
a BOM mark).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Sat May 19 13:08:08 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 19 May 2001 14:08:08 +0200
Subject: [I18n-sig] Transparent Encoding
References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de> <3B059BD5.CE05BBF3@ActiveState.com>
Message-ID: <3B066228.96022493@lemburg.com>

Paul Prescod wrote:
> 
> "Martin v. Loewis" wrote:
> >
> >...
> > An EncodedFile is not suitable since it has byte strings on both ends,
> > and Unicode strings only inside.
> 
> EncodedFile seems to work as I ask if I pass it the encoding name as
> "unicode-internal". Furthermore, code that does that is much simpler
> than code that looks up the codec manually. I'm not a big fan of those
> codec tuples.
> 
> Current:
> 
> writer = codecs.lookup("utf-8")[3]
> stream = writer(fileobj)
> 
> Proposed:
> 
> codecs.EncodedFile(fileobj, None, "utf-8")
> 
> As I understand it, you can almost always go without looking up the
> encoder tuple thanks to the .encode method. And you can almost always go
> without looking up the decoder, thanks to the .decode method. This
> EncodedFile convention would allow most common cases of wrapping Unicode
> to avoid looking up the tuple also.

Paul, I still don't understand what you really want to achieve.
Do you want a file-like object which writes utf-8 and can
take Unicode as input for write (as well as strings which are
then handled in the usual ASCII way) and returns Unicode for
.read() ?

The encoding 'unicode-internal' is really only meant for low-level
access to how we chose to represent Unicode at C level. This could
well change in some future version (note that Unicode is still
evolving and probably will continue to do so for some time;
e.g. Unicode 3.1 is just out the door and adds another 50k character
points, using the non-BMP space for the first time...).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Sat May 19 16:35:18 2001
From: guido@digicool.com (Guido van Rossum)
Date: Sat, 19 May 2001 11:35:18 -0400
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: Your message of "Sat, 19 May 2001 12:16:55 +0200."
 <3B064817.95FB5F5E@lemburg.com>
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
Message-ID: <200105191535.LAA01629@cj20424-a.reston1.va.home.com>

> The problem with BOMs is that they are supposed to appear at
> the start of a string.

Taken out of context, this strikes me as nonsense.  Strings in memory
(Python Unicode strings anyway) have absolutely no need for a byte
order mark since they are always in the right (native) byte order.

It is *files* that are supposed to have a BOM at the start.

I think the difference is worth noting: I don't mind if apps that read
files have to deal with the BOM (including, of course, using the
proper byte order to read the rest of the file).  But it is absurd to
expect code dealing with *strings* to handle BOMs.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From martin@loewis.home.cs.tu-berlin.de  Sat May 19 07:59:10 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 19 May 2001 08:59:10 +0200
Subject: [I18n-sig] Transparent Encoding
In-Reply-To: <3B059BD5.CE05BBF3@ActiveState.com> (message from Paul Prescod on
 Fri, 18 May 2001 15:01:57 -0700)
References: <3B056DC2.D143F641@ActiveState.com> <3B058080.7CEFF26C@lemburg.com> <3B0588C5.97E5E2E8@ActiveState.com> <200105182105.f4IL59I02050@mira.informatik.hu-berlin.de> <3B059BD5.CE05BBF3@ActiveState.com>
Message-ID: <200105190659.f4J6xAs01264@mira.informatik.hu-berlin.de>

> > An EncodedFile is not suitable since it has byte strings on both ends,
> > and Unicode strings only inside.
> 
> EncodedFile seems to work as I ask if I pass it the encoding name as
> "unicode-internal". 

What do you mean, "seems to work". The encoding "unicode-internal"
still produces byte strings, e.g.

>>> s=u"Hallo"
>>> s.encode("unicode-internal")
'H\x00a\x00l\x00l\x00o\x00'
>>> s
u'Hallo'

A unicode-internal encoded byte string is *not* the same thing as a
Unicode string.

> Furthermore, code that does that is much simpler
> than code that looks up the codec manually. I'm not a big fan of those
> codec tuples.
> 
> Current:
> 
> writer = codecs.lookup("utf-8")[3]
> stream = writer(fileobj)
> 
> Proposed:
> 
> codecs.EncodedFile(fileobj, None, "utf-8")

-0.

Regards,
Martin


From tdickenson@geminidataloggers.com  Mon May 21 11:06:46 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Mon, 21 May 2001 11:06:46 +0100
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de>   <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
Message-ID: <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com>

On Sat, 19 May 2001 11:35:18 -0400, Guido van Rossum
<guido@digicool.com> wrote:

>> The problem with BOMs is that they are supposed to appear at
>> the start of a string.
>
>Taken out of context, this strikes me as nonsense.  Strings in memory
>(Python Unicode strings anyway) have absolutely no need for a byte
>order mark since they are always in the right (native) byte order.

Thats true for Unicode strings.

However, a python plain string containing an encoded Unicode string
(in *any* character encoding) is no different to a file here - its
just a block-o-bytes.

>it is absurd to
>expect code dealing with *strings* to handle BOMs.

I agree with that, and is a good reason why the codecs should always
remove them.

"M.-A. Lemburg" <mal@lemburg.com> wrote:

>I'm still unsure whether I should change the UTF-16 decoder
>to only remove the BOM at the start of the stream -- the above
>case where BOMs are inserted due to string concatenation
>is very common (each .write() to a file will produce such
>a BOM mark).


Toby Dickenson
tdickenson@geminidataloggers.com


From walter@livinglogic.de  Mon May 21 12:08:34 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Mon, 21 May 2001 13:08:34 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com>
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
 <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
 <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
 <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com>
Message-ID: <200105211308340281.00448C08@mail.livinglogic.de>

On 21.05.01 at 11:06 Toby Dickenson wrote:

> [...]
> >it is absurd to
> >expect code dealing with *strings* to handle BOMs.
> 
> I agree with that, and is a good reason why the codecs should always
> remove them.

??? This is a good reason why the codec should pass the \ufeff
through, because a \ufeff in a unicode object should not be 
considered to be a BOM but a ZWNBSP (it might e.g. be used to
give hints to a hyphenation or ligature algorithm.)

> "M.-A. Lemburg" <mal@lemburg.com> wrote:
> 
> >I'm still unsure whether I should change the UTF-16 decoder
> >to only remove the BOM at the start of the stream -- the above
> >case where BOMs are inserted due to string concatenation
> >is very common (each .write() to a file will produce such
> >a BOM mark).

Then the write function has an error. A BOM should only be
written at the start of the file and not on every call to
write().

The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24)
states:
   Q: I am using a protocol that has BOM at the start of text. 
      How do I represent an initial ZWNBSP?

   A: Use the sequence FEFF FEFF

But with the current decoder implementation *both* \ufeffs
will be removed, so the ZWNBSP disappears.


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From mal@lemburg.com  Mon May 21 12:45:46 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 21 May 2001 13:45:46 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
 <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
 <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
 <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> <200105211308340281.00448C08@mail.livinglogic.de>
Message-ID: <3B08FFEA.7130A83F@lemburg.com>

Walter Doerwald wrote:
> 
> On 21.05.01 at 11:06 Toby Dickenson wrote:
> 
> > [...]
> > >it is absurd to
> > >expect code dealing with *strings* to handle BOMs.
> >
> > I agree with that, and is a good reason why the codecs should always
> > remove them.
> 
> ??? This is a good reason why the codec should pass the \ufeff
> through, because a \ufeff in a unicode object should not be
> considered to be a BOM but a ZWNBSP (it might e.g. be used to
> give hints to a hyphenation or ligature algorithm.)

True.
 
> > "M.-A. Lemburg" <mal@lemburg.com> wrote:
> >
> > >I'm still unsure whether I should change the UTF-16 decoder
> > >to only remove the BOM at the start of the stream -- the above
> > >case where BOMs are inserted due to string concatenation
> > >is very common (each .write() to a file will produce such
> > >a BOM mark).
> 
> Then the write function has an error. A BOM should only be
> written at the start of the file and not on every call to
> write().

That's hard to implement... how would the codec know where the
stream starts -- it only interfaces to the underyling stream
using .read() and .write() ?
 
> The Unicode FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#24)
> states:
>    Q: I am using a protocol that has BOM at the start of text.
>       How do I represent an initial ZWNBSP?
> 
>    A: Use the sequence FEFF FEFF
> 
> But with the current decoder implementation *both* \ufeffs
> will be removed, so the ZWNBSP disappears.

Note that this only happens in the UTF-16 codec. All other
codecs pass through the BOMs as-is. Perhaps I should modify
the UTF-16 codec to only remove BOMs when used in UTF-16
mode (without byte order indication) and not in 
UTF-16-LE/UTF-16-BE mode ?! ... and then only at the
start of a string.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Mon May 21 15:40:56 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 21 May 2001 16:40:56 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> (message from Toby
 Dickenson on Mon, 21 May 2001 11:06:46 +0100)
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de>   <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com>
Message-ID: <200105211440.f4LEeuB01307@mira.informatik.hu-berlin.de>

> Thats true for Unicode strings.
> 
> However, a python plain string containing an encoded Unicode string
> (in *any* character encoding) is no different to a file here - its
> just a block-o-bytes.

The problem with that approach is that writing to a UTF-16-encoded
file (as obtained by codecs.open(filename, "w", encoding="utf-16"))
will put the BOM in front of every chunk of data as passed to .write().

That is an error, IMO, the stream writer should only put the BOM into
the beginning of the entire file.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Mon May 21 15:44:20 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 21 May 2001 16:44:20 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <200105211308340281.00448C08@mail.livinglogic.de>
 (walter@livinglogic.de)
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
 <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
 <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
 <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> <200105211308340281.00448C08@mail.livinglogic.de>
Message-ID: <200105211444.f4LEiKQ01309@mira.informatik.hu-berlin.de>

> > >it is absurd to
> > >expect code dealing with *strings* to handle BOMs.
> > 
> > I agree with that, and is a good reason why the codecs should always
> > remove them.
> 
> ??? This is a good reason why the codec should pass the \ufeff
> through, because a \ufeff in a unicode object should not be 
> considered to be a BOM but a ZWNBSP (it might e.g. be used to
> give hints to a hyphenation or ligature algorithm.)

I agree. The decoder should *never* remove the BOM in the middle of a
string.

> Then the write function has an error. A BOM should only be
> written at the start of the file and not on every call to
> write().

I agree. Fixing that should not be too difficult; the codec instance
just needs to change its .encode and .decode attributes after the
first write.

This raises the question what:

f = open("/tmp/foo","w",encoding="utf-16")
f.close()

should give: an empty file, or a file containing just the BOM?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Mon May 21 15:50:41 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 21 May 2001 16:50:41 +0200
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <3B08FFEA.7130A83F@lemburg.com> (mal@lemburg.com)
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
 <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
 <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
 <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com>
Message-ID: <200105211450.f4LEofx01332@mira.informatik.hu-berlin.de>

> That's hard to implement... how would the codec know where the
> stream starts -- it only interfaces to the underyling stream
> using .read() and .write() ?

The stream readers and writers should assume that the first read and
write operation use the ZWNBSP as the BOM, so they should stop giving
a byte-order meaning to the BOM once they have seen the first chunk of
data. That is best implemented by replacing the .encode function with 
utf_16_be/le_encode (as appropriate).

> Note that this only happens in the UTF-16 codec. All other codecs
> pass through the BOMs as-is. Perhaps I should modify the UTF-16
> codec to only remove BOMs when used in UTF-16 mode (without byte
> order indication) and not in UTF-16-LE/UTF-16-BE mode ?!

You may want to study the RFC just to be sure, but I think this is how
UTF-16-[BL]E are defined.

Regards,
Martin


From mal@lemburg.com  Mon May 21 18:02:35 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 21 May 2001 19:02:35 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
 <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
 <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
 <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com> <200105211450.f4LEofx01332@mira.informatik.hu-berlin.de>
Message-ID: <3B094A2B.D7192F4C@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > That's hard to implement... how would the codec know where the
> > stream starts -- it only interfaces to the underyling stream
> > using .read() and .write() ?
> 
> The stream readers and writers should assume that the first read and
> write operation use the ZWNBSP as the BOM, so they should stop giving
> a byte-order meaning to the BOM once they have seen the first chunk of
> data. That is best implemented by replacing the .encode function with
> utf_16_be/le_encode (as appropriate).

Patches are welcome :-)
 
> > Note that this only happens in the UTF-16 codec. All other codecs
> > pass through the BOMs as-is. Perhaps I should modify the UTF-16
> > codec to only remove BOMs when used in UTF-16 mode (without byte
> > order indication) and not in UTF-16-LE/UTF-16-BE mode ?!
> 
> You may want to study the RFC just to be sure, but I think this is how
> UTF-16-[BL]E are defined.

According to the Unicode FAQ, BOM marks should only be used
where the byte order is not immediatly clear. In the case -LE and
-BE, this information is available, which is why the codecs
don't prepend a BOM mark.

Ok, I will modify the UTF-16-LE and -BE decoders to not remove
BOMs anymore and fix the UTF-16 decoder to only remove BOMs at
the start of the string. With these changes you should be able
to fix the UTF-16 stream codec to be more RFC compliant.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Mon May 21 17:55:21 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 21 May 2001 12:55:21 -0400
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: Your message of "Mon, 21 May 2001 13:45:46 +0200."
 <3B08FFEA.7130A83F@lemburg.com>
References: <3B02B9BB.E1F6AE39@ActiveState.com> <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de> <3B02F0B1.8863FDB1@lemburg.com> <877kze74gg.fsf@deneb.enyo.de> <3B064817.95FB5F5E@lemburg.com> <200105191535.LAA01629@cj20424-a.reston1.va.home.com> <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> <200105211308340281.00448C08@mail.livinglogic.de>
 <3B08FFEA.7130A83F@lemburg.com>
Message-ID: <200105211657.f4LGtcs20688@odiug.digicool.com>

> > Then the write function has an error. A BOM should only be
> > written at the start of the file and not on every call to
> > write().
> 
> That's hard to implement... how would the codec know where the
> stream starts -- it only interfaces to the underyling stream
> using .read() and .write() ?

To me this looks like it should be an application issue.  The
application should write an explicit BOM at the start of each file it
writes.  The codecs shouldn't do anything with BOMs -- just pass them
through.

I'm pretty sure that's what the intention of BOMs in the Unicode
standard was, because it's the only reasonable approach -- if it
isn't, I'd like to see chapter and verse quoted. ;-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From barry@wooz.org  Mon May 21 20:49:33 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Mon, 21 May 2001 15:49:33 -0400
Subject: [I18n-sig] pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
Message-ID: <15113.29005.357449.812516@anthem.wooz.org>

A very long time ago I wrote:

    >> I have a tentative patch for Tools/i18n/pygettext.py which adds
    >> optional extraction of module, class, method, and function
    >> docstrings.

    >> One question: should docstring extraction be turned on my
    >> default?

>>>>> And "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de>
>>>>> responded:

    MvL> I'd say so, yes. People who are confronted with gettext for
    MvL> the first time will say "Wow, it even does that!". In the
    MvL> rare cases where doc strings would confuse the meat of the
    MvL> catalog, people will be able to turn that off.  Perhaps it
    MvL> may be good to indicate in the catalog that this is a doc
    MvL> string? I'm thinking of

    MvL> #, py-doc

    MvL> I don't know the exact specification of the #, comments, but
    MvL> it can look like

    MvL> #, c-format, fuzzy

    MvL> i.e. it appears to be a comma-separated list of informative
    MvL> flags. Translators could then decide to deal with doc strings
    MvL> in a different manner (e.g follow different grammatical
    MvL> conventions).

Nearest I can tell, according to

    http://www.gnu.org/manual/gettext/html_chapter/gettext_2.html#SEC9

I think the correct thing to do is to mark docstring extractions with

    #. docstring

comments.  I'm going to check in a patch to do this now, although for
backwards compatibility I think I will still leave docstring
extraction disabled by default (enabled it with -D / --docstrings).

-Barry


From mal@lemburg.com  Tue May 22 09:57:43 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 22 May 2001 10:57:43 +0200
Subject: [I18n-sig] UTF-8 and BOM
References: <3B02B9BB.E1F6AE39@ActiveState.com>
 <200105162107.f4GL7nU01574@mira.informatik.hu-berlin.de>
 <3B02F0B1.8863FDB1@lemburg.com>
 <877kze74gg.fsf@deneb.enyo.de>
 <3B064817.95FB5F5E@lemburg.com>
 <200105191535.LAA01629@cj20424-a.reston1.va.home.com>
 <alphgtgrf02c68seij2gbuaoop0q57mohv@4ax.com> <200105211308340281.00448C08@mail.livinglogic.de> <3B08FFEA.7130A83F@lemburg.com> <200105211450.f4LEofx01332@mira.informatik.hu-berlin.de> <3B094A2B.D7192F4C@lemburg.com>
Message-ID: <3B0A2A07.85EFB484@lemburg.com>

"M.-A. Lemburg" wrote:
> 
> "Martin v. Loewis" wrote:
> >
> > > That's hard to implement... how would the codec know where the
> > > stream starts -- it only interfaces to the underyling stream
> > > using .read() and .write() ?
> >
> > The stream readers and writers should assume that the first read and
> > write operation use the ZWNBSP as the BOM, so they should stop giving
> > a byte-order meaning to the BOM once they have seen the first chunk of
> > data. That is best implemented by replacing the .encode function with
> > utf_16_be/le_encode (as appropriate).
 
Patches are welcome :-)

> > > Note that this only happens in the UTF-16 codec. All other codecs
> > > pass through the BOMs as-is. Perhaps I should modify the UTF-16
> > > codec to only remove BOMs when used in UTF-16 mode (without byte
> > > order indication) and not in UTF-16-LE/UTF-16-BE mode ?!
> >
> > You may want to study the RFC just to be sure, but I think this is how
> > UTF-16-[BL]E are defined.
> 
> According to the Unicode FAQ, BOM marks should only be used
> where the byte order is not immediatly clear. In the case -LE and
> -BE, this information is available, which is why the codecs
> don't prepend a BOM mark.
> 
> Ok, I will modify the UTF-16-LE and -BE decoders to not remove
> BOMs anymore and fix the UTF-16 decoder to only remove BOMs at
> the start of the string. With these changes you should be able
> to fix the UTF-16 stream codec to be more RFC compliant.

Done. See the CVS versions of Misc/NEWS and Include/unicodeobject.h
for details.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tkg@menteith.com  Wed May 23 04:17:58 2001
From: tkg@menteith.com (Tony Graham)
Date: Tue, 22 May 2001 23:17:58 -0400 (EST)
Subject: [I18n-sig] UTF-8 and BOM
In-Reply-To: <118422204@toto.iv>
Message-ID: <15115.11238.318000.934980@menteith.com>

At 21 May 2001 12:55 -0400, Guido van Rossum wrote:
 > I'm pretty sure that's what the intention of BOMs in the Unicode
 > standard was, because it's the only reasonable approach -- if it
 > isn't, I'd like to see chapter and verse quoted. ;-)

See Section 5.6 in http://www.unicode.org/unicode/uni2book/ch13.pdf.

I could also quote a chapter from "Unicode: A Primer," but it doesn't
have any verses.

Regards,


Tony Graham.


From walter@livinglogic.de  Wed May 23 10:35:32 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Wed, 23 May 2001 11:35:32 +0200
Subject: [I18n-sig] Re: [XML-SIG] XML and Unicode
In-Reply-To: <20010522193314.E22396@mnot.net>
References: <20010522150638.C22396@mnot.net>
 <3B0AEA6A.9CCD2A1F@lemburg.com>
 <20010522193314.E22396@mnot.net>
Message-ID: <200105231135320031.00663C0E@mail.livinglogic.de>

On 22.05.01 at 19:33 Mark Nottingham wrote:

> OK, so I'm not getting something then. The attached test script (and
> data file) is the problem pared down - if u'string' is a neutral
> encoding, and .encode('utf-8') generates a utf-8 encoded string of
> that encoding, then the utf-8.html output file should display
> correctly; however, it doesn't, while the latin-1 output does
> (because the input is latin-1).

>>> open("ISO-8859-1.xml","rb").read()
'<?xml version=3D"1.0" encoding=3D"ISO-8859-1" ?>\r\n<content>Net 21 \x96=
 The Survivors</content>\r\n\r\n'

The character contained in your test XML file seems to be \x96, which
is a control character in Unicode, but in Windows it's used as an 
endash.

If you want a "real" endash you should use the Unicode ndash U+2013:
"Net 21 &#8211; The Survivors".

But then encoding the output with latin-1 will no longer work.

> [...]

BTW, you might want to try several variants for the name of the
output encoding, because although Python encode method recognises 
the name, your web browser might not.

Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7=
 www.livinglogic.de


From keichwa@gmx.net  Thu May 24 21:02:47 2001
From: keichwa@gmx.net (Karl Eichwalder)
Date: 24 May 2001 22:02:47 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15113.29005.357449.812516@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
Message-ID: <shy9rmwm54.fsf@tux.gnu.franken.de>

barry@wooz.org (Barry A. Warsaw) writes:

> >>>>> And "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de>
> >>>>> responded:

>     MvL> #, py-doc
> 
>     MvL> I don't know the exact specification of the #, comments, but
>     MvL> it can look like
> 
>     MvL> #, c-format, fuzzy
> 
>     MvL> i.e. it appears to be a comma-separated list of informative
>     MvL> flags. Translators could then decide to deal with doc strings
>     MvL> in a different manner (e.g follow different grammatical
>     MvL> conventions).

> I think the correct thing to do is to mark docstring extractions with
> 
>     #. docstring
> 
> comments.

No, #. is reserved for literally extracted comments; #, is for
meta-comments.  Martin's proposal sounds better.

-- 
work : ke@suse.de                          |                   ,__o
     : http://www.suse.de/~ke/             |                 _-\_<,
home : keichwa@gmx.net                     |                (*)/'(*)


From barry@wooz.org  Fri May 25 00:15:50 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Thu, 24 May 2001 19:15:50 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
Message-ID: <15117.38438.361043.255768@anthem.wooz.org>

>>>>> "KE" == Karl Eichwalder <keichwa@gmx.net> writes:

    >> I think the correct thing to do is to mark docstring
    >> extractions with #. docstring comments.

    KE> No, #. is reserved for literally extracted comments; #, is for
    KE> meta-comments.  Martin's proposal sounds better.

You probably know better than me, but, is that opinion based on more
information than is available in the GNU gettext manual?

    http://www.gnu.org/manual/gettext/html_node/gettext_9.html#SEC9

seems to imply to me that #, comments define only two flags
(i.e. "fuzzy" and "c-format" / "no-c-format") and it doesn't say that
the flags are extensible or user definable.  Then again, it doesn't
say that #. comments are reserved.  It basically just says that
#-whitespace comments are reserved for the translators.

I'm happy to switch it, but I'd really like to have a reference I can
point to to short-circuit any further discussion.  Even a mailing list
archive url would be fine.

Thanks,
-Barry


From keichwa@gmx.net  Fri May 25 06:11:57 2001
From: keichwa@gmx.net (Karl Eichwalder)
Date: 25 May 2001 07:11:57 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15117.38438.361043.255768@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
Message-ID: <shn182ui5e.fsf@tux.gnu.franken.de>

barry@wooz.org (Barry A. Warsaw) writes:

> You probably know better than me, but, is that opinion based on more
> information than is available in the GNU gettext manual?

This is another piece of info you'll find within the gettext manual:

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- cut here -=3D-=3D-=3D=
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
   Therefore the `xgettext' adds a special tag to those messages it
thinks might be a format string.  There is no absolute rule for this,
only a heuristic.  In the `.po' file the entry is marked using the
`c-format' flag in the `#,' comment line (*note PO Files::).

   The careful reader now might say that this again can cause problems.
The heuristic might guess it wrong.  This is true and therefore
`xgettext' knows about special kind of comment which lets the
programmer take over the decision.  If in the same line or the
immediately preceding line of the `gettext' keyword the `xgettext'
program find a comment containing the words `xgettext:c-format' it will
mark the string in any case with the `c-format' flag.  This kind of
comment should be used when `xgettext' does not recognize the string as
a format string but is really is one and it should be tested.  Please
note that when the comment is in the same line of the `gettext'
keyword, it must be before the string to be translated.
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- cut here -=3D-=3D-=3D=
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-

>     http://www.gnu.org/manual/gettext/html_node/gettext_9.html#SEC9
>=20
> seems to imply to me that #, comments define only two flags
> (i.e. "fuzzy" and "c-format" / "no-c-format") and it doesn't say that
> the flags are extensible or user definable.  Then again, it doesn't
> say that #. comments are reserved.  It basically just says that
> #-whitespace comments are reserved for the translators.

You're right.  The term AUTOMATIC-COMMENTS is not properly defined.

Also FLAG leave open some questions.

> I'm happy to switch it, but I'd really like to have a reference I can
> point to to short-circuit any further discussion.  Even a mailing list
> archive url would be fine.

It's now bruno Haible who maintains the gettext suite.  There's a
po-utils-forum mailinglist at IRO.UMontreal.CA initiated by Fran=E7ois
(thanks); mostly for my own amusement ;)

The mailinglist is archived -- at the moment I don't know where.  You
can start browsing here:

    http://www.iro.umontreal.ca/~pinard/po-utils/HTML/

But right now "titan" (Fran=E7ois' workstation?) does not want to talk to
me.  Please, try again later.

The other gettext forum is gnu.utils.bugs .

Karl

--=20
work : ke@suse.de                          |                   ,__o
     : http://www.suse.de/~ke/             |                 _-\_<,
home : keichwa@gmx.net                     |                (*)/'(*)


From barry@wooz.org  Fri May 25 15:20:58 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Fri, 25 May 2001 10:20:58 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
Message-ID: <15118.27210.930905.339141@anthem.wooz.org>

Ah cool.

For those just coming in: the issue is that pygettext.py extracts
Python docstrings if you give it the -D/--docstring flag.  I want to
mark such docstrings in the .pot file because translators may not want
or need to translate every docstring.

The documentation for .po file comments is a little sparse here.  I
agree that the logical place for such markings is in the #, comments,
e.g.:

#, docstring
#: Mailman/Archiver/Archiver.py:142
msgid "The mbox name where messages are left for archive construction."
msgstr ""

But the po-file format documentation doesn't say that additional flags
can be defined for #, comments.  It seems to me a simple omission in
the documentation, right?  Is the intent of #, flags that the
extraction tools can define additional, language-specific flags?

-Barry


From martin@loewis.home.cs.tu-berlin.de  Fri May 25 21:12:42 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 25 May 2001 22:12:42 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15118.27210.930905.339141@anthem.wooz.org> (barry@wooz.org)
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de> <15118.27210.930905.339141@anthem.wooz.org>
Message-ID: <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>

> But the po-file format documentation doesn't say that additional flags
> can be defined for #, comments.  It seems to me a simple omission in
> the documentation, right?  Is the intent of #, flags that the
> extraction tools can define additional, language-specific flags?

I'd say that nobody has thought of that. Bruno is probably the person
to give a definitive yay or nay here, but I'd hope that tools
shouldn't go into flames if they see an extra flag. Atleast GNU
msgmerge does not show any concern.

Of course, it would be better if this possibility could be codified
somewhere, and if gettext.texi could serve as the repository of
well-known flags - even if they don't all have a meaning to GNU
gettext. Adding such documentation is probably an issue of submitting
patches against gettext.texi.

Regards,
Martin


From tree@basistech.com  Wed May 30 22:37:16 2001
From: tree@basistech.com (Tom Emerson)
Date: Wed, 30 May 2001 17:37:16 -0400
Subject: [I18n-sig] Unicode normalization and collation implementation?
Message-ID: <15125.26636.297182.646562@cymru.basistech.com>

I need to use the Unicode collation algorithm from Python --- has
anyone implemented this yet? I'd rather not do it, so if someone else
has code, share the wealth.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Thu May 31 08:16:41 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 31 May 2001 09:16:41 +0200
Subject: [I18n-sig] Unicode normalization and collation implementation?
References: <15125.26636.297182.646562@cymru.basistech.com>
Message-ID: <3B15EFD9.E6BAD2A9@lemburg.com>

Tom Emerson wrote:
> 
> I need to use the Unicode collation algorithm from Python --- has
> anyone implemented this yet? I'd rather not do it, so if someone else
> has code, share the wealth.

No. It's been on the plate for some time now, though.

Note that if your are going to start working in this direction,
you should focus on normalization form C since this is probably
the most often used (and practical) one:

	http://www.unicode.org/unicode/reports/tr15/

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Thu May 31 15:37:20 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 10:37:20 -0400
Subject: [I18n-sig] Unicode normalization and collation implementation?
In-Reply-To: <3B15EFD9.E6BAD2A9@lemburg.com>
References: <15125.26636.297182.646562@cymru.basistech.com>
 <3B15EFD9.E6BAD2A9@lemburg.com>
Message-ID: <15126.22304.663571.552971@cymru.basistech.com>

M.-A. Lemburg writes:
> Note that if your are going to start working in this direction,
> you should focus on normalization form C since this is probably
> the most often used (and practical) one:

No, I need form D for the collation algorithm, so this is what I'm
doing first.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Thu May 31 15:51:25 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 31 May 2001 16:51:25 +0200
Subject: [I18n-sig] Unicode normalization and collation implementation?
References: <15125.26636.297182.646562@cymru.basistech.com>
 <3B15EFD9.E6BAD2A9@lemburg.com> <15126.22304.663571.552971@cymru.basistech.com>
Message-ID: <3B165A6C.31B390C5@lemburg.com>

Tom Emerson wrote:
> 
> M.-A. Lemburg writes:
> > Note that if your are going to start working in this direction,
> > you should focus on normalization form C since this is probably
> > the most often used (and practical) one:
> 
> No, I need form D for the collation algorithm, so this is what I'm
> doing first.

Does that mean you are going to start working in that direction ?
(would be great !)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Thu May 31 15:56:04 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 10:56:04 -0400
Subject: [I18n-sig] Unicode normalization and collation implementation?
In-Reply-To: <3B165A6C.31B390C5@lemburg.com>
References: <15125.26636.297182.646562@cymru.basistech.com>
 <3B15EFD9.E6BAD2A9@lemburg.com>
 <15126.22304.663571.552971@cymru.basistech.com>
 <3B165A6C.31B390C5@lemburg.com>
Message-ID: <15126.23428.889431.364510@cymru.basistech.com>

M.-A. Lemburg writes:
> > No, I need form D for the collation algorithm, so this is what I'm
> > doing first.
> 
> Does that mean you are going to start working in that direction ?
> (would be great !)

Yes, as I said, I need the Unicode collation algorithm now, so I'll be
working on normalization and collation over the next week or two.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Thu May 31 18:13:03 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 31 May 2001 19:13:03 +0200
Subject: [I18n-sig] XML and UTF-16
Message-ID: <3B167B9F.344D6992@lemburg.com>

What is the standard file layout to use for storing an XML file
in UTF-16 ?

1) encode the whole file in UTF-16 (possibly prepended with a BOM)

or

2) write the first line containing the XML header (which has the
   encoding information) in ASCII and then proceed with UTF-16
   starting after the newline character

or

3) none of the above: you simply don't do this ;-)

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Thu May 31 18:23:31 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 13:23:31 -0400
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B167B9F.344D6992@lemburg.com>
References: <3B167B9F.344D6992@lemburg.com>
Message-ID: <15126.32275.110670.236066@cymru.basistech.com>

M.-A. Lemburg writes:
> What is the standard file layout to use for storing an XML file
> in UTF-16 ?

I thought this was covered in the XML specification as a non-normative
appendix. Maybe not.

> 1) encode the whole file in UTF-16 (possibly prepended with a BOM)

Yes. You can then pretty easily autodetect the which Unicode
transformation format is being used by looking at the first ten or
so bytes.

If the BOM is present, that's a big clue right there.

UTF-16-BE will have the first "<?xml " encoded like

003C 003F 0078 006D 006E

while UTF-16-LE will have it encoded as

3C00 3F00 7800 6D00 6E00

ASCII and UTF-8 will just have

3C 3F 78 6D 6E

> 2) write the first line containing the XML header (which has the
>    encoding information) in ASCII and then proceed with UTF-16
>    starting after the newline character

Ugh, no.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Thu May 31 18:39:17 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 31 May 2001 19:39:17 +0200
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com>
Message-ID: <3B1681C5.71FD484D@lemburg.com>

Tom Emerson wrote:
> 
> M.-A. Lemburg writes:
> > What is the standard file layout to use for storing an XML file
> > in UTF-16 ?
> 
> I thought this was covered in the XML specification as a non-normative
> appendix. Maybe not.

I was too lazy to look it up :-)
 
> > 1) encode the whole file in UTF-16 (possibly prepended with a BOM)
> 
> Yes. You can then pretty easily autodetect the which Unicode
> transformation format is being used by looking at the first ten or
> so bytes.
> 
> If the BOM is present, that's a big clue right there.
> 
> UTF-16-BE will have the first "<?xml " encoded like
> 
> 003C 003F 0078 006D 006E
> 
> while UTF-16-LE will have it encoded as
> 
> 3C00 3F00 7800 6D00 6E00
> 
> ASCII and UTF-8 will just have
> 
> 3C 3F 78 6D 6E

Perhaps we should have some smart auto-detection API somewhere
which does this automagically ?! Something like

	guess_xml_encoding(data) -> encoding string

It could work by looking at the first 256 bytes of the data
string and then apply all the tricks needed to extract the
encoding information (or default to UTF-8 if no such information
is given).

> > 2) write the first line containing the XML header (which has the
> >    encoding information) in ASCII and then proceed with UTF-16
> >    starting after the newline character
> 
> Ugh, no.

Thought so :-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Thu May 31 18:52:11 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 13:52:11 -0400
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B1681C5.71FD484D@lemburg.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
Message-ID: <15126.33995.327715.84261@cymru.basistech.com>

M.-A. Lemburg writes:
> Perhaps we should have some smart auto-detection API somewhere
> which does this automagically ?! Something like
> 
> 	guess_xml_encoding(data) -> encoding string
> 
> It could work by looking at the first 256 bytes of the data
> string and then apply all the tricks needed to extract the
> encoding information (or default to UTF-8 if no such information
> is given).

Yes, I think this would be a good idea. I would use something along
the lines of:

0) Assume UTF-8.

1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
   appropriate transmission format and endian nature. Goto 4.

2) Look for the UTF-8 uniBOM, since some editors like putting that in.
   Ignore it and goto 4.

3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
   with appropriate endian variants. If found, assume the detected
   encoding. Goto 4.

4) Look for the encoding attribute to the XML directive and validate
   it against the detected encoding. If we detected that the file is
   in UTF-16BE but the encoding attribute claims UTF-8, something is
   wrong somewhere.

5) If the value of the encoding attribute is consistent with the
   detected encoding, continue and possibly instantiate the
   appropriate transcoders for the document (e.g., if you see
   something like "<?xml version='1.0' encoding='gb-2312'>")


-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paulp@ActiveState.com  Thu May 31 22:17:18 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 31 May 2001 14:17:18 -0700
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com>
Message-ID: <3B16B4DE.B0E8ADD4@ActiveState.com>

Tom Emerson wrote:
> 
>...
> 
> Yes. You can then pretty easily autodetect the which Unicode
> transformation format is being used by looking at the first ten or
> so bytes.

Actually, the first four bytes are sufficient to get you started. Then
you have to look at the encoding declaration if present.

> If the BOM is present, that's a big clue right there.

"""Entities encoded in UTF-16 must begin with the Byte Order Mark
described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
(the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
signature, not part of either the markup or the character data of the
XML document. XML processors must be able to use this character to
differentiate between UTF-8 and UTF-16 encoded documents."""

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Thu May 31 22:21:24 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 31 May 2001 14:21:24 -0700
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com>
Message-ID: <3B16B5D4.730D8E30@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
>...
> 
> Perhaps we should have some smart auto-detection API somewhere
> which does this automagically ?! Something like
> 
>         guess_xml_encoding(data) -> encoding string
> 
> It could work by looking at the first 256 bytes of the data
> string and then apply all the tricks needed to extract the
> encoding information (or default to UTF-8 if no such information
> is given).

This might help:

http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/52257

I think Lars has a version too...

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From tree@basistech.com  Thu May 31 22:23:00 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 17:23:00 -0400
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B16B4DE.B0E8ADD4@ActiveState.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B16B4DE.B0E8ADD4@ActiveState.com>
Message-ID: <15126.46644.277960.763113@cymru.basistech.com>

Paul Prescod writes:
> Tom Emerson wrote:
> > Yes. You can then pretty easily autodetect the which Unicode
> > transformation format is being used by looking at the first ten or
> > so bytes.
> 
> Actually, the first four bytes are sufficient to get you started. Then
> you have to look at the encoding declaration if present.

Even for UTF-32?

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Thu May 31 21:28:31 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 31 May 2001 22:28:31 +0200
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <15126.32275.110670.236066@cymru.basistech.com> (message from Tom
 Emerson on Thu, 31 May 2001 13:23:31 -0400)
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com>
Message-ID: <200105312028.f4VKSVe02837@mira.informatik.hu-berlin.de>

> M.-A. Lemburg writes:
> > What is the standard file layout to use for storing an XML file
> > in UTF-16 ?
> 
> I thought this was covered in the XML specification as a non-normative
> appendix. Maybe not.

Indeed it is. In addition to the procedure you outline, they also
anticipate that a higher-level protocol (such as HTTP) may identify a
content type.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu May 31 21:46:31 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 31 May 2001 22:46:31 +0200
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <15126.33995.327715.84261@cymru.basistech.com> (message from Tom
 Emerson on Thu, 31 May 2001 13:52:11 -0400)
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com>
Message-ID: <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>

> Yes, I think this would be a good idea. I would use something along
> the lines of:

Please have a look at
xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
follows the procedure in the XML recommendation, except that it does
not expect "unusual" byte orders (2134, 3412), and that it does not
detect EBCDIC.

> 0) Assume UTF-8.
> 
> 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
>    appropriate transmission format and endian nature. Goto 4.
> 
> 2) Look for the UTF-8 uniBOM, since some editors like putting that in.
>    Ignore it and goto 4.

I see this was added to the XML recommendation only in the second
edition, so I should also added to xmlproc.

> 3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
>    with appropriate endian variants. If found, assume the detected
>    encoding. Goto 4.

Please note that ASCII is not detectable this way: If you see '<?xml',
then you don't know anything about the encoding except that you should
be able to parse the encoding= attribute successfully if present.


Regards,
Martin


From tree@basistech.com  Thu May 31 22:27:17 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 17:27:17 -0400
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
Message-ID: <15126.46901.610405.498190@cymru.basistech.com>

Martin v. Loewis writes:
> Please note that ASCII is not detectable this way: If you see '<?xml',
> then you don't know anything about the encoding except that you should
> be able to parse the encoding= attribute successfully if present.

Yes, of course --- I wasn't sufficiently explicit. If you see "<?xml"
then you know that you are looking at 7-bit characters that are at
least the same as US-ASCII, but could be a variant (GB-Roman,
JIS-Roman, etc.) but could be Latin-1 or UTF-8.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paulp@ActiveState.com  Thu May 31 22:34:30 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 31 May 2001 14:34:30 -0700
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B16B4DE.B0E8ADD4@ActiveState.com> <15126.46644.277960.763113@cymru.basistech.com>
Message-ID: <3B16B8E6.D1E083@ActiveState.com>

Tom Emerson wrote:
> 
> Paul Prescod writes:
> > Tom Emerson wrote:
> > > Yes. You can then pretty easily autodetect the which Unicode
> > > transformation format is being used by looking at the first ten or
> > > so bytes.
> >
> > Actually, the first four bytes are sufficient to get you started. Then
> > you have to look at the encoding declaration if present.
> 
> Even for UTF-32?

I think so. UTF-32 is a 32-bit encoding and 32 bits are 4 bytes. You
only need one character (either a BOM or a "<") sign to know what you
are dealing with.

You were right that it is an appendix to the spec:

 http://www.w3.org/TR/REC-xml.html#sec-guessing

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From tree@basistech.com  Thu May 31 22:35:30 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 17:35:30 -0400
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B16B8E6.D1E083@ActiveState.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B16B4DE.B0E8ADD4@ActiveState.com>
 <15126.46644.277960.763113@cymru.basistech.com>
 <3B16B8E6.D1E083@ActiveState.com>
Message-ID: <15126.47394.654300.731399@cymru.basistech.com>

Paul Prescod writes:
> I think so. UTF-32 is a 32-bit encoding and 32 bits are 4 bytes. You
> only need one character (either a BOM or a "<") sign to know what you
> are dealing with.

Well, you know that the first UTF-32 character is "<", but no
more. I'd at least look for "<?xml" to be absolutely sure, but I'm
also overly paranoid. You could be looking at "<!DOCTYPE" or some
such.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paulp@ActiveState.com  Thu May 31 22:44:37 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 31 May 2001 14:44:37 -0700
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B16B4DE.B0E8ADD4@ActiveState.com>
 <15126.46644.277960.763113@cymru.basistech.com>
 <3B16B8E6.D1E083@ActiveState.com> <15126.47394.654300.731399@cymru.basistech.com>
Message-ID: <3B16BB45.1A15560D@ActiveState.com>

Tom Emerson wrote:
> 
> Paul Prescod writes:
> > I think so. UTF-32 is a 32-bit encoding and 32 bits are 4 bytes. You
> > only need one character (either a BOM or a "<") sign to know what you
> > are dealing with.
> 
> Well, you know that the first UTF-32 character is "<", but no
> more. I'd at least look for "<?xml" to be absolutely sure, but I'm
> also overly paranoid. You could be looking at "<!DOCTYPE" or some
> such.

Would it matter if you were looking at <!DOCTYPE? Anyhow, a UTF-32
document without an XML declaration would be in error. The declaration
is required for everything other than UTF-8 and UTF-16.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From tree@basistech.com  Thu May 31 22:45:26 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 31 May 2001 17:45:26 -0400
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B16BB45.1A15560D@ActiveState.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B16B4DE.B0E8ADD4@ActiveState.com>
 <15126.46644.277960.763113@cymru.basistech.com>
 <3B16B8E6.D1E083@ActiveState.com>
 <15126.47394.654300.731399@cymru.basistech.com>
 <3B16BB45.1A15560D@ActiveState.com>
Message-ID: <15126.47990.808992.298339@cymru.basistech.com>

Paul Prescod writes:
> Would it matter if you were looking at <!DOCTYPE? Anyhow, a UTF-32
> document without an XML declaration would be in error. The declaration
> is required for everything other than UTF-8 and UTF-16.

I guess my point is that it is better to be overly conservative up
front and look for at least two complete characters (in whatever
encoding) before attempting to process the document.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Thu May 31 23:12:11 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 1 Jun 2001 00:12:11 +0200
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <15126.47394.654300.731399@cymru.basistech.com> (message from Tom
 Emerson on Thu, 31 May 2001 17:35:30 -0400)
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B16B4DE.B0E8ADD4@ActiveState.com>
 <15126.46644.277960.763113@cymru.basistech.com>
 <3B16B8E6.D1E083@ActiveState.com> <15126.47394.654300.731399@cymru.basistech.com>
Message-ID: <200105312212.f4VMCBl04236@mira.informatik.hu-berlin.de>

> Well, you know that the first UTF-32 character is "<", but no
> more. 

According to the procedure specified in the XML recommendation, this
is enough for auto-detection, so you clearly don't need to look at
more bytes when parsing XML.

In any case, what would you do if you find out that the next few bytes
cannot be interpreted as ?xml in UTF-32? You would probably signal an
error. So would you if the document is not well-formed XML if treated
as UTF-32 after looking at the first few bytes.

Regards,
Martin