From webmaster@pferdemarkt.ws  Wed Jan 15 12:25:51 2003
From: webmaster@pferdemarkt.ws (webmaster@pferdemarkt.ws)
Date: Wed, 15 Jan 2003 04:25:51 -0800
Subject: [I18n-sig] Pferdemarkt.ws informiert! Newsletter 01/2003
Message-ID: <200301151225.EAA02785@eagle.he.net>

http://www.pferdemarkt.ws

Wir sind in 2003 erfolgreich in des neue \"Pferdejahr 2003 gestartet.

Für den schnellen Erfolg unseres Marktes möchten wir uns bei Ihnen bedanken.

Heute am 15. Januar 2003 sind wir genau 14 Tage Online!

Täglich wächst unsere Datenbank um ca. 30  neue Angebote.

Stellen auch Sie als Privatperson Ihre zu verkaufenden Pferde direkt und

vollkommen Kostenlos ins Internet.

Zur besseren Sichtbarmachung Ihrer Angebote können SIe bis zu ein Bild zu Ihrer

Pferdeanzeige kostenlos einstellen!

Klicken Sie hier um sich direkt einzuloggen http://www.Pferdemarkt.ws

Kostenlos Anbieten, Kostenlos Suchen! Direkt von Privat zu Privat!

Haben Sie noch Fragen mailto: webmaster@pferdemarkt.ws


From gp@pooryorick.com  Wed Jan 15 22:48:49 2003
From: gp@pooryorick.com (Poor Yorick)
Date: Wed, 15 Jan 2003 15:48:49 -0700
Subject: [I18n-sig] codecs module, readlines and xreadlines
Message-ID: <3E25E551.4010202@pooryorick.com>

The following code shows an inconsistency between open.readlines and
codecs.open.readlines, and also between open.xreadlines and
codecs.open.xreadlines.  the call to open.readlines returns '\n' as the
whereas codecs.open.readlines returns '\r\n'.  Any plans to fix this?

  >>> fh = open('test2.txt', 'r')
  >>> lines = fh.readlines()
  >>> print lines
['1120, "Serial Number", 1016993947\n', '1122, "msconfig.exe",
1016994129\n', '1123, "Microsoft Windows XP", 1016994141\n', '1124,
"Version", 1016994143\n', '1125, "XP", 1016994156\n', '1126, "Microsoft
Windows", 1016994169\n', '1127, "Component", 1016994468']

  >>> fh = codecs.open('test1.txt', 'r', 'utf-16')
  >>> lines = fh.readlines()
  >>> print lines
[u'1120, "Serial Number", 1016993947\r\n', u'1122, "msconfig.exe",
1016994129\r\n', u'1123, "Microsoft Windows XP", 1016994141\r\n',
u'1124, "Version", 1016994143\r\n', u'1125, "XP", 1016994156\r\n',
u'1126, "Microsoft Windows", 1016994169\r\n', u'1127, "Component",
1016994468']

  >>> fh = open('test2.txt', 'r')
  >>> lines = fh.xreadlines()
  >>> lines.next()
'1120, "Serial Number", 1016993947\n'
  >>> lines.next()
'1122, "msconfig.exe", 1016994129\n'

  >>> fh = codecs.open('test1.txt', 'r', 'utf-16')
  >>> lines = fh.xreadlines()
  >>> lines.next()
'\xff\xfe1\x001\x002\x000\x00,\x00
\x00"\x00S\x00e\x00r\x00i\x00a\x00l\x00
\x00N\x00u\x00m\x00b\x00e\x00r\x00"\x00,\x00
\x001\x000\x001\x006\x009\x009\x003\x009\x004\x007\x00\r\x00\n'
  >>> lines.next()
'\x001\x001\x002\x002\x00,\x00
\x00"\x00m\x00s\x00c\x00o\x00n\x00f\x00i\x00g\x00.\x00e\x00x\x00e\x00"\x00,\x00 

\x001\x000\x001\x006\x009\x009\x004\x001\x002\x009\x00\r\x00\n'
  >>>

Poor Yorick
gp@pooryorick.com


From martin@v.loewis.de  Wed Jan 15 23:06:15 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 16 Jan 2003 00:06:15 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E25E551.4010202@pooryorick.com>
References: <3E25E551.4010202@pooryorick.com>
Message-ID: <m3znq29fo8.fsf@mira.informatik.hu-berlin.de>

Poor Yorick <gp@pooryorick.com> writes:

> The following code shows an inconsistency between open.readlines and
> codecs.open.readlines, and also between open.xreadlines and
> codecs.open.xreadlines.  the call to open.readlines returns '\n' as the
> whereas codecs.open.readlines returns '\r\n'.  Any plans to fix this?

Not without a bug report, or better yet, an actual patch. I think it
would be best if codecs supported the "universal newlines" feature of
Python 2.3.

Regards,
Martin


From mal@lemburg.com  Thu Jan 16 09:15:37 2003
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 16 Jan 2003 10:15:37 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E25E551.4010202@pooryorick.com>
References: <3E25E551.4010202@pooryorick.com>
Message-ID: <3E267839.50301@lemburg.com>

Poor Yorick wrote:
> The following code shows an inconsistency between open.readlines and
> codecs.open.readlines, and also between open.xreadlines and
> codecs.open.xreadlines.  the call to open.readlines returns '\n' as the
> whereas codecs.open.readlines returns '\r\n'.  Any plans to fix this?

On Windows, the 'r' opens the file in text which mangles the line-end
information. You should try to open the file in 'rb' (binary) mode
for comparison.

codecs.open() automatically appends the 'b' to the 'r' for you,
so this is probably the cause of the problem.

>  >>> fh = open('test2.txt', 'r')
>  >>> lines = fh.readlines()
>  >>> print lines
> ['1120, "Serial Number", 1016993947\n', '1122, "msconfig.exe",
> 1016994129\n', '1123, "Microsoft Windows XP", 1016994141\n', '1124,
> "Version", 1016994143\n', '1125, "XP", 1016994156\n', '1126, "Microsoft
> Windows", 1016994169\n', '1127, "Component", 1016994468']
> 
>  >>> fh = codecs.open('test1.txt', 'r', 'utf-16')
>  >>> lines = fh.readlines()
>  >>> print lines
> [u'1120, "Serial Number", 1016993947\r\n', u'1122, "msconfig.exe",
> 1016994129\r\n', u'1123, "Microsoft Windows XP", 1016994141\r\n',
> u'1124, "Version", 1016994143\r\n', u'1125, "XP", 1016994156\r\n',
> u'1126, "Microsoft Windows", 1016994169\r\n', u'1127, "Component",
> 1016994468']
> 
>  >>> fh = open('test2.txt', 'r')
>  >>> lines = fh.xreadlines()
>  >>> lines.next()
> '1120, "Serial Number", 1016993947\n'
>  >>> lines.next()
> '1122, "msconfig.exe", 1016994129\n'
> 
>  >>> fh = codecs.open('test1.txt', 'r', 'utf-16')
>  >>> lines = fh.xreadlines()
>  >>> lines.next()
> '\xff\xfe1\x001\x002\x000\x00,\x00
> \x00"\x00S\x00e\x00r\x00i\x00a\x00l\x00
> \x00N\x00u\x00m\x00b\x00e\x00r\x00"\x00,\x00
> \x001\x000\x001\x006\x009\x009\x003\x009\x004\x007\x00\r\x00\n'
>  >>> lines.next()
> '\x001\x001\x002\x002\x00,\x00
> \x00"\x00m\x00s\x00c\x00o\x00n\x00f\x00i\x00g\x00.\x00e\x00x\x00e\x00"\x00,\x00 
> 
> \x001\x000\x001\x006\x009\x009\x004\x001\x002\x009\x00\r\x00\n'
>  >>>
> 
> Poor Yorick
> gp@pooryorick.com
> 
> 
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From martin@v.loewis.de  Thu Jan 16 10:09:05 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 16 Jan 2003 11:09:05 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E267839.50301@lemburg.com>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>
Message-ID: <m3hec98kzi.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> On Windows, the 'r' opens the file in text which mangles the line-end
> information. You should try to open the file in 'rb' (binary) mode
> for comparison.

The issue is, of course, that codecs.open is usually meant for text
data, so comparing 'r' to 'r' is fair, IMO.

> codecs.open() automatically appends the 'b' to the 'r' for you,
> so this is probably the cause of the problem.

That is an implementation detail which shouldn't be visible to the
user. I understand that it is necessary to open the underlying stream
in binary mode, but then the higher layers should hide that fact.

Regards,
Martin


From gp@pooryorick.com  Thu Jan 16 15:59:48 2003
From: gp@pooryorick.com (Poor Yorick)
Date: Thu, 16 Jan 2003 08:59:48 -0700
Subject: [I18n-sig] codecs module, readlines and xreadlines
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com> <m3hec98kzi.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3E26D6F4.8090802@pooryorick.com>


Martin v. Löwis wrote:

>"M.-A. Lemburg" <mal@lemburg.com> writes:
>
>>On Windows, the 'r' opens the file in text which mangles the line-end
>>information. You should try to open the file in 'rb' (binary) mode
>>for comparison.
>>
>
>The issue is, of course, that codecs.open is usually meant for text
>data, so comparing 'r' to 'r' is fair, IMO.
>
>>codecs.open() automatically appends the 'b' to the 'r' for you,
>>so this is probably the cause of the problem.
>>
>
Whether the file is opened in binary mode or in text mode, the '\r' 
character is still there.  It isn't mangled, it's just that in the 
utf-16 encoding all characters are encoded as double-byte characters, 
and \r\n becomes \x00\r\x00\n.

The thing is that I AM processing text data.  It just happens to be 
unicode text data.  The example I used turns into perfectly legible 
chinese characters once it's decoded in Python.  I think that people 
using the codecs module on Windows to read Unicode text files would 
expect codecs.open.readlines to behave exactly like the builtin 
open.readlines.  

open.readlines automatically removes the "\r" character on Windows 
systems when the file is opened and read in text mode, and inserts a \r 
character when a \n is written to a file, so to be consistent, 
codecs.open.readlines should do the same thing and remove \x00\r when 
the file is opened in text mode.

Poor Yorick
gp@pooryorick.com


From martin@v.loewis.de  Thu Jan 16 16:08:28 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 16 Jan 2003 17:08:28 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E26D6F4.8090802@pooryorick.com>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>
 <m3hec98kzi.fsf@mira.informatik.hu-berlin.de>
 <3E26D6F4.8090802@pooryorick.com>
Message-ID: <m3vg0p13ib.fsf@mira.informatik.hu-berlin.de>

Poor Yorick <gp@pooryorick.com> writes:

> The thing is that I AM processing text data.  It just happens to be
> unicode text data.  The example I used turns into perfectly legible
> chinese characters once it's decoded in Python.  I think that people
> using the codecs module on Windows to read Unicode text files would
> expect codecs.open.readlines to behave exactly like the builtin
> open.readlines.  

Would you like to work on a patch to fix this problem?

> open.readlines automatically removes the "\r" character on Windows
> systems when the file is opened and read in text mode, and inserts a
> \r character when a \n is written to a file, so to be consistent,
> codecs.open.readlines should do the same thing and remove \x00\r
> when the file is opened in text mode.

It is not Python code which does that, though: instead, the Microsoft
C library does the removal/insertion of \r. For Unicode, this is
useless, since we cannot open the file in text mode: The C library
would *still* remove \r (only), leaving us with an extra null byte.

Notice that a similar problem exists on the Mac, where \r should be
replaced by \n.

Regards,
Martin


From mal@lemburg.com  Thu Jan 16 16:14:38 2003
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 16 Jan 2003 17:14:38 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E26D6F4.8090802@pooryorick.com>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>	<m3hec98kzi.fsf@mira.informatik.hu-berlin.de> <3E26D6F4.8090802@pooryorick.com>
Message-ID: <3E26DA6E.9000306@lemburg.com>

Poor Yorick wrote:
>=20
>=20
> Martin v. L=F6wis wrote:
>=20
>> "M.-A. Lemburg" <mal@lemburg.com> writes:
>>
>>> On Windows, the 'r' opens the file in text which mangles the line-end
>>> information. You should try to open the file in 'rb' (binary) mode
>>> for comparison.
>>>
>>
>> The issue is, of course, that codecs.open is usually meant for text
>> data, so comparing 'r' to 'r' is fair, IMO.
>>
>>> codecs.open() automatically appends the 'b' to the 'r' for you,
>>> so this is probably the cause of the problem.
>>>
>>
> Whether the file is opened in binary mode or in text mode, the '\r'=20
> character is still there.  It isn't mangled, it's just that in the=20
> utf-16 encoding all characters are encoded as double-byte characters,=20
> and \r\n becomes \x00\r\x00\n.
>=20
> The thing is that I AM processing text data.  It just happens to be=20
> unicode text data.  The example I used turns into perfectly legible=20
> chinese characters once it's decoded in Python.  I think that people=20
> using the codecs module on Windows to read Unicode text files would=20
> expect codecs.open.readlines to behave exactly like the builtin=20
> open.readlines.=20
> open.readlines automatically removes the "\r" character on Windows=20
> systems when the file is opened and read in text mode, and inserts a \r=
=20
> character when a \n is written to a file,=20

That's what I meant with mangling. I don't see any code
in fileobject.c which would do the above, so unless I've
overlooked something the MS C lib must apply this
operation.

> so to be consistent,=20
> codecs.open.readlines should do the same thing and remove \x00\r when=20
> the file is opened in text mode.

But only on Windows, right ? (On Unix text mode and binary mode
behave identically)

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From gp@pooryorick.com  Thu Jan 16 16:32:34 2003
From: gp@pooryorick.com (Poor Yorick)
Date: Thu, 16 Jan 2003 09:32:34 -0700
Subject: [I18n-sig] codecs module, readlines and xreadlines
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>	<m3hec98kzi.fsf@mira.informatik.hu-berlin.de>	<3E26D6F4.8090802@pooryorick.com> <m3vg0p13ib.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3E26DEA2.3040902@pooryorick.com>


Martin v. Löwis wrote:

>Poor Yorick <gp@pooryorick.com> writes:
>
>>The thing is that I AM processing text data.  It just happens to be
>>unicode text data.  The example I used turns into perfectly legible
>>chinese characters once it's decoded in Python.  I think that people
>>using the codecs module on Windows to read Unicode text files would
>>expect codecs.open.readlines to behave exactly like the builtin
>>open.readlines.  
>>
>
>Would you like to work on a patch to fix this problem?
>

Alas, I only wish I had the skills that would require!  Perhaps someday. 
 I just wanted to point the issue out.

I could, however, probably help by creating a tutorial about the subject 
for others like me, which I will try to do.

Thank you,

Poor Yorick,
gp@pooryorick.com


From Scott.Daniels@Acm.Org  Thu Jan 16 21:45:52 2003
From: Scott.Daniels@Acm.Org (Scott David Daniels)
Date: Thu, 16 Jan 2003 13:45:52 -0800
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E26DA6E.9000306@lemburg.com>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>	<m3hec98kzi.fsf@mira.informatik.hu-berlin.de> <3E26D6F4.8090802@pooryorick.com> <3E26DA6E.9000306@lemburg.com>
Message-ID: <3E272810.9050600@dsl-only.net>


M.-A. Lemburg wrote:
> Poor Yorick wrote:
>> so to be consistent, codecs.open.readlines should do the same thing 
>> and remove \x00\r when the file is opened in text mode.
> But only on Windows, right ? (On Unix text mode and binary mode
> behave identically)

Actually, on Apple's systems, lines are delimitted with \r, removing
the \n.  As painful as it is for me to acknowledge this, Microsoft
is actually the most standards-compliant of the three major
interpretation. C (and hence Unix) considered that it was redundant
to have two distinct characters indicating end-of line.

The unix choice was the only irreversible character of the pair
(the line-feed).  For a while, MIT had a non-standard control
character that they called the "line-starve" which reversed the
effect of the line feed.  On the old teletype model 33s, though,
the line feed was irreversible, while the carriage return was
simple horiozontal postioning (and equivalent to the appropriate
number of backspaces.

Apple, I suspect, was thinking of the analogue to the keyboard.
Very few typists ever type the line feed character; they type
a return which emits the \r character.  Unix solves this by
conversion if the terminal is not in "raw" mode; Apple doesn't
have to make a distinction.

The least reasonable (but most standard-conforming) choice is \r\n, 
which (if you interpret the early ASCII standards literally),
should be interpretted the same as \n\r.  It is also uncomfortably
true that \r\n\n should be exactly equivalent to \r\n\r\n.  So, a
lot of code is simplified if there is a single EOL (End-Of-Line)
character.  C declared this so, and anyone who does not use LF (\n)
as a line delimiter in the environment where their C runtimes work
is supposed to translate their local convention to the C-standard
in the I/O runtimes.

To summarize briefly, after being hopelessly long-winded, Apple
non-raw should probably convert \r to \n, Microsoft non-raw
should similarly convert \r\n to \n.  What should be done in
non-binary mode for the other line terminators in UniCode (I
_think_ some exist) might be a source of hopelessly long-winded
debate.


From guido@python.org  Thu Jan 16 21:57:09 2003
From: guido@python.org (Guido van Rossum)
Date: Thu, 16 Jan 2003 16:57:09 -0500
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: Your message of "Thu, 16 Jan 2003 13:45:52 PST."
 <3E272810.9050600@dsl-only.net>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com> <m3hec98kzi.fsf@mira.informatik.hu-berlin.de> <3E26D6F4.8090802@pooryorick.com> <3E26DA6E.9000306@lemburg.com>
 <3E272810.9050600@dsl-only.net>
Message-ID: <200301162157.h0GLv9Z14045@odiug.zope.com>

> To summarize briefly, after being hopelessly long-winded, Apple
> non-raw should probably convert \r to \n, Microsoft non-raw
> should similarly convert \r\n to \n.  What should be done in
> non-binary mode for the other line terminators in UniCode (I
> _think_ some exist) might be a source of hopelessly long-winded
> debate.

That's exactly what Universal newlines does.  Have I missed something?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From martin@v.loewis.de  Thu Jan 16 23:52:09 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 17 Jan 2003 00:52:09 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <200301162157.h0GLv9Z14045@odiug.zope.com>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>
 <m3hec98kzi.fsf@mira.informatik.hu-berlin.de>
 <3E26D6F4.8090802@pooryorick.com> <3E26DA6E.9000306@lemburg.com>
 <3E272810.9050600@dsl-only.net>
 <200301162157.h0GLv9Z14045@odiug.zope.com>
Message-ID: <m3n0m0fyae.fsf@mira.informatik.hu-berlin.de>

Guido van Rossum <guido@python.org> writes:

> That's exactly what Universal newlines does.  Have I missed something?

The issue is whether codecs.open should follow the platform
conventions for text mode if neither "b" nor "U" is passed. Builtin
open currently does, codecs.open does not (instead, it treats a plain
"r" just as if "rb" had been passed).

Furthermore, the *implementation* of universal newlines is useless for
codecs.open, as the newline conversion must happen after decoding, not
before.

Regards,
Martin


From mal@lemburg.com  Fri Jan 17 09:04:57 2003
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 17 Jan 2003 10:04:57 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <m3n0m0fyae.fsf@mira.informatik.hu-berlin.de>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>	<m3hec98kzi.fsf@mira.informatik.hu-berlin.de>	<3E26D6F4.8090802@pooryorick.com> <3E26DA6E.9000306@lemburg.com>	<3E272810.9050600@dsl-only.net>	<200301162157.h0GLv9Z14045@odiug.zope.com> <m3n0m0fyae.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3E27C739.9050802@lemburg.com>

Martin v. L=F6wis wrote:
> Guido van Rossum <guido@python.org> writes:
>=20
>>That's exactly what Universal newlines does.  Have I missed something?
>=20
> The issue is whether codecs.open should follow the platform
> conventions for text mode if neither "b" nor "U" is passed. Builtin
> open currently does, codecs.open does not (instead, it treats a plain
> "r" just as if "rb" had been passed).

I'd say: let the codecs decide what to do here. After all, codecs.open()
only provide an interface to the codecs and leaves all the processing to
them. If a codec thinks that line ends should all be converted to '\n'
then so be it. That's also why codecs.open() appends an 'b' to the
mode in case it is not already there: otherwise opening files in e.g.
UTF-16 on Windows would lose big.

I think that the codecs.open() kind of treatment is more reliable
than the open() one for text files. Simply because you always know
what will happen and can then apply whatever conversion needs
to be done in the program.

> Furthermore, the *implementation* of universal newlines is useless for
> codecs.open, as the newline conversion must happen after decoding, not
> before.

Right.

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/


From martin@v.loewis.de  Fri Jan 17 10:15:31 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 17 Jan 2003 11:15:31 +0100
Subject: [I18n-sig] codecs module, readlines and xreadlines
In-Reply-To: <3E27C739.9050802@lemburg.com>
References: <3E25E551.4010202@pooryorick.com> <3E267839.50301@lemburg.com>
 <m3hec98kzi.fsf@mira.informatik.hu-berlin.de>
 <3E26D6F4.8090802@pooryorick.com> <3E26DA6E.9000306@lemburg.com>
 <3E272810.9050600@dsl-only.net>
 <200301162157.h0GLv9Z14045@odiug.zope.com>
 <m3n0m0fyae.fsf@mira.informatik.hu-berlin.de>
 <3E27C739.9050802@lemburg.com>
Message-ID: <m3el7crsjg.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> I'd say: let the codecs decide what to do here. 

Certainly. Unfortunately, this is not possible at the moment, since it
is already codecs.open which uses binary mode, and the codec has no
way of knowing what the original opening mode was.

> After all, codecs.open() only provide an interface to the codecs and
> leaves all the processing to them. If a codec thinks that line ends
> should all be converted to '\n' then so be it. That's also why
> codecs.open() appends an 'b' to the mode in case it is not already
> there: otherwise opening files in e.g.  UTF-16 on Windows would lose
> big.

Again: I do think that it is correct to open the underlying stream in
binary. The question is whether the codec should perform newline
translation (in addition to decoding, and probably after it).

> I think that the codecs.open() kind of treatment is more reliable
> than the open() one for text files. Simply because you always know
> what will happen [...]

This is not really true. The OP complains that you *cannot* know what
how line ends will be represented. For the builtin open, you know that
a line end will be always \n in text mode, even more so in universal
mode. As it is, the representation of a line end in the Unicode data
is platform dependent, which is bad for portability.

Regards,
Martin


From Tex Texin <tex@XenCraft.com>  Fri Jan 17 12:41:22 2003
From: Tex Texin <tex@XenCraft.com> (Tex Texin)
Date: Fri, 17 Jan 2003 07:41:22 -0500
Subject: [I18n-sig] Register now for early bird rates for the 23rd Unicode conference,
 Prague
Message-ID: <3E27F9F2.2A4DC1BE@i18nguy.com>

It is time - time to register for the 23rd Unicode conference
in Prague!  Please see the details below, and check out the program which has
been updated.

**************************************************************************
     Twenty-third Internationalization and Unicode Conference (IUC23)
      Unicode, Internationalization, the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc23
                           March 24-26, 2003
                        Prague, Czech Republic
*************************************************************************
Register now! > Just 10 weeks to go > Register now! > Just 10 weeks to go
*************************************************************************

NEWS

 > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc23 )
   to check the updated Conference program and register.  To help you
   choose Conference sessions, we've included abstracts of talks and
   speakers' biographies.

 > Hotel guest room group rate valid to March 1.

 > Early bird registration rate valid to March 1.

 > Find out about the Workshop on Managing Localization Projects, organised
   by XenCraft, and taking place in the same venue on 27 March -- See:
   http://www.unicode.org/iuc/iuc23


CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Microsoft Corporation
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

   For the first time, we will have an Exhibitors' track as part of the
   Conference.  For more information, please visit the Web site at:

   http://www.unicode.org/iuc/iuc23/showcase.html

CONFERENCE VENUE

The Conference will take place at:

     Marriott Prague Hotel
     V Celnici 8
     Prague, 110 00
     Czech Republic

     Tel:  (+420 2) 2288 8888
     Fax:  (+420 2) 2288 8889

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding. The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646. In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations. Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.