From victor.stinner at haypocalc.com  Thu Aug  9 02:41:08 2007
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Thu, 9 Aug 2007 02:41:08 +0200
Subject: [Email-SIG] fix email module for python 3000 (bytes/str)
Message-ID: <200708090241.08369.victor.stinner@haypocalc.com>

(This email was first sent to python-3000 mailing list. Guido van Rossum 
proposed me to send it to email-sig and that's what I do :-))

Hi,

I started to work on email module to port it for Python 3000, but I have 
trouble to understand if a function should returns bytes or str (because I 
don't know email module).

Header.encode() -> bytes?
Message.as_string() -> bytes?
decode_header() -> list of (bytes, str|None) or (str, str|None)?
base64MIME.encode() -> bytes?

message_from_string() <- bytes?

Message.get_payload() -> bytes or str?

A charset name type is str, right?

---------------

Things to change to get bytes:
 - replace StringIO with BytesIO
 - add 'b' prefix, eg. '' becomes b''
 - replace "%s=%s" % (x, y) with b''.join((x, b'=', y))
   => is it the best method to concatenate bytes?

Problems (to port python 2.x code to 3000):
 - When obj.lower() is used, I expect obj to be str but it's bytes
 - obj.strip() doesn't work when obj is a byte, it requires an
   argument but I don't know the right value! Maybe b'\n\r\v\t '?
 - iterate on a bytes object gives number and not bytes object, eg.
      for c in b"small text":
         if re.match("(\n|\r)", c): ...
   Is it possible to 'bytes' regex? re.compile(b"x") raise an exception

-- 
Victor Stinner aka haypo
http://hachoir.org/

From victor.stinner at haypocalc.com  Sat Aug 11 01:49:10 2007
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Sat, 11 Aug 2007 01:49:10 +0200
Subject: [Email-SIG] fix email module for python 3000 (bytes/str)
In-Reply-To: <200708090241.08369.victor.stinner@haypocalc.com>
References: <200708090241.08369.victor.stinner@haypocalc.com>
Message-ID: <200708110149.10939.victor.stinner@haypocalc.com>

Hi,

On Thursday 09 August 2007 02:41:08 Victor Stinner wrote:
> I started to work on email module to port it for Python 3000, but I have
> trouble to understand if a function should returns bytes or str (because I
> don't know email module).

It's really hard to convert email module to Python 3000 because it does mix 
byte strings and (unicode) character strings...

I wrote some notes about bytes/str helping people to migrate Python 2.x code 
to Python 3000, or at least to explain the difference between Python 
2.x "str" type and Python 3000 "bytes" type:
   http://wiki.python.org/moin/BytesStr

About email module, some deductions:
 test_email.py: openfile() must use 'rb' file mode for all tests
 base64MIME.decode() and base64MIME.encode() should accept bytes and str
 base64MIME.decode() result type is bytes
 base64MIME.encode() result type should be... bytes or str, no idea

Other decode() and encode() functions should use same rules about types.

Python modules (binascii and base64) choosed bytes type for encode result.

Victor Stinner aka haypo
http://hachoir.org/

From barry at python.org  Sun Aug 12 16:50:05 2007
From: barry at python.org (Barry Warsaw)
Date: Sun, 12 Aug 2007 09:50:05 -0500
Subject: [Email-SIG] [Python-3000] fix email module for python 3000
	(bytes/str)
In-Reply-To: <200708110149.10939.victor.stinner@haypocalc.com>
References: <200708090241.08369.victor.stinner@haypocalc.com>
	<200708110149.10939.victor.stinner@haypocalc.com>
Message-ID: <8B640CF2-EB88-45A5-A85F-1267AF24749E@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 10, 2007, at 6:49 PM, Victor Stinner wrote:

> It's really hard to convert email module to Python 3000 because it  
> does mix
> byte strings and (unicode) character strings...

Indeed, but I'm making progress.

Just a very quick follow up now, with hopefully more detail soon.   
I'm cross posting this one on purpose because of a couple of more  
general py3k issues involved.

In r56957 I committed changes to sndhdr.py and imghdr.py so that they  
compare what they read out of the files against proper byte  
literals.  AFAICT, neither module has a unittest, and if you run them  
from the command line, you'll see that they're completely broken  
(without my fix).  The email package uses these to guess content type  
subparts for the MIMEAudio and MIMEImage subclasses.  I didn't add  
unittests, just some judicious 'b' prefixes, and a quick command line  
test seems to make the situation better.  This also makes a bunch of  
email unittests pass.

Another general Python thing that bit me was when an exception gets  
raised with a non-ascii message, e.g.

 >>> raise RuntimeError('oops')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
RuntimeError: oops
 >>> raise RuntimeError('oo\xfcps')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 >>>

Um, what?  (I'm using a XEmacs shell buffer on OS X, but you get  
something similar in an iTerm and Terminal window.).  In the email  
unittests, I was getting one unexpected exception that had a non- 
ascii character in it, but this crashed the unittest harness because  
when it tried to print the exception message out, you'd instead get  
an exception in io.py and the test run would exit.  Okay, that all  
makes sense, but IWBNI py3k could do better <wink>.

Fixing other simple issues (not checked in yet), I'm down to 20  
failures, 13 errors out of 247 tests.  I'm running  
test_email_renamed.py only because test_email.py will go away (we  
should remove the old module names and bump the email pkg version  
number too).

As for the other questions Victor raises, we definitely need to  
answer them, but that should be for another reply.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRr8eHXEjvBPtnXfVAQIrJgQAoWGaoN82/KFLggu0IIM0BSghIQppiFVv
9weB+Kq6oAcgN95XKGSCZmPwA8jHkeUAWRpm8gZn7k44N2fJuZw11Klajy0tzUPW
Y4b5y8jPVU85phOKinynmHb9suXroyb35ZgMSp+WipL4L5PkOMv/x9q59Rs6ldjZ
cQu3Sssai9I=
=QG9j
-----END PGP SIGNATURE-----

From janssen at parc.com  Sun Aug 12 19:09:26 2007
From: janssen at parc.com (Bill Janssen)
Date: Sun, 12 Aug 2007 10:09:26 PDT
Subject: [Email-SIG] [Python-3000] fix email module for python 3000
	(bytes/str)
In-Reply-To: <200708110149.10939.victor.stinner@haypocalc.com> 
References: <200708090241.08369.victor.stinner@haypocalc.com>
	<200708110149.10939.victor.stinner@haypocalc.com>
Message-ID: <07Aug12.100928pdt."57996"@synergy1.parc.xerox.com>

>  base64MIME.decode() and base64MIME.encode() should accept bytes and str
>  base64MIME.decode() result type is bytes
>  base64MIME.encode() result type should be... bytes or str, no idea
> 
> Other decode() and encode() functions should use same rules about types.

Victor,

Here's my take on this:

base64MIME.decode converts string to bytes
base64MIME.encode converts bytes to string

Pretty straightforward.

Bill

From victor.stinner at haypocalc.com  Mon Aug 13 02:26:03 2007
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Mon, 13 Aug 2007 02:26:03 +0200
Subject: [Email-SIG] [Python-3000] fix email module for python 3000
	(bytes/str)
In-Reply-To: <8B640CF2-EB88-45A5-A85F-1267AF24749E@python.org>
References: <200708090241.08369.victor.stinner@haypocalc.com>
	<200708110149.10939.victor.stinner@haypocalc.com>
	<8B640CF2-EB88-45A5-A85F-1267AF24749E@python.org>
Message-ID: <200708130226.03670.victor.stinner@haypocalc.com>

On Sunday 12 August 2007 16:50:05 Barry Warsaw wrote:
> In r56957 I committed changes to sndhdr.py and imghdr.py so that they
> compare what they read out of the files against proper byte
> literals.

So nobody read my patches? :-( See my emails "[Python-3000] Fix imghdr module 
for bytes" and "[Python-3000] Fix sndhdr module for bytes" from last 
saturday. But well, my patches look similar.

Barry's patch is incomplete: test_voc() is wrong.

I attached a new patch:
 - fix "h[sbseek] == b'\1'" and "ratecode = ord(h[sbseek+4])" in test_voc()
 - avoid division by zero
 - use startswith method: replace h[:2] == b'BM' by h.startswith(b'BM')
 - use aifc.open() instead of old aifc.openfp()
 - use ord(b'P') instead of ord('P')

Victor Stinner aka haypo
http://hachoir.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: py3k-imgsnd-hdr.patch
Type: text/x-diff
Size: 5326 bytes
Desc: not available
Url : http://mail.python.org/pipermail/email-sig/attachments/20070813/081c76b4/attachment.bin 

From victor.stinner at haypocalc.com  Tue Aug 14 04:22:36 2007
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Tue, 14 Aug 2007 04:22:36 +0200
Subject: [Email-SIG] Questions about email bytes/str (python 3000)
Message-ID: <200708140422.36818.victor.stinner@haypocalc.com>

Hi,

After many tests, I'm unable to convert email module to Python 3000. I'm also 
unable to take decision of the best type for some contents.


(1) Email parts should be stored as byte or character string?

Related methods: Generator class, Message.get_payload(), Message.as_string().

Let's take an example: multipart (MIME) email with latin-1 and base64 (ascii) 
sections. Mix latin-1 and ascii => mix bytes. So the best type should be 
bytes.

=> bytes


(2) Parsing file (raw string): use bytes or str in parsing?

The parser use methods related to str like splitlines(), lower(), strip(). But 
it should be easy to rewrite/avoid these methods. I think that low-level 
parsing should be done on bytes. At the end, or when we know the charset, we 
can convert to str.

=> bytes


About base64, I agree with Bill Janssen:
 - base64MIME.decode converts string to bytes
 - base64MIME.encode converts bytes to string

But decode may accept bytes as input (as base64 modules does): use 
str(value, 'ascii', 'ignore') or str(value, 'ascii', 'strict').


I wrote 4 differents (non-working) patches. So I you want to work on email 
module and Python 3000, please first contact me. When I will get a better 
patch, I will submit it.


Victor Stinner aka haypo
http://hachoir.org/

From barry at python.org  Tue Aug 14 15:30:58 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 14 Aug 2007 09:30:58 -0400
Subject: [Email-SIG] [Python-3000] fix email module for python 3000
	(bytes/str)
In-Reply-To: <200708130226.03670.victor.stinner@haypocalc.com>
References: <200708090241.08369.victor.stinner@haypocalc.com>
	<200708110149.10939.victor.stinner@haypocalc.com>
	<8B640CF2-EB88-45A5-A85F-1267AF24749E@python.org>
	<200708130226.03670.victor.stinner@haypocalc.com>
Message-ID: <E50AC401-C936-4202-903A-1691BD56ABE5@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 12, 2007, at 8:26 PM, Victor Stinner wrote:

> On Sunday 12 August 2007 16:50:05 Barry Warsaw wrote:
>> In r56957 I committed changes to sndhdr.py and imghdr.py so that they
>> compare what they read out of the files against proper byte
>> literals.
>
> So nobody read my patches? :-( See my emails "[Python-3000] Fix  
> imghdr module
> for bytes" and "[Python-3000] Fix sndhdr module for bytes" from last
> saturday. But well, my patches look similar.

Victor, sorry but my email was very spotty and I definitely missed  
your original patches.  Sorry for duplicating work and thanks for  
fixing the last few things in these modules.  Glad Guido got these  
committed.

I'll follow up on email package more in a bit.
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRsGuknEjvBPtnXfVAQLbfgQAqfiBeaVwIN35nXn9D7DZXItkzoZSd+1V
f/a4PnzBHTdvFZgggisK/7o5b1uULOaHILLSmiQMFp0W/zV2JFCvKI7kc1/SkjSo
UgIXK3o9WtmljH3aj1njc6fgy3VCVfa09NDKf89/rCy15AaSxF21YinIDIqF/yGN
Sn2RQJqvNPc=
=KpZC
-----END PGP SIGNATURE-----

From barry at python.org  Tue Aug 14 17:39:29 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 14 Aug 2007 11:39:29 -0400
Subject: [Email-SIG] [Python-3000] Questions about email bytes/str
	(python 3000)
In-Reply-To: <200708140422.36818.victor.stinner@haypocalc.com>
References: <200708140422.36818.victor.stinner@haypocalc.com>
Message-ID: <E8DCAEF8-B7F8-4946-8256-AD0732492C51@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 13, 2007, at 10:22 PM, Victor Stinner wrote:

> After many tests, I'm unable to convert email module to Python  
> 3000. I'm also
> unable to take decision of the best type for some contents.

I made a lot of progress on the email package while I was traveling,  
though I haven't checked things in yet.  I probably will very soon,  
even if I haven't yet fixed the last few remaining problems.  I'm  
down to 7 failures, 9 errors of 247 tests.

> (1) Email parts should be stored as byte or character string?

Strings.  Email messages are conceptually strings so I think it makes  
sense to represent them internally as such.  The FeedParser should  
expect strings and the Generator should output strings.  One place  
where I think bytes should show up would be in decoded payloads, but  
in that case I really want to make an API change so that .get_payload 
(decoded=True) is deprecated in favor of a separate method.

I'm proposing other API changes to make things work better, a few of  
which are in my current patch, but others I want to defer if they  
don't directly contribute to getting these tests to pass.

> Related methods: Generator class, Message.get_payload(),  
> Message.as_string().
>
> Let's take an example: multipart (MIME) email with latin-1 and  
> base64 (ascii)
> sections. Mix latin-1 and ascii => mix bytes. So the best type  
> should be
> bytes.
>
> => bytes

Except that by the time they're parsed into an email message, they  
must be ascii, either encoded as base64 or quoted-printable.  We also  
have to know at that point the charset being used, so I think it  
makes sense to keep everything as strings.

> (2) Parsing file (raw string): use bytes or str in parsing?
>
> The parser use methods related to str like splitlines(), lower(),  
> strip(). But
> it should be easy to rewrite/avoid these methods. I think that low- 
> level
> parsing should be done on bytes. At the end, or when we know the  
> charset, we
> can convert to str.
>
> => bytes

Maybe, though I'm not totally convinced.  It's certainly easier to  
get the tests to pass if we stick with parsing strings.   
email.message_from_string() should continue to accept strings,  
otherwise obviously it would have to be renamed, but also because  
it's primary use case is turning a triple quoted string literal into  
an email message.

I alluded to the one crufty part of this in a separate thread.  In  
order to accept universal newlines but preserve end-of-line  
characters, you currently have to open files in binary mode.  Then,  
because my parser works on strings you have to convert those bytes to  
strings, which I am successfully doing now, but which I suspect is  
ultimately error prone.  I would like to see a flag to preserve line  
endings on files opened in text + universal newlines mode, and then I  
think the hack for Parser.parse() would go away.  We'd define how  
files passed to this method must be opened.  Besides, I think it is  
much more common to be parsing strings into email messages anyway.

> About base64, I agree with Bill Janssen:
>  - base64MIME.decode converts string to bytes
>  - base64MIME.encode converts bytes to string

I agree.

> But decode may accept bytes as input (as base64 modules does): use
> str(value, 'ascii', 'ignore') or str(value, 'ascii', 'strict').

Hmm, I'm not sure about this, but I think that .encode() may have to  
accept strings.

> I wrote 4 differents (non-working) patches. So I you want to work  
> on email
> module and Python 3000, please first contact me. When I will get a  
> better
> patch, I will submit it.

Like I said, I also have an extensive patch that gets me most of the  
way there.  I don't want to having dueling patches, so I think what  
I'll do is put a branch in the sandbox and apply my changes there for  
now.  Then we will have real code to discuss.

A few other things from my notes and diff:

Do we need email.message_from_bytes() and Message.as_bytes()?  While  
I'm (currently <wink>) pretty well convinced that email messages  
should be strings, the use case for bytes includes reading them  
directly to or from sockets, though in this case because the RFCs  
generally require ascii with encodings and charsets clearly  
described, I think a bytes-to-string wrapper may suffice.

Charset class: How do we do conversions from input charset to output  
charset?  This is required by e.g. Japanese to go from euc-jp to  
iso-2022-jp IIUC.  Currently I have to use a crufty string-to-bytes  
converter like so:

 >>> bytes(ord(c) for c in s)

rather than just bytes(s).  I'm sure there's a better way I haven't  
found yet.

Generator._write_headers() and the _is8bitstring() test aren't really  
appropriate or correct now that everything's a unicode.  This  
affected quite a few tests because long headers that previously were  
getting split were now not getting split.  I ended up ditching the  
_is8bitstring() test, but that lead me into an API change for  
Message.__str__() and Message.as_string(), which I've long wanted to  
do anyway.  First Message.__str__() no longer includes the Unix-From  
header, but more importantly, .as_string() takes the maxheaderlen as  
an argument and defaults to no header wrapping.  By changing various  
related tests to call .as_string(maxheaderlen=78), these split header  
tests can be made to pass again.  I think these changes make str 
(some_message) saner and more explicit (because it does not split  
headers) but these may be controversial in the email-sig.

You asked earlier about decode_header().  This should definitely  
return a list of tuples of (bytes, charset|None).

Header is going to need some significant revision  First, there's the  
whole mess of .encode() vs. __str__() vs. __unicode__() to sort out.   
It's insane that the latter two had different semantics w.r.t.  
whitespace preservation between encoded words, so let's fix that.   
Also, if the common use case is to do something like this:

 >>> msg['subject'] = 'a subject string'

then I wonder if we shouldn't be doing more sanity checking on the  
header value.  For example, if the value had a non-ascii character in  
it, then what should we do?  One way would be to throw an exception,  
requiring the use of something like:

 >>> msg['subject'] = Header('a \xfc subject', 'utf-8')

or we could do the most obvious thing and try to convert to 'ascii'  
then 'utf-8' if no charset is given explicitly.  I thought about  
always turning headers into Header instances, but I think that might  
break some common use cases.  It might be possible to define equality  
and other operations on Header instances so that these common cases  
continue to work.  The email-sig can address that later.

However, if all Header instances are unicode and have a valid  
charset, I wonder if the splittable tests are still relevant, and  
whether we can simplify header splitting.  I have to think about this  
some more.

As for the remaining failures and errors, they come down to  
simplifying the splittable logic, dealing with Message.__str__() vs.  
Message.__unicode__(), verifying that the UnicodeErrors some tests  
expect to get raise don't make sense any more, and fixing a couple of  
other small issues I haven't gotten to yet.

I will create a sandbox branch and apply my changes later today so we  
have something concrete to look at.

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRsHMsXEjvBPtnXfVAQLfCwP8CeHi9RBW5ULri3w6sBz5a1fkdVCftk71
uW8q0LercTJSa2ewvtrlWdKm9F403IabYjh2Bg8cZfHmYyZ+/b18oU64zzkZylo/
pHw9Iyvk9ZW6G7mwJRwpV9c6JXJNvsQtKRWipuue0ZMagI5OJBXR8vhRIDGkt+NC
ARhIrHXPEW8=
=DBLp
-----END PGP SIGNATURE-----

From janssen at parc.com  Wed Aug 15 03:44:54 2007
From: janssen at parc.com (Bill Janssen)
Date: Tue, 14 Aug 2007 18:44:54 PDT
Subject: [Email-SIG] [Python-3000] Questions about email bytes/str
	(python 3000)
In-Reply-To: <E8DCAEF8-B7F8-4946-8256-AD0732492C51@python.org> 
References: <200708140422.36818.victor.stinner@haypocalc.com>
	<E8DCAEF8-B7F8-4946-8256-AD0732492C51@python.org>
Message-ID: <07Aug14.184454pdt."57996"@synergy1.parc.xerox.com>

> > Let's take an example: multipart (MIME) email with latin-1 and  
> > base64 (ascii)
> > sections. Mix latin-1 and ascii => mix bytes. So the best type  
> > should be
> > bytes.
> >
> > => bytes
> 
> Except that by the time they're parsed into an email message, they  
> must be ascii, either encoded as base64 or quoted-printable.  We also  
> have to know at that point the charset being used, so I think it  
> makes sense to keep everything as strings.

Actually, Victor's right here -- it makes more sense to treat them as
bytes.  It's RFC 821 (SMTP) that requires 7-bit ASCII, not the MIME
format.  Non-SMTP mail transports do exist, and are popular in various
places.  Email transported via other transport mechanisms may, for
instance, use a Content-Transfer-Encoding of "binary" for some
sections of the message.  Some parts of the top-most header of the
message may be counted on to be encoded as ASCII strings, but not the
whole message in general.

> > About base64, I agree with Bill Janssen:
> >  - base64MIME.decode converts string to bytes
> >  - base64MIME.encode converts bytes to string
> 
> I agree.
> 
> > But decode may accept bytes as input (as base64 modules does): use
> > str(value, 'ascii', 'ignore') or str(value, 'ascii', 'strict').
> 
> Hmm, I'm not sure about this, but I think that .encode() may have to  
> accept strings.

Personally, I think it would avoid more errors if it didn't.  Let the
user explicitly encode the string to a particular representation
before calling base64.encode().

Bill

From barry at python.org  Wed Aug 15 07:50:30 2007
From: barry at python.org (Barry Warsaw)
Date: Wed, 15 Aug 2007 01:50:30 -0400
Subject: [Email-SIG] [Python-3000] Questions about email bytes/str
	(python 3000)
In-Reply-To: <E8DCAEF8-B7F8-4946-8256-AD0732492C51@python.org>
References: <200708140422.36818.victor.stinner@haypocalc.com>
	<E8DCAEF8-B7F8-4946-8256-AD0732492C51@python.org>
Message-ID: <F4B589A6-09C3-490E-95A7-070E6E2CBCEF@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 14, 2007, at 11:39 AM, Barry Warsaw wrote:

> I will create a sandbox branch and apply my changes later today so  
> we have something concrete to look at.

Done.  See:

http://svn.python.org/view/sandbox/trunk/emailpkg/5_0-exp/

I'm down to 5 failures and 6 errors (in test_email.py only), and I  
think most if not all of them are related to the broken header  
splittable stuff.

Please take a look.
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRsKUJnEjvBPtnXfVAQISBQQAnEKytL8fqLbe+HADIyIBr1gDFtzbc4nw
zY4oEDPV+d4zFiAj9Ap5uePCfQxnqRdBMsHhkbCkB9k0XSDoWv2NxC10KLdE2CEO
YMLB+BB5uMjTCkHhaUVr/rIdKv/4LKZFy1v9dJv5X3BF5clugWa3L+tioe0kPk9X
jDkjZKc59LE=
=73uN
-----END PGP SIGNATURE-----

From victor.stinner at haypocalc.com  Wed Aug 15 21:52:38 2007
From: victor.stinner at haypocalc.com (Victor Stinner)
Date: Wed, 15 Aug 2007 21:52:38 +0200
Subject: [Email-SIG] [Python-3000] Questions about email bytes/str
	(python 3000)
In-Reply-To: <07Aug14.184454pdt."57996"@synergy1.parc.xerox.com>
References: <200708140422.36818.victor.stinner@haypocalc.com>
	<E8DCAEF8-B7F8-4946-8256-AD0732492C51@python.org>
	<07Aug14.184454pdt."57996"@synergy1.parc.xerox.com>
Message-ID: <200708152152.38839.victor.stinner@haypocalc.com>

On Wednesday 15 August 2007 03:44:54 Bill Janssen wrote:
> > (...) I think that base64MIME.encode() may have to accept strings.
>
> Personally, I think it would avoid more errors if it didn't.

Yeah, how can you guess which charset the user want to use? For most user, 
there is only one charset: latin-1. So I you use UTF-8, he will not 
understand conversion errors.

Another argument: I like bidirectional codec:
   decode(encode(x)) == x
   encode(decode(x)) == x

So if you mix bytes and str, these relations will be wrong.

Victor Stinner aka haypo
http://hachoir.org/

From barry at python.org  Sun Aug 19 22:19:09 2007
From: barry at python.org (Barry Warsaw)
Date: Sun, 19 Aug 2007 16:19:09 -0400
Subject: [Email-SIG] The performance issue of the email package,
	and how about a cEmail?
In-Reply-To: <20070818173801.787E.ICEBERG@21cn.com>
References: <20070818173801.787E.ICEBERG@21cn.com>
Message-ID: <257CD72B-8935-4239-AF9C-8A3A137910EE@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 18, 2007, at 6:11 AM, Iceberg wrote:

> It seems the only way to boost up the performance is to rewrite the  
> key part (mostly FeedParser.py) in C. And I believe the effort will  
> be worthy for such a fundamental, widespread, core package. In  
> fact, we saw similar happen, such as: cString, cPickle, cProfile.
>
> I read old mail archive (http://mail.python.org/pipermail/email- 
> sig/) since 2005, but found no thread on this topic. So, I would  
> venture to ask, is there any plan for a cEmail package in near future?

There's no plans by me to do this, but if yo're interested, I think  
it could be a worth goal.  Without looking at those existing  
packages, there's two things I'd say.  I doubt that either package  
would be included in Python by default, either because it's C++ or  
because of a license incompatibility.  OTOH, it may or may not be  
worth enabling optional building of a cFeedParser based on whether  
these packages are available or not.

OTOH, it might be nice to provide something like a cFeedParser as a  
third-party egg, and if it works out, and is enough of a performance  
boost, I'd probably support extending the email package to use it if  
it's available.

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRsilvnEjvBPtnXfVAQJDAwP9HBWew9kpf8IS8DM90/cnoWr8udblxcGi
W6YgLcG8JR9B22aUThC8t/5wMuu3mBZhgouyPgNCUK/j4kL1zMC33zYoinGzzrke
F4f9ZXQ8Z1eG+GreDGhjxD6psrcpDAj+/04XtyL1tr7FE5GWcEN90f9InhzFGbQF
Uu3PPLIZ9n0=
=NPq5
-----END PGP SIGNATURE-----

From barry at python.org  Wed Aug 22 00:12:40 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 21 Aug 2007 18:12:40 -0400
Subject: [Email-SIG] [Python-3000] Py3k Sprint Tasks (Google Docs &
	Spreadsheets)
In-Reply-To: <c09ffb51ed04383961b5e8ff223d43@gmail.com>
References: <c09ffb51ed04383961b5e8ff223d43@gmail.com>
Message-ID: <93DBB66F-5D0D-4E46-8480-D2BFC693722A@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 21, 2007, at 1:56 PM, gvanrossum at gmail.com wrote:

> I've shared a document with you called "Py3k Sprint Tasks":
> http://spreadsheets.google.com/ccc? 
> key=pBLWM8elhFAmKbrhhh0ApQA&inv=python-3000 at python.org&t=3328567089265 
> 242420&guest
>
> It's not an attachment -- it's stored online at Google Docs &  
> Spreadsheets. To open this document, just click the link above.
>
> (resend, I'm not sure this made it out the first time)
>
> This spreadsheet is where I'm organizing the tasks for the Google
> Sprint starting tomorrow.
>
> Feel free to add. If you're coming to the sprint, feel free to claim
> ownership of a task.

I have approval to spend some official time at this sprint, though  
I'll be working from home and will be on IRC, Skype, etc.

I've been spending hours of my own time on the email package for py3k  
this week and every time I think I'm nearing success I get defeated  
again.  I think Victor Stinner came to similar conclusions.  To put  
it mildly, the email package is effed up!  But I'm determined to  
solve the worst of the problems this week.

I only have Wednesday and Thursday to work on this, with most of my  
time available on Thursday.  I'd really like to find one or two other  
folks to connect with to help work out the stickiest issues.  Please  
contact me directly or on this list to arrange a time with me.  I'm  
UTC-4 if that helps.  I'll be on #python-dev (barry) too.

Remember that the current code is in the python sandbox (under  
emailpkg/5_0-exp).  I have some uncommitted code which I'll try to  
check in tonight, though I don't know if it will make matters better  
or worse. ;)

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRstjWXEjvBPtnXfVAQLQcQP+Lo/D1YH1+w/51kNyQN1+zrzu1Cov7ERk
1xtT5L2LlaPjXGeVMlc6Xz0bbLVc96kSQ4SIrkc5RRNorcYzMf8kID4rLkO6S+kU
CXtpOVgmzkX9zotAL9O72v2uOHT6c0fcK8ag44EiAtWei3Tdf+R2rL6lOzo0lHgj
qmVPFzlzGCA=
=t1nr
-----END PGP SIGNATURE-----

From stephen at xemacs.org  Sat Aug 25 08:10:04 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 25 Aug 2007 15:10:04 +0900
Subject: [Email-SIG] [Python-3000] Py3k Sprint Tasks (Google Docs &
	Spreadsheets)
In-Reply-To: <93DBB66F-5D0D-4E46-8480-D2BFC693722A@python.org>
References: <c09ffb51ed04383961b5e8ff223d43@gmail.com>
	<93DBB66F-5D0D-4E46-8480-D2BFC693722A@python.org>
Message-ID: <87y7g0401v.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > I've been spending hours of my own time on the email package for py3k  
 > this week and every time I think I'm nearing success I get defeated  
 > again.

I'm ankle deep in the Big Muddy (daughter tested positive for TB as
expected -- the Japanese innoculate all children against it because of
the sins of their fathers -- and school starts on Tuesday, so we need
to make a bunch of extra trips to doctors and whatnot), so what thin
hope I had of hanging out with the big boys at the Python-3000 sprint
long since evaporated.

However, starting next week I should have a day a week or so I can
devote to email stuff -- if you want to send any thoughts or
requisitions my way (or an URL to sprint IRC transcripts), I'd love to
help.  Of course you'll get it all done and leave none for me, right?

 > But I'm determined to solve the worst of the problems this week.

Bu-wha-ha-ha!

Steve


From barry at python.org  Sun Aug 26 20:30:47 2007
From: barry at python.org (Barry Warsaw)
Date: Sun, 26 Aug 2007 14:30:47 -0400
Subject: [Email-SIG] [Python-3000] Py3k Sprint Tasks (Google Docs &
	Spreadsheets)
In-Reply-To: <87y7g0401v.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <c09ffb51ed04383961b5e8ff223d43@gmail.com>
	<93DBB66F-5D0D-4E46-8480-D2BFC693722A@python.org>
	<87y7g0401v.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <9CBCCF2F-B428-4D37-8C18-1EAFB86CD7D9@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Aug 25, 2007, at 2:10 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> I've been spending hours of my own time on the email package for py3k
>> this week and every time I think I'm nearing success I get defeated
>> again.
>
> I'm ankle deep in the Big Muddy (daughter tested positive for TB as
> expected -- the Japanese innoculate all children against it because of
> the sins of their fathers -- and school starts on Tuesday, so we need
> to make a bunch of extra trips to doctors and whatnot), so what thin
> hope I had of hanging out with the big boys at the Python-3000 sprint
> long since evaporated.

Stephen, sorry to hear about your daughter and I hope she's going to  
be okay of course!

> However, starting next week I should have a day a week or so I can
> devote to email stuff -- if you want to send any thoughts or
> requisitions my way (or an URL to sprint IRC transcripts), I'd love to
> help.  Of course you'll get it all done and leave none for me, right?

Unfortunately, we didn't really sprint much on it, but I did get a  
chance to spend time on the branch.  I think I see the light at the  
end of the tunnel for getting the existing tests to pass, though I  
haven't even looked at test_email_codecs.py yet.  Because of the way  
things are going to work with in put and output codecs, I'll  
definitely want to get some sanity checks with Asian codecs.  I'll  
try to put together a list of issues and questions and get those sent  
out next week.

>> But I'm determined to solve the worst of the problems this week.
>
> Bu-wha-ha-ha!

Heh, well I'm getting closer.  We're definitely going to have some  
API changes, so I'll outline those as well.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRtHG13EjvBPtnXfVAQKCngP+PUTm82FjnVpqz7HvPLS/zPXBMelDNhkK
AKGIk5hveka180QEbA/DMsu7LZmPK2jXOQJWxufRsLfuzwKL3WtDF1IIyiICkC/I
HoR04bHZJzUdEzZuZPL53I704JoO8QBpXEOn/JdauFEaZ6qakueLdnqx1Ab0LbSP
RCLiVh9BxtU=
=6Ngh
-----END PGP SIGNATURE-----

From stephen at xemacs.org  Tue Aug 28 05:36:56 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 28 Aug 2007 12:36:56 +0900
Subject: [Email-SIG] [Python-3000] Py3k Sprint Tasks (Google Docs &
	Spreadsheets)
In-Reply-To: <9CBCCF2F-B428-4D37-8C18-1EAFB86CD7D9@python.org>
References: <c09ffb51ed04383961b5e8ff223d43@gmail.com>
	<93DBB66F-5D0D-4E46-8480-D2BFC693722A@python.org>
	<87y7g0401v.fsf@uwakimon.sk.tsukuba.ac.jp>
	<9CBCCF2F-B428-4D37-8C18-1EAFB86CD7D9@python.org>
Message-ID: <87tzqkz5wn.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > Stephen, sorry to hear about your daughter and I hope she's going to  
 > be okay of course!

Oh, she's *fine*.  There's just a conflict between the Japanese
practice of vaccinating all school children against TB, and the
U.S. practice of testing for TB antibodies.  About 1 in 3 kids coming
from Japan to U.S. schools get snagged.  Annoying, but I'll trade this
for the problems with visas and the like that colleagues have had
*any* day.

 > haven't even looked at test_email_codecs.py yet.  Because of the way  
 > things are going to work with in put and output codecs, I'll  
 > definitely want to get some sanity checks with Asian codecs.

OK, *that* I can help with!