help with unicode email parse

Fri Sep 8 01:24:49 EDT 2006

john , you can look my code:
it downloads email and save to local filesystem(filename and email
contains non-english characters)
it works now.
but i still think python's unicode string is not as straightforward as
java's
string SHOULD always be unicode. or i'm trouble dealing them when they
are in different encodings. because before using it, i must try to find
what encoding it use, unicode or 8bit. and why the system use ascii to
decode. you have explained some, but i cannot catch up you. however i
never have encoding problem using string in java.

#-*- coding:utf8 -*-
import poplib
from poplib import POP3
import Utils
#from datetime import datetime
import time
import email
from email.Header import decode_header

conf = Utils.getValues()

def getMail():
	a = POP3( conf.get( "mailhost" ) )
	print a.getwelcome()
	a.user( conf.get( "mailuser" ) )
	a.pass_( conf.get( "mailpass" ) )
	a.list()
	( numMsgs, totalSize ) = a.stat()
	print "==begin==total %s mail,%s bytes" % ( numMsgs, totalSize )
	for i in range( 1, numMsgs + 1 ):
		( header, msg, octets ) = a.retr( i )
		print "Message %d:" % i
		text= list2txt( msg )
		save( text )
		#print octets
		#print header
	a.quit()
	print "==finish==total %s mail,%s bytes" % ( numMsgs, totalSize )

def list2txt( l ):
	return reduce( lambda x, y:x+"\r\n"+y, l )

def save( text ):
	stamp = getStamp()
	store=conf.get( "mailstore" )
	msg = email.message_from_string( text )
	path = getPath( text, msg )
	title = decode_header( msg["Subject"] )
	title= title[0][0]
	title= title.decode( "utf8" )
	print repr( title )
	title = encodeFilename( title )
	print repr( title )
	fn = "%s/%s/%s-%s.mail"%( store.encode( "utf8" ),
							path.encode( "utf8" ),
							stamp.encode( "utf8" ),
							title )
	print repr( fn )
#	fn = fn.decode( "utf8" )

	import os
	path =os.path.dirname( fn )
	if not os.path.exists( path ) :
		os.makedirs( path )

	print repr( fn )
	f = file( fn, "wb" )
	f.write( text )
	f.close()

def encodeFilename( s ):
	slist=[]
	for ch in s:
		#print "CH", repr( ch )
		if "\":?*/\\<>|".find( ch ) >= 0:
			#print "here"
			slist.append( "_" )
		else:
			#print "there"
			slist.append( ch )
	#print repr( slist )
	return "".join( slist )
#encodeFilename( "abc:dd" )

def getPath( text, msg ):
	import Classify
	return Classify.run( text, msg )

def getStamp():
	s = repr( int( time.clock() * 1000000000000000L ) )
	#print s
	return unicode( s )
#print repr( getStamp() )

def test():
	subject = decode_header( "=?UTF-8?B?5rWL6K+V?=" )
	print "s1=", repr( subject )
	t1 = subject[0][0]
	print "t1=", repr( t1 )
	fn = "%s/%s-%s.mail"%( "d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95'
)
	print "fn=", repr( fn )
	fn = fn.decode( "utf8" )
	print "fn=", repr( fn )
	f = file( fn, "wb" )
	f.write( "test" )
	f.close()

#test()
getMail()
John Machin wrote:
> neoedmund wrote:
> [top-posting corrected]
> > John Machin wrote:
> > > neoedmund wrote:
> > > > i want to get the subject from email and construct a filename with the
> > > > subject.
> > > > but tried a lot, always got error like this:
> > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 4:
> > > > ordinal not in range(128)
> > > >
> > > >
> > > > 	msg = email.message_from_string( text )
> > > > 	title = decode_header( msg["Subject"] )
> > > > 	title= title[0][0]
> > > > 	#title=title.encode("utf8")
> > >
> > > Why is that commented out?
> > >
> > > > 	print title
> > > > 	fn = ""+path+"/"+stamp+"-"+title+".mail"
> > > >
> > > >
> > > > the variable "text"  come from sth like this:
> > > > ( header, msg, octets ) = a.retr( i )
> > > > text= list2txt( msg )
> > > > def list2txt( l ):
> > > > 	return reduce( lambda x, y:x+"\r\n"+y, l )
> > > >
> > > > anyone can help me out? thanks.
> > >
> > > Not without a functional crystal ball.
> > >
> > > You could help yourself considerably by (1) working out which line of
> > > code the problem occurs in [the traceback will tell you that] (2)
> > > working out which string is being decoded into Unicode, and has '\xe9'
> > > as its 5th byte. Either that string needs to be decoded using something
> > > like 'latin1' [should be specified in the message headers] rather than
> > > the default 'ascii', or the code has a deeper problem ...
> > >
> > > If you can't work it out for yourself, show us the exact code that ran,
> > > together with the traceback. If (for example) title is the problem,
> > > insert code like:
> > >     print 'title=', repr(title)
> > > and include that in your next post as well.
> > >
> > > HTH,
> > > John
> > thank you John and Diez.
> > i found
> > fn = "%s/%s-%s.mail"%("d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95' )
> > is ok
> > fn = "%s/%s-%s.mail"%(u"d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95' )
> > results:
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0:
> > ordinal not in range(128)
> > So "str"%(param) not accept unicode, only accept byte array?
>
> No, quite the contrary. And that's no "byte array", it's a string.
>
> The first substitution is in unicode, so the "%" operation  ups the
> ante from 8-bit string, and tries to decode the remaining
> substitutions, using the default ascii codec, which barfs on the 3rd
> substitution, which isn't ascii.
>
> If you want fn to be in some 8-bit encoding, then don't put the u in
> front of the first substitution.
> If you want fn to be in unicode, then you'll have to determine what
> encoding you're dealing with, and specify that explicitly.
>
> By the way, what has this "fn" stuff to do with your original problem?
> 
> Cheers,
> John