help with unicode email parse

Fri Sep 8 00:01:16 EDT 2006

neoedmund wrote:
[top-posting corrected]
> John Machin wrote:
> > neoedmund wrote:
> > > i want to get the subject from email and construct a filename with the
> > > subject.
> > > but tried a lot, always got error like this:
> > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 4:
> > > ordinal not in range(128)
> > >
> > >
> > > 	msg = email.message_from_string( text )
> > > 	title = decode_header( msg["Subject"] )
> > > 	title= title[0][0]
> > > 	#title=title.encode("utf8")
> >
> > Why is that commented out?
> >
> > > 	print title
> > > 	fn = ""+path+"/"+stamp+"-"+title+".mail"
> > >
> > >
> > > the variable "text"  come from sth like this:
> > > ( header, msg, octets ) = a.retr( i )
> > > text= list2txt( msg )
> > > def list2txt( l ):
> > > 	return reduce( lambda x, y:x+"\r\n"+y, l )
> > >
> > > anyone can help me out? thanks.
> >
> > Not without a functional crystal ball.
> >
> > You could help yourself considerably by (1) working out which line of
> > code the problem occurs in [the traceback will tell you that] (2)
> > working out which string is being decoded into Unicode, and has '\xe9'
> > as its 5th byte. Either that string needs to be decoded using something
> > like 'latin1' [should be specified in the message headers] rather than
> > the default 'ascii', or the code has a deeper problem ...
> >
> > If you can't work it out for yourself, show us the exact code that ran,
> > together with the traceback. If (for example) title is the problem,
> > insert code like:
> >     print 'title=', repr(title)
> > and include that in your next post as well.
> >
> > HTH,
> > John
> thank you John and Diez.
> i found
> fn = "%s/%s-%s.mail"%("d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95' )
> is ok
> fn = "%s/%s-%s.mail"%(u"d:/mail", "12345", '\xe6\xb5\x8b\xe8\xaf\x95' )
> results:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0:
> ordinal not in range(128)
> So "str"%(param) not accept unicode, only accept byte array?

No, quite the contrary. And that's no "byte array", it's a string.

The first substitution is in unicode, so the "%" operation  ups the
ante from 8-bit string, and tries to decode the remaining
substitutions, using the default ascii codec, which barfs on the 3rd
substitution, which isn't ascii.

If you want fn to be in some 8-bit encoding, then don't put the u in
front of the first substitution.
If you want fn to be in unicode, then you'll have to determine what
encoding you're dealing with, and specify that explicitly.

By the way, what has this "fn" stuff to do with your original problem?

Cheers,
John