[Mailman-Developers] [PATCH] Header q-p/base64 RFC 2047 encoding for email module

Ben Gertzfield che@debian.org
Wed, 14 Nov 2001 19:21:43 +0900


The following patch to the email module implements the RFC
2047-specified Base64 and quoted-printable (called "B" and "Q"
encoding by the RFC) for header-safe encoding of 8-bit strings, for
From:, To:, Subject:, and other fields.  

It includes charset information within the encoded strings themselves,
which, along with the special line-wrapping algorithm needed for B and
Q encoding, make this a very useful general feature for
internationalized Python email programs.

Most MIME-aware mail readers in use today understand the RFC 2047
convention, and in the East Asian world, it's 100% necessary to send
subject and address fields in Base64 encoding.

Mailman needs this functionality in order to send out localized
emails from the virgin queue; without it, it's very possible that 
8-bit characters will be blindly placed into the Subject: and To:
fields.  This also allows localized List-Id fields, as a bonus!

This patch adds the following functions to email.Utils:

encode_address(real_name, address, charset="iso-8859-1", encoding=QP):
    MIME-encode a header field intended for an address (from, to, cc, etc.)

encode_header(header, charset="iso-8859-1", encoding=QP):
    MIME-encode a general email header field (eg. Subject).

encode_header_chunks(header_chunks):
    MIME-encode a header with many different charsets and/or encodings.

It also adds the following support functions to email.Encoders.  

header_qencode(header, charset="iso-8859-1", maxlinelen=75):
    Encode a header line with quoted-printable (like) encoding.

header_bencode(header, charset, maxlinelen=75):
    Encode a header line with Base64 encoding and a charset specification.

I needed to re-implement the quoted-printable algorithm in
header_qincode because the "Q" encoding specified by RFC 2045 is
different in a few key areas from the one implemented in quopri.py,
and the line-wrapping at 75 characters got too hairy with just
quopri.py.

Patch follows, against email 0.95. (Sorry, I tried CVS, but I didn't
want to install Python 2.2 beta just yet.)

I will work on integrating this into Mailman tomorrow.

diff -ruN email.orig/Encoders.py email/Encoders.py
--- email.orig/Encoders.py	Tue Oct  2 04:29:38 2001
+++ email/Encoders.py	Wed Nov 14 19:07:24 2001
@@ -6,8 +6,10 @@
 
 import base64
 import quopri
+from binascii import b2a_base64
 from cStringIO import StringIO
 
+CRLFSPACE = "\015\012 "
 
 
 # Helpers
@@ -24,6 +26,15 @@
         return value[:-1]
     return value
 
+def _max_append(list, str, maxlen):
+    if len(list) == 0:
+        list.append(str)
+        return
+    
+    if len(list[-1] + str) < maxlen:
+        list[-1] += str
+    else:
+        list.append(str)
 
 def _bencode(s):
     # We can't quite use base64.encodestring() since it tacks on a "courtesy
@@ -78,3 +89,91 @@
 
 def encode_noop(msg):
     """Do nothing."""
+
+
+def header_qencode(header, charset="iso-8859-1", maxlinelen=75):
+    """Encode a header line with quoted-printable (like) encoding.
+
+    Defined in RFC 2045, this "Q" encoding is similar to
+    quoted-printable, but used specifically for email header fields to
+    allow charsets with mostly 7 bit characters (and some 8 bit) to
+    remain more or less readable in non-RFC 2045 aware mail clients.    
+
+    The resulting string will be in the form:
+
+    "=?charset?q?I_f=E2rt_in_your_g=E8n=E8ral_dire=E7tion?=\r\n
+      =?charset?q?Silly_=C8nglish_Kn=EEghts?="
+
+    with each line wrapped safely at, at most, maxlinelen characters.
+    It is safe to use verbatim in any email header field, as the
+    wrapping is performed in a quoted-printable aware way and each
+    linefeed is a \r\n.
+
+    charset defaults to "iso-8859-1", and maxlinelen defaults to 75
+    characters.
+    """
+    quoted = []
+
+    # =? plus ?q? plus ?= is 7 characters
+    maxlen = maxlinelen - len(charset) - 7
+    
+    for c in header:
+        # Space may be represented as _ instead of =20 for readability
+        if c == ' ':
+            _max_append(quoted, "_", maxlen)
+        # These characters can be included verbatim
+        elif ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or
+              (c >= '0' and c <= '9') or (c in ('!', '*', '+', '-', '/'))):
+            _max_append(quoted, c, maxlen)
+        # Otherwise, replace with hex value like =E2
+        else:
+            _max_append(quoted, "=%02X" % (ord(c)), maxlen)
+
+    encoded = ""
+
+    for q in quoted:
+        # Any chunk past the fir7st must start with "\r\n "
+        if len(encoded) > 0:
+            encoded += CRLFSPACE
+        encoded += "=?%s?q?%s?=" % (charset, q)
+
+    return encoded
+
+def header_bencode(header, charset, maxlinelen=75):
+    """Encode a header line with Base64 encoding and a charset specification.
+    
+    Defined in RFC 2045, this Base64 encoding is identical to normal
+    Base64 encoding, except that each line must be intelligently
+    wrapped (respecting the Base64 encoding), and subsequent lines must
+    start with a space.  
+
+    The resulting string will be in the form:
+
+    "=?charset?b?WW/5ciBtYXp66XLrIHf8eiBhIGhhbXBzdGHuciBBIFlv+XIgbWF6euly?=\r\n
+      =?charset?b?6yB3/HogYSBoYW1wc3Rh7nIgQkMgWW/5ciBtYXp66XLrIHf8eiBhIGhh?="
+      
+    with each line wrapped at, at most, maxlinelen characters. It is
+    safe to use verbatim in any email header field, as the wrapping is
+    performed in a quoted-printable aware way and each linefeed is a
+    \r\n.
+
+    charset defaults to "iso-8859-1", and maxlinelen defaults to 75
+    characters.
+    """
+    base64ed = []
+
+    maxlen = ((maxlinelen - len(charset) - 7) / 4) * 3
+    num_lines = (len(header) / maxlen) + 1
+
+    for i in xrange(0, num_lines):
+        base64ed.append(b2a_base64(header[i*maxlen:(i+1)*maxlen]))
+
+    encoded = ""
+
+    for b in base64ed:
+        if len(encoded) > 0:
+            encoded += CRLFSPACE
+        # We ignore the last character of each line, which is a \n.
+        encoded += "=?%s?b?%s?=" % (charset, b[:-1])
+
+    return encoded
diff -ruN email.orig/Utils.py email/Utils.py
--- email.orig/Utils.py	Sat Nov 10 02:07:44 2001
+++ email/Utils.py	Wed Nov 14 19:16:00 2001
@@ -17,11 +17,16 @@
 import base64
 
 # Intrapackage imports
-from Encoders import _bencode, _qencode
+from Encoders import _bencode, _qencode, header_qencode, header_bencode
 
 COMMASPACE = ', '
 UEMPTYSTRING = u''
 
+CRLFSPACE = "\015\012 "
+
+# Flags for types of header encodings
+QP     = 1  # Quoted-Printable
+BASE64 = 2  # Base64
 
 
 # Helpers
@@ -56,6 +61,16 @@
         return value[:-1]
     return value
 
+def _chunk_append(chunks, header, goodlinelen=75):
+    if len(chunks) == 0:
+        chunks.append(header)
+        return
+    
+    for chunk in header.split(CRLFSPACE):
+        if len(chunks[-1] + chunk) < goodlinelen:
+            chunks[-1] += " " + chunk
+        else:
+            chunks.append(chunk)
 
 
 def getaddresses(fieldvalues):
@@ -156,3 +171,90 @@
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'][now[1] - 1],
         now[0], now[3], now[4], now[5],
         zone)
+
+def encode_address(real_name, address, charset="iso-8859-1", encoding=QP):
+    """MIME-encode a header field intended for an address (from, to, cc, etc.)
+
+    Given an 8-bit string containing a real name, an email address,
+    and optionally the real name's character set, and the encoding
+    you wish to use with it, return a 7-bit MIME-encoded string
+    suitable for use in a From, To, Cc, or other email header
+    field.
+    
+    The encoding can be email.Utils.QP (quoted-printable, for
+    ASCII-like character sets like iso-8859-1), email.Utils.BASE64
+    (Base64, for non-ASCII like character sets like KOI8-R and
+    iso-2022-jp), or None (no encoding).
+    
+    The charset defaults to "iso-8859-1", and the encoding defaults
+    to email.Utils.QP.
+    
+    The resulting string will be in the format:
+    
+    "=?charset?q?Kevin_Phillips_B=F6ng?= <philips@slightly.silly.party.go.uk>"
+    
+    and can be included verbatim in an email header field.  Even
+    very long addresses are handled properly with this method:
+    
+    "=?charset?q?T=E4rquin_Fintimlinbinhinbimlim_Bus_St=F6p_Poontang_Poont?=\r\n
+      =?charset?q?ang_Ol=E9_Biscuit-Barrel?=\r\n
+      <tarquin@very.silly.party.go.uk>"
+    """       
+    
+    return encode_header_chunks([ [real_name, charset, encoding],
+                                  ["<%s>" % address, None, None] ])
+
+def encode_header(header, charset="iso-8859-1", encoding=QP):
+    """MIME-encode a general email header field (eg. Subject).
+
+    Given an 8-bit header string, and optionally its charset and the
+    encoding you wish to use, return a 7-bit MIME-encoded string
+    suitable for use in a general email header (but most useful for
+    the Subject: line).
+
+    The encoding can be email.Utils.QP (quoted-printable, for
+    ASCII-like character sets like iso-8859-1), email.Utils.BASE64
+    (Base64, for non-ASCII like character sets like KOI8-R and
+    iso-2022-jp), or None (no encoding).
+    
+    The charset defaults to "iso-8859-1", and the encoding defaults
+    to email.Utils.QP.
+    """
+    return encode_header_chunks([[header, charset, encoding]])
+
+def encode_header_chunks(header_chunks):
+    """MIME-encode a header with many different charsets and/or encodings.
+
+    Given a list of triplets [ [string, charset, encoding] ], return a
+    MIME-encoded string suitable for use in a header field.  Each triplet
+    may have different charsets and/or encodings, and the resulting header
+    will accurately reflect each setting.
+
+    Each encoding can be email.Utils.QP (quoted-printable, for
+    ASCII-like character sets like iso-8859-1), email.Utils.BASE64
+    (Base64, for non-ASCII like character sets like KOI8-R and
+    iso-2022-jp), or None (no encoding).
+    
+    Each triplet will be represented on a separate line; the resulting
+    string will be in the format:
+
+    "=?charset1?q?Mar=EDa_Gonz=E1lez_Alonso?=\r\n
+      =?charset2?b?SvxyZ2VuIEL2aW5n?="
+    """
+    chunks = []
+    
+    for header, charset, encoding in header_chunks:
+        encoded = ""
+        encoding_char = ""
+
+        if encoding is None:
+            _chunk_append(chunks, header)
+        else:
+            if encoding is QP:
+                _chunk_append(chunks, header_qencode(header, charset))
+                    
+            elif encoding is BASE64:
+                _chunk_append(chunks, header_bencode(header, charset))
+
+    return CRLFSPACE.join(chunks)
+

-- 
Brought to you by the letters A and H and the number 10.
"Wuzzle means to mix."
Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/