[Mailman-i18n] "Funny" characters in real names?

Martin v. Löwis loewis@informatik.hu-berlin.de
19 Sep 2002 09:37:38 +0200


barry@python.org (Barry A. Warsaw) writes:

> Martin, sometimes this Unicode stuff makes my head hurt. ;)

In an application that deals with multiple charsets on a regular basis
(such as mailman), I recommend not to mix byte strings and Unicode
strings. This can be achieved by
- converting all byte strings that represent text data to Unicode
  at the earliest possible point in processing,
- converting all Unicode strings back to byte strings just before
  output.

If most data is likely ASCII, it is tempting to use byte strings for
pure-ASCII, and Unicode for everything else. Try to resist this
temptation.

If you follow this strategy, you find that processing becomes much
simpler.

> So it seems like name.encode('us-ascii') is my only choice.  What am I
> missing?

If you are following the above strategy, you will know whether name is
Unicode or byte string. If it is Unicode, .encode is fine. If it is a
byte string, unicode(name,'ascii') will work.

I admit that the strategy has two problems:

1. In some cases, it might be impossible to generate a Unicode string
   for text data. In MIME, the encoding may not be specified, or it
   may be unknown to mailman, or the data may fail to convert.

   In these cases, it may be acceptable to "force" the data to
   Unicode: If there is no encoding, guess latin-1. If the string
   fails to convert, convert it with "replace". If the encoding is
   unknown, replace all non-printable characters with question marks.

   Whether this is acceptable depends on how frequent the problem
   occurs and whose fault that is (e.g. an unknown encoding should be
   added to Mailman).

2. When converting an application that used to be byte-oriented to
   Unicode, adding conversions at all required places might be too
   much effort, or breakage because of incorrect data might be
   unacceptable.

   In these cases, I recommend to add type tests at strategic places,
   and taper over any incorrect data.

E.g. in this case, you could write a function

def unicode_is_pure_ascii(text):
  if type(text) is types.UnicodeType:
     try:
       text.encode("ascii")
       return 1
     except UnicodeError:
       return 0
  if DEBUG:
     raise DebugError, "string not unicode:"+repr(text)
   try:
     unicode(text,"ascii")
     return 1
   except UnicodeError:
     return 0

If you expect name to be a byte string, the function would be
bytes_are_ascii, of course.

Regards,
Martin