[I18n-sig] encoding support for Docutils: please review

David Goodger goodger@users.sourceforge.net
Thu, 27 Jun 2002 23:56:12 -0400


I'm implementing support for Unicode and encodings in Docutils, and
have some questions about locales and determining encodings.  I want
Docutils to be able to handle files using any encoding.  Having read
Skip Montanero's "Using Unicode in Python"
(http://manatee.mojam.com/~skip/unicode/unicode/) and "Introduction to
i18n" by Tomohiro KUBOTA
(http://www.debian.org/doc/manuals/intro-i18n/), I came up with the
following heuristics:

- Try the encoding specified by a command-line option, if any.

- Try the locale's encoding.

- Try UTF-8.

- Try platform-specific encodings: CP-1252 on Windows, Mac-Roman on
  MacOS, perhaps Latin-9 (iso-8859-15) otherwise.

Does this look right, or am I missing something?

My questions:

- Does the application have to call
  ``locale.setlocale(locale.LC_ALL, '')``, and if so, where?  Is it OK
  to call setlocale from within the decoding function, or should it be
  left up to the client application?
  
- Should I use the result of ``locale.getlocale()``?  On
  Win2K/Python2.2.1, I get this::

      >>> import locale
      >>> locale.getlocale()
      (None, None)
      >>> locale.getdefaultlocale()
      ('en_US', 'cp1252')
      
  Looks good so far.
  
      >>> locale.setlocale(locale.LC_ALL, '')
      'English_United States.1252'
      >>> locale.getlocale()
      ['English_United States', '1252']
  
  "1252"?  What happened to the "cp"?

      >>> s='abcd'
      >>> s.decode('1252')
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      LookupError: unknown encoding

  How can I use ``locale.getlocale()`` when it doesn't return a
  known encoding?  Or put another way, how can I get a known
  encoding out of ``locale.getlocale()``?

- Does ``locale.getdefaultlocale()[1]`` reliably produce the
  platform-specific encoding?

Here's the decoding code I've written::

    def decode(self, data):
        """
        Decode a string, `data`, heuristically into Unicode.
        Raise UnicodeError if unsuccessful.
        """
        encodings = [self.options.input_encoding, # command-line option
                     locale.getlocale()[1],
                     'utf-8',
                     locale.getdefaultlocale()[1],]
        # is locale.getdefaultlocale() platform-specific?
        for enc in encodings:
            if not enc:
                continue
            try:
                decoded = unicode(data, enc)
                return decoded
            except UnicodeError:
                pass
        raise UnicodeError(
            'Unable to decode input data.  Tried the following encodings:'
            '%s.' % ', '.join([repr(enc) for enc in encodings if enc]))

Suggestions for improvement and/or pointers to other resources would
be most appreciated.  Thank you.

-- 
David Goodger  <goodger@users.sourceforge.net>  Open-source projects:
  - Python Docutils: http://docutils.sourceforge.net/
    (includes reStructuredText: http://docutils.sf.net/rst.html)
  - The Go Tools Project: http://gotools.sourceforge.net/