[I18n-sig] encoding support for Docutils: please review
David Goodger
goodger@users.sourceforge.net
Thu, 27 Jun 2002 23:56:12 -0400
I'm implementing support for Unicode and encodings in Docutils, and
have some questions about locales and determining encodings. I want
Docutils to be able to handle files using any encoding. Having read
Skip Montanero's "Using Unicode in Python"
(http://manatee.mojam.com/~skip/unicode/unicode/) and "Introduction to
i18n" by Tomohiro KUBOTA
(http://www.debian.org/doc/manuals/intro-i18n/), I came up with the
following heuristics:
- Try the encoding specified by a command-line option, if any.
- Try the locale's encoding.
- Try UTF-8.
- Try platform-specific encodings: CP-1252 on Windows, Mac-Roman on
MacOS, perhaps Latin-9 (iso-8859-15) otherwise.
Does this look right, or am I missing something?
My questions:
- Does the application have to call
``locale.setlocale(locale.LC_ALL, '')``, and if so, where? Is it OK
to call setlocale from within the decoding function, or should it be
left up to the client application?
- Should I use the result of ``locale.getlocale()``? On
Win2K/Python2.2.1, I get this::
>>> import locale
>>> locale.getlocale()
(None, None)
>>> locale.getdefaultlocale()
('en_US', 'cp1252')
Looks good so far.
>>> locale.setlocale(locale.LC_ALL, '')
'English_United States.1252'
>>> locale.getlocale()
['English_United States', '1252']
"1252"? What happened to the "cp"?
>>> s='abcd'
>>> s.decode('1252')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
LookupError: unknown encoding
How can I use ``locale.getlocale()`` when it doesn't return a
known encoding? Or put another way, how can I get a known
encoding out of ``locale.getlocale()``?
- Does ``locale.getdefaultlocale()[1]`` reliably produce the
platform-specific encoding?
Here's the decoding code I've written::
def decode(self, data):
"""
Decode a string, `data`, heuristically into Unicode.
Raise UnicodeError if unsuccessful.
"""
encodings = [self.options.input_encoding, # command-line option
locale.getlocale()[1],
'utf-8',
locale.getdefaultlocale()[1],]
# is locale.getdefaultlocale() platform-specific?
for enc in encodings:
if not enc:
continue
try:
decoded = unicode(data, enc)
return decoded
except UnicodeError:
pass
raise UnicodeError(
'Unable to decode input data. Tried the following encodings:'
'%s.' % ', '.join([repr(enc) for enc in encodings if enc]))
Suggestions for improvement and/or pointers to other resources would
be most appreciated. Thank you.
--
David Goodger <goodger@users.sourceforge.net> Open-source projects:
- Python Docutils: http://docutils.sourceforge.net/
(includes reStructuredText: http://docutils.sf.net/rst.html)
- The Go Tools Project: http://gotools.sourceforge.net/