[Doc-SIG] non-ascii docstrings

Fri Mar 24 14:53:47 CET 2006

[Edward Loper]
> I've been working on epydoc, and the question has come up of how I
> should treat non-unicode docstrings that contain non-ascii
> characters.  An example of such a file is
> "python2.4/encodings/string_escape.py", whose module docstring
> contains an 'o' with an umlaut.
>
> In particular, the question is whether I should assume that the
> docstring is encoded with the encoding specified by the "-*- coding
> -*-" directive at the top of the file.

I think that although it's the only possible assumption, it's also
potentially a wrong assumption.  IOW, don't assume anything.

> The reason why we *wouldn't* use the encoding is that PEP 263 [1],
> which defines the coding directive, says that it does *not* apply to
> non-unicode string literals.  In particular, PEP 263 says that the
> entire file should be read & tokenized using the specified coding,
> but once string objects are created, they should be reencoded back
> into 8-bit strings using the file encoding.

One reason is that the module code may expect such string literals to
have their original encoding.  String literals can contain arbitrary
8-bit data (strings are bytes, not characters).  Attempting to decode
such strings is inviting misinterpretation.

Another reason is simple: "In the face of ambiguity, refuse the
temptation to guess."

> So the "correct" fix is for the author of the module to use unicode
> literals instead of string literals for docstrings that contain
> non-ascii characters.  This has the advantage that if a user tries
> to look at the docstring via introspection, it will be correct.
>
> On the other hand, epydoc is often used by people other than the
> author of a module, and requiring them to go through and replace all
> string literal docstrings with unicode literals seems a bit
> unreasonable.

Yes, it's unreasonable.  But such code is buggy IMO.  It's also
unreasonable to expect Epydoc to correctly interpret garbage input.
Don't do it.

> So the question is..  Should epydoc (and other tools like it) be
> compliant with PEP 263 (and consistent with Python); or should they
> "do what I mean, not what I say" and treat non-ascii docstrings as
> if they were encoded using the module's encoding?

Be compliant with PEP 263, issue a warning (PEP 263, Implementation,
step 1), and either ignore such string literals or represent them as
strings of bytes (using "\xYY" notation).

-- 
David Goodger <http://python.net/~goodger>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/doc-sig/attachments/20060324/865b4cc0/attachment.pgp