[Doc-SIG] Hello and feedback

David Goodger goodger@python.org
Sat, 15 Mar 2003 01:01:25 -0500


David Priest wrote:
>  The two key missing puzzle-pieces are as follows:
>  - I am unable to insert Unicode character entities into the text.

As I recently wrote on the docutils-users mailing list,

    With Unicode and encodings available (UTF-8, Latin-1, etc.), we
    can directly code   as \u00A0 or \xA0.  Same with soft
    hyphens (­: \u00AD or \xAD), and any other Unicode character.
    This lessens (eliminates?) the need for the "escape sequence ->
    concrete character" kind of character processing.

    <http://article.gmane.org/gmane.text.docutils.user/190>

Why are you unable to insert the actual, encoded characters into the
text?  What *are* you able to insert?  What encoding are your files
using?  What platform (OS, editor, etc.)?  It could be that you *can*
insert real characters but don't know it.

In any case, the world is moving inexorably towards Unicode.  I don't
see the point in complicating reStructuredText in order to duplicate
what can be had for free with properly encoded input files.

> Substitutions are the answer, so I've created a |subst|
> replace:&xAAA; table, as included in the zipfile.  Alas, the
> DocUtils insist on <PRE>coding or &amp;-substituting the replacement
> text.

XML character entities (&xNNNN;) are unknown to reStructuredText and
are the wrong way to do it.  Docutils is correct in substituting
"&amp;" in HTML output for every "&" in the input file (how could it
tell the difference between *using* an XML character entity and just
*talking* about one?).

Your substitution table idea ought to work though, but you need to use
real characters instead of "&xNNNN;" in it.  Use a character encoding
that contains all the characters you want -- Latin-1 if that's enough,
your platform's native encoding (e.g. a Windows CP-* encoding), or
UTF-8 for full Unicode capability.  If your normal text editor can't
handle the encoding you choose, use Emacs or VIM; if you can't handle
them, get one of your programmers to help.  It will be limited to the
one file.

For this to work, the encoding of the substitution table file has to
be the same as that of the input file.  Alternatively, some work has
to be done on the Docutils code to allow different encodings.

Also, the "\ " (backslash-space) escape sequence trick (also
<http://article.gmane.org/gmane.text.docutils.user/190>) needs to be
implemented to allow for character-level inline markup or
substitutions inside words.  The implementation should be easy.

>  - I am unable to use regex macros to manipulate the input text.
> There are two uses for this:
> 
>   1) To do standardized typographic replacement on naive-user's
> text.  For instance -- in print documentation -- the long dash, as I
> just used it, should be replaced using an em-dash surrounded by thin
> spaces.  This is a detail that no naive user should have to worry
> about: that's my job as technical writer.  Likewise, certain
> measurement units require marking-up: I'd like my naive users to
> write "36.8degrees" and get 36.8[degree sign], or 10m3 and get
> 10m[superscripted 3].
> 
>      .. use the formal en-dash with number ranges
>      .. regex:: ([0-9]+?)-([0-9]+?) = \1&ndash;\2

As I said in private email, at first glance,

    I doubt it [a regex substitution mechanism] will make its way into
    the Docutils/reStructuredText core.  Such a function could easily
    be handled by pre-processing though.  A script could convert ' -- '
    into a UTF-8 em-dash, etc.  Such a pre-processing system *could*
    be made part of Docutils, but separate from the core code I think.

I realize that there is a problem with pre-processing.  "10m3" inside
a paragraph and "10m3" inside a literal block or inline literal should
probably be handled differently, and a simple pre-processor would find
it difficult to differentiate.

Personally, I would go with an 80/20 auto/manual system.  Write a
quick script (or Emacs lisp function) to do the changes (convert
"10m3" to "10m\ :sup:`3`", for example) and verify manually.  If you
try to add this conversion to the processing pipeline (in whatever
form), there will someday be an exception where you *won't* want the
conversion to take place.

OTOH, if a good design appears and somebody implements it, who knows?
The idea needs a lot of fleshing out first though.  As with the
substitution table file idea, "&ndash;" is no good; it would have to
be a properly encoded en-dash character.

>      Note that the parenthesis in the prefix part of the equation
> indicate grouping: ...

I don't think we need a lesson in regular expression syntax, thanks.
;-)

>  If a file has a regex definition, it will be run before all other
> macros and substitutions, mais non?  I think so.

What "other macros"?  reStructuredText doesn't have macros.

>  2) To supplement the :role:` ` coding.  In particular, the software
> documentation needs to indicate GUI elements.  In HTML they would be
> rendered bold; in DocBook as <guilabel>.  It would be appropriate
> for the source text to read " open the :gui:`File` menu and choose
> :gui:`Save` ".
>  A :role: transformer program would be overkill for this.  All that
> is needed is the ability to run an appropriate regex against the
> enclosed text, appropriate being defined by the destination file
> type.

There's absolutely no need for any regex substitution here.  That's
exactly what the Writers are for.  The ":gui:`File`" input may become
"<gui>File</gui>" in the internal document tree.  The HTML Writer
would write it with bold (or, better yet, with <span class="gui">,
made bold by the stylesheet).  The DocBook Writer would write it with
<guilabel>.

There has been some discussion about parameterizing the interpreted
text system somehow, to avoid proliferation of element types (gui,
keypress, etc.).  No decision or action yet.

> (there are some other gaps, but I see that they've been addressed by
> others.  Figure/Table autonumbering and referencing are a biggie.  I
> see that someone has suggested a naming convention to allow
> referencing, and the autonumbering shouldn't be any problem at all,
> so I anticipate a solution is on the way.)

They're on the To Do list, but I don't know when they'll be
implemented.  Contributions welcome!
-- David Goodger    http://starship.python.net/~goodger

Programmer/sysadmin for hire: http://starship.python.net/~goodger/cv