[Doc-SIG] Tokens for labels & endnotes

Edward D. Loper edloper@gradient.cis.upenn.edu
Wed, 21 Mar 2001 14:03:24 EST


> I'm assuming we're talking about paragraph labels.
Actually, I think we were talking about [endnotes].  But the same
questions apply to labels..

> I think we should just go with the English definition of a word, which
> means [-A-Za-z], and leave it at that. It is *meant* to look like a
> word.

Is that too anglo-centric?  

> I think "keep it simple" is required here - these labels are meant to be
> few and simple, so English words seems sensible to me. I would thus vote
> against underlines and against digits.

It might be that underlines and digits are more applicable for 
endnotes.  Some people might like this [1] or this [noam_chomsky97].

> Also, validation aside, I don't *use* a regular expression - I look for
> the right "shape" of paragraph (1 line, colon in it) and check what is
> to the left of the colon against the dictionary. From *my* point of view
> the legitimate characters idea only comes in with a validation phase (of
> course, it would be different for Edward).

This may be different if you want [this to not be an endnote].

> > Basically re defines '\w' = '[0-9a-zA-Z_]
> 
> Erm - basically it doesn't - it invokes "locales" which makes life more
> complex (and I have no idea what sre does about '\w').

If LOCALE and UNICODE flags aren't used when compiling a regexp,
\w = [a-zA-Z0-9_] (at least according to "the python library
reference manual 
for re":<http://www.python.org/doc/current/lib/re-syntax.html>).
Furthermore, it will always match '_', regardless of LOCALE and
UNICODE (again, according to the ref. manual).

-Edward