Useful RE patterns (was: Variable Interpolation - status of PEP 215)
Mike C. Fletcher
mcfletch at rogers.com
Tue Jul 2 15:29:54 EDT 2002
Fredrik, it would be nice to see the list you collect (not just the
selected final entries).
I'm actually doing something very similar for SimpleParse (pre-built
parsers for common constructs that can automatically be included in your
grammars).
<plug>The library feature will appear in SimpleParse 2.0 (now under
development). Watch for more details in the weeks ahead or join in in
the development effort...</plug>
So far I have:
int
hex
float
number := hex/float/int
double quoted string
single quoted string
string := (dqs/sqs)
semi_colon_comment (ini-files and the like)
hash_comment (python-style)
slashslash_comment (C++ // comments)
slashbang_comment (C /* */ non-nesting comments)
slashbang_nest_comment (as previous, but allows nesting)
(and common character classes as well, but RE has those already)
Common ones I'm thinking of adding:
Identifiers (e.g. Python, XML, HTML, C, filenames, URIs)
Dates (with a decent selection of formats, suitable for human data entry
processing)
Times (again, a number of formats)
Display-formatted numbers (e.g. 200,000.00 or 200 000,00 or (200,00) ;
locale specific by default, possibly offering a few common international
formats)
Common units of measurement (SI units only? or maybe Imperial as well.
Anyway, type would be something like: unit_weight or unit_distance or
unit_energy (parsers would then define expression,unit to require a unit
or expression,unit? to provide a default unit))
Irrational numbers (under numbers, i or j forms)
Monetary values (locale-specific base, possible with a "world" version to
allow for parsing Pounds, Francs, Euros, Yen, Dollars etceteras without
needing to switch locales, support for surrounding brackets meaning
negative, those kinds of things :) ).
IP Addresses
Dotted identifiers
Higher-level constructs under consideration:
Mathematical expressions
Lists, tuples, dicts (not sure how to make this generic without requiring
a specific name for key/value expressions)
Possible Python-specific additions (seen in tokeniser.py for your purposes):
Calling/parameter-lists (definition and use)
Triple quoted strings (under strings)
SGML/XML/HTML-specific, thinking of including them as
simpleparse/common/sgml.py:
Identifier
Tag
Attribute
Comment
Entity References
Processing instruction
Various DTD elements (not sure if worth the trouble)
For most of those you could probably find RE versions in various libs of
the standard library (after all, they're common :) ).
Enjoy,
Mike
Michael Hudson wrote:
> Norman Shelley <Norman_Shelley-RRDN60 at email.sps.mot.com> writes:
>
>
>>Fredrik Lundh wrote:
>>
>>
>>>...
>>>If I were to add a dozen (or so) patterns to the (S)RE module,
>>>what should I pick? What patterns do you find yourself using
>>>over and over again?
>>
>>All kinds of numerics, e.g. scientific (1e-6, 2e6, ...) and
engineering (1u,
>>2M and/or 2MEG, ...) notation.
>>
>>Python identifiers as previously mentioned.
>
>
> Well, *they're* already in the tokenize module.
>
> Cheers,
> M.
>
--
_______________________________________
Mike C. Fletcher
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list