Useful RE patterns (was: Variable Interpolation - status of PEP 215)

Mike C. Fletcher mcfletch at rogers.com
Tue Jul 2 15:29:54 EDT 2002


Fredrik, it would be nice to see the list you collect (not just the
selected final entries).

I'm actually doing something very similar for SimpleParse (pre-built 
parsers for common constructs that can automatically be included in your 
grammars).

<plug>The library feature will appear in SimpleParse 2.0 (now under 
development).  Watch for more details in the weeks ahead or join in in 
the development effort...</plug>

So far I have:
	int
	hex
	float
	number := hex/float/int
	double quoted string
	single quoted string
	string := (dqs/sqs)
	semi_colon_comment (ini-files and the like)
	hash_comment (python-style)
	slashslash_comment (C++ // comments)
	slashbang_comment (C /* */ non-nesting comments)
	slashbang_nest_comment (as previous, but allows nesting)

	(and common character classes as well, but RE has those already)


Common ones I'm thinking of adding:
	Identifiers (e.g. Python, XML, HTML, C, filenames, URIs)
	Dates (with a decent selection of formats, suitable for human data entry
processing)
	Times (again, a number of formats)
	Display-formatted numbers (e.g.  200,000.00 or 200 000,00 or (200,00) ; 
locale specific by default, possibly offering a few common international 
formats)
	Common units of measurement (SI units only? or maybe Imperial as well. 
Anyway, type would be something like: unit_weight or unit_distance or 
unit_energy (parsers would then define expression,unit to require a unit 
or expression,unit? to provide a default unit))
	Irrational numbers (under numbers, i or j forms)
	Monetary values (locale-specific base, possible with a "world" version to 
allow for parsing Pounds, Francs, Euros, Yen, Dollars etceteras without 
needing to switch locales, support for surrounding brackets meaning 
negative, those kinds of things :) ).
	IP Addresses
	Dotted identifiers


Higher-level constructs under consideration:
	Mathematical expressions
	Lists, tuples, dicts (not sure how to make this generic without requiring 
a specific name for key/value expressions)


Possible Python-specific additions (seen in tokeniser.py for your purposes):
	Calling/parameter-lists (definition and use)
	Triple quoted strings (under strings)


SGML/XML/HTML-specific, thinking of including them as
simpleparse/common/sgml.py:
	Identifier
	Tag
	Attribute
	Comment
	Entity References
	Processing instruction
	Various DTD elements (not sure if worth the trouble)


For most of those you could probably find RE versions in various libs of 
the standard library (after all, they're common :) ).

Enjoy,
Mike

	

Michael Hudson wrote:
 > Norman Shelley <Norman_Shelley-RRDN60 at email.sps.mot.com> writes:
 >
 >
 >>Fredrik Lundh wrote:
 >>
 >>
 >>>...
 >>>If I were to add a dozen (or so) patterns to the (S)RE module,
 >>>what should I pick?  What patterns do you find yourself using
 >>>over and over again?
 >>
 >>All kinds of numerics, e.g. scientific (1e-6, 2e6, ...) and 
engineering (1u,
 >>2M and/or 2MEG, ...) notation.
 >>
 >>Python identifiers as previously mentioned.
 >
 >
 > Well, *they're* already in the tokenize module.
 >
 > Cheers,
 > M.
 >


-- 
_______________________________________
    Mike C. Fletcher
    http://members.rogers.com/mcfletch/








More information about the Python-list mailing list