[Tutor] Parsing problem

Paul McGuire paul at alanweberassociates.com
Mon Jul 25 05:11:43 CEST 2005


Liam -

The two arguments to Word work this way:
- the first argument lists valid *initial* characters
- the second argument lists valid *body* or subsequent characters

For example, in the identifier definition, 

identifier = pp.Word(pp.alphas, pp.alphanums + "_/:.")

identifiers *must* start with an alphabetic character, and then may be
followed by 0 or more alphanumeric or _/: or . characters.  If only one
argument is supplied, then the same string of characters is used as both
initial and body.  Identifiers are very typical for 2 argument Word's, as
they often start with alphas, but then accept digits and other punctuation.
No whitespace is permitted within a Word.  The Word matching will end when a
non-body character is seen.

Using this definition:

integer = pp.Word(pp.nums+"-+.", pp.nums)

It will accept "+123", "-345", "678", and ".901".  But in a real number, a
period may occur anywhere in the number, not just as the initial character,
as in "3.14159".  So your bodyCharacters must also include a ".", as in:

integer = pp.Word(pp.nums+"-+.", pp.nums+".")

Let me say, though, that this is a very permissive definition of integer -
for one thing, we really should rename it something like "number", since it
now accepts non-integers as well!  But also, there is no restriction on the
frequency of body characters.  This definition would accept a "number" that
looks like "3.4.3234.111.123.3234".  If you are certain that you will only
receive valid inputs, then this simple definition will be fine.  But if you
will have to handle and reject erroneous inputs, then you might do better
with a number definition like:

number = Combine( Word( "+-"+nums, nums ) + 
                  Optional( point + Optional( Word( nums ) ) ) )

This will handle "+123", "-345", "678", and "0.901", but not ".901".  If you
want to accept numbers that begin with "."s, then you'll need to tweak this
a bit further.

One last thing: you may want to start using setName() on some of your
expressions, as in:

number = Combine( Word( "+-"+nums, nums ) + 
                  Optional( point + Optional( Word( nums ) ) )
).setName("number")

Note, this is *not* the same as setResultsName.  Here setName is attaching a
name to this pattern, so that when it appears in an exception, the name will
be used instead of an encoded pattern string (such as W:012345...).  No need
to do this for Literals, the literal string is used when it appears in an
exception.

-- Paul




More information about the Tutor mailing list