[Tutor] How to match strange characters

Mon Sep 8 16:09:11 CEST 2008

Instead of trying to match on the weird characters, in order to remove them,
here is a pyparsing program that ignores those header lines and just
extracts the interesting data for each section.

In a pyparsing program, you start by defining what patterns you want to look
for.  This is similar to the re module, but uses friendlier names like
OneOrMore, Group, and Combine instead of special characters that require
backslashes and so on.  By default, pyparsing skips over whitespace between
expressions, so we use Combine to override this (as in realnum, in which we
want to match "3.1415", but not "3 . 1415").

Here is the opening part of the program, that defines the basic bits in your
data file, and the input parameter prompts:

from pyparsing import Combine, Word, nums, Literal, Group, oneOf, OneOrMore

# define basic expressions
realnum = Combine(Word(nums) + "." + Word(nums))
two_digit_num = Word(nums,exact=2)
four_digit_num = Word(nums,exact=4)
date = Combine(two_digit_num + '-' + two_digit_num + '-' + four_digit_num)
timestamp = Combine(two_digit_num + ':' + two_digit_num + ':' + 
                    two_digit_num + '.' + two_digit_num)

# literal prompt strings
enter_date = Literal("Enter Date for Precession as (MM-DD-YYYY) or C/R for
")
enter_catalog = Literal("Enter the Catalog Name or C/R for CATALOG.SRC >")
the_julian_date_is = Literal("The Julian Date is =")

# build up the header definition
enter_date_line = enter_date + date + ">"
julian_date_line = the_julian_date_is + realnum("julian_date")
header = Group(enter_date_line + date("date") + 
                enter_catalog + julian_date_line)

This next part uses similar style to define the format of the lines of data.

# build up the definition for a line of data
field_1 = Word(nums,exact=4) + "+" + Word(nums,exact=3)
field_2 = realnum
field_3 = Combine(oneOf("+ -") + realnum)
field_4 = timestamp
field_5 = timestamp
# change the results names as appropriate - I just made these up
data_line = Group( field_1("fld1") + field_2("magnitude") + 
        field_3("phase") + field_4("start_time") + field_5("end_time") )

I guessed at/made up names for the fields in the data_line ("fld1",
"magnitude", etc.).  You should change these to names that make sense in
your application.

Now a final definition that puts everything together:

# put everything together into a PRECESS run header+data section
section = header("header") + OneOrMore(data_line)("data")

And now use section.scanString to locate all the matching data in your input
file:
test = """
??????????????????????????????????????????????
? Radio Source Precession Program ?
? by John B. Doe ?
? 31 August 1992 ?
??????????????????????????????????????????????
Enter Date for Precession as (MM-DD-YYYY) or C/R for 05-28-2004 > 
05-28-2004
Enter the Catalog Name or C/R for CATALOG.SRC >
The Julian Date is = 2453153.5
0022+002 5.6564 +0.2713 00:22:37.54 00:16:16.65
0106+013 17.2117 +1.6052 01:08:50.80 01:36:18.58
"""

# use scanString to read through the input data - this will ignore the 
# parts of the header with the weird characters
for data_section, start,end in section.scanString(test):
    # each data_section returns the parsed results, which can be treated
    # like an object or a dict, using the results names for attribute names
    # or dict keys - the dump() method shows a structured output, keys()
    # values(), and items() work just like in a dict
    print data_section.dump()
    print data_section.header.julian_date
    # note the use of results name to access the "data" part
    for d in data_section.data:
        print d.dump()
        print "  ", d.start_time, d.end_time, d.phase

Note how the results names are used to access the matched fields in the
input.

This creates the following output:
[['Enter Date for Precession as (MM-DD-YYYY) or C/R for ', '05-28-2004', ...
- data: [['0022', '+', '002', '5.6564', '+0.2713', '00:22:37.54', ...
- header: ['Enter Date for Precession as (MM-DD-YYYY) or C/R for ', ...
  - date: 05-28-2004
  - julian_date: 2453153.5
2453153.5
['0022', '+', '002', '5.6564', '+0.2713', '00:22:37.54', '00:16:16.65']
- end_time: 00:16:16.65
- fld1: ['0022', '+', '002']
- magnitude: 5.6564
- phase: +0.2713
- start_time: 00:22:37.54
   00:22:37.54 00:16:16.65 +0.2713
['0106', '+', '013', '17.2117', '+1.6052', '01:08:50.80', '01:36:18.58']
- end_time: 01:36:18.58
- fld1: ['0106', '+', '013']
- magnitude: 17.2117
- phase: +1.6052
- start_time: 01:08:50.80
   01:08:50.80 01:36:18.58 +1.6052

You can get the complete program at this pastebin URL:
http://pyparsing.pastebin.com/m6f0ae6bc

If you still want to use re's, then this program might still help you in at
least laying out what your re's should match for at different places in the
data.

-- Paul