Legacy data parsing

Thomas Bartkus thomasbartkus at comcast.net
Fri Jul 8 16:03:45 EDT 2005


"gov" <Gov at mailinator.com> wrote in message
news:1120847474.604271.196220 at g49g2000cwa.googlegroups.com...
> Hi,
>
> I've just started to learn programming and was told this was a good
> place to ask questions :)
>
> Where I work, we receive large quantities of data which is currently
> all printed on large, obsolete, dot matrix printers.  This is a problem
> because the replacement parts will not be available for much longer.
>
> So I'm trying to create a program which will capture the fixed width
> text file data and convert as well as sort the data (there are several
> different report types) into a different format which would allow it to
> be printed normally, or viewed on a computer.

Text file data has no concept of "fixed width".  Somewhere in your system,
text file data is being thrown at your dot matrix printer.  It would seem a
trivial exercise to simply plug in a newer and probably inexpensive
replacement printer.

  What am I missing here?

> I've been reading up on the Regular Expression module and ways in which
> to manipulate strings however it has been difficult to think of a way
> in which to extract an address.
>
> Here's an example of the raw text that I have to work with:
>
<snip>

How are you intercepting this text data?
Are you replacing your old printer with a Python speaking computer?
How will you deliver this data to your Python program?

> (the # = any number, and the X's are just regular text)
> I would like to extract the address information, but the two different
> text objects on the right hand side are difficult to remove.  I think
> it would be easier if I could just extract a fixed square of
> information, but I don't have a clue as to how to go about it.

Assuming you know how your Python code will "see" this data -

You would need no more than standard Python string handling to perform these
tasks.

There is no concept of a "fixed square" here.  This is a continuous stream
of (probably ascii) characters. If you could pick the data up from a file,
you would use readline() to build a list of individual lines.  If you were
picking the data from a serial port, you might assemble the whole thing into
one big string and use split(/n)  to build your list of lines.

Once you had a full record (print page?) as a list of individual lines you
could identify each line by it's position in the list *if*, as is likely,
each item arrives at the same line position.  If not, your code can read
each line and test.  For example:
The line
"#######"
Seems to immediately precede several address lines
"                                        MRS XXX X          XXXXXXX"
"                                        #####"
"                                        ####:
"                                        ###-###-#"

If you can rely on this you would know that the line "#######" is
immediately followed by several lines of an address - up until the empty
line.  And you can look at each of those address lines and use trim() to
remove leading and trailing blanks.

Similarly, the line that begins "  LANG:" would seem to immediately precede
another address.

None of this is particularly difficult with standard Python.
But then - if we are merely replacing an old printer -

We are already working way too hard!
Thomas Bartkus








More information about the Python-list mailing list