Legacy data parsing
Thomas Bartkus
thomasbartkus at comcast.net
Fri Jul 8 16:03:45 EDT 2005
"gov" <Gov at mailinator.com> wrote in message
news:1120847474.604271.196220 at g49g2000cwa.googlegroups.com...
> Hi,
>
> I've just started to learn programming and was told this was a good
> place to ask questions :)
>
> Where I work, we receive large quantities of data which is currently
> all printed on large, obsolete, dot matrix printers. This is a problem
> because the replacement parts will not be available for much longer.
>
> So I'm trying to create a program which will capture the fixed width
> text file data and convert as well as sort the data (there are several
> different report types) into a different format which would allow it to
> be printed normally, or viewed on a computer.
Text file data has no concept of "fixed width". Somewhere in your system,
text file data is being thrown at your dot matrix printer. It would seem a
trivial exercise to simply plug in a newer and probably inexpensive
replacement printer.
What am I missing here?
> I've been reading up on the Regular Expression module and ways in which
> to manipulate strings however it has been difficult to think of a way
> in which to extract an address.
>
> Here's an example of the raw text that I have to work with:
>
<snip>
How are you intercepting this text data?
Are you replacing your old printer with a Python speaking computer?
How will you deliver this data to your Python program?
> (the # = any number, and the X's are just regular text)
> I would like to extract the address information, but the two different
> text objects on the right hand side are difficult to remove. I think
> it would be easier if I could just extract a fixed square of
> information, but I don't have a clue as to how to go about it.
Assuming you know how your Python code will "see" this data -
You would need no more than standard Python string handling to perform these
tasks.
There is no concept of a "fixed square" here. This is a continuous stream
of (probably ascii) characters. If you could pick the data up from a file,
you would use readline() to build a list of individual lines. If you were
picking the data from a serial port, you might assemble the whole thing into
one big string and use split(/n) to build your list of lines.
Once you had a full record (print page?) as a list of individual lines you
could identify each line by it's position in the list *if*, as is likely,
each item arrives at the same line position. If not, your code can read
each line and test. For example:
The line
"#######"
Seems to immediately precede several address lines
" MRS XXX X XXXXXXX"
" #####"
" ####:
" ###-###-#"
If you can rely on this you would know that the line "#######" is
immediately followed by several lines of an address - up until the empty
line. And you can look at each of those address lines and use trim() to
remove leading and trailing blanks.
Similarly, the line that begins " LANG:" would seem to immediately precede
another address.
None of this is particularly difficult with standard Python.
But then - if we are merely replacing an old printer -
We are already working way too hard!
Thomas Bartkus
More information about the Python-list
mailing list