Browsing text ; Python the right tool?

Wed Jan 26 18:59:35 EST 2005

Jeff Shannon wrote:
> John Machin wrote:
>
> > Jeff Shannon wrote:
> >
> >> [...]  If each record is CRLF terminated, then
> >>you can get one record at a time simply by iterating over the file
> >>("for line in open('myfile.dat'): ...").  You can have a dictionary
> >>classes or factory functions, one for each record type, keyed off
> >>of the 2-character identifier.  Each class/factory would know the
> >>layout of that record type,
> >
> > This is plausible only under the condition that Santa Claus is
paying
> > you $X per class/factory or per line of code, or you are so
speed-crazy
> > that you are machine-generating C code for the factories.
>
> I think that's overly pessimistic.  I *was* presuming a case where
the
> number of record types was fairly small, and the definitions of those

> records reasonably constant.  For ~10 or fewer types whose spec
> doesn't change, hand-coding the conversion would probably be quicker
> and/or more straightforward than writing a spec-parser as you
suggest.

I didn't suggest writing a "spec-parser". No (mechanical) parsing is
involved. The specs that I'm used to dealing with set out the record
layouts in a tabular fashion. The only hassle is extracting that from a
MSWord document or a PDF.

>
> If, on the other hand, there are many record types, and/or those
> record types are subject to changes in specification, then yes, it'd
> be better to parse the specs from some sort of data file.

"Parse"? No parsing, and not much code at all: The routine to "load"
(not "parse") the layout from the layout.csv file into dicts of dicts
is only 35 lines of Python code. The routine to take an input line and
serve up an object instance is about the same. It does more than the
OP's browsing requirement already. The routine to take an object and
serve up a correctly formatted output line is only 50 lines of which
1/4 is comment or blank.

>
> The O.P. didn't mention anything either way about how dynamic the
> record specs are, nor the number of record types expected.

My reasoning: He did mention A0 and C1 hence one could guess from that
he maybe had 6 at least. Also, files used to "create printed pages by
an external company" (especially by a company that had "leaseplan" in
its e-mail address) would indicate "many" and "complicated" to me.

> I suspect
> that we're both assuming a case similar to our own personal
> experiences, which are different enough to lead to different
preferred
> solutions. ;)

Indeed. You seem to have lead a charmed life; may the wizards and the
rangers ever continue to protect you from the dark riders! :-)

My personal experiences and attitudes: (1) extreme aversion to having
to type (correctly) lots of numbers (column positions and lengths), and
to having to mentally translate start = 663, len = 13 to [662:675] or
having ugliness like [663-1:663+13-1] (2) cases like 17 record types
and 112 fields in one file, 8 record types and 86 fields in a second --
this being a new relatively clean simple exercise in exchanging files
with a government department (3) Past history of this govt dept is that
there are at least another 7 file types in regular use and they change
the _major_ version number of each file type about once a year on
average (3) These things tend to start out deceptively small and simple
and turn into monsters.

Cheers,
John