Best way to parse file into db-type layout?

John Machin sjmachin at lexicon.net
Tue May 3 17:34:34 EDT 2005


On Tue, 03 May 2005 19:12:07 GMT, Peter A. Schott
<paschott at no.yahoo.spamm.com> wrote:

>Thanks for the comments.  Quick notes:
>1.  Yes, ultimate goal is to output various fields into Excel,

Peter, (again) I'd like to suggest that putting some or all of the
data into storage of type X (whether X is a database or Excel or 5x3
index cards) is unlikely to be the true ultimate goal ...

> but I need to
>enter the data into a DB as well to avoid having any strange problems. That may
>be handled by a completely different process.  I should have been more exact in
>saying that my current goal is to get this data into a database.

Let me get this straight: Process 1 is going to put some fields into
Excel. Process 2 is going to put the data into a DB to avoid "any
strange problems" -- with what? The (completely different) process 1?

>
>3.  regarding read/write each row - I started thinking about working my way
>through the dictionary.  This file could have up to 1900 pairs of data per row.
>That's an awful lot of columns to put into a database table, especially if quite
>a few will be NULL.

Focussing (briefly) on the storage details, it's also an awful lot
compared to the max 255 columns in Excel. Dragging ourselves back up
to the live situation that your data is modelling, it's also an awful
lot of attributes for any object (whether real or virtual) to have. It
must be possible to break each line down further. It's extremely
unlikely to be stored like that on the mainframe; it's just been spat
out like that to go to the letter-printing bureau. You may like to
consider talking to the mainframe folk about their data structures,
and how you can more or less reconstruct those structures from the
letter file.

>  I starting thinking of stepping through each pair by each
>row.  The rows are identified by Letter type and an account number (always the
>same field #).  I would want to use those, and insert something like:
>LetterType, Account #, Field#, Field Data, (current date)

This method of representing the data is close to the other extreme,
compared to to 1900 columns per row. With this method, the five or so
components of the customer address (for example) will be in different
rows in the database i.e. not grouped together with a handle that says
"address". It doesn't appear to me to be any more usable than the
letter file.

If the customer can have more than one letter of the same type on the
same date, you will need to store also something that uniquely
identifies the letter e.g. the line number in the file; otherwise the
data from multiple letters will be mingled. Even if "they" say it's
not possible, and you believe them, it might be a good idea anyway --
might make it easier than working with (letter_type, acct_num) when
you are trying to glue the fragments back together.

If it's not possible because the 1900 fields = something like 100
fields x an allowance up to 19 repetitions of a similar scenario, then
you know what to do next :-)

You probably mean "effective date as at which the file was produced"
rather than "current date". Only one file per day? There may be a
backlog when you start this exercise, there may be a backlog in the
future after a disruption in the pipeline. You may like to check if
the mainframe assigns a unique name/number to such files and use that
instead as part of your composite key.

> No Quotes (no
>Commas in values).  Coming from the mainframe, I don't have to worry about
>embedded special characters within the file as they can't be entered into these
>fields to start with.

Yes, peace in our time, the cheque's in the mail, and of course
they'll still love you in the morning :-)

>Thanks to all.  Off to search for dictionaries and databases now.  :-)

Searching for dictionaries???

As we are by now way OT (even by this group's standards), I'll just
leave you with my previous comments:
>> 
>> What you need is (a) a clear appreciation of what you are trying to do
>> with the data at a high level (b) then develop an understanding of
>> what is the underlying data model (c) then and only then worry about
>> technical details.
>> 

It's been nice kibitzing your game.
Lots of luck,
John






More information about the Python-list mailing list