Legacy data parsing

Fri Jul 8 18:35:22 EDT 2005

On 8 Jul 2005 11:31:14 -0700, "gov" <Gov at mailinator.com> wrote:

>Hi,
>
>I've just started to learn programming and was told this was a good
>place to ask questions :)
>
>Where I work, we receive large quantities of data which is currently
>all printed on large, obsolete, dot matrix printers.  This is a problem
>because the replacement parts will not be available for much longer.
>
>So I'm trying to create a program which will capture the fixed width
>text file data and convert as well as sort the data (there are several
>different report types) into a different format which would allow it to
>be printed normally, or viewed on a computer.
>
>I've been reading up on the Regular Expression module and ways in which
>to manipulate strings however it has been difficult to think of a way
>in which to extract an address.
>
>Here's an example of the raw text that I have to work with:
>
>
>ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
>****************************
>
>FOR/POUR AL/LA:  20
>  CORR TYP:  A1B 2C3      P:3 CHNGD/CHANG
>  LANG: E CONS/REGR:             #######
>  MRS XXX X XXXXXXX
>  ### XXXXXXXXX ST                      DD   TYP:               P:6
>CHNGD/CHANG
>  MONCTON NB                            LANG: E CONS/REGR:
>#######
>                                        MRS XXX X          XXXXXXX
>                                        #####
>                                        ####
>                                        ###-###-#
>
>ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
>****************************
>
>FOR/POUR AL/LA:  30
>  BOTH TYP:  A1B 2D3      P:3 CHNGD/CHANG
>  LANG: E CONS/REGR:             #######
>  MISS XXXX XXXXX
>  ### XXXXXXXX ST
>  MONCTON NB
>
>EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
>***********
>
>(the # = any number, and the X's are just regular text)
>I would like to extract the address information, but the two different
>text objects on the right hand side are difficult to remove.  I think
>it would be easier if I could just extract a fixed square of
>information, but I don't have a clue as to how to go about it.
>
>If anyone could give me suggestions as to methods in sorting this type
>of data, it would be appreciated.
>
If this is all fixed-width font characters and fixed record formats, you
might get some ideas about extracting a "fixed square". I re-joined the
strings of the fixed square with '\n'.join(<lines_of_the_square>) to print it,
but you could extract data from the lines in various ways with regexes and such.

I used your data example and added some under the alternate header.
(Not tested beyond what you see ;-)

----< legacy_data_parsing.py >---------------------------------------------------
data = """\
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA:  20
  CORR TYP:  A1B 2C3      P:3 CHNGD/CHANG
  LANG: E CONS/REGR:             #######
  MRS XXX X XXXXXXX
  ### XXXXXXXXX ST                      DD   TYP:               P:6
CHNGD/CHANG
  MONCTON NB                            LANG: E CONS/REGR:
#######
                                        MRS XXX X          XXXXXXX
                                        #####
                                        ####
                                        ###-###-#

ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************

FOR/POUR AL/LA:  30
  BOTH TYP:  A1B 2D3      P:3 CHNGD/CHANG
  LANG: E CONS/REGR:             #######
  MISS XXXX XXXXX
  ### XXXXXXXX ST
  MONCTON NB

EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********
1  [Don't know what               [<- 1,34 This is a box of
2  goes in this kind               text with top/left
3  of record, but this             character row/col 1,34
4  is some text to show            and bottom/right at 4,62 ->]
5  how it might get
6  extracted]

"""

record_headers = [
"""\
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
""",
"""\
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
"""
]

import re
recsplitter = re.compile('('+ '|'.join(map(re.escape,record_headers))+')')
def extract_block(tl, br, data):
    lines = [s.ljust(br[1]+1) for s in data.splitlines()]
    return '\n'.join([line[tl[1]:br[1]+1] for line in lines[tl[0]:br[0]+1]])

for i, hdr_or_body in enumerate(recsplitter.split(data)):
    if i==0:
        print '='*10, 'file prefix', '='*30
        data_type = ''
    elif i%2:
        print '='*10, 'record hdr', '='*30
        data_type = hdr_or_body
    else:
        print '='*10, 'record data', '='*30
    print hdr_or_body
    print '='*10
    if not i%2 and data_type == record_headers[1]: # EARNINGS etc
        print '---- earnings record right block ----'
        print extract_block((1,34),(4,62), hdr_or_body)
        print '----'
---------------------------------------------------------------------------------

Produces:

[15:33] C:\pywk\clp>py24 legacy_data_parsing.py
========== file prefix ==============================

==========
========== record hdr ==============================
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:

==========
========== record data ==============================
****************************

FOR/POUR AL/LA:  20
  CORR TYP:  A1B 2C3      P:3 CHNGD/CHANG
  LANG: E CONS/REGR:             #######
  MRS XXX X XXXXXXX
  ### XXXXXXXXX ST                      DD   TYP:               P:6
CHNGD/CHANG
  MONCTON NB                            LANG: E CONS/REGR:
#######
                                        MRS XXX X          XXXXXXX
                                        #####
                                        ####
                                        ###-###-#

==========
========== record hdr ==============================
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:

==========
========== record data ==============================
****************************

FOR/POUR AL/LA:  30
  BOTH TYP:  A1B 2D3      P:3 CHNGD/CHANG
  LANG: E CONS/REGR:             #######
  MISS XXXX XXXXX
  ### XXXXXXXX ST
  MONCTON NB

==========
========== record hdr ==============================
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:

==========
========== record data ==============================
***********
1  [Don't know what               [<- 1,34 This is a box of
2  goes in this kind               text with top/left
3  of record, but this             character row/col 1,34
4  is some text to show            and bottom/right at 4,62 ->]
5  how it might get
6  extracted]

==========
---- earnings record right block ----
[<- 1,34 This is a box of
 text with top/left
 character row/col 1,34
 and bottom/right at 4,62 ->]
----

HTH

Regards,
Bengt Richter