Legacy data parsing
Bengt Richter
bokr at oz.net
Fri Jul 8 18:35:22 EDT 2005
On 8 Jul 2005 11:31:14 -0700, "gov" <Gov at mailinator.com> wrote:
>Hi,
>
>I've just started to learn programming and was told this was a good
>place to ask questions :)
>
>Where I work, we receive large quantities of data which is currently
>all printed on large, obsolete, dot matrix printers. This is a problem
>because the replacement parts will not be available for much longer.
>
>So I'm trying to create a program which will capture the fixed width
>text file data and convert as well as sort the data (there are several
>different report types) into a different format which would allow it to
>be printed normally, or viewed on a computer.
>
>I've been reading up on the Regular Expression module and ways in which
>to manipulate strings however it has been difficult to think of a way
>in which to extract an address.
>
>Here's an example of the raw text that I have to work with:
>
>
>ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
>****************************
>
>FOR/POUR AL/LA: 20
> CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
> LANG: E CONS/REGR: #######
> MRS XXX X XXXXXXX
> ### XXXXXXXXX ST DD TYP: P:6
>CHNGD/CHANG
> MONCTON NB LANG: E CONS/REGR:
>#######
> MRS XXX X XXXXXXX
> #####
> ####
> ###-###-#
>
>ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
>****************************
>
>FOR/POUR AL/LA: 30
> BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
> LANG: E CONS/REGR: #######
> MISS XXXX XXXXX
> ### XXXXXXXX ST
> MONCTON NB
>
>EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
>***********
>
>(the # = any number, and the X's are just regular text)
>I would like to extract the address information, but the two different
>text objects on the right hand side are difficult to remove. I think
>it would be easier if I could just extract a fixed square of
>information, but I don't have a clue as to how to go about it.
>
>If anyone could give me suggestions as to methods in sorting this type
>of data, it would be appreciated.
>
If this is all fixed-width font characters and fixed record formats, you
might get some ideas about extracting a "fixed square". I re-joined the
strings of the fixed square with '\n'.join(<lines_of_the_square>) to print it,
but you could extract data from the lines in various ways with regexes and such.
I used your data example and added some under the alternate header.
(Not tested beyond what you see ;-)
----< legacy_data_parsing.py >---------------------------------------------------
data = """\
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************
FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
****************************
FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
***********
1 [Don't know what [<- 1,34 This is a box of
2 goes in this kind text with top/left
3 of record, but this character row/col 1,34
4 is some text to show and bottom/right at 4,62 ->]
5 how it might get
6 extracted]
"""
record_headers = [
"""\
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
""",
"""\
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
"""
]
import re
recsplitter = re.compile('('+ '|'.join(map(re.escape,record_headers))+')')
def extract_block(tl, br, data):
lines = [s.ljust(br[1]+1) for s in data.splitlines()]
return '\n'.join([line[tl[1]:br[1]+1] for line in lines[tl[0]:br[0]+1]])
for i, hdr_or_body in enumerate(recsplitter.split(data)):
if i==0:
print '='*10, 'file prefix', '='*30
data_type = ''
elif i%2:
print '='*10, 'record hdr', '='*30
data_type = hdr_or_body
else:
print '='*10, 'record data', '='*30
print hdr_or_body
print '='*10
if not i%2 and data_type == record_headers[1]: # EARNINGS etc
print '---- earnings record right block ----'
print extract_block((1,34),(4,62), hdr_or_body)
print '----'
---------------------------------------------------------------------------------
Produces:
[15:33] C:\pywk\clp>py24 legacy_data_parsing.py
========== file prefix ==============================
==========
========== record hdr ==============================
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
==========
========== record data ==============================
****************************
FOR/POUR AL/LA: 20
CORR TYP: A1B 2C3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MRS XXX X XXXXXXX
### XXXXXXXXX ST DD TYP: P:6
CHNGD/CHANG
MONCTON NB LANG: E CONS/REGR:
#######
MRS XXX X XXXXXXX
#####
####
###-###-#
==========
========== record hdr ==============================
ADDRESS INFORMATION/RENSEIGNEMENTS SUR L'ADRESSE:
==========
========== record data ==============================
****************************
FOR/POUR AL/LA: 30
BOTH TYP: A1B 2D3 P:3 CHNGD/CHANG
LANG: E CONS/REGR: #######
MISS XXXX XXXXX
### XXXXXXXX ST
MONCTON NB
==========
========== record hdr ==============================
EARNINGS VITAL INFORMATION/RENSEIGNEMENTS ESSENTIELS SUR LES GAINS:
==========
========== record data ==============================
***********
1 [Don't know what [<- 1,34 This is a box of
2 goes in this kind text with top/left
3 of record, but this character row/col 1,34
4 is some text to show and bottom/right at 4,62 ->]
5 how it might get
6 extracted]
==========
---- earnings record right block ----
[<- 1,34 This is a box of
text with top/left
character row/col 1,34
and bottom/right at 4,62 ->]
----
HTH
Regards,
Bengt Richter
More information about the Python-list
mailing list