parsing tab and newline delimited text

elsa kerensaelise at hotmail.com
Tue Aug 3 23:35:34 EDT 2010


On Aug 4, 12:49 pm, Tim Chase <python.l... at tim.thechases.com> wrote:
> On 08/03/10 21:14, elsa wrote:
>
>
>
> > I have a large file of text I need to parse. Individual 'entries' are
> > separated by newline characters, while fields within each entry are
> > separated by tab characters.
>
> > So, an individual entry might have this form (in printed form):
>
> > Title    date   position   data
>
> > with each field separated by tabs, and a newline at the end of data.
> > So, I thought I could simply open a file, read each line in in turn,
> > and parse it....
>
> > f=open('MyFile')
> > line=f.readline()
> > parts=line.split('\t')
>
> > etc...
>
> > However, 'data' is a fairly random string of characters. Because the
> > files I'm processing are large, there is a good chance that in every
> > file, there is a data field that might look like this:
>
> > 899998dlKKlS\lk3#kdf\nllllKK99
>
> My first question is whether the line contains actual newline/tab
> characters within the field data, or the string-representation of
> the line.  For one of the lines in question, what does
>
>    print repr(line)

here is what I get at the interactive prompt:

>>> line = """IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55
... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""

>>> line
'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

>>> print repr(line)
'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

basically this is numeric values encoded into ASCII symbols. So '\' is
a value, 'n' is a value, 'E' is a value etc... it's
all part of the same data field. It's just unfortunate that '\' and
'n' have ended up together. (I didn't design this file,
btw, I'm just expected to process it!)

Elsa.



More information about the Python-list mailing list