Recommended data structure for newbie

Paul McGuire ptmcg at austin.rr._bogus_.com
Wed May 3 15:31:04 EDT 2006


"Paul McGuire" <ptmcg at austin.rr._bogus_.com> wrote in message
news:7Z16g.1536$CH2.1053 at tornado.texas.rr.com...
> "manstey" <manstey at csu.edu.au> wrote in message
> news:1146626916.066395.206540 at y43g2000cwc.googlegroups.com...
> > Hi,
> >
> > I have a text file with about 450,000 lines. Each line has 4-5 fields,
> > separated by various delimiters (spaces, @, etc).
> >
> > I want to load in the text file and then run routines on it to produce
> > 2-3 additional fields.
> >
>
> <snip>
>
> Matthew -
>
> If you find re's to be a bit cryptic, here is a pyparsing version that may
> be a bit more readable, and will easily scan through your input file:
>
<snip>

Lest I be accused of pushing pyparsing where it isn't appropriate, here is a
non-pyparsing version of the same program.

The biggest hangup with your sample data is that you can't predict what the
separator is going to be - sometimes it's '[', sometimes it's '^'.  If the
separator character were more predictable, you could use simple split()
calls, as in:

    data = "blah blah blah^more blah".split("^")
    elements = data[0].split() + [data[1]]
    print elements

    ['blah', 'blah', 'blah', 'more blah']

Note that this also discards the separator.  Since you had something which
goes beyond simple string split()'s I thought you might find pyparsing to be
a simple alternative to re's.

Here is a version that tries different separators, then builds the
appropriate list of pieces, including the matching separator.  I've also
shown an example of a generator, since you are likely to want one, parsing
100's of thousands of lines as you are.

-- Paul

=================
data = """gee fre asd[234
ger dsf asd[243
gwer af as.:^25a"""

# generator to process each line of data
# call using processData(listOfLines)
def processData(d):
    separators = "[^" #expand this string if need other separators
    for line in d:
        for s in separators:
            if s in line:
                parts = line.split(s)
                # return the first element of parts, split on whitespace
                # followed by the separator
                # followed by whatever was after the separator
                yield parts[0].split() + [ s, parts[1] ]
                break
        else:
            yield line

# to call this for a text file, use something like
#      for lineParts in processData( file("xyzzy.txt").readlines() )
for lineParts in processData( data.split("\n") ):
    print lineParts

print

# rerun processData, augmenting extracted values with additional
# computed values
for lineParts in processData( data.split("\n") ):
    toks = lineParts
    tokens = toks[:]
    tokens.append( toks[0]+toks[1] )
    tokens.append( toks[-1] + toks[-1][-1] )
    #~ tokens.append( str( lineno(start, data) ) )
    print tokens

====================
prints:

['gee', 'fre', 'asd', '[', '234']
['ger', 'dsf', 'asd', '[', '243']
['gwer', 'af', 'as.:', '^', '25a']

['gee', 'fre', 'asd', '[', '234', 'geefre', '2344']
['ger', 'dsf', 'asd', '[', '243', 'gerdsf', '2433']
['gwer', 'af', 'as.:', '^', '25a', 'gweraf', '25aa']








More information about the Python-list mailing list