parsing a file for analysis

Sat Feb 26 11:36:51 EST 2011

Yes, Yes :-). I was using awk to do all of this.  It does work but I find
myself repeating reading the same data because awk does not support complex
data structures. Plus the code is getting ugly.

I was told about Orange (http://orange.biolab.si/). Does anyone have
experience with it?

On Sat, Feb 26, 2011 at 10:53 AM, Martin Gregorie
<martin at address-in-sig.invalid> wrote:

> On Sat, 26 Feb 2011 16:29:54 +0100, Andrea Crotti wrote:
>
> > Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto:
> >
> >> I have a large text (4GB) which I am parsing.
> >>
> >> I am reading the file to collect stats on certain items.
> >>
> >> My approach has been simple,
> >>
> >> for row in open(file):
> >>   if "INFO" in row:
> >>     line=row.split()
> >>     user=line[0]
> >>     host=line[1]
> >>     __time=line[2]
> >>     ...
> >>
> >> I was wondering if there is a framework or a better algorithm to read
> >> such as large file and collect it stats according to content. Also, are
> >> there any libraries, data structures or functions which can be helpful?
> >> I was told about 'collections' container.  Here are some stats I am
> >> trying to get:
> >>
> >> *Number of unique users
> >> *Break down each user's visit according to time, t0 to t1 *what user
> >> came from what host.
> >> *what time had the most users?
> >>
> >> (There are about 15 different things I want to query)
> >>
> >> I understand most of these are redundant but it would be nice to have a
> >> framework or even a object oriented way of doing this instead of
> >> loading it into a database.
> >>
> >>
> >> Any thoughts or ideas?
> >
> > Not an expert, but maybe it might be good to push the data into a
> > database, and then you can tweak the DBMS and write smart queries to get
> > all the statistics you want from it.
> >
> > It might take a while (maybe with regexp splitting is faster) but it's
> > done only once and then you work with DB tools.
> >
> This is the sort of job that is best done with awk.
>
> Awk processes a text file line by line, automatically splitting each line
> into an array of words. It uses regexes to recognise lines and trigger
> actions on them. For example, building a list of visitors: assume there's
> a line containing "username logged on", you could build a list of users
> and count their visits with this statement:
>
> /logged on/ { user[$1] += 1 }
>
> where the regex, /logged on/, triggers the action, in curly brackets, for
> each line it matches. "$1" is a symbol for the first word in the line.
>
>
> --
> martin@   | Martin Gregorie
> gregorie. | Essex, UK
> org       |
> --
> http://mail.python.org/mailman/listinfo/python-list
>

-- 
--- Get your facts first, then you can distort them as you please.--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110226/d4dcd15d/attachment-0001.html>