parsing a file for analysis

Sat Feb 26 10:53:51 EST 2011

On Sat, 26 Feb 2011 16:29:54 +0100, Andrea Crotti wrote:

> Il giorno 26/feb/2011, alle ore 06.45, Rita ha scritto:
> 
>> I have a large text (4GB) which I am parsing.
>> 
>> I am reading the file to collect stats on certain items.
>> 
>> My approach has been simple,
>> 
>> for row in open(file):
>>   if "INFO" in row:
>>     line=row.split()
>>     user=line[0]
>>     host=line[1]
>>     __time=line[2]
>>     ...
>> 
>> I was wondering if there is a framework or a better algorithm to read
>> such as large file and collect it stats according to content. Also, are
>> there any libraries, data structures or functions which can be helpful?
>> I was told about 'collections' container.  Here are some stats I am
>> trying to get:
>> 
>> *Number of unique users
>> *Break down each user's visit according to time, t0 to t1 *what user
>> came from what host.
>> *what time had the most users?
>> 
>> (There are about 15 different things I want to query)
>> 
>> I understand most of these are redundant but it would be nice to have a
>> framework or even a object oriented way of doing this instead of
>> loading it into a database.
>> 
>> 
>> Any thoughts or ideas?
> 
> Not an expert, but maybe it might be good to push the data into a
> database, and then you can tweak the DBMS and write smart queries to get
> all the statistics you want from it.
> 
> It might take a while (maybe with regexp splitting is faster) but it's
> done only once and then you work with DB tools.
>
This is the sort of job that is best done with awk.

Awk processes a text file line by line, automatically splitting each line 
into an array of words. It uses regexes to recognise lines and trigger 
actions on them. For example, building a list of visitors: assume there's 
a line containing "username logged on", you could build a list of users
and count their visits with this statement:

/logged on/ { user[$1] += 1 }

where the regex, /logged on/, triggers the action, in curly brackets, for 
each line it matches. "$1" is a symbol for the first word in the line.

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |