Reading log and saving data to DB

Guy Tamir guytamir1 at gmail.com
Thu Aug 15 03:23:57 EDT 2013


On Wednesday, August 14, 2013 4:46:09 PM UTC+3, mar... at python.net wrote:
> On Wed, Aug 14, 2013, at 09:18 AM, Guy Tamir wrote:
> 
> > Hi all,
> 
> > 
> 
> > I have a Ubuntu server running NGINX that logs data for me.
> 
> > I want to write a python script that reads my customized logs and after 
> 
> > a little rearrangement save the new data into my DB (postgresql).
> 
> > 
> 
> > The process should run about every 5 minutes and i'm expecting large
> 
> > chunks of data on several 5 minute windows..
> 
> > 
> 
> > My plan for achieving this is to install python on the server, write a
> 
> > script and add it to cron.
> 
> > 
> 
> > My question is what the simplest way to do this? 
> 
> > should i use any python frameworks? 
> 
> 
> 
> Rarely do I put "framework" and "simplest way" in the same set.
> 
> 
> 
> I would do 1 of 2 things:
> 
> 
> 
> * Write a simple script that reads lines from stdin, and writes to the
> 
> db.  Make sure it gets run in init before nginx does and tail -F -n 0 to
> 
> that script.  Don't worry about the 5-minute cron.
> 
> 
> 
> * Similar to above but if you want to use cron also store in the db the
> 
> offset of the last byte read in the file, then when the cron job kicks
> 
> off again seek to that position + 1 and begin reading, at EOF write the
> 
> offset again.
> 
> 
> 
> This is irrespective of any log rotating that is going on behind the
> 
> scenes, of course.

Not sure i understood the first options and what it means to run before the nginx.

The second options sound more like what i had in mind. 
Aren't there any components like this written that i can use? 

since the log fills up a lot i'm having trouble reading so much data and writing it all to the DB in a reasonable amount of time.

The table receiving the new data is somewhat complex.. the table's purpose is to save data regarding ads shown from my app, the fields are - (ad_id,user_source_site,user_location,day_date,specific_hour,views,clicks)
each row is distinct by the first 5 fields since i need to show different types of stats..
 
because each new line created may or may not be in the DB i have to run a upsert command (update or insert) on each row.. 

This leads to very poor performance.. 
Do have any ideas about how i can make this script more efficient?




More information about the Python-list mailing list