Reading log and saving data to DB
Guy Tamir
guytamir1 at gmail.com
Thu Aug 15 03:23:57 EDT 2013
On Wednesday, August 14, 2013 4:46:09 PM UTC+3, mar... at python.net wrote:
> On Wed, Aug 14, 2013, at 09:18 AM, Guy Tamir wrote:
>
> > Hi all,
>
> >
>
> > I have a Ubuntu server running NGINX that logs data for me.
>
> > I want to write a python script that reads my customized logs and after
>
> > a little rearrangement save the new data into my DB (postgresql).
>
> >
>
> > The process should run about every 5 minutes and i'm expecting large
>
> > chunks of data on several 5 minute windows..
>
> >
>
> > My plan for achieving this is to install python on the server, write a
>
> > script and add it to cron.
>
> >
>
> > My question is what the simplest way to do this?
>
> > should i use any python frameworks?
>
>
>
> Rarely do I put "framework" and "simplest way" in the same set.
>
>
>
> I would do 1 of 2 things:
>
>
>
> * Write a simple script that reads lines from stdin, and writes to the
>
> db. Make sure it gets run in init before nginx does and tail -F -n 0 to
>
> that script. Don't worry about the 5-minute cron.
>
>
>
> * Similar to above but if you want to use cron also store in the db the
>
> offset of the last byte read in the file, then when the cron job kicks
>
> off again seek to that position + 1 and begin reading, at EOF write the
>
> offset again.
>
>
>
> This is irrespective of any log rotating that is going on behind the
>
> scenes, of course.
Not sure i understood the first options and what it means to run before the nginx.
The second options sound more like what i had in mind.
Aren't there any components like this written that i can use?
since the log fills up a lot i'm having trouble reading so much data and writing it all to the DB in a reasonable amount of time.
The table receiving the new data is somewhat complex.. the table's purpose is to save data regarding ads shown from my app, the fields are - (ad_id,user_source_site,user_location,day_date,specific_hour,views,clicks)
each row is distinct by the first 5 fields since i need to show different types of stats..
because each new line created may or may not be in the DB i have to run a upsert command (update or insert) on each row..
This leads to very poor performance..
Do have any ideas about how i can make this script more efficient?
More information about the Python-list
mailing list