[Tutor] suggestions for splitting file based on date

Dave Angel davea at davea.name
Fri Jul 19 22:30:23 CEST 2013


On 07/19/2013 04:00 PM, Peter Otten wrote:
> Sivaram Neelakantan wrote:
>
>> I've got some stock indices data that I plan to plot using matplotlib.
>> The data is simply date, idx_close_value and my plan is to plot the
>> last 30 day, 90, 180 day & all time graphs of the indices.
>>
>> a) I can do the date computations using the python date libs
>> b) plotting with matplotlib, I can get that done
>>
>> what is the best way to split the file into the last 30 day recs, 90
>> day recs when the data is in increasing time order?  My initial
>> thinking is to first reverse the file, append to various 30/90/180 day
>> lists for every rec > computed date for the corresponding date
>> windows.
>>
>> Is that the way to go or is there a better way?
>
> I'd start with a single list for the complete data, reverse that using the
> aptly named method and then create the three smaller lists using slicing.
>
> For example:
>
>>>> stock_data = range(10)
>>>> stock_data.reverse()
>>>> stock_data
> [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
>>>> stock_data[:3] # the last three days
> [9, 8, 7]
>
> On second thought I don't see why you want to reverse the data. If you omit
> that step you need to modify the slicing:
>
>>>> stock_data = range(10)
>>>> stock_data[-3:] # the last three days
> [7, 8, 9]
>
>

I see Alan has assumed that the data is already divided into day-size 
hunks, so that subscripting those hunks is possible.  He also assumed 
all the data will fit in memory at one time.

But in my envisioning of your description, I pictured a variable number 
of records per day, with each record being a variable length stream of 
bytes starting with a length field.  I pictured needing to handle a 
month with either zero entries or one with 3 billion entries.  And even 
if a month is reasonable, I pictured the file as having 10 years of 
spurious data before you get to the 180 day point.

Are you looking for an optimal solution, or just one that works?  What 
order do you want the final data to be in.  How is the data organized on 
disk?  Is each record a fixed size?  If so, you can efficiently do a 
binary search in the file to find the 30, 90, and 180 day points.

Once you determine the offset in the file for those 180, 90, and 30 day 
points, it's a simple matter to just seek to one such spot and process 
all the records following.  Most records need never be read from disk at 
all.

If the records are not fixed length, you can still do the same thing, 
but you will need one complete pass through the file to find those same 
3 offsets.

-- 
DaveA



More information about the Tutor mailing list