[Tutor] Help with date range

Martin A. Brown martin at linux-ip.net
Tue Feb 9 14:37:04 EST 2016


Greetings Chelsea,

>So what I am trying to do is take in a csv file and the format of 
>the csv file is:

While I do not have a suggestion to you about the Python you have 
sent to the list, I have a few suggestions about how to approach the 
problem differently.  I hope that this may make your task easier.

>something, something1, something2,something3, something4, 
>something5, something6, something7. Then in the csv file I want to 
>search for a certain range of dates like 1/3/2016 - 2/3/2016. I can 
>get individual dates but not a range of dates.


Option 1 for date handling:

Consider converting your date column to a datetime object, which 
then can be used in comparisons (and sorting).

  >>> d0 = datetime.datetime.strptime('1/3/2016', '%m/%d/%Y')
  >>> d1 = datetime.datetime.strptime('2/3/2016', '%m/%d/%Y')
  >>> d0 > d1
  False
  >>> d0 < d1
  True

This is good for general date and time handling.  If you are only 
handling date,s then you may wish to use the strptime() function to 
parse your date and then choose to convert it to a date.  See also 
below my point about serialized (on disk) representation and memory 
representation.


Option 2 for date handling:

Consider converting your date to a time struct, which can then by 
used in comparisons and sorting.

  >>> t0 = time.strptime('1/3/2016', '%m/%d/%Y')
  >>> t1 = time.strptime('2/3/2016', '%m/%d/%Y')
  >>> t0 > t1
  False
  >>> t0 < t1
  True

The time library is good, too.  I like it because I have, 
historically, preferred the Epoch timestamp, which is how several 
classes of computers expose their time-tracking to applications.

  >>> time.time()
  1455044423.592147


>I have an if elif statement to read row5 which is the date row. My 
>if statement is the initial pass at returning the values within the 
>date range and then my elif is the parameter and if there is no 
>parameter passed in then it returns all data.

>I am having some trouble trying to pass in a date range parameter. 
>The piece of code is under the def populate_dict function the 
>date_choice part. Here is the code:

Even though I would separate the selection of record from the 
reading and conversion of the data (see more below), I think you 
want to do something more like this (to handle a range of time):

    def populate_dict(self, filename, date_choice, key_choice):
        # -- date_choice must be a list/tuple of datetime types (datetime, datetime)
        with open(filename, 'rb') as f:
            reader = csv.reader(f)
            next(reader, None) # -- skipping the header, eh?
            for row in reader:
                eventdate = datetime.datetime.strptime(row[5], '%m/%d/%Y')

                if date_choice is None:
                   self.dict[row[2]].append(row[3])
                elif date_choice[0] <= eventdate <= date_choice[1]:
                   self.dict[row[2]].append(row[3])

                if key_choice == row[3]:
                   self.dict[row[2]].append(row[3])
                elif key_choice == "none":
                   self.dict[row[2]].append(row[3])

Handling time ranges is always a wee bit tricky.  If you don't 
convert the data into some sort of convenient format before trying 
to apply the range criteria, you'll slowly go crazy trying to figure 
out what is wrong.

How much data do you have?  I'm assuming it is small(er than half a 
million rows).  If so, then, there's no harm in loading it all into 
memory and then filtering in (or out) the stuff you want.

But, most importantly, I have a general suggestion about separating 
the serialized (on-disk) format from the format you are using in 
memory.  I will make reference to another data handling library in 
Python, but mostly to illustrate the idea.

You can read CSV data sets with a tool called pandas [0].  There are 
many reasons it is popular, but I'll examine one feature.  If you 
load a CSV file using pandas, it tries to apply a bit of inference 
logic to each data type as it reads a file.  You said that the sixth 
column of your data is a date-formatted field.

If the pandas library can determine that the sixth field in every 
row can be converted to a datetime() type (or an int() or a float(), 
etc.), it will automatically convert that field.  This is useful 
because the serialized form is a string that looks like '1/3/2016' 
or '2016-02-09 11:13:06' or 'Tue Feb  9 19:13:24 UTC 2016'.

But, you as a human are not thinking "this is a string".  You are 
thinking of it as a date.  So, simply put:

  Serialized form:  a string representation that contains a date
  Form in memory:  a date, datetime, or time type (of some kind)

Again, if you are working with a small amount of data (like 
less than 50MB), don't be afraid to load it into memory.  You can 
make things more efficient later, if need be.  If you have a larger 
data set or you want to perform all sorts of different computations, 
sorting and grouping, you may find a SQLite database a good solution 
(as Joel Goldstick has just posted).

So, finally, if you are merely reading a CSV file, I might suggest 
that you convert the serialized representation of the data (a 
string) into an object or datatype in memory that allows you to 
perform the desired sorting, computation, calculation or 
range-finding.  Also, if you have a smaller block of code and data, 
we may find it easier to make specific suggestions.

And, one last general comment....  Unless you have a specific reason 
why not to do so, it's best to convert and decode inputs into a 
canonical memory representation as soon as possible when reading 
from disk/network.

Good luck,

-Martin

 [0] http://pandas.pydata.org/

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list