[Tutor] Help with date range
Martin A. Brown
martin at linux-ip.net
Tue Feb 9 14:37:04 EST 2016
Greetings Chelsea,
>So what I am trying to do is take in a csv file and the format of
>the csv file is:
While I do not have a suggestion to you about the Python you have
sent to the list, I have a few suggestions about how to approach the
problem differently. I hope that this may make your task easier.
>something, something1, something2,something3, something4,
>something5, something6, something7. Then in the csv file I want to
>search for a certain range of dates like 1/3/2016 - 2/3/2016. I can
>get individual dates but not a range of dates.
Option 1 for date handling:
Consider converting your date column to a datetime object, which
then can be used in comparisons (and sorting).
>>> d0 = datetime.datetime.strptime('1/3/2016', '%m/%d/%Y')
>>> d1 = datetime.datetime.strptime('2/3/2016', '%m/%d/%Y')
>>> d0 > d1
False
>>> d0 < d1
True
This is good for general date and time handling. If you are only
handling date,s then you may wish to use the strptime() function to
parse your date and then choose to convert it to a date. See also
below my point about serialized (on disk) representation and memory
representation.
Option 2 for date handling:
Consider converting your date to a time struct, which can then by
used in comparisons and sorting.
>>> t0 = time.strptime('1/3/2016', '%m/%d/%Y')
>>> t1 = time.strptime('2/3/2016', '%m/%d/%Y')
>>> t0 > t1
False
>>> t0 < t1
True
The time library is good, too. I like it because I have,
historically, preferred the Epoch timestamp, which is how several
classes of computers expose their time-tracking to applications.
>>> time.time()
1455044423.592147
>I have an if elif statement to read row5 which is the date row. My
>if statement is the initial pass at returning the values within the
>date range and then my elif is the parameter and if there is no
>parameter passed in then it returns all data.
>I am having some trouble trying to pass in a date range parameter.
>The piece of code is under the def populate_dict function the
>date_choice part. Here is the code:
Even though I would separate the selection of record from the
reading and conversion of the data (see more below), I think you
want to do something more like this (to handle a range of time):
def populate_dict(self, filename, date_choice, key_choice):
# -- date_choice must be a list/tuple of datetime types (datetime, datetime)
with open(filename, 'rb') as f:
reader = csv.reader(f)
next(reader, None) # -- skipping the header, eh?
for row in reader:
eventdate = datetime.datetime.strptime(row[5], '%m/%d/%Y')
if date_choice is None:
self.dict[row[2]].append(row[3])
elif date_choice[0] <= eventdate <= date_choice[1]:
self.dict[row[2]].append(row[3])
if key_choice == row[3]:
self.dict[row[2]].append(row[3])
elif key_choice == "none":
self.dict[row[2]].append(row[3])
Handling time ranges is always a wee bit tricky. If you don't
convert the data into some sort of convenient format before trying
to apply the range criteria, you'll slowly go crazy trying to figure
out what is wrong.
How much data do you have? I'm assuming it is small(er than half a
million rows). If so, then, there's no harm in loading it all into
memory and then filtering in (or out) the stuff you want.
But, most importantly, I have a general suggestion about separating
the serialized (on-disk) format from the format you are using in
memory. I will make reference to another data handling library in
Python, but mostly to illustrate the idea.
You can read CSV data sets with a tool called pandas [0]. There are
many reasons it is popular, but I'll examine one feature. If you
load a CSV file using pandas, it tries to apply a bit of inference
logic to each data type as it reads a file. You said that the sixth
column of your data is a date-formatted field.
If the pandas library can determine that the sixth field in every
row can be converted to a datetime() type (or an int() or a float(),
etc.), it will automatically convert that field. This is useful
because the serialized form is a string that looks like '1/3/2016'
or '2016-02-09 11:13:06' or 'Tue Feb 9 19:13:24 UTC 2016'.
But, you as a human are not thinking "this is a string". You are
thinking of it as a date. So, simply put:
Serialized form: a string representation that contains a date
Form in memory: a date, datetime, or time type (of some kind)
Again, if you are working with a small amount of data (like
less than 50MB), don't be afraid to load it into memory. You can
make things more efficient later, if need be. If you have a larger
data set or you want to perform all sorts of different computations,
sorting and grouping, you may find a SQLite database a good solution
(as Joel Goldstick has just posted).
So, finally, if you are merely reading a CSV file, I might suggest
that you convert the serialized representation of the data (a
string) into an object or datatype in memory that allows you to
perform the desired sorting, computation, calculation or
range-finding. Also, if you have a smaller block of code and data,
we may find it easier to make specific suggestions.
And, one last general comment.... Unless you have a specific reason
why not to do so, it's best to convert and decode inputs into a
canonical memory representation as soon as possible when reading
from disk/network.
Good luck,
-Martin
[0] http://pandas.pydata.org/
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list