Seperating CSV rows into new, seperate files
Peter Otten
__peter__ at web.de
Sat Aug 14 06:22:03 EDT 2004
TonyB wrote:
> I've searched the group and need more information and guidance on this
> issue I need to resolve next week. I work for the local school system
> and I am working on a way to parse a CSV file of class lists from MS
> Excel. It is a ~2MB file with all teachers and all their classes in
> it. I thought I would use Python to do this since many people I know
> tell me how great it is. The real sticking point is there are
> duplicates of each teacher's classes embedded in the file (don't ask
> why). I want to break up the source file into the seperate class
> files without duplication and use the teachers name, coursecode and
> section in the filename.
>
> Example Data:
> Last Name First Name Grade Period Teacher Name
> SMITH JOHN 8 1 JONES SALLY
>
> Student ID Course Code Course Title
> 12345678 1234 ALGEBRA I
>
> Course Section Session
> 1 0
>
>
> The course code and section number are the same in the duplicate
> classes. The difference is that the session number changes. I'm
> thinking that I will need to somehow iterate through the lines and
> break the class when the course code changes and/or section change
> (course codes can be the same but section numbers change with each
> class). Then I have to monitor the session number to see if there is
> a duplicate course code and section but a different session number
> would indicate that the class is a duplicate. Essentially, the session
> number will be 0, 1, or 2 for a total of three lists of the same
> class. Hopefully that makes sense. Again don't ask why the data is
> like this.
>
> My humble attempts have not been very successful. I've been trying
> the csv modules to read the file but I am not sure how to procede
> after that.
>
> Any guidance is greatly appreciated.
>From the above I think you can use the (student id, course code, course
section) tuple as the unique key when reading the data via the cvs module
into a dictionary instead of a list. (For larger data you would need a
database - the humble MS Access could have saved you from the trouble with
duplicates in the first place, btw.)
#untested
STUDENT_ID = 5
COURSE_CODE = 6
COURSE_SECTION = 8
data = {}
for row in csv.reader(instream):
# you could do more processing here,
# e. g. with nested course/student
# dictionaries, but let's keep it simple
data[row[STUDENT_ID], row[COURSE_CODE], row[COURSE_SECTION]] = row
Subsequent sessions will then just overwrite data already in the dictionary.
Note that you are throwing away some information as I am pretty sure (don't
ask why :-) that the duplicates will not be completely identical.
You can split this raw data into classes like so:
#untested
klasses = {}
for row in data.values():
key = row[COURSE_CODE], row[COURSE_SECTION]
if key in klasses:
klasses[key].append(row)
else:
klasses[key] = [row]
If you feel comfortable with (Python) classes you can make the dictionary
values instances of a Row class that lets you access attributes by name, e.
g. row.studentID instead of row[STUDENT_ID].
If these hints aren't sufficient to get you started it would help if you
showed the code you already have to avoid duplicate effort.
Peter
PS: Who am I to talk about orthography, but it's "separate".
More information about the Python-list
mailing list