Seperating CSV rows into new, seperate files

Sat Aug 14 06:22:03 EDT 2004

TonyB wrote:

> I've searched the group and need more information and guidance on this
> issue I need to resolve next week. I work for the local school system
> and I am working on a way to parse a CSV file of class lists from MS
> Excel.  It is a ~2MB file with all teachers and all their classes in
> it.  I thought I would use Python to do this since many people I know
> tell me how great it is.  The real sticking point is there are
> duplicates of each teacher's classes embedded in the file (don't ask
> why).  I want to break up the source file into the seperate class
> files without duplication and use the teachers name, coursecode and
> section in the filename.
> 
> Example Data:
> Last Name     First Name      Grade   Period  Teacher Name
> SMITH           JOHN           8      1       JONES   SALLY
> 
> Student ID            Course Code     Course Title
> 12345678              1234             ALGEBRA I
> 
> Course Section        Session
> 1             0
> 
> 
> The course code and section number are the same in the duplicate
> classes.  The difference is that the session number changes.   I'm
> thinking that I will need to somehow iterate through the lines and
> break the class when the course code changes and/or section change
> (course codes can be the same but section numbers change with each
> class).  Then I have to monitor the session number to see if there is
> a duplicate course code and section but a different session number
> would indicate that the class is a duplicate. Essentially, the session
> number will be 0, 1, or 2 for a total of three lists of the same
> class. Hopefully that makes sense.  Again don't ask why the data is
> like this.
> 
> My humble attempts have not been very successful.  I've been trying
> the csv modules to read the file but I am not sure how to procede
> after that.
> 
> Any guidance is greatly appreciated.

>From the above I think you can use the (student id, course code, course
section) tuple as the unique key when reading the data via the cvs module
into a dictionary instead of a list. (For larger data you would need a
database - the humble MS Access could have saved you from the trouble with
duplicates in the first place, btw.)

#untested
STUDENT_ID = 5
COURSE_CODE = 6
COURSE_SECTION = 8
data = {}
for row in csv.reader(instream):
    # you could do more processing here,
    # e. g. with nested course/student
    # dictionaries, but let's keep it simple
    data[row[STUDENT_ID], row[COURSE_CODE], row[COURSE_SECTION]] = row

Subsequent sessions will then just overwrite data already in the dictionary.
Note that you are throwing away some information as I am pretty sure (don't
ask why :-) that the duplicates will not be completely identical.

You can split this raw data into classes like so:

#untested
klasses = {}
for row in data.values():
    key = row[COURSE_CODE], row[COURSE_SECTION]
    if key in klasses:
        klasses[key].append(row)
    else:
        klasses[key] = [row]

If you feel comfortable with (Python) classes you can make the dictionary
values instances of a Row class that lets you access attributes by name, e.
g. row.studentID instead of row[STUDENT_ID]. 

If these hints aren't sufficient to get you started it would help if you
showed the code you already have to avoid duplicate effort.

Peter

PS: Who am I to talk about orthography, but it's "separate".