[Tutor] creating a corpus from a csv file

Sat May 4 10:29:57 CEST 2013

Treder, Robert wrote:

> I'm very new to python and am trying to figure out how to make a corpus
> from a text file. I have a csv file (actually pipe '|' delimited) where
> each row corresponds to a different text document. Each row contains a
> communication note. Other columns correspond to categories of types of
> communications. I am able to read the csv file and print the notes column
> as follows:
>  
> import csv
> with open('notes.txt', 'rb') as infile:
>     reader = csv.reader(infile, delimiter = '|')
>     i = 0
>     for row in reader:
>     if i <= 25: print row[8]
>     i = i+1
> 
> I would like to convert this to a categorized corpus with some of the
> other columns corresponding to the categories. All of the columns are text
> (i.e., strings). I have looked for documentation on how to use csv.reader
> with PlaintextCorpusReader but have been unsuccessful in finding a 
> example similar to what I want to do. Can someone please help?

This mailing list is for learning Python. For problems with a specific 
library you should use the general python list 

<http://mail.python.org/mailman/listinfo/python-list>

or a forum dedicated to that library

<http://groups.google.com/group/nltk-users>

If you ask on a general forum you should give some context -- the name of 
the library would be the bare minimum.

The following comes with no warranties as I'm not an nltk user:

import csv
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from itertools import islice, chain

LIMIT_SIZE = 25 # set to None if not debugging

def pairs(filename):
    """Generate (filename, list_of_categories) pairs from a csv file
    """
    with open(filename, "rb") as infile:
        rows = islice(csv.reader(infile, delimiter="|"), LIMIT_SIZE)
        for row in rows:
            # assume that columns 10 and above contain categories
            yield row[8], row[9:]

if __name__ == "__main__":
    import random
    FILENAME = "notes.txt"

    # assume that every filename occurs only once in the file
    file_to_categories = dict(pairs(FILENAME))

    files = list(file_to_categories)

    all_categories = 
set(chain.from_iterable(file_to_categories.itervalues()))

    reader = CategorizedPlaintextCorpusReader(".", files, 
cat_map=file_to_categories)

    # print words for a random category
    category = random.choice(list(all_categories))
    print "words for category {}:".format(category)
    print sorted(set(reader.words(categories=category)))