[Tutor] creating a corpus from a csv file
Peter Otten
__peter__ at web.de
Sat May 4 10:29:57 CEST 2013
Treder, Robert wrote:
> I'm very new to python and am trying to figure out how to make a corpus
> from a text file. I have a csv file (actually pipe '|' delimited) where
> each row corresponds to a different text document. Each row contains a
> communication note. Other columns correspond to categories of types of
> communications. I am able to read the csv file and print the notes column
> as follows:
>
> import csv
> with open('notes.txt', 'rb') as infile:
> reader = csv.reader(infile, delimiter = '|')
> i = 0
> for row in reader:
> if i <= 25: print row[8]
> i = i+1
>
> I would like to convert this to a categorized corpus with some of the
> other columns corresponding to the categories. All of the columns are text
> (i.e., strings). I have looked for documentation on how to use csv.reader
> with PlaintextCorpusReader but have been unsuccessful in finding a
> example similar to what I want to do. Can someone please help?
This mailing list is for learning Python. For problems with a specific
library you should use the general python list
<http://mail.python.org/mailman/listinfo/python-list>
or a forum dedicated to that library
<http://groups.google.com/group/nltk-users>
If you ask on a general forum you should give some context -- the name of
the library would be the bare minimum.
The following comes with no warranties as I'm not an nltk user:
import csv
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from itertools import islice, chain
LIMIT_SIZE = 25 # set to None if not debugging
def pairs(filename):
"""Generate (filename, list_of_categories) pairs from a csv file
"""
with open(filename, "rb") as infile:
rows = islice(csv.reader(infile, delimiter="|"), LIMIT_SIZE)
for row in rows:
# assume that columns 10 and above contain categories
yield row[8], row[9:]
if __name__ == "__main__":
import random
FILENAME = "notes.txt"
# assume that every filename occurs only once in the file
file_to_categories = dict(pairs(FILENAME))
files = list(file_to_categories)
all_categories =
set(chain.from_iterable(file_to_categories.itervalues()))
reader = CategorizedPlaintextCorpusReader(".", files,
cat_map=file_to_categories)
# print words for a random category
category = random.choice(list(all_categories))
print "words for category {}:".format(category)
print sorted(set(reader.words(categories=category)))
More information about the Tutor
mailing list