[Tutor] creating a corpus from a csv file

Mon May 13 16:12:31 CEST 2013

Message: 1
Date: Fri, 03 May 2013 23:05:32 +0100
From: Alan Gauld <alan.gauld at btinternet.com>
To: tutor at python.org
Subject: Re: [Tutor] creating a corpus from a csv file
Message-ID: <km1cb8$ist$1 at ger.gmane.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 03/05/13 21:48, Treder, Robert wrote:

> I'm very new to python and am trying to figure out how to
 > make a corpus from a text file.

Hi, I for one have no idea what a corpus is or looks like so you will need to help us out a little before we can help you.

> I have a csv file (actually pipe '|' delimited) where each row 
> corresponds to a different text document.

> Each row contains a communication note.
 > Other columns correspond to categories of types of communications.

> I am able to read the csv file and print the notes column as follows:
>
> import csv
> with open('notes.txt', 'rb') as infile:
>      reader = csv.reader(infile, delimiter = '|')
>      i = 0
>      for row in reader:
>      if i <= 25: print row[8]
>      i = i+1

You don't need to manually manage 'i'.

you could do this instead:

with open('notes.txt', 'rb') as infile:
      reader = csv.reader(infile, delimiter = '|')
      for count, row in enumerate(reader):
          if count <= 25: print row[8]  # I assume indented?
          else: break                   # save time if its a big file

> I would like to convert this to a categorized corpus with
 > some of the other columns corresponding to the categories.

You might be able to use a dictionary but for now I'm still not clear what you mean. Can you show us some sample input and output data?

 > documentation on how to use csv.reader with PlaintextCorpusReader

never heard of the latter - is it an external module?

HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

Message: 7
Date: Sat, 04 May 2013 10:29:57 +0200
From: Peter Otten <__peter__ at web.de>
To: tutor at python.org
Subject: Re: [Tutor] creating a corpus from a csv file
Message-ID: <km2gtu$o7a$1 at ger.gmane.org>
Content-Type: text/plain; charset="ISO-8859-1"

Treder, Robert wrote:

> I'm very new to python and am trying to figure out how to make a corpus
> from a text file. I have a csv file (actually pipe '|' delimited) where
> each row corresponds to a different text document. Each row contains a
> communication note. Other columns correspond to categories of types of
> communications. I am able to read the csv file and print the notes column
> as follows:
>  
> import csv
> with open('notes.txt', 'rb') as infile:
>     reader = csv.reader(infile, delimiter = '|')
>     i = 0
>     for row in reader:
>     if i <= 25: print row[8]
>     i = i+1
> 
> I would like to convert this to a categorized corpus with some of the
> other columns corresponding to the categories. All of the columns are text
> (i.e., strings). I have looked for documentation on how to use csv.reader
> with PlaintextCorpusReader but have been unsuccessful in finding a 
> example similar to what I want to do. Can someone please help?

This mailing list is for learning Python. For problems with a specific 
library you should use the general python list 

<http://mail.python.org/mailman/listinfo/python-list>

or a forum dedicated to that library

<http://groups.google.com/group/nltk-users>

If you ask on a general forum you should give some context -- the name of 
the library would be the bare minimum.

The following comes with no warranties as I'm not an nltk user:

import csv
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from itertools import islice, chain

LIMIT_SIZE = 25 # set to None if not debugging

def pairs(filename):
    """Generate (filename, list_of_categories) pairs from a csv file
    """
    with open(filename, "rb") as infile:
        rows = islice(csv.reader(infile, delimiter="|"), LIMIT_SIZE)
        for row in rows:
            # assume that columns 10 and above contain categories
            yield row[8], row[9:]

if __name__ == "__main__":
    import random
    FILENAME = "notes.txt"

    # assume that every filename occurs only once in the file
    file_to_categories = dict(pairs(FILENAME))

    files = list(file_to_categories)

    all_categories = 
set(chain.from_iterable(file_to_categories.itervalues()))

    reader = CategorizedPlaintextCorpusReader(".", files, 
cat_map=file_to_categories)

    # print words for a random category
    category = random.choice(list(all_categories))
    print "words for category {}:".format(category)
    print sorted(set(reader.words(categories=category)))

------------------------------
Alan, Peter, 

Thanks for your responses. Sorry about the lack of context and module information in my initial post. Peter got the context right - creating python object(s) from a collection of text documents (the corpus) in preparation to doing text mining and modeling. The modified script from Peter follows. I dropped the size limitation and have included some test data below. 

Problems still exist. The code attempts to read files with names based on concatenating the first and third columns, the data that is coming form the yield . Consequently, I'm convinced I will need to write a custom csvCorpusReader. I've received some tips for that from an nltk email group. 

If anyone has additional suggestions or comments I would love to hear them. 

Thanks, 
Bob

#####  Code below here   #####

import csv
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from itertools import islice, chain

#filename = 'L:/gps_pa/DEV/TextMining/EmailTickerInterest/Data/testNotes.txt'# set to None if not debugging
filename = "C:/nltk_data/corpora/notes/testNotes.txt" # set to None if not debugging

def pairs(filename):
    """Generate (filename, list_of_categories) pairs from a csv file
    """
    with open(filename, "rb") as infile:
        rows = csv.reader(infile, delimiter="|")
        for row in rows:
            yield row[0], row[2]
            print row[0], row[2]

if __name__ == "__main__":
    import random
    FILENAME = "C:/nltk_data/corpora/notes/testNotes.txt"

    # assume that every filename occurs only once in the file
    file_to_categories = dict(pairs(FILENAME))

    files = list(file_to_categories)

    all_categories = set(chain.from_iterable(file_to_categories.itervalues()))
    print all_categories

    reader = CategorizedPlaintextCorpusReader(".", files, cat_map=file_to_categories)

    # print words for a random category
    category = random.choice(list(all_categories))
    print "words for category {}:".format(category)
    print sorted(set(reader.words(categories=category)))

Some test data looks like the following, the first row being column headers: 

CID|X|MID|note
1|not|101|note 1
2|any|102|note 2
3|thing|103|note 3
4|tbd|104|note 4

Modifying Peter's code to get it to run as far as possible.  

--------------------------------------------------------------------------------

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.