[TriZPUG] TriZPUG Digest, Vol 52, Issue 32

Wed Aug 22 15:08:51 CEST 2012

Hello All (Nathan Rice, in particular),

I am also interested in this Python NLTK/data mining/machine learning
project of public research data.
I am relatively new to Python and have found integration of existing
classes and methods less than graceful in assimilating modifications
on the fly.  I did just download NLTK.  Much work to do.

On the data mining front, one important consideration is the
segregation of plausibly independent training and validation data
sets, particularly as linkages via various criteria for association is
a target of the modeling endeavor.  I suggest seeding a collection of
'distant' documents, selecting neighborhoods of each, and then train
on one and validate on the other.  Varying the threshold distances and
investigating the goodness of (over)fitting may illuminate the
propensity for overfitting.  Nota Bene: as a mathematical fact, finite
samples are theoretically unable to replicate scale-free networks
(which are typical of citation-type association models) so this may
remain awkward.

Also, assimilating defaults for unobserved categories of association
is critical.

Ray Falk

On Sat, Aug 18, 2012 at 6:00 AM,  <trizpug-request at python.org> wrote:
> Send TriZPUG mailing list submissions to
>         trizpug at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://mail.python.org/mailman/listinfo/trizpug
> or, via email, send a message with subject or body 'help' to
>         trizpug-request at python.org
>
> You can reach the person managing the list at
>         trizpug-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of TriZPUG digest..."
>
>
> Today's Topics:
>
>    1. Re: Python NLTK/data mining/machine learning project of
>       public research data, anyone interested? (Nathan Rice)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 17 Aug 2012 11:47:47 -0400
> From: Nathan Rice <nathan.alexander.rice at gmail.com>
> To: James Whisnant <jwhisnant at gmail.com>,       "Triangle (North Carolina)
>         Zope and Python Users Group"    <trizpug at python.org>
> Subject: Re: [TriZPUG] Python NLTK/data mining/machine learning
>         project of public research data, anyone interested?
> Message-ID:
>         <CAOFbRm++m++UJsv9RxAVOAwRQn=2zBhQ9CoD8BKPFRXmS571pQ at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi James,
>
> The standard way to deal with text in analysis is n-grams.  An n-gram
> is an ordered set of words, of length "n" generated by moving a
> sliding window over the words of the text.  An example will be the
> easiest way to make this clear; the bi-grams for "the dog jumped over
> the moon" would be [(None, "the"), ("the", "dog"), ("dog", "jumped"),
> ("jumped", "over"), ("over", "the"), ("the", "moon"), ("moon", None)].
>  Typically text is pre-processed to remove common words like the, a,
> it, is, etc.  The size of n in the n-gram is highly dependent on the
> size of your text corpora - if you have a lot of training text, larger
> values can work for n, but if you don't have a lot of examples you
> want to stick with 2.  This basically transforms the text into a
> vector of n-gram counts, and lets you use a host of linear algebraic
> techniques.  In some instances, the feature vector generated from
> n-gram counts has a dimensionality reduction step performed on it; the
> most common technique is called latent semantic indexing.  This is
> just a fancy name for taking the collection of n-gram count vectors,
> bundling them up into a matrix, and taking the singular value
> decomposition of that matrix in order to factorize it into a feature
> (with features represented by unique vector) matrix, a variance
> distribution matrix (which is diagonal) and a feature weight matrix.
>
> Once you have your reduced dimensionality announcement corpora, you
> will probably want to create a generative model for the price of the
> stock as a function of its prior price points (thus an autoregressive
> model) along with some new impulse that results from the information
> vector.  Neural networks such as the restricted boltzmann machine tend
> to work well for this, as do gaussian processes (watch out, O(n^3))
> and various other kernel based methods.  I would probably start with a
> simple ARIMA model for which ample source code and modules are
> available and work your way up to the good stuff as you get more
> comfortable.  The continuous time dependence with limited trading
> windows of this model is the main complicating factor, otherwise it is
> pretty straightforward.
>
> Take care,
>
> Nathan
>
> On Fri, Aug 17, 2012 at 10:27 AM, James Whisnant <jwhisnant at gmail.com> wrote:
>> Nathan - sounds like a very interesting.  I saw your previous talk about
>> symbolic math which was interesting (although above my math skills).  I have
>> also been looking at nltk for a project I am thinking about.  I was also
>> looking at using nltk for meaning extraction from a list of company news
>> releases, starting first with meaning extraction to start with for earnings
>> reports.  Firstly converting the text to a standard format for analysis.
>>
>> I would be interested in advice on the best way to do the math, and what
>> math needs be done.  I always say that "I like math, but math doesn't like
>> me".  Sounds like we have some overlapping goals with nltk.
>>
>> Being able to determine later something like - 80% of the time when a
>> company put out a release that says "leveraging synergies" (n=30 - number of
>> companies) - their stock value decreases by 5% (std_dev=1%) within 4 hours
>> afterwards.  So maybe I should short the company's stock and try to profit
>> when they just announce they are to going to "leverage synergies".  But that
>> is a goal for later.
>>
>> I will keep it on my hack night radar also.
>>
>>
>> On Thu, Aug 16, 2012 at 5:17 PM, Jesse <jessebikman at gmail.com> wrote:
>>>
>>> I'll keep the next hack night on my radar, this is an interesting project.
>>>
>>>
>>> On Thu, Aug 16, 2012 at 4:56 PM, Nathan Rice
>>> <nathan.alexander.rice at gmail.com> wrote:
>>>>
>>>> On Thu, Aug 16, 2012 at 4:13 PM, Jesse <jessebikman at gmail.com> wrote:
>>>> > I don't know how helpful I'd be, but I'd like to at least check out
>>>> > what
>>>> > you're doing. I just started programming in Python last month. When
>>>> > could
>>>> > this happen? Are you near Chapel Hill?
>>>>
>>>> I work at UNC.  I could demonstrate some stuff at a hack night.  I'm
>>>> still in the planning stages for most of the stuff; I have the pubmed
>>>> extraction code pretty well nailed, and I have a solid outline for the
>>>> article disqualification (create a feature vector out of topic and
>>>> abstract bigrams, MeSH subject headings and journal, use a SVM
>>>> discriminator and manually generate a RoC curve to determine the
>>>> cutoff score) but I'm still very up in the air regarding NL extraction
>>>> of things like sample size, significance, etc.  If you'd like to learn
>>>> more I would of course be happy to go over my thoughts on the matter
>>>> and we can play around with some code.
>>>>
>>>> Nathan
>>>
>>>
>>>
>>>
>>> --
>>> Jesse Bikman
>>>
>>> _______________________________________________
>>> TriZPUG mailing list
>>> TriZPUG at python.org
>>> http://mail.python.org/mailman/listinfo/trizpug
>>> http://trizpug.org is the Triangle Zope and Python Users Group
>>
>>
>
>
> ------------------------------
>
> _______________________________________________
> TriZPUG mailing list
> TriZPUG at python.org
> http://mail.python.org/mailman/listinfo/trizpug
>
>
> End of TriZPUG Digest, Vol 52, Issue 32
> ***************************************