[TriZPUG] Python NLTK/data mining/machine learning project of public research data, anyone interested?

Nathan Rice nathan.alexander.rice at gmail.com
Fri Aug 17 17:47:47 CEST 2012


Hi James,

The standard way to deal with text in analysis is n-grams.  An n-gram
is an ordered set of words, of length "n" generated by moving a
sliding window over the words of the text.  An example will be the
easiest way to make this clear; the bi-grams for "the dog jumped over
the moon" would be [(None, "the"), ("the", "dog"), ("dog", "jumped"),
("jumped", "over"), ("over", "the"), ("the", "moon"), ("moon", None)].
 Typically text is pre-processed to remove common words like the, a,
it, is, etc.  The size of n in the n-gram is highly dependent on the
size of your text corpora - if you have a lot of training text, larger
values can work for n, but if you don't have a lot of examples you
want to stick with 2.  This basically transforms the text into a
vector of n-gram counts, and lets you use a host of linear algebraic
techniques.  In some instances, the feature vector generated from
n-gram counts has a dimensionality reduction step performed on it; the
most common technique is called latent semantic indexing.  This is
just a fancy name for taking the collection of n-gram count vectors,
bundling them up into a matrix, and taking the singular value
decomposition of that matrix in order to factorize it into a feature
(with features represented by unique vector) matrix, a variance
distribution matrix (which is diagonal) and a feature weight matrix.

Once you have your reduced dimensionality announcement corpora, you
will probably want to create a generative model for the price of the
stock as a function of its prior price points (thus an autoregressive
model) along with some new impulse that results from the information
vector.  Neural networks such as the restricted boltzmann machine tend
to work well for this, as do gaussian processes (watch out, O(n^3))
and various other kernel based methods.  I would probably start with a
simple ARIMA model for which ample source code and modules are
available and work your way up to the good stuff as you get more
comfortable.  The continuous time dependence with limited trading
windows of this model is the main complicating factor, otherwise it is
pretty straightforward.

Take care,

Nathan

On Fri, Aug 17, 2012 at 10:27 AM, James Whisnant <jwhisnant at gmail.com> wrote:
> Nathan - sounds like a very interesting.  I saw your previous talk about
> symbolic math which was interesting (although above my math skills).  I have
> also been looking at nltk for a project I am thinking about.  I was also
> looking at using nltk for meaning extraction from a list of company news
> releases, starting first with meaning extraction to start with for earnings
> reports.  Firstly converting the text to a standard format for analysis.
>
> I would be interested in advice on the best way to do the math, and what
> math needs be done.  I always say that "I like math, but math doesn't like
> me".  Sounds like we have some overlapping goals with nltk.
>
> Being able to determine later something like - 80% of the time when a
> company put out a release that says "leveraging synergies" (n=30 - number of
> companies) - their stock value decreases by 5% (std_dev=1%) within 4 hours
> afterwards.  So maybe I should short the company's stock and try to profit
> when they just announce they are to going to "leverage synergies".  But that
> is a goal for later.
>
> I will keep it on my hack night radar also.
>
>
> On Thu, Aug 16, 2012 at 5:17 PM, Jesse <jessebikman at gmail.com> wrote:
>>
>> I'll keep the next hack night on my radar, this is an interesting project.
>>
>>
>> On Thu, Aug 16, 2012 at 4:56 PM, Nathan Rice
>> <nathan.alexander.rice at gmail.com> wrote:
>>>
>>> On Thu, Aug 16, 2012 at 4:13 PM, Jesse <jessebikman at gmail.com> wrote:
>>> > I don't know how helpful I'd be, but I'd like to at least check out
>>> > what
>>> > you're doing. I just started programming in Python last month. When
>>> > could
>>> > this happen? Are you near Chapel Hill?
>>>
>>> I work at UNC.  I could demonstrate some stuff at a hack night.  I'm
>>> still in the planning stages for most of the stuff; I have the pubmed
>>> extraction code pretty well nailed, and I have a solid outline for the
>>> article disqualification (create a feature vector out of topic and
>>> abstract bigrams, MeSH subject headings and journal, use a SVM
>>> discriminator and manually generate a RoC curve to determine the
>>> cutoff score) but I'm still very up in the air regarding NL extraction
>>> of things like sample size, significance, etc.  If you'd like to learn
>>> more I would of course be happy to go over my thoughts on the matter
>>> and we can play around with some code.
>>>
>>> Nathan
>>
>>
>>
>>
>> --
>> Jesse Bikman
>>
>> _______________________________________________
>> TriZPUG mailing list
>> TriZPUG at python.org
>> http://mail.python.org/mailman/listinfo/trizpug
>> http://trizpug.org is the Triangle Zope and Python Users Group
>
>


More information about the TriZPUG mailing list