[TriZPUG] Python NLTK/data mining/machine learning project of public research data, anyone interested?

Thu Aug 16 20:31:00 CEST 2012

Hi All,

Normally, my projects are pretty boring, and I prefer to endure the
suffering in solitary silence.  As luck would have it though, I
actually have an interesting project on my plate currently, and I
think it is cool enough that I wanted to give other people the
opportunity to stick their noses in and provide input or play with
some code.

I am currently involved in compiling a database of medical data
(published clinical or pre-clinical trials) surrounding ethno- and
alternative- medicinal treatments, for semi-automated meta analysis
and treatment guidance.  In order for this to work, a lot of technical
challenges have to be overcome:

My initial tally from PubMed puts the number of articles at over
70,000; based on visual inspection, many of these are not actually
applicable, but there are limited filtering options via the Entrez web
API.  Machine learning techniques would probably be very helpful at
scoring articles for applicability, and ignoring ones that are clearly
inapplicable.

In order to perform meta-analysis and treatment guidance, the article
needs to be mined for treatment, condition, circumstances of treatment
and condition, and whether it was successful or not (with some p value
and sample size).  Most of this is not available as standard metadata
for the studies, and must be mined from the text.

In addition, not all studies are equal.  Methodological errors, lack
of reproduciblity, and so forth can all render a study meaningless.
Thus, studies must have a scoring mechanism so you can avoid tainting
meta-analyses with biased data.  This scoring mechanism will probably
include the impact factor of the journal, the g/h-index of the
authors, the number of (non self) citations, etc.

As you can see, each of these is meaty, and all of them need to be
taken care of to get good results :)  If anyone is interested in
getting some serious natural language processing/data mining/machine
learning practice, I'd love to involve you.  There's no reason I
should have all the fun!

Take care,

Nathan Rice