[scikit-learn] combining datasets from different sources

Jason Rudy jcrudy at gmail.com
Tue Sep 5 13:02:57 EDT 2017


Thomas,

This is sort of related to the problem I did my M.S. thesis on years ago:
cross-platform normalization of gene expression data.  If you google that
term you'll find some papers.  The situation is somewhat different, though,
because with microarrays or RNA-seq you get thousands of data points for
each experiment, which makes it easier to estimate the batch effect.  The
principle is the similar, however.

If I were in your situation, I would consider whether I have any of the
following advantages:

1. Some molecules that appear in multiple data sets
2. Detailed information about the different experimental conditions
3. Physical/chemical models of how experimental conditions influence
binding affinity

If you have any of the above, you can potentially use them to improve your
estimates.  You could also consider using experiment ID as a categorical
predictor in a sufficiently general regression method.

Lastly, you may already know this, but the term "meta-analysis" is relevant
here, and you can google for specific techniques.  Most of these would be
more limited than what you are envisioning, I think.

Best,

Jason

On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis <tevang3 at gmail.com>
wrote:

> Greetings,
>
> I am working on a problem that involves predicting the binding affinity of
> small molecules on a receptor structure (is regression problem, not
> classification). I have multiple small datasets of molecules with measured
> binding affinities on a receptor, but each dataset was measured in
> different experimental conditions and therefore I cannot use them all
> together as trainning set. So, instead of using them individually, I was
> wondering whether there is a method to combine them all into a super
> training set. The first way I could think of is to convert the binding
> affinities to Z-scores and then combine all the small datasets of
> molecules. But this is would be inaccurate because, firstly the datasets
> are very small (10-50 molecules each), and secondly, the range of binding
> affinities differs in each experiment (some datasets contain really strong
> binders, while others do not, etc.). Is there any other approach to combine
> datasets with values coming from different sources? Maybe if someone points
> me to the right reference I could read and understand if it is applicable
> to my case.
>
> Thanks,
> Thomas
>
> --
>
> ======================================================================
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
>           tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170905/a99267c8/attachment.html>


More information about the scikit-learn mailing list