[scikit-learn] combining datasets from different sources

Maciek Wójcikowski maciek at wojcikowski.pl
Thu Sep 7 10:14:51 EDT 2017


2017-09-07 15:57 GMT+02:00 Thomas Evangelidis <tevang3 at gmail.com>:

>
>
> On 7 September 2017 at 15:29, Maciek Wójcikowski <maciek at wojcikowski.pl>
> wrote:
>
>> I think StandardScaller is what you want. For each assay you will get
>> mean and var. Average mean would be the "optimal" shift and average
>> variance the spread. But would this value make any physical sense?
>>
>> ​I think you missed my point. The problem was scaling with restraints,
> the RMSD between the binding affinity of the common ligands ​must be
> minimized uppon scaling. Anyway, I managed to work it out using
> scipy.optimize.
>
Yes, I meant the common ligand which would probably lead you to similar
solution. Out of curiosity: is there a connection with the optimal shift
and the type of assay in your case?


>
>
>
>> Considering the RF-Score-VS: In fact it's a regressor and it predicts a
>> real value, not a class. Although it is validated mostly using Enrichment
>> Factor, the last figure shows top results for regression vs Vina.
>>
>> ​To my understanding you trained the RF using class information (active,
> inactive) and the prediction was a probability value.​ If the probability
> is above 0.5 then the compound is an active, otherwise it is an inactive.
> This is how sklearn.ensemble.RandomForestClassifier works.
>
We trained RandomForestRegressor with binding affinities of DUD-E actives.
The decoys were arbitrarily assigned 5.95 pK activity.


>
> In contrast I train MLPRegressors using binding affinities (scalar values)
> and the predictions are binding affinities (scallar values).
>
We will have chance to talk it through in Berlin, see you there!

>
>

>
>
>> ----
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> maciek at wojcikowski.pl
>>
>> 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis <tevang3 at gmail.com>:
>>
>>> ​​
>>> After some though about this problem today, I think it is an objective
>>> function minimization problem, when the objective function can be the root
>>> mean square deviation (RMSD) between the affinities of the common molecules
>>> in the two data sets. I could work iteratively, first rescale and fit assay
>>> B to match A, then proceed to assay C and so forth. Or alternatively, for
>>> each Assay I need to find two missing variables, the optimum shift Sh and
>>> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the
>>> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD
>>> between the binding affinities of the overlapping molecules. Any idea how I
>>> can do that with scikit-learn?
>>>
>>>
>>> On 6 September 2017 at 00:29, Thomas Evangelidis <tevang3 at gmail.com>
>>> wrote:
>>>
>>>> Thanks Jason, Sebastian and Maciek!
>>>>
>>>> I believe from all the suggestions, the most feasible solutions is to
>>>> look experimental assays which overlap by at least two compounds, and then
>>>> adjust the binding affinities of one of them by looking in their difference
>>>> in both assays. Sebastian mentioned the simplest scenario, where the shift
>>>> for both compounds is 2 kcal/mol. However, he neglected to mention that the
>>>> ratio between the affinities of the two compounds in each assay also
>>>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but
>>>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to
>>>> select the right transformation function for the values from Assay B. Is
>>>> anybody away of any clever algorithm to select the right transformation
>>>> function for such a case? I am sure there exists.
>>>>
>>>> The other approach would be to train different predictors from each
>>>> assay and then apply a data fusion technique (e.g. min rank). But that
>>>> wouldn't be that elegant.
>>>>
>>>> @Maciek To my understanding, the paper you cited addresses a
>>>> classification problem (actives, inactives) by implementing Random Forrest
>>>> Classfiers. My case is a Regression problem.
>>>>
>>>>
>>>> best,
>>>> Thomas
>>>>
>>>>
>>>> On 5 September 2017 at 20:33, Maciek Wójcikowski <maciek at wojcikowski.pl
>>>> > wrote:
>>>>
>>>>> Hi Thomas and others,
>>>>>
>>>>> It also really depend on how many data points you have on each
>>>>> compound. If you had more than a few then there are few options. If you get
>>>>> two very distinct activities for one ligand. I'd discard such samples as
>>>>> ambiguous or decide on one of the assays/experiments (the one with lower
>>>>> error). The exact problem was faced by PDBbind creators, I'd also look
>>>>> there for details what they did with their activities.
>>>>>
>>>>> To follow up Sebastians suggestion: have you checked how different
>>>>> ranks/Z-scores you get? Check out the Kendall Tau.
>>>>>
>>>>> Anyhow, you could build local models for a specific experimental
>>>>> methods. In our recent publication on slightly different area
>>>>> (protein-ligand scoring function), we show that the RF build on one target
>>>>> is just slightly better than the RF build on many targets (we've used DUD-E
>>>>> database); Checkout the "horizontal" and "per-target" splits
>>>>> https://www.nature.com/articles/srep46710. Unfortunately, this may
>>>>> change for different models. Plus the molecular descriptors used, which we
>>>>> know nothing about.
>>>>>
>>>>> I hope that helped a bit.
>>>>>
>>>>> ----
>>>>> Pozdrawiam,  |  Best regards,
>>>>> Maciek Wójcikowski
>>>>> maciek at wojcikowski.pl
>>>>>
>>>>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.raschka at gmail.com>:
>>>>>
>>>>>> Another approach would be to pose this as a "ranking" problem to
>>>>>> predict relative affinities rather than absolute affinities. E.g., if you
>>>>>> have data from one (or more) molecules that has/have been tested under 2 or
>>>>>> more experimental conditions, you can rank the other molecules accordingly
>>>>>> or normalize. E.g. if you observe that the binding affinity of molecule a
>>>>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding
>>>>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that
>>>>>> should give you some information for normalizing the values from assay 2
>>>>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and
>>>>>> might be error prone, but so are experimental assays ... (when I sometimes
>>>>>> look at the std error/CI of the data I get from collaborators ... well, it
>>>>>> seems that absolute binding affinities have always taken with a grain of
>>>>>> salt anyway)
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcrudy at gmail.com> wrote:
>>>>>> >
>>>>>> > Thomas,
>>>>>> >
>>>>>> > This is sort of related to the problem I did my M.S. thesis on
>>>>>> years ago: cross-platform normalization of gene expression data.  If you
>>>>>> google that term you'll find some papers.  The situation is somewhat
>>>>>> different, though, because with microarrays or RNA-seq you get thousands of
>>>>>> data points for each experiment, which makes it easier to estimate the
>>>>>> batch effect.  The principle is the similar, however.
>>>>>> >
>>>>>> > If I were in your situation, I would consider whether I have any of
>>>>>> the following advantages:
>>>>>> >
>>>>>> > 1. Some molecules that appear in multiple data sets
>>>>>> > 2. Detailed information about the different experimental conditions
>>>>>> > 3. Physical/chemical models of how experimental conditions
>>>>>> influence binding affinity
>>>>>> >
>>>>>> > If you have any of the above, you can potentially use them to
>>>>>> improve your estimates.  You could also consider using experiment ID as a
>>>>>> categorical predictor in a sufficiently general regression method.
>>>>>> >
>>>>>> > Lastly, you may already know this, but the term "meta-analysis" is
>>>>>> relevant here, and you can google for specific techniques.  Most of these
>>>>>> would be more limited than what you are envisioning, I think.
>>>>>> >
>>>>>> > Best,
>>>>>> >
>>>>>> > Jason
>>>>>> >
>>>>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis <
>>>>>> tevang3 at gmail.com> wrote:
>>>>>> > Greetings,
>>>>>> >
>>>>>> > I am working on a problem that involves predicting the binding
>>>>>> affinity of small molecules on a receptor structure (is regression problem,
>>>>>> not classification). I have multiple small datasets of molecules with
>>>>>> measured binding affinities on a receptor, but each dataset was measured in
>>>>>> different experimental conditions and therefore I cannot use them all
>>>>>> together as trainning set. So, instead of using them individually, I was
>>>>>> wondering whether there is a method to combine them all into a super
>>>>>> training set. The first way I could think of is to convert the binding
>>>>>> affinities to Z-scores and then combine all the small datasets of
>>>>>> molecules. But this is would be inaccurate because, firstly the datasets
>>>>>> are very small (10-50 molecules each), and secondly, the range of binding
>>>>>> affinities differs in each experiment (some datasets contain really strong
>>>>>> binders, while others do not, etc.). Is there any other approach to combine
>>>>>> datasets with values coming from different sources? Maybe if som
>>>>>>  eone points me to the right reference I could read and understand if
>>>>>> it is applicable to my case.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Thomas
>>>>>> >
>>>>>> > --
>>>>>> > ============================================================
>>>>>> ==========
>>>>>> > Dr Thomas Evangelidis
>>>>>> > Post-doctoral Researcher
>>>>>> > CEITEC - Central European Institute of Technology
>>>>>> > Masaryk University
>>>>>> > Kamenice 5/A35/2S049,
>>>>>> > 62500 Brno, Czech Republic
>>>>>> >
>>>>>> > email: tevang at pharm.uoa.gr
>>>>>> >               tevang3 at gmail.com
>>>>>> >
>>>>>> > website: https://sites.google.com/site/thomasevangelidishomepage/
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > scikit-learn mailing list
>>>>>> > scikit-learn at python.org
>>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > scikit-learn mailing list
>>>>>> > scikit-learn at python.org
>>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>> _______________________________________________
>>>>>> scikit-learn mailing list
>>>>>> scikit-learn at python.org
>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ======================================================================
>>>>
>>>> Dr Thomas Evangelidis
>>>>
>>>> Post-doctoral Researcher
>>>> CEITEC - Central European Institute of Technology
>>>> Masaryk University
>>>> Kamenice 5/A35/2S049,
>>>> 62500 Brno, Czech Republic
>>>>
>>>> email: tevang at pharm.uoa.gr
>>>>
>>>>           tevang3 at gmail.com
>>>>
>>>>
>>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> ======================================================================
>>>
>>> Dr Thomas Evangelidis
>>>
>>> Post-doctoral Researcher
>>> CEITEC - Central European Institute of Technology
>>> Masaryk University
>>> Kamenice 5/A35/2S049,
>>> 62500 Brno, Czech Republic
>>>
>>> email: tevang at pharm.uoa.gr
>>>
>>>           tevang3 at gmail.com
>>>
>>>
>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
>
> ======================================================================
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tevang at pharm.uoa.gr
>
>           tevang3 at gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170907/12871663/attachment-0001.html>


More information about the scikit-learn mailing list